I have a statsforecast arima model, and would like to perform rolling h-step predictions.
My model is trained on very large quantity of data, so I do not want to retrain the model as new data comes in - essentially I want to use the actual new observations to predict the next h values, without using those observation as part of training.
How can that be done? The model.predict() only allows to predict h-steps after the end of the training set, and does not allow me to append any new observations without training on them before.
I see that the second notebook you sent is on transfer learning. There might have been a misunderstanding, I am not trying to predict a different time series, just the continuation of the time series as new data comes in - precisely the use case for the ARIMA model.
How are the statsforecasts models used when actually featuring in a pipeline? Do they have to be retrained by scratch after every observation?
Thank you again for your time and help
07/08/2022, 3:19 PM
Sorry, maybe I wasn’t very clear. The case of predicting a time series with a pre-trained model on that same time series could be understood as a very specific case of transfer learning.
In other words, you could follow the Colab and use the model you trained on your data to predict your own and same data. Hopefully, you will achieve your practical goals. Namely: saving time on new training without compromising too much accuracy.
Regarding `statsforecast`: currently yo do have to retrain every time you want to update the model with new data. That doesn’t necessarily mean you have too retrain with every new observation.
Example: you want to predict the next 48 hours every 24 hours although you get new info every hour. In that case you would retrain from scratch every 24 hours, say at noon.
In an oversimplification, how people tend to deploy this models into production is by orchestrating workflows or pipelines that:
1. Extract data, do some wrangling and load to memory every so and so or given a specific event
2. Run the algorithms. Sometimes people run different models and then choose the best one bases on defined error metrics.
3. Output the prediction somewhere
For simple pipelines we have seen people use tools like cron on servers on the cloud.
For more complicated pipelines you can use things like Prefect, Airflow, Ray, Mlflow and even run the computation on clusters of different machines. If you use spot instances, this process is actually quite cheap and fast.
I’m sure you knew most of this stuff but wanted to be as clear as possible.
If you need help with your particular implementation we could schedule a call or something.
07/08/2022, 3:21 PM
Hello Max, thank you so much for the thorough response. I think that everything is clear now! Have a great weekend