https://github.com/nixtla logo
#general
Title
# general
n

Nasreddine D

05/07/2023, 3:48 PM
Hi, I am quite new to time series field and am starting a forecasting project. I would like to handle it in a proper way with good project management practice and communication to other people and also having a kind of template for future needs for colleagues. I have so many questions, but I'll try to keep it simple to start (please note I have been through the documentation for Statsforecast and MLForecast): Context : A previous POC has been made to forecast the monthly company revenue (Univariate TS, Around 230 months history, 10y history, forecast horizon 24 months) using Holtwinter (tuned parameters). Libraries used where pandas and statsmodel. 1. In terms of forecasting generally speaking, what is the good process? Below my understanding, but please correct it, if I am wrong. a. Analysis? (finding trends, saisonnalities, outliers, missing values...) i. Can I do it with Nixtla stack? Or I should use also others libraries? b. Choose metrics i. Can I choose the MAPE for one univariate TS? c. Crossvalidation: i. Only train and test? or train/val/test? ii. n_window : is there a good number of windows? If I use 5 or 30, the best model won't be the same. iii. The mean score I get from a crossval, should it be the score to communicate to management to explain the capability of a model? d. Choose best model e. Forecast the desired horizon 2. I would like to use the Nixtla stack. For my time serie where should I start? a. StatsForecast, then MLForecast and finally neural-forecast? b. StatsForecast questions : i. Should I try it with all models and see the result with a crossval? Is it appropriate to tune these models? (I have not seen how to perform it in the documentation) c. MLForecast questions : i. My understanding here is we have to "transform" target (if needed) and create relevant features) 1. How can I know what transformation should I use for the target? 2. How can I know what are best features to create and how many? ii. I will try to find relevant external data, that can be used to help the model. 1. Should I do it here after a first iteration with only the target. Can it be done with StatsForecast and Neural-Forecast also? 2. Are there ways to evaluate if external data is relevant? d. NeuralForecast, I have not been through the documentation yet, so I will send questions later once I read it. e. Hierarchical forecast: This can be a last step, I will have to explore the data deeply before getting to this, and understand all the above. i. Is it correct, to try this type of forecasting and then compare it to the others above? More questions when I get there... Thank you very much for taking the time to read and answer me. I hope my questions are relevant and in the right place. Best regards Nass
m

Max (Nixtla)

05/08/2023, 8:01 PM
Thanks for the very detailed question. I think this is a great index for a tutorial that we should write. In the mean time I would recommend you the following. • Follow this End to End Walkthrough _Model training, evaluation and selection for multiple time series._ Some brief answers to your points. 1. a.) Nixtla does not support exploratory analysis per se. (Here is a tutorial using pandas profiling) TIP: Speak to your collegues from business and operations to find important categorical variables. For example, if there is a particular month where business ran a promotion you could try to create categortiable variables like promotion 1 or 0. The same for special things like covid. 1. b.) Short: don't use MAPE, maybe use MAE. a. Long Read: i. Forecast KPIs: RMSE, MAE, MAPE & Bias ii. Time Series Forecast Error Metrics You Should Know 2. Yes, start with statasforecast. Then follow: https://nixtla.github.io/statsforecast/examples/statisticalneuralmethods.html
i. Should I try it with all models and see the result with a crossval? Is it appropriate to tune these models? (I have not seen how to perform it in the documentation)
1.
Start with the auto models. These models find the best parameters for you. AutoTheta, AutoMSTL, etc,
n

Nasreddine D

05/09/2023, 1:26 PM
Hi Max, Thanks for the answer 🙂. It will help to start. I confirm that a full tutorial for handling a time series forecasting project using all Nixtla tools (and others) would be great and helpful. I did a first iteration with StatsForecast with Auto models, I got the first results (which seems good). I quickly tried pandas-profiling, it is very simple for now. I am going to follow this tuto now and see the results https://nixtla.github.io/statsforecast/examples/statisticalneuralmethods.html
f

fede (nixtla) (they/them)

05/09/2023, 6:44 PM
hey @Nasreddine D! Thanks for using the nixtlaverse. Complementing @Max (Nixtla), here are a couple of ideas regarding your questions: 1. a. Yes, the best practice is to start the pipeline with a proper eda. A good approach to finding trends and seasonalities would be using the MSTL model (https://nixtla.github.io/statsforecast/examples/multipleseasonalities.html) to decompose the series and find relevant patterns. Also, you could use tsfeatures (https://github.com/Nixtla/tsfeatures) to extract relevant information about your series (such as sparsity, entropy, and autocorrelation strength, among others). Currently, handling missing values is out of the scope of our libraries (we are working on that). But in this tutorial (https://github.com/Nixtla/statsforecast/blob/main/experiments/bigquery/src/statsforecast-fugue-citibikes-trips.ipynb) you can find an approach to filling them. b. The mape metric usually is extremely hard to judge (https://blog.blueyonder.com/mean-absolute-percentage-error-mape-has-served-its-duty-and-should-now-retire/). Scaled metrics such as mase or rmsse might be a better option. c. Usually, having more than one cross-validation window is a better practice. The exact number of windows depends on the quantity of your data, and your use case. If you have long time series and are willing to wait, we suggest using all the possible windows. 2. a. Yes, that’s a good approach. StatsForecast contains the most simple models (Naive, and SeasonalNaive), so after creating a benchmark with those models, you should start building more complex ones. b. In our experience, the automated models (AutoARIMA, AutoETS, AutoTheta, and AutoCES) produce good benchmarks and the tuning is performed inside them, so there is no need to tune them. c. i. This is usually an empirical question. In our experience, an iterative process where you start with the most simple features and transformations (for example, adding Differences and lags) and then increase complexity leads to good results. Transformations such as scaling (MinMax, Standardize) and BoxCox are good options to test. ii. Yes, the three libraries can handle external regressors. (StatsForecast through the AutoARIMA model). To evaluate if external data is relevant, perform cross-validation with and without that data and compare the models’ performances. e. HierarchicalForecast assumes that you have a hierarchical structure (for example, if you want to predict national-level sales and state-level sales and want them to be coherent, that means that if you add up the forecasts at the state level you’ll get the national level). The algorithms in the library are agnostic: you can use any algorithm to produce forecasts and then reconcile them using the library. Here’s an introduction to the topic: https://nixtla.github.io/hierarchicalforecast/examples/introduction.html. Please let us know if you have any further questions.
n

Nasreddine D

05/10/2023, 2:31 PM
Hi, Thanks again for these details. I will take into account all these comments. I am trying to add exogenous data to AutoARIMA model and using a Crossval to evaluate if the perf is better than just using the target. I checked this tuto, but it is done only by doing a train/test split : https://nixtla.github.io/statsforecast/examples/exogenous.html#train-model But I can't find how to adapt it, there is not "X_df" in the StatsForecast.cross_validation(). Can you propose me something that I can use? Another question regarding this statement: "If the future values of the exogenous regressors are not available, then they must be forecasted or the regressors need to be eliminated from the model. Without them, it is not possible to generate the forecast." My exogenous data is not known in the future, does it mean I have to forecast it before using any model? This could add bias? Thanks again for your help.
I am trying to create features from tsfeatures, but I don't understand how the features created can be used within my dataset. When I run it, I get the result below. What should I do with it? Thank you very much.
f

fede (nixtla) (they/them)

05/10/2023, 8:18 PM
Hey @Nasreddine D! Usually those features are used to explore the characteristics of the time series to plan in advance what models would perform better and cluster them. Here’s a reference on the topic: https://otexts.com/fpp3/features.html.
🙌 1
Sorry @Nasreddine D, I missed the previous questions. Here are some ideas: • The
cross_validation
method automatically handles the exgenous variables. So if you have more variables after the target column
y
, they will be considered exogenous variables and used by the models that allow it. Since, for each window, the exogenous variables of the future are available in cross-validation, their handling is done automatically. • Yes, an approach to use unknown exogenous variables is to forecast them separately and use the forecasted values to produce forecasts of the target variable.
n

Nasreddine D

05/12/2023, 9:45 AM
Hi @fede (nixtla) (they/them), Thank you again, I just learned new things about features 🙂! Regarding the
cross_validation
: • Rolling window vs Expanding window : ◦ Is there a way to choose one or the other? Or what is the best approach? • This is the configuration I've used to test different models from statsforecast. I would like to do the same with
N-HITS
but I want it to be comparable (same number of windows. I am not sure how it will work because with
N-HITS
there must be a validation set to adjust the model and a test set? If I put
n_windows=100
with N-HITS what will happen to val/test? Hope my question is clear.
Copy code
crossvalidation_df = sf.cross_validation(
    df=Y_ts,
    h=24,
    step_size=1,
    n_windows=100
  )
I am going to start with NeuralForecast or MLForecast: • Which one should I start with? (Remember my TS has around 180 months history). So not very long. • For these 2 librairies should I create Features? Or it's just for ML Forecast? • Feature Engineering: what is the best strategy? I read I could create a bunch of features and then select the best ones (lasso...), do you know any ressource that explain that with code? • Do I need to normalize the data for N-HITS? And other models in NeuralForecast? Thanks again for your valuable time.
f

fede (nixtla) (they/them)

05/17/2023, 8:40 PM
hey @Nasreddine D! Yes, you can choose rolling window (default behavior) and expanding window (statsforecast and mlforecast) using
refit=False
in
cross_validation
(currently this option is only available for the Auto models in statsforecast). If you set
n_windows=100
, those windows will be treated as a test set (the model will not use those values during training). Usually, a good workflow starts with ststaforecast, mlforecast, and then neuralforecast. (less complex to more complex models). About features, neuralforecast does not need them, but you’ll need to specify them using mlforecast. The best strategy for feature engineering is to start with lags and simple transformations and then add more to see if the cross-validation signal improves. It is always best practice to scale (or normalize) the data using global models (neuralforecast or mlforecast). The models included in the neuralforecast library can receive the
scaler_type
argument to perform different strategies of scaling, here’s an example: https://nixtla.github.io/neuralforecast/examples/longhorizon_probabilistic.html
n

Nasreddine D

05/21/2023, 2:06 PM
Hi @fede (nixtla) (they/them), thank you for your feedback. I am following your advice and process. Regarding the feature engineering with cross-val, let say I start with lag 1, check the score, then I add lag 2, check score, then... Until lag 12... And select best features from scores I have seen. Then I start the same process with window features... It will take forever. Is there a way of automating this process? Or this is how it should be done? Thank you.
3 Views