It is not clear to me why neuralforecast has switc...
# neural-forecast
It is not clear to me why neuralforecast has switched to preferring a logic based on "steps" rather than one based on "epochs". Correct me if I'm wrong, but if I'm thinking in terms of epochs I'm thinking in terms of complete passages over all the time series that make up the dataset. Instead, thinking in terms of steps, I could for example train my model for a number of steps that is halfway between 1 epoch and 2 epochs. In this case, I am therefore giving more weight to some time series than others, because some will have contributed to the minimization of the loss only 1 time, while others 2 times. So by reasoning in terms of steps (which can be equivalent to incomplete epochs) I am introducing a bias by giving greater weight to the time series which by chance have been considered 2 times. Wouldn't it be "cleaner" to think in terms of epochs so that all time series are considered the same number of times? Could you explain the logic behind preferring to define steps instead of epochs? Thank you
šŸ‘€ 2
šŸ’” 1
Hi @Manuel! This is a very important question. We decided to define it in steps for several reasons: ā€¢ We first saw the step definition in the NBEATS paper, where they demonstrated that it was possible to achieve SoTA performance on M4 without exhausting all observations. ā€¢ Following that case, we have repeatedly observed and confirmed in many other experiments (for instance, our NHITS paper) that global models do not necessarily need to observe every possible window of each time series to achieve SoTA performance. ā€¢ Our sampling operates by a two-stage sampling. In each step, it first samples
series, and then
windows from those time series. If
, then the model would have observed windows from all time series. ā€¢ We believe that (for general purposes) it is more flexible to train based on steps. It is much easier to define defaults and control the training times based on steps. Training by epochs is completely determined by the number of time series, so a fixed number (eg. 10) can be too little or too much, producing sub-par performance (excessive training time, overfitting, etc) in many cases.
@Cristian (Nixtla) Thank you! In my specific case the dataset consists of time series with different patterns and consuming the time series the same number of times seems important, because I have found that otherwise the model tends to make predictions using the patterns of the time series it has seen the most during training, even for time series with slightly different patterns. So I convert the number of epochs to steps using something like this
steps_in_epoch = np.ceil(Y_train_df["unique_id"].nunique() / batch_size)
I am not sure the step or epoch approach would be different. Our two-stage approach ensures that all time series are sampled equally in the first stage. DL models are also very flexible (with the appropriate hyperparameters), so one model can forecast many different types of patters, even if some are over represented in your data. If you are observing that, I suggest you to increase the size of the model, and train them longer. And use some form of regularization to avoid overfiting!
@Cristian (Nixtla) Actually I'm using all of these: my best model is trained for full epochs (not a number of steps that leaves the last epoch incomplete), it's a TFT model with hidden_size=256 and n_heads=8, I train it for 230 epochs (about 31000 steps) and I use dropout and huber loss. It takes about 2:30 hours to train with a NVIDIA V100 GPU
sounds good!
Interesting discussion. I think the reasoning for this should be part of the docs. (Maybe it is, but then I haven't seen it.)
šŸ‘€ 1