This message was deleted Nixtla Community #neural-forecast

Join Slack

This message was deleted.

# neural-forecast

Slackbot

07/24/2023, 10:45 AM

This message was deleted.

👀 2

💡 1

Cristian (Nixtla)

07/24/2023, 3:25 PM

Hi @Manuel! This is a very important question. We decided to define it in steps for several reasons: • We first saw the step definition in the NBEATS paper, where they demonstrated that it was possible to achieve SoTA performance on M4 without exhausting all observations. • Following that case, we have repeatedly observed and confirmed in many other experiments (for instance, our NHITS paper) that global models do not necessarily need to observe every possible window of each time series to achieve SoTA performance. • Our sampling operates by a two-stage sampling. In each step, it first samples

batch_size

series, and then

windows_batch_size

windows from those time series. If

batch_size*max_steps>n_series

, then the model would have observed windows from all time series. • We believe that (for general purposes) it is more flexible to train based on steps. It is much easier to define defaults and control the training times based on steps. Training by epochs is completely determined by the number of time series, so a fixed number (eg. 10) can be too little or too much, producing sub-par performance (excessive training time, overfitting, etc) in many cases.

Manuel

07/24/2023, 3:39 PM

@Cristian (Nixtla) Thank you! In my specific case the dataset consists of time series with different patterns and consuming the time series the same number of times seems important, because I have found that otherwise the model tends to make predictions using the patterns of the time series it has seen the most during training, even for time series with slightly different patterns. So I convert the number of epochs to steps using something like this

steps_in_epoch = np.ceil(Y_train_df["unique_id"].nunique() / batch_size)

Cristian (Nixtla)

07/24/2023, 7:53 PM

I am not sure the step or epoch approach would be different. Our two-stage approach ensures that all time series are sampled equally in the first stage. DL models are also very flexible (with the appropriate hyperparameters), so one model can forecast many different types of patters, even if some are over represented in your data. If you are observing that, I suggest you to increase the size of the model, and train them longer. And use some form of regularization to avoid overfiting!

Manuel

07/24/2023, 8:16 PM

@Cristian (Nixtla) Actually I'm using all of these: my best model is trained for full epochs (not a number of steps that leaves the last epoch incomplete), it's a TFT model with hidden_size=256 and n_heads=8, I train it for 230 epochs (about 31000 steps) and I use dropout and huber loss. It takes about 2:30 hours to train with a NVIDIA V100 GPU

Cristian (Nixtla)

07/24/2023, 9:36 PM

sounds good!

Stefan Otte

08/01/2023, 6:30 AM

Interesting discussion. I think the reasoning for this should be part of the docs. (Maybe it is, but then I haven't seen it.)

👀 1

15 Views

Open in Slack

Previous Next