Hi I have a question regarding the `cross validation` method Nixtla Community #neural-forecast

Hi! I have a question regarding the `cross_valida...

Arthur LAMBERT

05/03/2024, 1:08 PM

Hi! I have a question regarding the

cross_validation

method of

NeuralForecast

Is there a way to parallelize the process on multiple GPUs? For now, it works when running on one GPU, but when putting as input

devices

> 1 (4 in my case), when instantiating my TFT model, I get the following error:

Copy code

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jules.bertrand/dp4p-ai--sales-forecasting-ml/src/modelling/training_script.py", line 159, in <module>
[rank1]:     cv_df = nf.cross_validation(
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/core.py", line 981, in cross_validation
[rank1]:     return self._no_refit_cross_validation(
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/core.py", line 863, in _no_refit_cross_validation
[rank1]:     model_fcsts = model.predict(
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/common/_base_windows.py", line 686, in predict
[rank1]:     fcsts = trainer.predict(self, datamodule=datamodule)
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in predict
[rank1]:     return call._call_and_handle_interrupt(
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 903, in _predict_impl
[rank1]:     results = self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
[rank1]:     self.strategy.setup_environment()
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment
[rank1]:     super().setup_environment()
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 129, in setup_environment
[rank1]:     self.accelerator.setup_device(self.root_device)
[rank1]:   File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 119, in root_device
[rank1]:     return self.parallel_devices[self.local_rank]
[rank1]: IndexError: list index out of range
[rank: 1] Child process with PID 56954 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
src/modelling/run_modelling.sh: line 14: 55998 Killed                  python src/modelling/training_script.py --business_unit "$business_unit" --date "$date"
First script failed. Exiting.

Thanks!

José Morales

05/03/2024, 4:53 PM

Hey. Multiple GPUs are mostly useful for training, since the inference is usually pretty fast. My suggestion is to train the model with multiple GPUs, set

nf.models[0].trainer_kwargs.update(dict(max_steps=0 , devices=1))

and call the cross_validation method to get the predictions from the trained model using a single GPU

Arthur LAMBERT

05/04/2024, 6:19 PM

The thing is that I'd like to use the cross_validation method to build a custom hyperparameters search, in order to take into account multiple validation folds instead of one (like in

AutoTFT

). It would mean having the training (I would also like to refit the model for each validation fold) and the inference at the same time.

José Morales

05/06/2024, 5:13 PM

I see we have this https://github.com/Nixtla/neuralforecast/blob/b85b07da66b95366d1766d696995b05f3b6c7392/neuralforecast/common/_base_windows.py#L691-L696 which sets the predict step to use 1 GPU. Can you try providing

accelerator='gpu'

in your config?

Arthur LAMBERT

05/07/2024, 7:37 PM

In the end, after reflecting on the problem, I'm gonna go with 1 validation fold for the HP tuning part. Thus,

AutoTFT

will do the job. Many thanks for your help José!

👍 1

Open in Slack

Previous Next