Arthur LAMBERT
05/03/2024, 1:08 PMcross_validation
method of NeuralForecast
Is there a way to parallelize the process on multiple GPUs? For now, it works when running on one GPU, but when putting as input devices
> 1 (4 in my case), when instantiating my TFT model, I get the following error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/jules.bertrand/dp4p-ai--sales-forecasting-ml/src/modelling/training_script.py", line 159, in <module>
[rank1]: cv_df = nf.cross_validation(
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/core.py", line 981, in cross_validation
[rank1]: return self._no_refit_cross_validation(
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/core.py", line 863, in _no_refit_cross_validation
[rank1]: model_fcsts = model.predict(
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/neuralforecast/common/_base_windows.py", line 686, in predict
[rank1]: fcsts = trainer.predict(self, datamodule=datamodule)
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in predict
[rank1]: return call._call_and_handle_interrupt(
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 903, in _predict_impl
[rank1]: results = self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
[rank1]: self.strategy.setup_environment()
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment
[rank1]: super().setup_environment()
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 129, in setup_environment
[rank1]: self.accelerator.setup_device(self.root_device)
[rank1]: File "/home/jules.bertrand/miniconda3/envs/adeo-fcst/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 119, in root_device
[rank1]: return self.parallel_devices[self.local_rank]
[rank1]: IndexError: list index out of range
[rank: 1] Child process with PID 56954 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
src/modelling/run_modelling.sh: line 14: 55998 Killed python src/modelling/training_script.py --business_unit "$business_unit" --date "$date"
First script failed. Exiting.
Thanks!José Morales
05/03/2024, 4:53 PMnf.models[0].trainer_kwargs.update(dict(max_steps=0 , devices=1))
and call the cross_validation method to get the predictions from the trained model using a single GPUArthur LAMBERT
05/04/2024, 6:19 PMAutoTFT
). It would mean having the training (I would also like to refit the model for each validation fold) and the inference at the same time.José Morales
05/06/2024, 5:13 PMaccelerator='gpu'
in your config?Arthur LAMBERT
05/07/2024, 7:37 PMAutoTFT
will do the job. Many thanks for your help José!