I asked this in the Ray Slack too but it's more relevant here since I'm using neuralforecast :
Has anyone experienced issues when using multiple GPUs? When I switch from 1 GPU to more than 1, I get a TuneError complaining about previous errors. The only other error I see is a ValueError that refers to trial_runner.py self._on_training_result(trial, result[_Executor Event.KEY_FUTURE_RESULT]). I should add that this is running in a Jupyter notebook. If I run the .py script, it hangs in there indefinitely but gives no errors to use for troubleshooting.
We also had some issues when using Tune with multiple GPUs on notebooks. I think they dont allow for "interactive environments". We fixed some bugs when training on multiple GPUs, and it should work now using scripts
02/23/2023, 4:17 PM
@Cristian (Nixtla) script didn't work for me either. I converted the same notebook to .py and while it didn't give me an error, it got stuck indefinitely. I tried with multiple different EC2 instances of 2 or 4 GPUs but the result was the same. I wait to see if the Ray's team has any ideas.