Hi everyone I d like to have some advice on HP tuning for TF Nixtla Community #neural-forecast

Hi everyone! I’d like to have some advice on HP t...

Arthur LAMBERT

05/16/2024, 1:36 PM

Hi everyone! I’d like to have some advice on HP tuning for TFT. My goal is to forecast weekly sales of products (the granularity is week x store x product) with a horizon of 52. At the moment, my

stat_exog_list

is composed of 6 features,

hist_exog_list

composed of 2 features, and

futr_exog_list

composed of 3 features. The overall list of features will be augmented in the future. I have a training set composed of more or less two years of data, and a validation set composed of one year of data. Here is the configuration that I am using at the moment for the HP tuning process:

Copy code

num_samples: 20
 float_hp:
  dropout:
   base_value: 0.1
   lower: 0.1
   upper: 0.3
   step: 0.1
  attn_dropout:
   base_value: 0.1
   lower: 0.1
   upper: 0.3
   step: 0.1
 integer_hp:
  input_size:
   base_value: 52
   lower: 26
   upper: 104
   step: 26
  hidden_size:
   base_value: 64
   lower: 64
   upper: 768
   step: 64
 categorical_hp:
  scaler_type:
   base_value: robust
   choices:
    - robust
    - standard
  learning_rate:
   base_value: 0.001
   choices:
    - 0.01
    - 0.001
    - 0.0001
    - 0.00001
  n_head:
   base_value: 4
   choices:
    - 2
    - 4
    - 8
 epochs: 50
 batch_size: 128

The thing is that, for certain combinations of HP, I have the following overflow error, linked to the GPU memory:

Copy code

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 488.00 MiB. GPU 2 has a total capacity of 14.57 GiB of which 422.75 MiB is free. Including non-PyTorch memory, this process has 14.15 GiB memory in use. Of the allocated memory 13.69 GiB is allocated by PyTorch, and 251.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (<https://pytorch.org/docs/stable/notes/cuda.html#environment-variables>)

I am running the code on a VM with 4 Tesla T4, with ~15GiB memory each. The main parameters that are draining the memory are the

batch_size

hidden_size

input_size

and

n_head

if I am not mistaken. I tried different combinations of HP in order to see when the GPUs overflow and when it is not, and here are different combinations where the overflow happen: -

batch_size = 128

hidden_size = 320

input_size = 104

and

n_head = 2

batch_size = 128

hidden_size = 256

input_size = 104

and

n_head = 8

And for the following ones, I don’t have any issues: -

batch_size = 128

hidden_size = 384

input_size = 52

and

n_head = 8

(14,4 GiB per GPU) -

batch_size = 256

hidden_size = 256

input_size = 104

and

n_head = 2

(14,3 GiB per GPU) -

batch_size = 128

hidden_size = 192

input_size = 104

and

n_head = 8

(13,2 GiB per GPU) Considering the list of features that will increase in the future, and the problem framing, would you have any recommendation on the range of HP to choose (maybe some of them can be reduced to benefit others)? And do you think that the used hardware is in the right range? That would be greatly appreciated. Thank you!

Cristian (Nixtla)

05/19/2024, 4:31 AM

Hi Arthur. Check the inference batch sizes as well, because the memory error might be in the validation step

Arthur LAMBERT

05/21/2024, 9:26 AM

Hi @Cristian (Nixtla)! Thanks a lot for your answer! The overload happens directly after starting the training, so my guess is that it does not happen in the validation step unfortunately.

Arthur LAMBERT

05/27/2024, 2:31 PM

Hey @Cristian (Nixtla), would you have any other recommendation maybe? Thanks a lot for taking the time!

2 Views

Open in Slack

Previous Next