Arthur LAMBERT
05/16/2024, 1:36 PMstat_exog_list
is composed of 6 features, hist_exog_list
composed of 2 features, and futr_exog_list
composed of 3 features. The overall list of features will be augmented in the future.
I have a training set composed of more or less two years of data, and a validation set composed of one year of data.
Here is the configuration that I am using at the moment for the HP tuning process:
num_samples: 20
float_hp:
dropout:
base_value: 0.1
lower: 0.1
upper: 0.3
step: 0.1
attn_dropout:
base_value: 0.1
lower: 0.1
upper: 0.3
step: 0.1
integer_hp:
input_size:
base_value: 52
lower: 26
upper: 104
step: 26
hidden_size:
base_value: 64
lower: 64
upper: 768
step: 64
categorical_hp:
scaler_type:
base_value: robust
choices:
- robust
- standard
learning_rate:
base_value: 0.001
choices:
- 0.01
- 0.001
- 0.0001
- 0.00001
n_head:
base_value: 4
choices:
- 2
- 4
- 8
epochs: 50
batch_size: 128
The thing is that, for certain combinations of HP, I have the following overflow error, linked to the GPU memory:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 488.00 MiB. GPU 2 has a total capacity of 14.57 GiB of which 422.75 MiB is free. Including non-PyTorch memory, this process has 14.15 GiB memory in use. Of the allocated memory 13.69 GiB is allocated by PyTorch, and 251.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (<https://pytorch.org/docs/stable/notes/cuda.html#environment-variables>)
I am running the code on a VM with 4 Tesla T4, with ~15GiB memory each.
The main parameters that are draining the memory are the batch_size
, hidden_size
, input_size
and n_head
if I am not mistaken.
I tried different combinations of HP in order to see when the GPUs overflow and when it is not, and here are different combinations where the overflow happen:
- batch_size = 128
, hidden_size = 320
, input_size = 104
and n_head = 2
- batch_size = 128
, hidden_size = 256
, input_size = 104
and n_head = 8
And for the following ones, I don’t have any issues:
- batch_size = 128
, hidden_size = 384
, input_size = 52
and n_head = 8
(14,4 GiB per GPU)
- batch_size = 256
, hidden_size = 256
, input_size = 104
and n_head = 2
(14,3 GiB per GPU)
- batch_size = 128
, hidden_size = 192
, input_size = 104
and n_head = 8
(13,2 GiB per GPU)
Considering the list of features that will increase in the future, and the problem framing, would you have any recommendation on the range of HP to choose (maybe some of them can be reduced to benefit others)? And do you think that the used hardware is in the right range? That would be greatly appreciated.
Thank you!Cristian (Nixtla)
05/19/2024, 4:31 AMArthur LAMBERT
05/21/2024, 9:26 AMArthur LAMBERT
05/27/2024, 2:31 PM