Hi! I am using using `AutoTFT` with an `EarlyStop...
# neural-forecast
a
Hi! I am using using
AutoTFT
with an
EarlyStopping
to track the validation loss. I am using the
Optuna
backend, and training on 4 GPUs. I'd like to retrieve at the end of the HP tuning, the optimal set of HP, but also the epoch at which the model stopped training due to the
EarlyStopping
. My idea was to retrieve the evolution of the validation loss at the end of the
AutoTFT
training (with the
valid_trajectories
attribute) to get it. However, when printing it at the end of the training, I obtain different values, one for each GPU (cf screenshot, where
max_steps = 98
, and the validation happens every 48 steps, which corresponds to one epoch). I don't really know how to interpret them, and what would be the best methodology to achieve my goal. Many thanks for your support! PS: Worth to mention that I set the
random_seed
in my
tft_config
, and that I also set the seed in my
TPESampler
for Optuna.
j
Hey. There are a couple of things that don't work well with distributed (the train/valid trajectories being one of them). I think in this case it's better to use the outputs from the pytorch lightning logger, since that syncs the validation losses before logging them. So, something like this should work:
Copy code
import uuid

import pandas as pd
import pytorch_lightning as pl
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATSx
from utilsforecast.data import generate_series

series = generate_series(10, min_length=100, max_length=200)
version = uuid.uuid4()
check_steps = 5
patience = 2
nf = NeuralForecast(
    models=[
        NBEATSx(
            input_size=10,
            h=10,
            max_steps=200,
            logger=pl.loggers.CSVLogger(save_dir='.', version=version),
            val_check_steps=check_steps,
            early_stop_patience_steps=patience,
        )],
    freq='D',
)
nf.fit(series, val_size=10)
logs = pd.read_csv(f'lightning_logs/version_{version}/metrics.csv')
best_steps = logs['step'].max() - check_steps * patience + 1
a
Seems to work like a charm, thank you José! May I ask you the difference between the
ptl/val_loss
and the
valid_loss
, and the difference between
train_loss_epoch
and
train_loss_step
?
j
The valid losses are the same I think, the only difference is that the valid_loss doesn't get synchronized across devices. The train ones are computed at the end of each step (batch) and at the end of each epoch, also neither of them are synchronized
a
Clear, thanks again José!
🙌 1