Hi I am using using `AutoTFT` with an `EarlyStopping` to tra Nixtla Community #neural-forecast

Hi! I am using using `AutoTFT` with an `EarlyStop...

Arthur LAMBERT

05/08/2024, 8:24 PM

Hi! I am using using

AutoTFT

with an

EarlyStopping

to track the validation loss. I am using the

Optuna

backend, and training on 4 GPUs. I'd like to retrieve at the end of the HP tuning, the optimal set of HP, but also the epoch at which the model stopped training due to the

EarlyStopping

. My idea was to retrieve the evolution of the validation loss at the end of the

AutoTFT

training (with the

valid_trajectories

attribute) to get it. However, when printing it at the end of the training, I obtain different values, one for each GPU (cf screenshot, where

max_steps = 98

, and the validation happens every 48 steps, which corresponds to one epoch). I don't really know how to interpret them, and what would be the best methodology to achieve my goal. Many thanks for your support! PS: Worth to mention that I set the

random_seed

in my

tft_config

, and that I also set the seed in my

TPESampler

for Optuna.

José Morales

05/08/2024, 10:31 PM

Hey. There are a couple of things that don't work well with distributed (the train/valid trajectories being one of them). I think in this case it's better to use the outputs from the pytorch lightning logger, since that syncs the validation losses before logging them. So, something like this should work:

Copy code

import uuid

import pandas as pd
import pytorch_lightning as pl
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATSx
from utilsforecast.data import generate_series

series = generate_series(10, min_length=100, max_length=200)
version = uuid.uuid4()
check_steps = 5
patience = 2
nf = NeuralForecast(
    models=[
        NBEATSx(
            input_size=10,
            h=10,
            max_steps=200,
            logger=pl.loggers.CSVLogger(save_dir='.', version=version),
            val_check_steps=check_steps,
            early_stop_patience_steps=patience,
        )],
    freq='D',
)
nf.fit(series, val_size=10)
logs = pd.read_csv(f'lightning_logs/version_{version}/metrics.csv')
best_steps = logs['step'].max() - check_steps * patience + 1

Arthur LAMBERT

05/09/2024, 9:15 AM

Seems to work like a charm, thank you José! May I ask you the difference between the

ptl/val_loss

and the

valid_loss

, and the difference between

train_loss_epoch

and

train_loss_step

José Morales

05/09/2024, 3:54 PM

The valid losses are the same I think, the only difference is that the valid_loss doesn't get synchronized across devices. The train ones are computed at the end of each step (batch) and at the end of each epoch, also neither of them are synchronized

Arthur LAMBERT

05/09/2024, 6:55 PM

Clear, thanks again José!

🙌 1

Open in Slack

Previous Next