Hi team I m using a 32 core Azure compute instance to run Li Nixtla Community #mlforecast

Hi team, I'm using a 32-core Azure compute instan...

Quang Bui

06/07/2024, 2:22 AM

Hi team, I'm using a 32-core Azure compute instance to run LightGBM cross-validation`cv = LightGBMCV()`, setting

num_threads=32

. After running

cv.fit()

, and then retraining the model with the best iteration using

MLForecast.from_cv()

, the final model I get is on that turns out to be trained with just

num_threads=1

Copy code

MLForecast(models=[LGBMRegressor], freq=5min, lag_features=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag12', 'lag288', 'lag576', 'lag864', 'lag1152', 'exponentially_weighted_mean_lag1_alpha0.5', 'rolling_mean_lag1_window_size12', 'rolling_mean_lag1_window_size24', 'rolling_mean_lag1_window_size288', 'rolling_mean_lag1_window_size864', 'rolling_quantile_lag1_p0.5_window_size12', 'rolling_quantile_lag1_p0.5_window_size288', 'rolling_quantile_lag1_p0.5_window_size864', 'rolling_std_lag1_window_size12', 'rolling_std_lag1_window_size288', 'seasonal_rolling_mean_lag1_season_length288_window_size7', 'seasonal_rolling_std_lag1_season_length288_window_size7', 'seasonal_rolling_quantile_lag1_p0.5_season_length288_window_size7', 'seasonal_rolling_min_lag1_season_length288_window_size7', 'seasonal_rolling_max_lag1_season_length288_window_size7', 'rolling_mean_lag12_window_size288', 'rolling_mean_lag24_window_size288', 'rolling_mean_lag288_window_size12', 'rolling_mean_lag288_window_size288', 'rolling_std_lag288_window_size12', 'rolling_mean_lag576_window_size12', 'rolling_std_lag576_window_size12', 'rolling_mean_lag864_window_size12', 'rolling_std_lag864_window_size12'], date_features=[<function localize_and_get_five_min_index at 0x7f82991b0040>, <function localize_hour at 0x7f8256837a30>, <function localize_and_identify_weekend at 0x7f82569bc310>, <function localize_dayofweek at 0x7f82569bcaf0>], num_threads=1)

I'm training on a very large dataset, and I notice that it takes a very long time to complete.

Quang Bui

06/08/2024, 1:55 AM

Separately, I know that we can train a larger model and use that pre-trained model to predict on an entire different series (transfer learning). https://nixtlaverse.nixtla.io/mlforecast/docs/how-to-guides/transfer_learning.html If the trained model contains exogenous features that are based on the

unique_id

from the (original) training data (e.g. mean of

grouped by time of the day and

unique_id

), will we also need to include the same exogenous feature in

new_df

to forecast on different series, or is it handle to handle a missing column?

José Morales

06/08/2024, 6:24 AM

Hey. For LightGBMCV the num_threads argument is used to train each of the models in parallel, so I recommend setting it to n_windows if possible. You need to provide the same features during predict, otherwise you'll get an error. You can also provide them full with NaNs, but the predictions will most likely be bad, especially if the features were useful during training.

Quang Bui

06/08/2024, 6:42 AM

Really appreciate the response, @José Morales! You've been so helpful 🙂

Quang Bui

06/10/2024, 4:10 AM

Hi @José Morales Just a quick follow up on another issue that I've been stuck on. When I pass

new_df

into the

new_df

argument, I get the following error, which says that some of the features are missing in `new_df``, but they're all there...

José Morales

06/10/2024, 4:44 PM

Hey. The

new_df

only needs to have the target for the "training" period and the exogenous features in the forecast horizon must be provided through

X_df

Quang Bui

06/10/2024, 10:25 PM

Thanks for the quick response. I'm still getting errors, but maybe it is because I have still misunderstood. Suppose I want to perform in-sample forecasting, i.e., forecasting over a period of time that was also seen in the training data. Specifically, I want to forecast for just one of the

unique_id

(which was also included in the training data) and I only want to forecast 1-day ahead from July 25th to July 26th:

Copy code

start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'

The data is 5-minutely. The model was trained with many lags and lag transformations of

. The training data

transformed_df

has exogenous features like the weather forecast and trend and seasonal from

MSTL

decomposition.

LightGBMCV()

was first performed before

.fit()

Copy code

cv = LightGBMCV(
        freq='5min',
        lags=[1,2,3,4,5,6,12,288,576,864,1152],
        lag_transforms={
            1: [
                ExponentiallyWeightedMean(alpha=0.5), 
                RollingMean(window_size=12), 
                RollingMean(window_size=24), 
                RollingMean(window_size=288),
                RollingMean(window_size=864),
                RollingQuantile(window_size=12, p=0.5),
                RollingQuantile(window_size=288, p=0.5),
                RollingQuantile(window_size=864, p=0.5),
                RollingStd(window_size=12),
                RollingStd(window_size=288),
                SeasonalRollingMean(season_length=288, window_size=7),
                SeasonalRollingStd(season_length=288, window_size=7),
                SeasonalRollingQuantile(p=0.5, season_length=288, window_size=7),
                SeasonalRollingMin(season_length=288, window_size=7),
                SeasonalRollingMax(season_length=288, window_size=7)
            ],
            12: [RollingMean(window_size=288)],
            24: [RollingMean(window_size=288)],
            288: [
                RollingMean(window_size=12),
                RollingMean(window_size=288),
                RollingStd(window_size=12),
            ],
            576: [
                RollingMean(window_size=12),
                RollingStd(window_size=12)
            ],
            864: [
                RollingMean(window_size=12),
                RollingStd(window_size=12)
            ],
        },
        date_features=[localize_and_get_five_min_index, localize_and_identify_weekend],
        num_threads=32,
    )
    
    cv_hist = cv.fit(
        transformed_df,
        n_windows=args.n_windows,
        h=args.h,
        step_size=args.step_size,
        params=lgb_params,
        num_iterations=args.num_iterations,
        eval_every=args.eval_every,
        early_stopping_evals=args.early_stopping_evals,
        early_stopping_pct=args.early_stopping_pct,
        compute_cv_preds=args.compute_cv_preds,
        metric=args.metric,
        static_features=[],
        dropna=True 
    )

    final_fcst = MLForecast.from_cv(cv)
    assert cv.best_iteration_ == final_fcst.models['LGBMRegressor'].n_estimators
    cv.best_iteration_
    final_fcst.fit(transformed_df, static_features=[], as_numpy=True)

I've saved the model and loaded it up as

load_model

Going to back to the forecast, what should go into

load_model.predict()

if I would like to forecast 1-day ahead for just one unique_id

'98-1'

at the datetime below:

Copy code

start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'

I just want to be sure that I have a correct understanding of how to use

predict()

, namely the arguments in them like

new_df

and

X_df

. Thank you! :)

José Morales

06/10/2024, 10:57 PM

If those dates are within

transformed_df

I believe you can do:

Copy code

id_mask = transformed_df['unique_id'] == '98-1'
insample_mask = transformed_df['ds'] < '2021-07-25 00:05:00'
new_df = transformed_df.loc[id_mask & insample_mask, ['unique_id', 'ds', 'y']]
X_df = transformed_df[id_mask & ~insample_mask].drop(columns='y')
final_fcst.predict(h=12 * 24, new_df=new_df, X_df=X_df)

José Morales

06/10/2024, 10:58 PM

So the new_df must have the previous target values to compute the features and the X_df should have the future exog

Quang Bui

06/11/2024, 1:39 AM

Thanks for replying to quickly once again! This is almost exactly that I've tried earlier. I used a

new_df

that contained just unique_id, ds, and y before '2021-07-25 000500' and a

X_df

that contains unique_id, ds and exogenous features from '2021-07-25 000500' onwards. Using your example (as well as what I tried with my similar example), I got this error:

Copy code

ValueError: Number of features of the model must match the input. Model n_features_ is 117 and input n_features is 37

Some additional information: •

transformed_df

is the training data and it contains 83 columns • Running

final_fcst.preprocess(new_df)

gives me a DataFrame with 40 columns I'm still not sure where to go from here. Is any additional information that I can provide you so that we can solve this? Thanks again! You've always been very helpful! When it says "input n_features is 37", that tells me that

predict()

has recognised

new_df

, and has correctly computed the lags of y and transformation of lags of y (ignoring unique_id, ds, and y, you have 37 columns). The exogenous features aren't recognised through

X_df

it seems..

José Morales

06/11/2024, 4:12 PM

Sorry, we retrieve the feature names from

new_df

, so that one should have all features as well, i.e. it should be like this:

Copy code

new_df = transformed_df[id_mask & insample_mask]

Quang Bui

06/12/2024, 6:31 AM

Thanks, Jose! Got it working now :)

👍 1

Quang Bui

06/18/2024, 6:00 AM

@José Morales Just wanted to follow up with you on another matter. For each training fold used during cross-validation, is it possible to perform decomposition each time and extract the seasonal and trend component for each respective training fold, then use them as features? So like a dynamic feature generator where if I move to another training fold, I recompute the seasonal and trend feature. I don't want the seasonal and trend component used in training to come from future data points. Thanks!

José Morales

06/18/2024, 4:45 PM

Sure. The easiest way is to wrap the logic to do that inside a scikit-learn estimator and provide a pipeline as a model, as in this guide. That way you'll know for sure that the features are being computed on the training set only. Note that you'll have to forecast those components in some way for the predict step and apply them on the

transform

method of the estimator

👍 1

Open in Slack

Previous Next