Hi team, I'm using a 32-core Azure compute instan...
# mlforecast
q
Hi team, I'm using a 32-core Azure compute instance to run LightGBM cross-validation`cv = LightGBMCV()`, setting
num_threads=32
. After running
cv.fit()
, and then retraining the model with the best iteration using
MLForecast.from_cv()
, the final model I get is on that turns out to be trained with just
num_threads=1
.
Copy code
MLForecast(models=[LGBMRegressor], freq=5min, lag_features=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag12', 'lag288', 'lag576', 'lag864', 'lag1152', 'exponentially_weighted_mean_lag1_alpha0.5', 'rolling_mean_lag1_window_size12', 'rolling_mean_lag1_window_size24', 'rolling_mean_lag1_window_size288', 'rolling_mean_lag1_window_size864', 'rolling_quantile_lag1_p0.5_window_size12', 'rolling_quantile_lag1_p0.5_window_size288', 'rolling_quantile_lag1_p0.5_window_size864', 'rolling_std_lag1_window_size12', 'rolling_std_lag1_window_size288', 'seasonal_rolling_mean_lag1_season_length288_window_size7', 'seasonal_rolling_std_lag1_season_length288_window_size7', 'seasonal_rolling_quantile_lag1_p0.5_season_length288_window_size7', 'seasonal_rolling_min_lag1_season_length288_window_size7', 'seasonal_rolling_max_lag1_season_length288_window_size7', 'rolling_mean_lag12_window_size288', 'rolling_mean_lag24_window_size288', 'rolling_mean_lag288_window_size12', 'rolling_mean_lag288_window_size288', 'rolling_std_lag288_window_size12', 'rolling_mean_lag576_window_size12', 'rolling_std_lag576_window_size12', 'rolling_mean_lag864_window_size12', 'rolling_std_lag864_window_size12'], date_features=[<function localize_and_get_five_min_index at 0x7f82991b0040>, <function localize_hour at 0x7f8256837a30>, <function localize_and_identify_weekend at 0x7f82569bc310>, <function localize_dayofweek at 0x7f82569bcaf0>], num_threads=1)
I'm training on a very large dataset, and I notice that it takes a very long time to complete.
Separately, I know that we can train a larger model and use that pre-trained model to predict on an entire different series (transfer learning). https://nixtlaverse.nixtla.io/mlforecast/docs/how-to-guides/transfer_learning.html If the trained model contains exogenous features that are based on the
unique_id
from the (original) training data (e.g. mean of
y
grouped by time of the day and
unique_id
), will we also need to include the same exogenous feature in
new_df
to forecast on different series, or is it handle to handle a missing column?
j
Hey. For LightGBMCV the num_threads argument is used to train each of the models in parallel, so I recommend setting it to n_windows if possible. You need to provide the same features during predict, otherwise you'll get an error. You can also provide them full with NaNs, but the predictions will most likely be bad, especially if the features were useful during training.
q
Really appreciate the response, @José Morales! You've been so helpful 🙂
Hi @José Morales Just a quick follow up on another issue that I've been stuck on. When I pass
new_df
into the
new_df
argument, I get the following error, which says that some of the features are missing in `new_df``, but they're all there...
j
Hey. The
new_df
only needs to have the target for the "training" period and the exogenous features in the forecast horizon must be provided through
X_df
q
Thanks for the quick response. I'm still getting errors, but maybe it is because I have still misunderstood. Suppose I want to perform in-sample forecasting, i.e., forecasting over a period of time that was also seen in the training data. Specifically, I want to forecast for just one of the
unique_id
(which was also included in the training data) and I only want to forecast 1-day ahead from July 25th to July 26th:
Copy code
start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'
The data is 5-minutely. The model was trained with many lags and lag transformations of
y
. The training data
transformed_df
has exogenous features like the weather forecast and trend and seasonal from
MSTL
decomposition.
LightGBMCV()
was first performed before
.fit()
Copy code
cv = LightGBMCV(
        freq='5min',
        lags=[1,2,3,4,5,6,12,288,576,864,1152],
        lag_transforms={
            1: [
                ExponentiallyWeightedMean(alpha=0.5), 
                RollingMean(window_size=12), 
                RollingMean(window_size=24), 
                RollingMean(window_size=288),
                RollingMean(window_size=864),
                RollingQuantile(window_size=12, p=0.5),
                RollingQuantile(window_size=288, p=0.5),
                RollingQuantile(window_size=864, p=0.5),
                RollingStd(window_size=12),
                RollingStd(window_size=288),
                SeasonalRollingMean(season_length=288, window_size=7),
                SeasonalRollingStd(season_length=288, window_size=7),
                SeasonalRollingQuantile(p=0.5, season_length=288, window_size=7),
                SeasonalRollingMin(season_length=288, window_size=7),
                SeasonalRollingMax(season_length=288, window_size=7)
            ],
            12: [RollingMean(window_size=288)],
            24: [RollingMean(window_size=288)],
            288: [
                RollingMean(window_size=12),
                RollingMean(window_size=288),
                RollingStd(window_size=12),
            ],
            576: [
                RollingMean(window_size=12),
                RollingStd(window_size=12)
            ],
            864: [
                RollingMean(window_size=12),
                RollingStd(window_size=12)
            ],
        },
        date_features=[localize_and_get_five_min_index, localize_and_identify_weekend],
        num_threads=32,
    )
    
    cv_hist = cv.fit(
        transformed_df,
        n_windows=args.n_windows,
        h=args.h,
        step_size=args.step_size,
        params=lgb_params,
        num_iterations=args.num_iterations,
        eval_every=args.eval_every,
        early_stopping_evals=args.early_stopping_evals,
        early_stopping_pct=args.early_stopping_pct,
        compute_cv_preds=args.compute_cv_preds,
        metric=args.metric,
        static_features=[],
        dropna=True 
    )

    final_fcst = MLForecast.from_cv(cv)
    assert cv.best_iteration_ == final_fcst.models['LGBMRegressor'].n_estimators
    cv.best_iteration_
    final_fcst.fit(transformed_df, static_features=[], as_numpy=True)
I've saved the model and loaded it up as
load_model
Going to back to the forecast, what should go into
load_model.predict()
if I would like to forecast 1-day ahead for just one unique_id
'98-1'
at the datetime below:
Copy code
start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'
I just want to be sure that I have a correct understanding of how to use
predict()
, namely the arguments in them like
new_df
and
X_df
. Thank you! :)
j
If those dates are within
transformed_df
I believe you can do:
Copy code
id_mask = transformed_df['unique_id'] == '98-1'
insample_mask = transformed_df['ds'] < '2021-07-25 00:05:00'
new_df = transformed_df.loc[id_mask & insample_mask, ['unique_id', 'ds', 'y']]
X_df = transformed_df[id_mask & ~insample_mask].drop(columns='y')
final_fcst.predict(h=12 * 24, new_df=new_df, X_df=X_df)
So the new_df must have the previous target values to compute the features and the X_df should have the future exog
q
Thanks for replying to quickly once again! This is almost exactly that I've tried earlier. I used a
new_df
that contained just unique_id, ds, and y before '2021-07-25 000500' and a
X_df
that contains unique_id, ds and exogenous features from '2021-07-25 000500' onwards. Using your example (as well as what I tried with my similar example), I got this error:
Copy code
ValueError: Number of features of the model must match the input. Model n_features_ is 117 and input n_features is 37
Some additional information: •
transformed_df
is the training data and it contains 83 columns • Running
final_fcst.preprocess(new_df)
gives me a DataFrame with 40 columns I'm still not sure where to go from here. Is any additional information that I can provide you so that we can solve this? Thanks again! You've always been very helpful! When it says "input n_features is 37", that tells me that
predict()
has recognised
new_df
, and has correctly computed the lags of y and transformation of lags of y (ignoring unique_id, ds, and y, you have 37 columns). The exogenous features aren't recognised through
X_df
it seems..
j
Sorry, we retrieve the feature names from
new_df
, so that one should have all features as well, i.e. it should be like this:
Copy code
new_df = transformed_df[id_mask & insample_mask]
q
Thanks, Jose! Got it working now :)
👍 1
@José Morales Just wanted to follow up with you on another matter. For each training fold used during cross-validation, is it possible to perform decomposition each time and extract the seasonal and trend component for each respective training fold, then use them as features? So like a dynamic feature generator where if I move to another training fold, I recompute the seasonal and trend feature. I don't want the seasonal and trend component used in training to come from future data points. Thanks!
j
Sure. The easiest way is to wrap the logic to do that inside a scikit-learn estimator and provide a pipeline as a model, as in this guide. That way you'll know for sure that the features are being computed on the training set only. Note that you'll have to forecast those components in some way for the predict step and apply them on the
transform
method of the estimator
👍 1