Quang Bui
06/07/2024, 2:22 AMnum_threads=32
. After running cv.fit()
, and then retraining the model with the best iteration using MLForecast.from_cv()
, the final model I get is on that turns out to be trained with just num_threads=1
.
MLForecast(models=[LGBMRegressor], freq=5min, lag_features=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag12', 'lag288', 'lag576', 'lag864', 'lag1152', 'exponentially_weighted_mean_lag1_alpha0.5', 'rolling_mean_lag1_window_size12', 'rolling_mean_lag1_window_size24', 'rolling_mean_lag1_window_size288', 'rolling_mean_lag1_window_size864', 'rolling_quantile_lag1_p0.5_window_size12', 'rolling_quantile_lag1_p0.5_window_size288', 'rolling_quantile_lag1_p0.5_window_size864', 'rolling_std_lag1_window_size12', 'rolling_std_lag1_window_size288', 'seasonal_rolling_mean_lag1_season_length288_window_size7', 'seasonal_rolling_std_lag1_season_length288_window_size7', 'seasonal_rolling_quantile_lag1_p0.5_season_length288_window_size7', 'seasonal_rolling_min_lag1_season_length288_window_size7', 'seasonal_rolling_max_lag1_season_length288_window_size7', 'rolling_mean_lag12_window_size288', 'rolling_mean_lag24_window_size288', 'rolling_mean_lag288_window_size12', 'rolling_mean_lag288_window_size288', 'rolling_std_lag288_window_size12', 'rolling_mean_lag576_window_size12', 'rolling_std_lag576_window_size12', 'rolling_mean_lag864_window_size12', 'rolling_std_lag864_window_size12'], date_features=[<function localize_and_get_five_min_index at 0x7f82991b0040>, <function localize_hour at 0x7f8256837a30>, <function localize_and_identify_weekend at 0x7f82569bc310>, <function localize_dayofweek at 0x7f82569bcaf0>], num_threads=1)
I'm training on a very large dataset, and I notice that it takes a very long time to complete.Quang Bui
06/08/2024, 1:55 AMunique_id
from the (original) training data (e.g. mean of y
grouped by time of the day and unique_id
), will we also need to include the same exogenous feature in new_df
to forecast on different series, or is it handle to handle a missing column?José Morales
06/08/2024, 6:24 AMQuang Bui
06/08/2024, 6:42 AMQuang Bui
06/10/2024, 4:10 AMnew_df
into the new_df
argument, I get the following error, which says that some of the features are missing in `new_df``, but they're all there...José Morales
06/10/2024, 4:44 PMnew_df
only needs to have the target for the "training" period and the exogenous features in the forecast horizon must be provided through X_df
Quang Bui
06/10/2024, 10:25 PMunique_id
(which was also included in the training data) and I only want to forecast 1-day ahead from July 25th to July 26th:
start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'
The data is 5-minutely. The model was trained with many lags and lag transformations of y
. The training data transformed_df
has exogenous features like the weather forecast and trend and seasonal from MSTL
decomposition. LightGBMCV()
was first performed before .fit()
cv = LightGBMCV(
freq='5min',
lags=[1,2,3,4,5,6,12,288,576,864,1152],
lag_transforms={
1: [
ExponentiallyWeightedMean(alpha=0.5),
RollingMean(window_size=12),
RollingMean(window_size=24),
RollingMean(window_size=288),
RollingMean(window_size=864),
RollingQuantile(window_size=12, p=0.5),
RollingQuantile(window_size=288, p=0.5),
RollingQuantile(window_size=864, p=0.5),
RollingStd(window_size=12),
RollingStd(window_size=288),
SeasonalRollingMean(season_length=288, window_size=7),
SeasonalRollingStd(season_length=288, window_size=7),
SeasonalRollingQuantile(p=0.5, season_length=288, window_size=7),
SeasonalRollingMin(season_length=288, window_size=7),
SeasonalRollingMax(season_length=288, window_size=7)
],
12: [RollingMean(window_size=288)],
24: [RollingMean(window_size=288)],
288: [
RollingMean(window_size=12),
RollingMean(window_size=288),
RollingStd(window_size=12),
],
576: [
RollingMean(window_size=12),
RollingStd(window_size=12)
],
864: [
RollingMean(window_size=12),
RollingStd(window_size=12)
],
},
date_features=[localize_and_get_five_min_index, localize_and_identify_weekend],
num_threads=32,
)
cv_hist = cv.fit(
transformed_df,
n_windows=args.n_windows,
h=args.h,
step_size=args.step_size,
params=lgb_params,
num_iterations=args.num_iterations,
eval_every=args.eval_every,
early_stopping_evals=args.early_stopping_evals,
early_stopping_pct=args.early_stopping_pct,
compute_cv_preds=args.compute_cv_preds,
metric=args.metric,
static_features=[],
dropna=True
)
final_fcst = MLForecast.from_cv(cv)
assert cv.best_iteration_ == final_fcst.models['LGBMRegressor'].n_estimators
cv.best_iteration_
final_fcst.fit(transformed_df, static_features=[], as_numpy=True)
I've saved the model and loaded it up as load_model
Going to back to the forecast, what should go into load_model.predict()
if I would like to forecast 1-day ahead for just one unique_id '98-1'
at the datetime below:
start_datetime = '2021-07-25 00:05:00'
end_datetime = '2021-07-26 00:00:00'
I just want to be sure that I have a correct understanding of how to use predict()
, namely the arguments in them like new_df
and X_df
. Thank you! :)José Morales
06/10/2024, 10:57 PMtransformed_df
I believe you can do:
id_mask = transformed_df['unique_id'] == '98-1'
insample_mask = transformed_df['ds'] < '2021-07-25 00:05:00'
new_df = transformed_df.loc[id_mask & insample_mask, ['unique_id', 'ds', 'y']]
X_df = transformed_df[id_mask & ~insample_mask].drop(columns='y')
final_fcst.predict(h=12 * 24, new_df=new_df, X_df=X_df)
José Morales
06/10/2024, 10:58 PMQuang Bui
06/11/2024, 1:39 AMnew_df
that contained just unique_id, ds, and y before '2021-07-25 000500' and a X_df
that contains unique_id, ds and exogenous features from '2021-07-25 000500' onwards. Using your example (as well as what I tried with my similar example), I got this error:
ValueError: Number of features of the model must match the input. Model n_features_ is 117 and input n_features is 37
Some additional information:
• transformed_df
is the training data and it contains 83 columns
• Running final_fcst.preprocess(new_df)
gives me a DataFrame with 40 columns
I'm still not sure where to go from here. Is any additional information that I can provide you so that we can solve this?
Thanks again! You've always been very helpful!
When it says "input n_features is 37", that tells me that predict()
has recognised new_df
, and has correctly computed the lags of y and transformation of lags of y (ignoring unique_id, ds, and y, you have 37 columns). The exogenous features aren't recognised through X_df
it seems..José Morales
06/11/2024, 4:12 PMnew_df
, so that one should have all features as well, i.e. it should be like this:
new_df = transformed_df[id_mask & insample_mask]
Quang Bui
06/12/2024, 6:31 AMQuang Bui
06/18/2024, 6:00 AMJosé Morales
06/18/2024, 4:45 PMtransform
method of the estimator