Naren Castellon
06/02/2024, 3:56 AMjan rathfelder
06/03/2024, 8:55 PMTruong Hoang
06/04/2024, 1:05 AMQuang Bui
06/07/2024, 2:22 AMnum_threads=32
. After running cv.fit()
, and then retraining the model with the best iteration using MLForecast.from_cv()
, the final model I get is on that turns out to be trained with just num_threads=1
.
MLForecast(models=[LGBMRegressor], freq=5min, lag_features=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag12', 'lag288', 'lag576', 'lag864', 'lag1152', 'exponentially_weighted_mean_lag1_alpha0.5', 'rolling_mean_lag1_window_size12', 'rolling_mean_lag1_window_size24', 'rolling_mean_lag1_window_size288', 'rolling_mean_lag1_window_size864', 'rolling_quantile_lag1_p0.5_window_size12', 'rolling_quantile_lag1_p0.5_window_size288', 'rolling_quantile_lag1_p0.5_window_size864', 'rolling_std_lag1_window_size12', 'rolling_std_lag1_window_size288', 'seasonal_rolling_mean_lag1_season_length288_window_size7', 'seasonal_rolling_std_lag1_season_length288_window_size7', 'seasonal_rolling_quantile_lag1_p0.5_season_length288_window_size7', 'seasonal_rolling_min_lag1_season_length288_window_size7', 'seasonal_rolling_max_lag1_season_length288_window_size7', 'rolling_mean_lag12_window_size288', 'rolling_mean_lag24_window_size288', 'rolling_mean_lag288_window_size12', 'rolling_mean_lag288_window_size288', 'rolling_std_lag288_window_size12', 'rolling_mean_lag576_window_size12', 'rolling_std_lag576_window_size12', 'rolling_mean_lag864_window_size12', 'rolling_std_lag864_window_size12'], date_features=[<function localize_and_get_five_min_index at 0x7f82991b0040>, <function localize_hour at 0x7f8256837a30>, <function localize_and_identify_weekend at 0x7f82569bc310>, <function localize_dayofweek at 0x7f82569bcaf0>], num_threads=1)
I'm training on a very large dataset, and I notice that it takes a very long time to complete.Johannes Emme
06/07/2024, 7:59 AMmax_horizon
larger than 10 (it works for 10 and lower).
from mlforecast import MLForecast
from lightgbm import LGBMRegressor
import pandas as pd
df = pd.concat([
pd.DataFrame({
'id': ['A'] * 1000,
'ds': pd.date_range(start='2020-01-01', periods=1000, freq='H'),
'y': range(1000)
}),
pd.DataFrame({
'id': ['B'] * 1000,
'ds': pd.date_range(start='2020-01-01', periods=1000, freq='H'),
'y': range(1000)
})
])
fcst = MLForecast(
models=LGBMRegressor(),
freq='H',
lags=[1, 2, 3],
)
fcst.fit(df, id_col='id', time_col='ds', target_col='y', max_horizon=11, fitted=True )
in_sample_predictions = fcst.forecast_fitted_values()
print(in_sample_predictions)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlforecast/forecast.py:412, in MLForecast._compute_fitted_values(self, base, X, y, id_col, time_col, target_col, max_horizon)
409 for horizon in range(max_horizon):
410 horizon_base = ufp.copy_if_pandas(base, deep=True)
411 horizon_base = ufp.assign_columns(
--> 412 horizon_base, target_col, y[:, horizon]
413 )
414 horizon_fitted_values.append(horizon_base)
415 for name, horizon_models in self.models_.items():
IndexError: index 10 is out of bounds for axis 1 with size 10
jan rathfelder
06/07/2024, 8:30 PM45 loaded_model.update(df_update)
46
*47* # apply encoder:
~/miniconda3/envs/demand_env/lib/python3.8/site-packages/mlforecast/forecast.py in update(self, df)
*986* df : pandas or polars DataFrame
*987* Dataframe with new observations."""
--> 988 self.ts.update(df)
~/miniconda3/envs/demand_env/lib/python3.8/site-packages/mlforecast/core.py in update(self, df)
*867* if isinstance(tfm, _BaseGroupedArrayTargetTransform):
*868* ga = GroupedArray(values, indptr)
--> 869 ga = tfm.update(ga)
*870* df = ufp.assign_columns(df, self.target_col, ga.data)
*871* else:
~/miniconda3/envs/demand_env/lib/python3.8/site-packages/mlforecast/target_transforms.py in update(self, ga)
*111* core_ga = CoreGroupedArray(ga.data, ga.indptr, self.num_threads)
*112* for scaler in self.scalers_:
--> 113 transformed = scaler.update(core_ga)
*114* core_ga = core_ga._with_data(transformed)
*115* return GroupedArray(transformed, ga.indptr)
~/miniconda3/envs/demand_env/lib/python3.8/site-packages/coreforecast/scalers.py in update(self, ga)
*348* )
*349* if self.tails_.size != tails_indptr[-1]:
--> 350 raise ValueError("Number of tails doesn't match the number of groups")
*351* tails_ga = GroupedArray(self.tails_, tails_indptr, num_threads=ga.num_threads)
*352* combined = tails_ga._append(ga)
ValueError: Number of tails doesn't match the number of groups
Affan M
06/10/2024, 8:11 PMWeikai Lu
06/12/2024, 9:41 PMcross_validation
function and I had a question about it. I understand that in time series analysis, when we create a validation set, it should only include information that would be available at the time of prediction. This means that lagged features for the validation set should be computed based on data up to the last point in the training set for each window.
I was wondering how the cross_validation
function in Mlforecast handles this. Does it ensure that lagged features for the validation set are only computed based on data up to the last point in the training set for each window?
I hope my question makes sense. Any guidance on this would be really helpful. Thank you so much!Braaannigan
06/13/2024, 9:11 AMBraaannigan
06/18/2024, 8:06 PMSarim Zafar
06/18/2024, 9:03 PMmy_init_config
function, as shown in the example on the website? When I use a simple log-difference combination as I typically do with Cross Validation, the loss function returns NaN. For the loss function, I am using MAE as described here:
def custom_loss(df, train_df): return mae(df, models=["model"])["model"].mean()
Any guidance on these matters would be greatly appreciated.
Thank you!jan rathfelder
06/19/2024, 1:17 PMOlgahan Cat
06/19/2024, 3:55 PMBiagio Principe
06/25/2024, 4:00 PMOlgahan Cat
06/25/2024, 6:37 PMSarim Zafar
06/26/2024, 8:48 AMVítor Barbosa
06/26/2024, 10:38 PMfill_gaps
here:
from utilsforecast.preprocessing import fill_gaps
stocks_basic_pd = fill_gaps(stocks_basic_pd, freq='B', start='per_serie', end='per_serie', id_col='Ticker', time_col='Date')
I am getting the error below. Any ideas?
{
"name": "ValueError",
"message": "cannot handle a non-unique multi-index!",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[54], line 2
1 from utilsforecast.preprocessing import fill_gaps
----> 2 stocks_basic_pd = fill_gaps(stocks_basic_pd, freq='B', start='per_serie', end='per_serie', id_col='Ticker', time_col='Date')
File c:\\Python\\miniconda3\\envs\\openbb\\Lib\\site-packages\\utilsforecast\\preprocessing.py:166, in fill_gaps(df, freq, start, end, id_col, time_col)
164 times += offset.base
165 idx = pd.MultiIndex.from_arrays([uids, times], names=[id_col, time_col])
--> 166 res = df.set_index([id_col, time_col]).reindex(idx).reset_index()
167 extra_cols = df.columns.drop([id_col, time_col]).tolist()
168 if extra_cols:
File c:\\Python\\miniconda3\\envs\\openbb\\Lib\\site-packages\\pandas\\core\\frame.py:5365, in DataFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
5346 @doc(
5347 NDFrame.reindex,
5348 klass=_shared_doc_kwargs[\"klass\"],
(...)
5363 tolerance=None,
5364 ) -> DataFrame:
-> 5365 return super().reindex(
5366 labels=labels,
5367 index=index,
5368 columns=columns,
5369 axis=axis,
5370 method=method,
5371 copy=copy,
5372 level=level,
5373 fill_value=fill_value,
5374 limit=limit,
5375 tolerance=tolerance,
5376 )
File c:\\Python\\miniconda3\\envs\\openbb\\Lib\\site-packages\\pandas\\core\\generic.py:5607, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
5604 return self._reindex_multi(axes, copy, fill_value)
5606 # perform the reindex on the axes
-> 5607 return self._reindex_axes(
5608 axes, level, limit, tolerance, method, fill_value, copy
5609 ).__finalize__(self, method=\"reindex\")
File c:\\Python\\miniconda3\\envs\\openbb\\Lib\\site-packages\\pandas\\core\\generic.py:5630, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
5627 continue
5629 ax = self._get_axis(a)
-> 5630 new_index, indexer = ax.reindex(
5631 labels, level=level, limit=limit, tolerance=tolerance, method=method
5632 )
5634 axis = self._get_axis_number(a)
5635 obj = obj._reindex_with_indexers(
5636 {axis: [new_index, indexer]},
5637 fill_value=fill_value,
5638 copy=copy,
5639 allow_dups=False,
5640 )
File c:\\Python\\miniconda3\\envs\\openbb\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:4426, in Index.reindex(self, target, method, level, limit, tolerance)
4422 indexer = self.get_indexer(
4423 target, method=method, limit=limit, tolerance=tolerance
4424 )
4425 elif self._is_multi:
-> 4426 raise ValueError(\"cannot handle a non-unique multi-index!\")
4427 elif not self.is_unique:
4428 # GH#42568
4429 raise ValueError(\"cannot reindex on an axis with duplicate labels\")
ValueError: cannot handle a non-unique multi-index!"
}
Johannes Emme
06/29/2024, 2:46 PMcs_df
and the true target plotted against each other. From this plot, it can be seen that my model is okay at predicting the weekends but has clear difficulties in predicting the Mondays.
However, when I used the model for predictions (see plot 2), the uncertainty for the weekends was very large, and the Mondays had small uncertainty. (In the plot2 I have forgotten legends: black = true, blue = mean prediction, purple = 10th and 90th percentiles)
What I have come to realize is that the problem arises from a misalignment between the conformal horizon and the horizon of when I am predicting. With a conformal horizon of 96, the errors collected for a specific timestep are not “belonging to the same timeslot.” For instance, the first error in the first window corresponds to Monday 00:00, while for the next window, the first hour is Friday 00:00, then Tuesday 00:00, and so on. Hence, when I predict the consumption during Saturday, the quantiles are based on several different days and hours and not “Saturday hour errors.”
To overcome this issue, I set the conformal horizon to 24*7 (168) so that my conformal windows start with the same day as when I am predicting. Then I get the following result (see plot 3 and 4), where the uncertainty is low for the weekends and high for the Mondays. However, I do not believe this is a sustainable solution. Unfortunately, I don't have a very great alternative either. Currently, I have simply for my case rewritten the _add_conformal_distribution_intervals
function by:
1. Requiring that n_windows*h >= 168 to have all hours in the week represented.
2. Joining the cs_df
and fcst_df
on day_of_week
and hour
.
3. Subtracting and adding the mean to get a distribution around each hour, and then calculating the quantiles
I am very curious to hear your thoughts on this.
Best regards,
Johannesjan rathfelder
07/01/2024, 10:10 AMUserWarning: Found null values in expanding_std_lag1, rolling_std_lag1_window_size7_min_samples1, rolling_std_lag1_window_size70_min_samples1, rolling_std_lag1_window_size105_min_samples1, seasonal_rolling_std_lag1_season_length7_window_size3_min_samples1.
warnings.warn(f'Found null values in {", ".join(cols_with_nulls)}.')
Krystian W.
07/02/2024, 10:13 PMmax_horizon
arg in DistributedMlForecast? Or only through some workaround?Biagio Principe
07/03/2024, 7:52 PMscaler = TemporalNorm(scaler_type='standard', dim=1)
Dinis Timoteo
07/04/2024, 2:51 PMKrystian W.
07/07/2024, 7:32 PMcv = fcst.cross_validation(
spark_train_df,
n_windows=n_windows,
h=h,
static_features=[],
)
I tried to run cv.show() but I keep getting a KeyError that these features aren't found in index. On local works just fine.Ml Club
07/09/2024, 8:43 AM888 elif pandas_requires_conversion and any(d == object for d in dtypes_orig):
889 # Force object if any of the dtypes is an object
890 dtype_orig = object
ValueError: at least one array or dtype is required
Ml Club
07/09/2024, 8:46 AMlag=[1]
then it works great. what is the issue please help me resolve. Also i want to do a target transformation of np.log, How can i do that ?
Ml Club
07/11/2024, 7:56 AMMl Club
07/12/2024, 6:17 AMKrystian W.
07/14/2024, 2:06 PMrolling_quantile_lag_1_p=0.5_window_size_7
because of the dot in the parameter.Biagio Principe
07/15/2024, 8:35 AMmax_horizon
with lag 1
introduce data leakage? (see second image)
Grazie mille!Ml Club
07/16/2024, 4:19 PM