Hello to evereone. I have a li ttle trouble with p...
# general
m
Hello to evereone. I have a li ttle trouble with prediction intervals here. The attached file contains my dataframe. (with 924 unique ids) and some static variables. Follow my code:
Copy code
train = df_completo_encoded.query("ds < '2023-07-07'")
test = df_completo_encoded.query("ds >= '2023-07-07'")


from mlforecast.utils import PredictionIntervals
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals
from window_ops.ewm import ewm_mean
from window_ops.rolling import rolling_mean, seasonal_rolling_mean,rolling_min, rolling_max, rolling_std


mlf = MLForecast(
    freq = 'D',
    models=[ XGBRegressor(n_jobs = -1),LGBMRegressor(n_jobs = -1)],
     target_transforms=[Differences([1,7])],
     lag_transforms={
                        1: [(rolling_mean, 2),(rolling_mean, 3),(rolling_mean, 4),(rolling_mean, 5),(rolling_mean, 6),(rolling_mean, 7),
                            (rolling_mean, 7), (rolling_mean, 14), (rolling_mean, 28),(ewm_mean, 0.9), expanding_mean,
                            (rolling_min,7), (rolling_min,14),(rolling_min,28),
                            (rolling_max,7), (rolling_max,14),(rolling_max,28),
                            (rolling_std,2),(rolling_std,3),(rolling_std,4),(rolling_std,5),(rolling_std,6),(rolling_std,7), (rolling_std,14),(rolling_std,28),
      
                      },
    lags=[1,7,14,21,28],
   date_features=['month', 'year', 'day_of_week', 'day_of_year','is_month_start','quarter','days_in_month'],
      num_threads=4
                    )
%%time

mlf.fit( train,
    id_col='unique_id',
    #max_horizon = 47,
    prediction_intervals=PredictionIntervals(n_windows=10, window_size=47),
    time_col='ds',
    target_col='y',
    static_features= ['gtin','ADI','CV2','cluster_0','cluster_1','cluster_2','cluster_3','cluster_4'],)
To here, everything ok. But follow de error:
Copy code
levels = [50, 80, 95]

forecasts = mlf.predict(47, level = levels )

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[81], line 3
      1 levels = [50, 80, 95]
----> 3 forecasts = mlf.predict(47, level = levels )
      4 forecasts.head()

File /usr/local/lib/python3.10/site-packages/mlforecast/utils.py:186, in old_kw_to_pos.<locals>.decorator.<locals>.inner(*args, **kwargs)
    184                 new_args.append(kwargs.pop(arg_names[i]))
    185             new_args.append(kwargs.pop(old_name))
--> 186 return f(*new_args, **kwargs)

File /usr/local/lib/python3.10/site-packages/mlforecast/forecast.py:532, in MLForecast.predict(self, h, dynamic_dfs, before_predict_callback, after_predict_callback, new_df, level, X_df, ids, horizon, new_data)
    528         model_names = self.models.keys()
    529         conformal_method = _get_conformal_method(
    530             self.prediction_intervals.method
    531         )
--> 532         forecasts = conformal_method(
    533             forecasts,
    534             self._cs_df,
    535             model_names=list(model_names),
    536             level=level_,
    537             cs_h=self.prediction_intervals.h,
    538             cs_n_windows=self.prediction_intervals.n_windows,
    539             n_series=self.ts.ga.ngroups,
    540             horizon=h,
    541         )
    542 return forecasts

File /usr/local/lib/python3.10/site-packages/mlforecast/forecast.py:55, in _add_conformal_distribution_intervals(fcst_df, cs_df, model_names, level, cs_n_windows, cs_h, n_series, horizon)
     53 scores = scores[:, :, :horizon]
     54 mean = fcst_df[model].values.reshape(1, n_series, -1)
---> 55 scores = np.vstack([mean - scores, mean + scores])
     56 quantiles = np.quantile(
     57     scores,
     58     cuts,
     59     axis=0,
     60 )
     61 quantiles = quantiles.reshape(len(cuts), -1)

File /usr/local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:528, in BaseMaskedArray.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    525         return NotImplemented
    527 # for binary ops, use our custom dunder methods
--> 528 result = ops.maybe_dispatch_ufunc_to_dunder_op(
    529     self, ufunc, method, *inputs, **kwargs
    530 )
    531 if result is not NotImplemented:
    532     return result

File /usr/local/lib/python3.10/site-packages/pandas/_libs/ops_dispatch.pyx:113, in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op()

File /usr/local/lib/python3.10/site-packages/pandas/core/ops/common.py:81, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     77             return NotImplemented
     79 other = item_from_zerodim(other)
---> 81 return method(self, other)

File /usr/local/lib/python3.10/site-packages/pandas/core/arraylike.py:198, in OpsMixin.__rsub__(self, other)
    196 @unpack_zerodim_and_defer("__rsub__")
    197 def __rsub__(self, other):
--> 198     return self._arith_method(other, roperator.rsub)

File /usr/local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:659, in BaseMaskedArray._arith_method(self, other, op)
    657         other = np.asarray(other)
    658     if other.ndim > 1:
--> 659         raise NotImplementedError("can only perform ops with 1-d structures")
    661 # We wrap the non-masked arithmetic logic used for numpy dtypes
    662 #  in Series/Index arithmetic ops.
    663 other = ops.maybe_prepare_scalar_for_op(other, (len(self),))

NotImplementedError: can only perform ops with 1-d structures
j
Hey. Did you run the fit first with max_horizon?
m
Hello José. The max_horizon in the fit argument is commented out, I didn't use it. Should max_horizon influence?
j
I think if you run it first with and then without you can run into that error (if you don't re-create the MLForecast object)
m
i restarted the kernel and tried again. The training even goes ok, the message occurs is at the time of the predict.
Copy code
mlf.fit( train,
    id_col='unique_id',
    prediction_intervals=PredictionIntervals(n_windows=10, window_size=47),
    time_col='ds',
    target_col='y',
    static_features= ['gtin','ADI','CV2','cluster_0','cluster_1','cluster_2','cluster_3','cluster_4'],)

[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.063419 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5559
[LightGBM] [Info] Number of data points in the train set: 378531, number of used features: 44
[LightGBM] [Info] Start training from score -0.000629
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.088819 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5930
[LightGBM] [Info] Number of data points in the train set: 811401, number of used features: 44
[LightGBM] [Info] Start training from score 0.000392
CPU times: user 13min 1s, sys: 4.26 s, total: 13min 5s
Wall time: 3min 55s
[37]:
MLForecast(models=[XGBRegressor, LGBMRegressor], freq=<Day>, lag_features=['lag1', 'lag7', 'lag14', 'lag21', 'lag28', 'rolling_mean_lag1_window_size2', 'rolling_mean_lag1_window_size3', 'rolling_mean_lag1_window_size4', 'rolling_mean_lag1_window_size5', 'rolling_mean_lag1_window_size6', 'rolling_mean_lag1_window_size7', 'rolling_mean_lag1_window_size14', 'rolling_mean_lag1_window_size28', 'ewm_mean_lag1_alpha0.9', 'expanding_mean_lag1', 'rolling_min_lag1_window_size7', 'rolling_min_lag1_window_size14', 'rolling_min_lag1_window_size28', 'rolling_max_lag1_window_size7', 'rolling_max_lag1_window_size14', 'rolling_max_lag1_window_size28', 'rolling_std_lag1_window_size2', 'rolling_std_lag1_window_size3', 'rolling_std_lag1_window_size4', 'rolling_std_lag1_window_size5', 'rolling_std_lag1_window_size6', 'rolling_std_lag1_window_size7', 'rolling_std_lag1_window_size14', 'rolling_std_lag1_window_size28'], date_features=['month', 'year', 'day_of_week', 'day_of_year', 'is_month_start', 'quarter', 'days_in_month'], num_threads=4)
Copy code
mlf.prediction_intervals

PredictionIntervals(n_windows=10, h=47, method='conformal_distribution')
Copy code
levels = [50, 80, 95]

forecasts = mlf.predict(47, level = levels )

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[42], line 3
      1 levels = [50, 80, 95]
----> 3 forecasts = mlf.predict(47, level = levels )
      4 forecasts.head()

File /usr/local/lib/python3.10/site-packages/mlforecast/utils.py:186, in old_kw_to_pos.<locals>.decorator.<locals>.inner(*args, **kwargs)
    184                 new_args.append(kwargs.pop(arg_names[i]))
    185             new_args.append(kwargs.pop(old_name))
--> 186 return f(*new_args, **kwargs)

File /usr/local/lib/python3.10/site-packages/mlforecast/forecast.py:532, in MLForecast.predict(self, h, dynamic_dfs, before_predict_callback, after_predict_callback, new_df, level, X_df, ids, horizon, new_data)
    528         model_names = self.models.keys()
    529         conformal_method = _get_conformal_method(
    530             self.prediction_intervals.method
    531         )
--> 532         forecasts = conformal_method(
    533             forecasts,
    534             self._cs_df,
    535             model_names=list(model_names),
    536             level=level_,
    537             cs_h=self.prediction_intervals.h,
    538             cs_n_windows=self.prediction_intervals.n_windows,
    539             n_series=self.ts.ga.ngroups,
    540             horizon=h,
    541         )
    542 return forecasts

File /usr/local/lib/python3.10/site-packages/mlforecast/forecast.py:55, in _add_conformal_distribution_intervals(fcst_df, cs_df, model_names, level, cs_n_windows, cs_h, n_series, horizon)
     53 scores = scores[:, :, :horizon]
     54 mean = fcst_df[model].values.reshape(1, n_series, -1)
---> 55 scores = np.vstack([mean - scores, mean + scores])
     56 quantiles = np.quantile(
     57     scores,
     58     cuts,
     59     axis=0,
     60 )
     61 quantiles = quantiles.reshape(len(cuts), -1)

File /usr/local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:528, in BaseMaskedArray.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    525         return NotImplemented
    527 # for binary ops, use our custom dunder methods
--> 528 result = ops.maybe_dispatch_ufunc_to_dunder_op(
    529     self, ufunc, method, *inputs, **kwargs
    530 )
    531 if result is not NotImplemented:
    532     return result

File /usr/local/lib/python3.10/site-packages/pandas/_libs/ops_dispatch.pyx:113, in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op()

File /usr/local/lib/python3.10/site-packages/pandas/core/ops/common.py:81, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     77             return NotImplemented
     79 other = item_from_zerodim(other)
---> 81 return method(self, other)

File /usr/local/lib/python3.10/site-packages/pandas/core/arraylike.py:198, in OpsMixin.__rsub__(self, other)
    196 @unpack_zerodim_and_defer("__rsub__")
    197 def __rsub__(self, other):
--> 198     return self._arith_method(other, roperator.rsub)

File /usr/local/lib/python3.10/site-packages/pandas/core/arrays/masked.py:659, in BaseMaskedArray._arith_method(self, other, op)
    657         other = np.asarray(other)
    658     if other.ndim > 1:
--> 659         raise NotImplementedError("can only perform ops with 1-d structures")
    661 # We wrap the non-masked arithmetic logic used for numpy dtypes
    662 #  in Series/Index arithmetic ops.
    663 other = ops.maybe_prepare_scalar_for_op(other, (len(self),))

NotImplementedError: can only perform ops with 1-d structures
j
Are you on the latest version? This runs fine for me:
Copy code
from mlforecast import MLForecast
from mlforecast.utils import PredictionIntervals
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals, generate_daily_series
from window_ops.ewm import ewm_mean
from window_ops.rolling import rolling_mean, seasonal_rolling_mean,rolling_min, rolling_max, rolling_std
from window_ops.expanding import expanding_mean

train = generate_daily_series(10, min_length=1_000, max_length=1_000, n_static_features=2, static_as_categorical=False)
mlf = MLForecast(
    freq = 'D',
    models=[ XGBRegressor(n_jobs = -1),LGBMRegressor(n_jobs = -1)],
     target_transforms=[Differences([1,7])],
     lag_transforms={
                        1: [(rolling_mean, 2),(rolling_mean, 3),(rolling_mean, 4),(rolling_mean, 5),(rolling_mean, 6),(rolling_mean, 7),
                            (rolling_mean, 7), (rolling_mean, 14), (rolling_mean, 28),(ewm_mean, 0.9), expanding_mean,
                            (rolling_min,7), (rolling_min,14),(rolling_min,28),
                            (rolling_max,7), (rolling_max,14),(rolling_max,28),
                            (rolling_std,2),(rolling_std,3),(rolling_std,4),(rolling_std,5),(rolling_std,6),(rolling_std,7), (rolling_std,14),(rolling_std,28)],
      
                      },
    lags=[1,7,14,21,28],
   date_features=['month', 'year', 'day_of_week', 'day_of_year','is_month_start','quarter','days_in_month'],
      num_threads=4
                    )
mlf.fit( train,
    id_col='unique_id',
    #max_horizon = 47,
    prediction_intervals=PredictionIntervals(n_windows=10, window_size=47),
    time_col='ds',
    target_col='y',
    static_features=['static_0', 'static_1'])
mlf.predict(47, level = [50, 80, 95])
m
Hello Jose. Thanks for the reply. I'm investigating if it could be something in the sagemaker environment I'm running. I'll be back soon with more information
j
Also if you're able to print the shapes of the means and scores it'd help a lot. If you run
%debug
on your notebook when you get the error you can go up (with
u
) and get to that point and then just print the shapes. I think it could be because some small series as well
💡 1
m
Yes! I'm investigating whether it could be the size of the series or also whether during cross-validation, it's getting a horizon in which all observations are zeros, given the intermittence of the skus here at the company
José, maybe can you help me with a thing here. I'm currently facing a challenge in my e-commerce business, which comprises a diverse range of SKUs, each with its own unique sales behavior and patterns. Our goal is to make demand forecasts for the upcoming 30 days; however, due to the diversity and complexity of the data, finding an efficient strategy to select relevant and predictable SKUs has proven to be quite a challenge. To date, we've gathered sales data from January 2021 to the present date in August 2023, encompassing over a million GTINs (Global Trade Item Numbers) with sales records. This massive number of time series data points prompts us to seek an approach that's both effective and manageable. One of the approaches we're considering is applying the Pareto principle to prioritize the GTINs. By doing so, we're able to reduce the number of SKUs from over a million to around 20,000. This initial selection allows us to concentrate our efforts on the most promising time series data. Among the behavior patterns we've identified are: 1. Single Sales and Halting: Some GTINs sell only a single unit and then halt sales completely. 2. Sales Decline: Other GTINs had a strong sales history in the past but have stopped entirely for months. 3. Extreme Intermittency: Certain GTINs exhibit extremely intermittent patterns, selling one unit one day, going days without sales, and occasionally selling one unit again. Many of these cases show minimal variance in the data, with the time series inflated by zeros on days without sales. 4. New Launches: Other GTINs have recently begun selling and have limited sales history. 5. Variable Sales: There are GTINs that sell one unit on alternate days but, at specific times, suddenly register extremely high sales, followed by a sales interruption. 6. Diverse Time Periods: Additionally, the time series of GTINs have varying start and end dates, adding an extra layer of complexity. Considering these behavior patterns, I'm seeking guidance on the best approach to selecting the most relevant and predictable GTINS for our demand forecasting efforts. Have you experienced a similar situation where you had to filter out "predictable" products? Your insights, advice, and shared experiences would be immensely valuable in refining our strategy and optimizing our outcomes.
in some experiments, I even get very interesting predictions, according to the graph. But we still need to add the confidence interval and deep down I still doubt whether the strategy to select the gtins was the best
j
hey. Another possible approach would be using statsforecast with some dummy models (like Naive and SeasonalNaive) and a more sophisticated one like ARIMA or ETS and see if they outperform the dummy ones. Where they don't you could say that there isn't enough signal to forecast them. Also, since the volume you have is high you could try neuralforecast as well, this document shows how you can compare them and even combine them
👍 1
m
cool, and for the neuralforecast do you recommend balancing the panel (leaving all the time series with the same length) or is it not necessary?
j
Only if they're too short for your windows. We have the fill_gaps function which may be of help