Arsa Nikzad
03/17/2023, 1:12 AMmlforecast
in fcst.preprocess()
, the default value for dropna
is True, so in cases when we have intermittent series and we generate features like rolling_std
, NAs might occur anywhere in data (not just at initial rows) and we are dropping all those observations. the consequence might be inaccurate CV estimate when we try to evaluate the model. May be a warning in the function to notify the user about the default value of dropna
?Max (Nixtla)
03/17/2023, 1:08 PMJosé Morales
03/17/2023, 5:10 PMArsa Nikzad
03/17/2023, 6:09 PMrolling_std
when we have a set of consecutive zeros larger than window size in target. the root cause of issue seems to be _rolling_std
in window_ops.rolling
where it generates large negative numbers in above situation and these negative numbers are then converted to NAN. attached is an example.
data = pd.DataFrame({
'date': pd.date_range(start='2019-01-01', end='2020-12-31', freq='MS'),
'sprid': 1.,
'target': [1., 2., 0., 4., 0., 0., 0., 0., 9., 10., 11., 12] * 2
})
models = [lgb.LGBMRegressor(**{})]
fcst = MLForecast(
models=models,
freq='MS',
lags=[1],
lag_transforms={
1: [(rolling_std, 3)]
}
)
preprocessed_df = fcst.preprocess(data, id_col='sprid', time_col='date', target_col='target', dropna=False)
print(preprocessed_df)
## check _rolling_std
from window_ops.rolling import _rolling_std
a = np.array([1, 2, 0, 4, 0, 0, 0, 0, 9, 10, 11, 12] * 2)
print(_rolling_std(a, 3))
José Morales
03/17/2023, 6:33 PMArsa Nikzad
03/17/2023, 6:36 PMrolling_std
should generate zeros instead of NAN for these cases.Max (Nixtla)
03/17/2023, 6:37 PMArsa Nikzad
03/17/2023, 6:38 PMJosé Morales
03/22/2023, 3:36 AMArsa Nikzad
03/22/2023, 1:08 PM