This message was deleted.
# mlforecast
s
This message was deleted.
j
Hey. The configuration you provide to the intervals is used to perform cross validation, so two windows of size 52 would use 104 validation samples. If you have 156 points that should leave you with 52 training samples in the first window, however if one of your features requires more then it would be null and if your model can't handle nulls then you'd get that error. Are you also getting warnings like "Found nulls in feature X"?
a
Hey José, no I don't get any warnings from the preprocess part with my lags configs for series with len >= 156 :
Copy code
mlf = MLForecast(
    models=[model],
    freq="W",
    lags=list(range(1, 53)),
    lag_transforms = {
        1:  [(rolling_mean, 51), (rolling_std, 51)],
        26: [(rolling_mean, 26), (rolling_std, 26)],
        40: [(rolling_mean, 12), (rolling_std, 12)],
        44: [(rolling_mean, 8), (rolling_std, 8)],
        48: [(rolling_mean, 4), (rolling_std, 4)],
    }
)
But it seems like it works if I let at least 2 * 52 + 1 points by series after the preprocess, by setting
lags=list(range(1, 52))
(excluding lag 52).
j
lag 52 requires 53 samples, so that could be it
You could do something like this to check what are the minimum required samples:
Copy code
from mlforecast import MLForecast
from utilsforecast.data import generate_series

freq = 'W'
series = generate_series(1, min_length=1_000, max_length=2_000, freq=freq)
fcst = MLForecast(models=[], freq=freq, lags=[52])
prep = fcst.preprocess(series, dropna=False)
min_samples = prep.isnull().sum().max() + 1
min_samples
👍 1
a
df.groupby("unique_id")["ds"].count().min()
==> 156
prep.groupby("unique_id")["ds"].count().min()
==> 104
prep.isnull().sum().max() + 1
==> 1
it's not a lag/lag_transforms issue, it's the
prediction_intervals
process who seems to need more than `n_windows`*`h` observations
j
It needs n_windows * h + min_samples, because it performs CV, so it's like: • n_samples: 156 • first window is the earliest one, so we move back 104 timestamps and save the next 52 as the first validation. at this point you have 52 training samples. • Run the feature engineering on the 52 samples, if your features require more than that you'll get null values and if you set dropna=True you could drop your training set entirely, which should raise a different error
a
Oh ok, I got it! So it's the normal behavior and my model crashed because it can't handle NaN values in the first 1-step ahead cross-validation iteration.
So, if I really want to use the conformal predictions, while keeping my lags 52 or more (very important feature, because it brings seasonality informations to my model), what could be the best trick in you opinion? Add a fake first year of history for all short series just for calculating lags features?
j
Yeah you could do that or use a pipeline that does some kind of imputing, e.g.
Copy code
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    SimpleImputer(strategy='constant', fill_value=0),
    LinearRegression()  # or any other model you're using
)
fcst = MLForecast(models=[model], ...)
that would fill the nulls with zeros before passing them to the model, which is easier to implement I thnk
One thing to keep in mind is that the predictions from the model are then used to update the target (because of the recursive strategy) so the results of the first window may not be as reliable
a
You’re right, in that case, for the first window I will have y=lag52_y and it could add a serious bias to the training. But I don’t see another (simple) way to deal with my problem right now :p
j
I was going to suggest using n_windows=1 but I see we don't allow it haha
🤣 2
To be honest it's more reliable to have one window than doing those kinds of hacks, so you could override our check with the following:
Copy code
class MyPredictionIntervals:
    def __init__(self, h, n_windows, method = 'conformal_distribution'):
        self.h = h
        self.n_windows = n_windows
        self.method = method
and then provide that to fit:
Copy code
fcst.fit(..., prediction_intervals=MyPredictionIntervals(n_windows=1, h=52))
I believe that'd work. Just don't tell anyone I'm suggesting this haha
a
Why not haha. But do you know exactly why this assert for min 2 windows exists?
j
Yeah it's for statistical rigor I suppose. Since we're computing quantiles based on the errors just having one sample (window) isn't very reliable, but I believe it'd provide a better estimate than providing fictional samples
a
Thanks! I try it today
Hi @José Morales, so I tested your "hack", using
prediction_intervals = MyPredictionIntervals(h=52, n_windows=1)
and I have the same error
Input X contains NaN
with the linearRegression model. I also tested to set
h=1
, and I had the same result. I can’t figure out what’s going on in this CV...
Got it. The hack is working, but I have exploding (inf) forecasts sometimes during the CV on some specific series, and it results to NaNs during the lags calculation in the following recursive step 😕
j
which model are you using?
a
I was testing with a simple linear regression as baseline
j
Can you use one of the techniques here to inspect the inputs? It's weird for the linear regression to predict inf
a
I've already explored this with a debugging tool. On the 15th iteration out of 52 (in my case h=52), after generating the lags, the X entry contains inf and NaN, which causes the execution to crash when sklearn checks the inputs. However, no problem for a dataset of similar size generated with
generate_series
... My dataset can contain series with quite sparse values and large peaks. From what I've seen, 6 series out of 32,000 are involved.
j
Are you using target transformations?
a
Nope, only lags & lag_transforms
j
Which lag transforms?
If you're using the
%debug
magic in IPython you could check which columns contain the NaNs, that'd help a lot. Just running something like
X.isnull().sum()
should produce the NaNs by column
a
rolling_mean
&
rolling_std
j
They both divide by the window size so I don't think they could divide by zero. So it's probably coming from the forecast. Are you able to see the forecast in the step before? If the linear regression coefficients are somewhat big and it gets one of the spikes I guess it could produce an Inf but that seems to be a bit hard (they'd have to be really big values)
a
that’s exactly my conclusion, and that’s why I gave up and switched to LGBM 😛
j
haha ok. Let me know if that produces an error as well
a
The problem is LGBM natively handle NaN values. I will double check with xgboost to be sure!
👍 1