This message was deleted Nixtla Community #mlforecast

Join Slack

This message was deleted.

# mlforecast

Slackbot

12/13/2023, 11:30 AM

This message was deleted.

José Morales

12/13/2023, 5:02 PM

Hey. The configuration you provide to the intervals is used to perform cross validation, so two windows of size 52 would use 104 validation samples. If you have 156 points that should leave you with 52 training samples in the first window, however if one of your features requires more then it would be null and if your model can't handle nulls then you'd get that error. Are you also getting warnings like "Found nulls in feature X"?

Antoine SCHWARTZ -CROIX-

12/13/2023, 5:16 PM

Hey José, no I don't get any warnings from the preprocess part with my lags configs for series with len >= 156 :

Copy code

mlf = MLForecast(
    models=[model],
    freq="W",
    lags=list(range(1, 53)),
    lag_transforms = {
        1:  [(rolling_mean, 51), (rolling_std, 51)],
        26: [(rolling_mean, 26), (rolling_std, 26)],
        40: [(rolling_mean, 12), (rolling_std, 12)],
        44: [(rolling_mean, 8), (rolling_std, 8)],
        48: [(rolling_mean, 4), (rolling_std, 4)],
    }
)

But it seems like it works if I let at least 2 * 52 + 1 points by series after the preprocess, by setting

lags=list(range(1, 52))

(excluding lag 52).

José Morales

12/13/2023, 5:18 PM

lag 52 requires 53 samples, so that could be it

José Morales

12/13/2023, 5:22 PM

You could do something like this to check what are the minimum required samples:

Copy code

from mlforecast import MLForecast
from utilsforecast.data import generate_series

freq = 'W'
series = generate_series(1, min_length=1_000, max_length=2_000, freq=freq)
fcst = MLForecast(models=[], freq=freq, lags=[52])
prep = fcst.preprocess(series, dropna=False)
min_samples = prep.isnull().sum().max() + 1
min_samples

👍 1

Antoine SCHWARTZ -CROIX-

12/13/2023, 5:25 PM

df.groupby("unique_id")["ds"].count().min()

==> 156

prep.groupby("unique_id")["ds"].count().min()

==> 104

prep.isnull().sum().max() + 1

==> 1

Antoine SCHWARTZ -CROIX-

12/13/2023, 5:27 PM

it's not a lag/lag_transforms issue, it's the

prediction_intervals

process who seems to need more than `n_windows`*`h` observations

José Morales

12/13/2023, 5:33 PM

It needs n_windows * h + min_samples, because it performs CV, so it's like: • n_samples: 156 • first window is the earliest one, so we move back 104 timestamps and save the next 52 as the first validation. at this point you have 52 training samples. • Run the feature engineering on the 52 samples, if your features require more than that you'll get null values and if you set dropna=True you could drop your training set entirely, which should raise a different error

Antoine SCHWARTZ -CROIX-

12/13/2023, 5:50 PM

Oh ok, I got it! So it's the normal behavior and my model crashed because it can't handle NaN values in the first 1-step ahead cross-validation iteration.

Antoine SCHWARTZ -CROIX-

12/13/2023, 5:53 PM

So, if I really want to use the conformal predictions, while keeping my lags 52 or more (very important feature, because it brings seasonality informations to my model), what could be the best trick in you opinion? Add a fake first year of history for all short series just for calculating lags features?

José Morales

12/13/2023, 6:01 PM

Yeah you could do that or use a pipeline that does some kind of imputing, e.g.

Copy code

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    SimpleImputer(strategy='constant', fill_value=0),
    LinearRegression()  # or any other model you're using
)
fcst = MLForecast(models=[model], ...)

that would fill the nulls with zeros before passing them to the model, which is easier to implement I thnk

José Morales

12/13/2023, 6:03 PM

One thing to keep in mind is that the predictions from the model are then used to update the target (because of the recursive strategy) so the results of the first window may not be as reliable

Antoine SCHWARTZ -CROIX-

12/13/2023, 6:10 PM

You’re right, in that case, for the first window I will have y=lag52_y and it could add a serious bias to the training. But I don’t see another (simple) way to deal with my problem right now :p

José Morales

12/13/2023, 6:11 PM

I was going to suggest using n_windows=1 but I see we don't allow it haha

🤣 2

José Morales

12/13/2023, 6:13 PM

To be honest it's more reliable to have one window than doing those kinds of hacks, so you could override our check with the following:

Copy code

class MyPredictionIntervals:
    def __init__(self, h, n_windows, method = 'conformal_distribution'):
        self.h = h
        self.n_windows = n_windows
        self.method = method

and then provide that to fit:

Copy code

fcst.fit(..., prediction_intervals=MyPredictionIntervals(n_windows=1, h=52))

I believe that'd work. Just don't tell anyone I'm suggesting this haha

Antoine SCHWARTZ -CROIX-

12/13/2023, 6:15 PM

Why not haha. But do you know exactly why this assert for min 2 windows exists?

José Morales

12/13/2023, 6:17 PM

Yeah it's for statistical rigor I suppose. Since we're computing quantiles based on the errors just having one sample (window) isn't very reliable, but I believe it'd provide a better estimate than providing fictional samples

Antoine SCHWARTZ -CROIX-

12/14/2023, 8:31 AM

Thanks! I try it today

Antoine SCHWARTZ -CROIX-

12/15/2023, 9:37 AM

Hi @José Morales, so I tested your "hack", using

prediction_intervals = MyPredictionIntervals(h=52, n_windows=1)

and I have the same error

Input X contains NaN

with the linearRegression model. I also tested to set

h=1

, and I had the same result. I can’t figure out what’s going on in this CV...

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:12 PM

Got it. The hack is working, but I have exploding (inf) forecasts sometimes during the CV on some specific series, and it results to NaNs during the lags calculation in the following recursive step 😕

José Morales

12/18/2023, 4:21 PM

which model are you using?

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:22 PM

I was testing with a simple linear regression as baseline

José Morales

12/18/2023, 4:29 PM

Can you use one of the techniques here to inspect the inputs? It's weird for the linear regression to predict inf

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:46 PM

I've already explored this with a debugging tool. On the 15th iteration out of 52 (in my case h=52), after generating the lags, the X entry contains inf and NaN, which causes the execution to crash when sklearn checks the inputs. However, no problem for a dataset of similar size generated with

generate_series

... My dataset can contain series with quite sparse values and large peaks. From what I've seen, 6 series out of 32,000 are involved.

José Morales

12/18/2023, 4:46 PM

Are you using target transformations?

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:47 PM

Nope, only lags & lag_transforms

José Morales

12/18/2023, 4:52 PM

Which lag transforms?

José Morales

12/18/2023, 4:55 PM

If you're using the

%debug

magic in IPython you could check which columns contain the NaNs, that'd help a lot. Just running something like

X.isnull().sum()

should produce the NaNs by column

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:55 PM

rolling_mean

rolling_std

José Morales

12/18/2023, 4:58 PM

They both divide by the window size so I don't think they could divide by zero. So it's probably coming from the forecast. Are you able to see the forecast in the step before? If the linear regression coefficients are somewhat big and it gets one of the spikes I guess it could produce an Inf but that seems to be a bit hard (they'd have to be really big values)

Antoine SCHWARTZ -CROIX-

12/18/2023, 4:59 PM

that’s exactly my conclusion, and that’s why I gave up and switched to LGBM 😛

José Morales

12/18/2023, 5:00 PM

haha ok. Let me know if that produces an error as well

Antoine SCHWARTZ -CROIX-

12/18/2023, 5:01 PM

The problem is LGBM natively handle NaN values. I will double check with xgboost to be sure!

👍 1

4 Views

Open in Slack

Previous Next