Hi everyone, I was wondering if there are plans to...
# general
a
Hi everyone, I was wondering if there are plans to implement constant in-sample time series cross validation (sliding window rather than expanding window) as below:
f
hey @Andrew Doherty! The
input_size
parameter controls the length of the in-sample time series: https://nixtla.github.io/statsforecast/core.html#statsforecast.cross_validation. You can use it to perform cross-validation with sliding windows.
a
Hey Fede, thanks a lot for getting back to me so quickly. Sorry I missed that!
👍 1
Hi @fede (nixtla) (they/them), I am now using MLForecast and I was wondering whether there is a plan for the
input_size
argument to be implemented in
MLForecast.cross_validation
to enable a sliding window? Happy to raise an issue and contribute if possible.
f
Hey @Andrew Doherty!
MLForecast
has the
keep_last_n
argument instead of
input_size
to perform sliding windows. We are working on standardizing argument names across the nixtlaverse. 🙂 Here’s the reference to the cross-validation method: https://nixtla.github.io/mlforecast/forecast.html#mlforecast.cross_validation
a
Ah, sorry Fede. I thought that was dropping part of the forecast horizon as is done in some trading markets. Thanks for the clarification.
Hi Again @fede (nixtla) (they/them). Hope you are well. I have been making good progress evaluating MLForecast for our production solution, however, I am having an issue using
keep_last_n
and
cross_validation
. In the electricity_peak_forecasting notebook when using the
keep_last_n
argument the code it fails if
keep_last_n < (Y_df.shape[0] - window_size)
. In the example the minimum window that works is
keep_last_n = 6528
:
Am I doing something wrong?
I have carried different tests and the problem remains with and without differencing/exogenous features.
I have been continuing my investigation and have looked at whether this problem is just present when using
cross_validation.
I have just tried the
end_to_end_walkthrough.ipynb
notebook Training and Forecasting sections and the same error occurs when
differences =[24]
and
keep_last_n
is < 1008. No error is raised if the
differences
argument is not used. This therefore looks like a problem when slicing the data when there are exogenous/differences present and is an issue when both when using
fit
/
predict
or
cross_validation
.
Good afternoon @fede (nixtla) (they/them), I have done some digging and I think there might be two bugs when using
keep_last_n
in MLForecast. First, here the
self.last_dates
is not correct when using
keep_last_n
. This results in null values for the exogenous features when there is a merge in
_get_features_for_next_step
here. I corrected this using a bit of a hack:
self.last_dates = pd.DatetimeIndex([sorted_df.index.get_level_values(self.time_col)[-1]])
This appears to work for my use case but I don't know the design of MLForecast well so this might not be correct for other cases such as multiple `unique_id`'s, Secondly, once this was fixed I noticed that the
X
and
y
used in
fit_model
here had all the data and not just the last n samples. I implemented the following hack before `return self.fit_models(X, y)`:
Copy code
if keep_last_n is not None:
    X, y  = X[-keep_last_n:], y[-keep_last_n:]
This is not the right place to fix this as it should be done in
core.py
I think but I just did this quickly to fix and get some results. Do you have any thoughts? Happy to keep digging into the code if that helps, raise an issue on Github or share this with someone else if you don't have time? Thanks again
f
hey @Andrew Doherty! Thank you for taking the time to dig into the problem. Please help us to raise an issue on GitHub to track the problem. :) @José Morales, do you have any thoughts about the issue?
a
No problem at all. I’ll raise an issue tomorrow morning. Let me know if I can help.
❤️ 1
j
Hi. The
keep_last_n
argument is used only for predicting, it is meant to be an efficiency parameter for cases when you have very long series and your updates don't require all the history. For example if your series are of length 10,000 and your features only require the last 50 days, then setting
keep_last_n=50
makes it so that only the last 50 values of each serie are kept and used to compute the updates, this is because in the updates the whole transformation is computed but only the last value is kept. I think it'd be better to add the
input_size
argument to do exactly the same as in statsforecast, I'll work on that and let you know when it's done.
Although those errors you're getting seem a bit odd, the keep_last_n argument should only impact the predict step, I'm not sure why you get errors on the transform. I'd really appreciate if you can open an issue with a minimal reproducible example.
a
Thanks a lot for this @José Morales, that makes sense. Regarding the
input_size
argument, thanks a lot for working on this as it is really important for my current use case. Once the code is ready (even in a separate branch/fork) if you could let me know I will start using it to test on my data.
I will create a minimal reproducible example and raise an issue later today - I'll need to have a think about this error based on on my new understanding of what
keep_last_n
is doing so it might be later this evening. I think it might only be occurring in the predict step. I'll tag you in the issue later.
👀 1
f
Thank you @José Morales for clarifying the actual behavior of
keep_last_n
and for including the new feature 🙂 Sorry for the misunderstanding @Andrew Doherty 🙌
a
Absolutely no problem @fede (nixtla) (they/them). Looking forward to Jose's work going live. 😄
j
hey @Andrew Doherty, we just merged the PR adding the input_size argument, so if you install from the main branch you should be able to use it. please let us know how it goes. keep in mind that the number you set there won't necessarily be the number of training samples per serie because they can be shorter in the window and also some rows will be dropped unless you set
dropna=False
a
Amazing José, thanks a lot for the very quick turnaround. I’ll install today and get back to you early next week.