:wave: Not the most exciting question, but what is...
# mlforecast
j
đź‘‹ Not the most exciting question, but what is the heuristic that determines the minimum number of samples needed to avoid the "series is too short for window" error in LightGBMCV? It would be useful to have this in the error message if possible, as this is arguably a more important piece of information than the specific series that are too short.
j
Hey. Each serie must have more than n_windows * h samples. If you're using
dropna=True
this has to be true after computing the lag features and dropping the nulls that they produce, so for example if you're using lag 5 you need to have n_windows * h + 5 samples in each serie. Does that clarify the behavior?
If you're unsure about how many rows the dropna removes you can create an MLForecast object with the same features and run preprocess on a single serie
j
yeah, makes sense đź‘Ť
j
We've been thinking of having kind of QA functions. What would be most helpful for you in this case? We'd thought of returning the ids of the series that aren't long enough. Would including the number of samples they're missing be helpful?
j
That would work. Getting the id's to drop as a pre-CV step would be useful, or perhaps having the CV procedure exclude them (with warning) as part of the fitting. Or simply having the error be something along the lines of "The follow series have fewer than the required N samples: [,,,]" would be enough I think?
đź‘‹ revisiting this but are there any recommended strategies for dealing with portfolios of time series of different lengths? In a simplified case let's say I am a retailer with three products, one that's been on sale for 52 weeks and another that launched 12 weeks ago, and another that launched 1 week ago. One of the benefits of a global model is that you can learn expected behaviours for the newer product from the older ones (overcoming the coke start problem), but at the moment if I try and cross validate in this scenario I get a "series too short" error.
j
What would you like to happen in this case? Drop the series that are too short and just CV with the rest?
đź‘Ť 1
j
Yeah. I think that would be the desired behaviour. Throw a warning at each step etc, but ultimately don't prevent the evaluation.
That's how I've handled it when I've built a similar model in the past anyway, and it works. Ultimately the objective of a back-test (with refit) is that I want to see how the model config would have behaved at that point in time, with known information at that point. A product that launched last week shouldn't prevent a CV fit on data 12 weeks ago because 12 weeks ago that product didn't exist. I can probably work around it with a
for
loop for now, but it would be nice if it was implicit. Being able to handle cold-start forecasting would be a use-case worthy of noting on the website with an explicit example too.
đź‘Ť 1