More questions (sorry) -- what's the recommended w...
# mlforecast
j
More questions (sorry) -- what's the recommended way of handling categorical or boolean exogenous variables? It's not clear if this is technically static or not in the mlforecast docs. Lets say I'm using LightGBM that has native support for categorical features, and I have a column called "is_missing" which is boolean for if the data is missing/imputed, and I have a column called "holidays" which is the name of the holiday that occurs on a given date. As their
dtype
is
category
I have to put them in as a static feature. Do I need to handle this differently, or is that correct?
j
That's correct. Since they don't change over time they're static and since you set the dtype to category LightGBM will treat them as such when training
j
On the static vs dynamic, it makes sense for something like "store_id" to be static because a store is a store. Repeating a static value is the only way to handle it. But for dynamic features (where my understanding of a dynamic feature is that it's something that varies over time or by date but is not related to the unique_id) wouldn't missingness and holidays be dynamic? A given date can be a holiday or not, and the date can change by year, or it might be missing or not. The convention isn't clear to me.
j
The definition is the following: • Static: same value for a single id across all timestamps • Dynamic: more than one unique value for a single id If you have dynamic features they can also be categorical, the only difference is that you have to provide them through
X_df
when using predict, whereas static features are just repeated automatically for you. Does that make sense?
j
So here, a holiday is dynamic, right? It can be 1 or 0, or many values
j
Yes
j
If that's the case, how should it be passed to training? I get "can't cast to {integer or float}" errors if I dont put it as a "static feature".
j
Do those errors come from the model?
j
I don't think so. It's lightgbm, so it's quite happy with categories. And if they're set as static it's fine, but if they're not then I get an error.
I encounter the problem when including a categorical column (
category
type) in an input df, but not setting it as static. It should be relatively easy to reproduce with a basic lightgbm model.
j
Can you paste the stacktrace of the error?
j
I'll reproduce it tomorrow
I figured this out. When doing a train vs test comparison I need to pass the
test
dataset (without the static columns) to the predict method. This wasn't clear from the error message at all.
The error message is raised as
Copy code
-> 6178 raise KeyError(f"{not_found} not in index")
KeyError: "['holidays'] not in index"
The error could be improved to explicitly state that this is related to the
X_df
set.
j
Thanks for the feedback. Were you not providing
X_df
at all or was it just missing that column?
j
not provided at all, which makes sense in hindsight but puzzled me for a while.
j
I think we can add some more errors in there, I'll work on that