Hi everyone! I'm a data scientist working in retai...
# general
Hi everyone! I'm a data scientist working in retail modeling, and I've been learning Nixtla recently. It seems incredibly useful and well made! Definitely going to make my life easier! That said, I did have a few questions, specifically pertaining to the MLForecast library: 1. Can I save just the fitted model somehow? From what I gather the fitted model object has not only the model and transformation parameters, but also the entire data series that were used to train. This is obviously a problem if the data sets are gigabytes and I train multiple models on different sections. Is setting 'keep_last_n' to 0 and then feeding in entire series with 'new_df' in the predict step the right way to solve this? Relatedly, how do I access the fitted booster itself to run LightGBM.plot_importance(). 2. What happens if new_df is entered in the prediction that is discontinuous with the series at training time? Or if the training series itself has discontinuities? Are zeros filled in to the gaps? Or NAs? 3. Can I give a forecast date, e.g. forecast all series from a specific date? It seems like forecasts are built individually from the end of each series, even when different series have different last points. 4. Can I derive features based off columns other than the target? For example, if I have a Price column, I might also want a Price_lag1, Price_std, Price_cumrel etc. Do I need to calculate these manually? 5. Does any feature processing happen under the hood? e.g. removing a feature if all values are the same, normalize a feature etc.? 6. How are the identifier, target and the date column passed to LightGBM? Does target_col in MLForecast.fit overwrite label_col in the LightGBM_Parameters object? If the date and identifier columns aren't passed to LightGBM at all, does that effect column numbering (e.g. if I indicate in LightGBM_Parameters that the weight column is column 4). 7. Whats the difference between X_df and dynamic_dfs in MLForecast.predict?
Hey, thanks for using mlforecast. 1. Many things are saved from the fit step because the initial use case was forecasting the same series (the new_df argument was added later). If you're not going to forecast the same series you can safely delete some attributes, off the top of my head you can probably delete all of the following attributes from
before saving:
ga, _ga, static_features_, uids, last_dates, restore_idxs
. The fitted models are saved in the
attribute, so something like
would give you the trained booster object. 2. The assumption is that the training series are complete, so if you have a gap the lags, etc will be wrong. You may find the fill_gaps function useful, it will produce the full panel and you can then fill with any method you want. 3. All forecasts start from the respective ends of each serie, because the assumption is that those are the latest values you have and you want to forecast ahead. If you've seen new values of a serie you can use the update method as described here. 4. Not yet, this is something we have on our roadmap but at the moment you have to do it manually. This issue has a good way of doing it right now. 5. There aren't any hidden processings, we prefer not to give surprises, so you have to be explicit on what you want to happen. If you want to perform scaling for example you can use target_transforms (guide). 6. The column identifiers are mainly used to perform the feature engineering. The id and dates aren't passed to LightGBM, but you can pass the id if you specify it in the static_features argument and if your dates are integers you can specify a date_feature that is just the identity. If you're wondering in which order the features are going to be passed you can access the
attribute after calling fit. You can also perform just the feature engineering with preprocess, then manually training the model in any way you like and then assigning it to the
attribute or using the
method like here. 7. They serve the same purpose. The dynamic_dfs argument is legacy and will be removed soon, it was meant to be used for saving memory by not having repeated values in a single dataframe but the X_df is easier to reason about and makes the predict step faster.
Hey thanks so much for the prompt reply! To clarify on question 2, the lags will be calculated based off nearest previous timestamps? So I will get [ds, y, lag_1; 1, 4, NaN; 2, 5, 4; 4,6,5] Is that right? Also, will the series be ordered (by unqiue_id, ds) if it isn't initially?
That's right, the last lag will be wrong. Yes, they're ordered by id and date first, although the preprocess returns the dataframe in the same order you passed it (the ordering is just done internally)
ok last question (and perhaps not directly Nixtla related but I can't find it elsewhere), it seems like referring to the columns by name will be safer then trying to do it by number, and I'm struggling to figure out how to do that in the python. According to the LightGBM documentation: *add a prefix
for column name, e.g.
means c1, c2 and c3 are categorical features How does this look in python? Should I have something like: lgb_params = { 'categorical_feature': 'names: "col1", "col2"' }
No worries. That documentation refers to LightGBM in general, for python you should go to the Python API section. For LGBMRegressor they should be a list of ints (with indices) or list of str (feature names). So you should use something like:
categorical_feature=['col1', 'col2']
Btw if it's
(the default) it will use the data types of the columns and if they're categorical it will set them automatically