Hi, are the Auto formulas random? Sometimes when I...
# mlforecast
b
Hi, are the Auto formulas random? Sometimes when I run the code, I get different top configurations without changing anything else.
j
what do you mean by top configurations? tuning is always none-deterministic, meaning an algorithm (probably some bayesian search) is trying optimize some error (or reducing loss some loss). do you mean this or do you mean something else?
b
When I mentioned "top configurations" I meant the best-performing hyperparameter sets found by AutoMLForecast. Sometime when I run the function I get different results when nothing else changed.
j
this is what i explained. the search is always slightly different. it is not deterministic and it should be like this. maybe dont focus too much on the exact params but more on the final outcome. if you allow tuning for long enough the final forecasts should be at least very similar (hopefully 🙂 )
b
thanks! I focus on the final outcomes, each time the results change and sometimes I end up with really bad performance even though nothing changed. that is why i am asking how to mitigate this. currently i have a sample of 60, not sure if that is too small.
i just tried with number of samples 120 and it is performing even worse for all models
j
can you give some code and some background? normally this means your tuning settings are not optimal: not tuning long enough, or not optimal search space. but tbh sample of 60 is not much for ML. especially if you are using complex algorithms like boosted trees
b
So currently i am trying to develop different ML Models to compare to a local arima model. it is a hierarchical dataset so each unique id has somewhat different dynamics. the code I am running is this: ### MLForecast auto_mlf = AutoMLForecast( freq="ME", season_length=12, models={ 'lasso': AutoLasso(), 'lgb': AutoLightGBM(), 'ridge': AutoRidge(), 'xgb': AutoXGBoost() # Liep vast met rf dus eruit gehaald voor volledige agg }, fit_config=lambda trial: {'static_features': ['unique_id']} ) auto_mlf.fit( df=df_encoded, n_windows=n_windows, h=h, step_size=step_size, fitted=True, num_samples=120 )
model_configs = { "lgb": auto_mlf.results_['lgb'].best_trial.user_attrs['config'], "ridge": auto_mlf.results_['ridge'].best_trial.user_attrs['config'], "xgb": auto_mlf.results_['xgb'].best_trial.user_attrs['config'], "lasso": auto_mlf.results_['lasso'].best_trial.user_attrs['config'] } # Store cross-validation results in a dictionary cv_results = {} for model_name, config in model_configs.items(): print(f"Running CV for {model_name}...") # Select the correct model type if model_name == "lgb": model = LGBMRegressor(**config['model_params']) elif model_name == "ridge": model = Ridge(**config['model_params']) elif model_name == "xgb": model = XGBRegressor(**config['model_params']) elif model_name == "lasso": model = Lasso(**config['model_params']) else: print(f"Skipping unknown model: {model_name}") continue # Get MLForecast parameters if available mlf_init_params = config.get("mlf_init_params", {}) # Initialize MLForecast for the current model fcst = MLForecast( models=[model], freq="ME", **mlf_init_params ) # Perform cross-validation cv_result = fcst.cross_validation( df_encoded, n_windows=n_windows, h=h, step_size=step_size, static_features=['unique_id'] ) # Rename the model’s prediction column to avoid overwriting model_pred_col = f"{model_name}_Pred" cv_result = cv_result.rename(columns={model_name: model_pred_col}) # Store results cv_results[model_name] = cv_result # Merge all results into a single DataFrame final_cv_results = list(cv_results.values())[0] for df in list(cv_results.values())[1] final_cv_results = final_cv_results.merge(df, on=["unique_id", "ds", "cutoff", "y"], how="left")
My LGBM Especially is performing very bad, especially when I make my search space bigger
So at first I fit the model and then perform CV with the best config
j
what type of features do you use? am i blind or you dont use calender and lag feaztures?
b
df_encoded has exogenous variables and date features, lag features are handled by automl
i just tried n_samples of 300 and all models are performing worse in comparison to 60
j
this is the beauty of machine learning 🙂 often it is also relevant to include some more specific features. for example if you have very seasonal data, it might make sense to include specific target differences. for example, for weekly data use differecne(52) or for monthly difference(12). and if you have only few data points (not long series and not many unique_ids) it might very well be that some simple model like ridge performs better than xgb or lightgbm. how is your data? is it on article level or is it more aggregated like stores countries?
b
So i have included seasonal data through fourier terms, there are 975 unique ids and a lot of them have different dynamics. It is a large dataset, monthly data from 2016-2024 per unique id. I have included macro economic factors as well as unique id specific factors through one hot encoding. So for example one unique ID would be male/age 18-15/lives in x. The aim is to predict welfare need for each unique id and then aggregrate to a total level. I am including XGB, LGB, Ridge and Lasso. I have not performed feature selection and just use one global model as that is what sets ML apart from Statsforecast. Currently I notice that the models perform better with a smaller num_samples. The largest I have tried is 300.
j
so num_samples should just be the number of iterations your tuning algorithm is using during search. honestly it is hard to give smart advise from far away. often ml models need smart features and careful model building. also i dont really see your model performance. do you think you get overall bad performance? how have you tested this? do you have a hold-out set where you make forecasts and then compare true vs forecast? one more point on the hierarchical nature. there is actually a smart way to use hierarchical structures and nixtla is offering hierarchicalforecast. so the idea is to forecast on all levels (from highest level to lowest level) and then some hierarchical methods reconcile your values so they align overall. and often forecasting on higher levels is easier. it could very well be that forecasting on person level is pretty hard because you are missing important information, while forecasting on aggregated levels could be easier. maybe this is too much for you now, but at the end you have 2 options: try to make your ml forecast be really good by feature engineering and tuning or to use other methods like hierarchical forecasting to improve some weaker lower level forecast. maybe this helps a bit. otherwise i would need more information what is not working well or what the model is missing atm performance wise.
b
My current model performance based on the validation set across unique ids at the lowest level is this: MAE_LGBMRegressor 2244.0 MAE_XGBRegressor 1914.0 MAE_Ridge 2276.0 MAE_Lasso 1950.0. My baseline is an ARIMA Model which has an MAE of 1100. Eventually I will indeed use the hierarchical forecast to reconcile to different levels, but I want to make sure my model forecasts reasonably at the lowest level. I currently have 38 variables. The biggest difference between ARIMA and ML is that the ARIMA models are trained locally (so per unique id a seperate model) ML on the other hand is global, so we use one model and apply it to all unique ids. Also I have not performed feature selection. I thought ML Models perform feature selection inherently.
So the MAEs I have provided is aggregrated across all unique ids at the lowest level
I am also quite confused as to how my models perform better with less iterations during tuning. Or should I increase it to a very big number (1000)?
j
first i see is: ml models use a different loss function unless you changed to mae. so they would minimize mse. when you compare to an error metric, which is not optimized for it might be very different. in terms of loss function you could change the objective function to tweedie-loss. this sometimes gets you better results especially if your data is skewed. so try to change your loss functions (mae, mse, tweedie) and compare different metrics on your test set (if you have). or also check these metrics during cv. maybe the results are not so obvious anymore if you look at other metrics also. do you have more categorical features? ml models like to get some more information about the data (i think you mentioned some like gender etc, right?). another thing you can try is to replace ohe with some more advanced things like target encoder or catboost encoder. or directly use catboost. and when you buil your ml model, dont throw everything in their from the beginning. start small and check perforformance and only make it more complex if it helps. you are right, tree based models can do some kind of feature selection, but they can also be confused by 2 much information.
b
In my case MAE is most relevant, I am not sure as to how I can change this from MSE to MAE as I just fitted the models and cross validated as per tutorials. During cross validation, I only check MAE. I have not manually set any loss functions. I will have a look at other metrics as well. I have only numerical features, most models apart from LGBM did not accept categorical features. I have added gender etc. as boolean variables, so column male 1 if male. How can I perform feature selection for these MLForecast models? I believe there should be something to view feature importances, I have been unsuccesful in aquiring that.
j
check tweedie loss for sure, it is very simple to change your objectivee functtion normally. what do you mean did not acccept categorical features? you have to also convert them to numerical feaztures ith ohe or so
b
Yes that is what I did. I changed it to numerical features. Is it possible to change the objective function in the AutoML function?
But I think there might be too many features. I was told by the people from NIXTLA that the models do not perform feature selection, that is also why I am kind of confused.
j
No real feature selection. But the splitting in tree based models is doing this a bit by deciding how to split and how to use the features in the end. This is less strict than in an ols for example
b
Then is there a way to see feature importance given by the models that you are aware of?
j
there is a package for feature importance. this is copy paste from chatgpt. something like this should work. maybe you need to play around a bit to make it run:
Copy code
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# Suppose you extract the best model (or any fitted model) from MLForecast:
best_model = auto_mlf.best_model_

# Prepare your validation data in the format expected by your pipeline.
# (Make sure to include all features that your pipeline expects.)
X_val = validation_df.drop(columns=['y'])
y_val = validation_df['y']

# Compute permutation importance. Choose an appropriate scoring metric (e.g., neg_mean_absolute_error).
result = permutation_importance(
    best_model, 
    X_val, 
    y_val, 
    scoring='neg_mean_absolute_error', 
    n_repeats=10, 
    random_state=42
)

# For example, plot the mean importance values:
plt.figure(figsize=(10, 6))
plt.bar(range(len(result.importances_mean)), result.importances_mean, yerr=result.importances_std)
plt.xticks(range(len(result.importances_mean)), X_val.columns, rotation=45)
plt.ylabel("Decrease in MAE")
plt.title("Permutation Feature Importance")
plt.show()
👍 1