Question: Scikit-learn's RandomForrestRegressor an...
# general
r
Question: Scikit-learn's RandomForrestRegressor and some others support
continuous-multioutput
predictions - essentially, allowing one to predict something of the shape (n_samples, n_outputs). Is there a possibility that this can be used to predict multi-horizon forecasts instead of the current recursive and multi-models (one for each horizon) strategy? @Nixtla Team
j
Is this question about a specific library?
r
Yes, if we want to use scikit-learn RandomForrestRegressor , for example, in the model
Copy code
MLForecast(models=[RandomForrestRegressor], freq=D, lag_features=['lag7', 'lag14', 'expanding_mean_lag1', 'rolling_mean_lag7_window_size28'], date_features=['dayofweek'], num_threads=1)

## fit the model
fcst.fit(series)
And then use this fitted model to generate predictions
Copy code
predictions = fcst.predict(14)
predictions
RandomForrestRegressor offers a way to output an array instead of a single 'Y' value for each observation. I was wondering if that could be used instead of the current 2 alternatives of recursive and multiple models to generate forecasts. That way, a single model can generate outputs for the forecast horizon and there would be no compounding of errors. Let me know if that makes sense
j
scikit-learn's multi output regressor internally fits one estimator to each of the targets, which is what we do with the max_horizon setting, so there would be no difference
Here's the relevant part in their docs:
This strategy consists of fitting one regressor per target
r
Thanks for sharing this. My understanding is that RandomForrestRegressor and some others natively support multioutput regressions. This is not the same as using the multioutput regressor that indeed fits a separate model for each output. There is, in fact, a comparison between the 2 approaches where there is 1 RandomForrestRegressor predicting multioutput and another RandomForrestRegressor that is wrapped inside multioutput.MultiOutputRegressor(). Both lead to different results Link here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_regression_multioutput.html
j
That's a weird example to use, since random forest has native "randomness" and it seems like the multi output and native approaches only differ because of that. The only reason to use the native implementation seems to be due to speed only. Running that example in jupyterlite using Ridge instead I get the same results with both approaches
r
There is also this paper that talks more about multioutput regression trees https://arxiv.org/pdf/2201.05340
But, I will investigate it further
j
Can you try to use it and see if it improves your forecasting error? You can use preprocess to generate the targets for the multi horizon, manually train the model with that, use an approach like here to generate the features for the next step and compare with the approach using
max_horizon
?
```with fcst.ts._maybe_subset(None), fcst.ts._backup():
fcst.ts._predict_setup()
next_feats = fcst.ts._get_features_for_next_step()```
I'm ok with implementing it, I just want to see if it really produces different (better) results
r
will try it out