from sklearn.ensemble import RandomForestRegressor...
# mlforecast
i
from sklearn.ensemble import RandomForestRegressor from data_preparation import Preparation import pandas as pd from mlforecast import MLForecast import matplotlib.pyplot as plt from sklearn.metrics import mean_absolute_percentage_error from missing_timestamps import remove_duplicates from sklearn.ensemble import RandomForestRegressor # marvin data = Preparation(r'/home/ieftimska/operato-meteo-1/data/MAS_processed/ELES-MAS-5001.csv.gz', "AMBIENT_TEMPERATURE") #data = Preparation(r'/home/iva/Desktop/operato-meteo-1/data/MAS_processed/ELES-MAS-5001.csv.gz', "AMBIENT_TEMPERATURE") train, test = data.split() train_processed = remove_duplicates(train) #train_processed_ = train_processed["AMBIENT_TEMPERATURE"].copy().squeeze() train_processed_ = train_processed.asfreq("5T").fillna(method="ffill") test_processed = remove_duplicates(test) test_processed_ = test_processed["AMBIENT_TEMPERATURE"].copy().squeeze() dataset_train = pd.DataFrame() dataset_train["ds"] = train_processed.index dataset_train["y"] = train_processed.values dataset_train["unique_id"] = "AMBIENT_TEMPERATURE" models = RandomForestRegressor(random_state=123, n_jobs=-1, max_depth=10, n_estimators=100) model = MLForecast(models=models, freq='5T', lags=[t for t in range(1, 865)], date_features=['dayofweek', 'month'], num_threads=6) prepared_df = model.preprocess(dataset_train, id_col='unique_id', time_col='ds', target_col='y', static_features=[]) X_train, y_train = prepared_df.drop(columns=['unique_id', 'ds', 'y']), prepared_df['y'] model.fit_models(X_train, y_train) predictions = model.predict(horizon=864)
j
Can you provide more information like how many samples you have? Since you're using
dropna=True
(the default) the preprocessing drops 864 rows, which may be too many for your data. Also if your data has trend the random forest can't extrapolate out of the box, so you could use differences for example
i
Im using around 200k samples in train and 63k samples in test. I can provide the data if necessary. What I cant understand is that when Im using scikit-learn for just one-step forecast I use the same lags and I get pretty good results, but when I use mlforecast predict for multi-step forecasting just returns constant predictions, I can see that the model is fitted good, when I look to forecast_fitted_values. I expect the performance of the model to decrease since of the recursive strategy for multi-step forecasting, but I did not expect to get constan predicted values.
This is the data, I may mention that the original data, has some missing parts, I am not imputing the values, since no any simple method like ffill, interpolate, doesnt fit good here.
j
mlforecast assumes that the data is complete, if you have missing values then some of the lags will be wrong, e.g. if you have data for 10:05 and then 10:15, the lag1 at 10:15 will be the value at 10:05, which is wrong. Regardless, a constant prediction seems pretty extreme. Are you using the exact code you pasted here?
i
Yes, I use that code.
j
If you can provide the dataset train I can try to help debug, because there's nothing obviously wrong there
i
Here is the whole data.
data_preparation.py
missing_timestamps.py
Those are the scripts I am importing in the code
ELES-MAS-5001.csv.gz
The whole data is in the file ELES-MAS-5001.csv.gz I have sent by mistake, the previous one.
checking_timestamps.py
j
I just ran it with only lag1 and lag864, but it seems like the lag1 has all the importance, so the model basically just predicts the lag1. With the recursive strategy this is a problem because the prediction becomes the value for the target in the next timestamp, so in this case the prediction is the lag1 over and over again. Something that could help is removing that signal from the input and just try modeling the rest by using differences, something like this:
Copy code
from mlforecast.target_transforms import Differences

model = MLForecast(models=models,
                    freq='5T',
                    lags=[1, 12 * 24, 7 * 12 * 24],
                    date_features=['dayofweek', 'month'],
                    target_transforms=[Differences([7 * 12 * 24])],
                    num_threads=2)
This assumes that there's a weekly seasonality, so we subtract the value at the same time in the previous week. Using that I see a better forecast:
Copy code
from utilsforecast.plotting import plot_series

plot_series(predictions, target_col='RandomForestRegressor')
plot_series(dataset_train, predictions, max_insample_length=28 * 24 * 14)
i
My data has a frequency of 5 min. Why you substract from 7*12*24 that is not the same with the lag 864. If the lag1 is the most important feature, than why in the recursive strategy the lag1 is repeated over time, when it should retrain the model over and over again, adding the previous prediction as the current target? 7*12*24 is 2016 which is around 403 min that is 16 hours it is not the previous week, or am I getting wrong sth here?
Sorry for the wrong assumption, 7*12*24 lags is correctly one week and 12*24 lags is one day, how did you get to these lags did you use hyperparameter optimization?
j
No, those usually work well (1, prev day same hour, prev week same day of week and same hour). You could also try using the 1st difference and removing the lag1 for example
i
How about the predict function when using recursive strategy, for example I want to forecast three days, the first day is unknown so as a target if Im using just lag1, what is written as a target at that first day when it is unknown, as a lag is one timestamp before so the last timestamp in train.
And further, for the second day as a target is used the predicted value of the first day and what is here used as a lag? The previous lag, is the day before, but that is also unknown, I mean is predicted but we are using that as a target
And does predict just, use predicted values as new target, recomputes the features or also when using predicted values as new target, retrain the model, get a new target, again retrain and so on?
j
When predicting the first timestamp the lag1 is the last timestamp on train, for the second one the lag1 is the first predicted value and so on. In every timestamp we just update the target values and recompute the features, we use the same model for every timestamp