Daylight savings & Time-index Hi everyone, I was...
# general
q
Daylight savings & Time-index Hi everyone, I was just wondering how one might go about creating a five-minutely time index that is based on the clock-datetime (so it accounts for daylight savings) instead of the UTC time (what I have in my
ds
column)? I've used the function below, which works fine to create the five-minutely time index based on the UTC time (
ds
column):
Copy code
def five_min_index(dates):
    """Calculate 5-minutely index for each datetime (0 to 287)"""
    return (dates.hour * 60 + dates.minute) // 5
It is used in
LightGBMCV()
as follows:
Copy code
cv = LightGBMCV(
    freq='5min',
    target_transforms=[Differences([288])],
    lags=[1,2,3,4,5,6,12,288],
    lag_transforms={
        1: [
            ExponentiallyWeightedMean(alpha=0.5), 
            RollingMean(window_size=12), 
        ],
        12: [RollingMean(window_size=288)],
    },
    date_features=[five_min_index, 'hour', 'dayofweek'],
    num_threads=4,
)
My
y
depends on the clock-datetime, it's human driven, so is influenced by when we start the day and when we end the day. Any thoughts? I'd like to be able to have the index in
date_features
instead of having to create a dynamic exogenous feature...
j
Hey. The ds column can also have integers, would that help you here?
q
Thanks @José Morales! What I ended up doing instead is writing up a custom function to put in
date_features
. Here is the function:
Copy code
def localise_five_min_index(dates):
    if isinstance(dates, pd.DatetimeIndex):
        localised_dates = dates.tz_localize('Australia/Adelaide', ambiguous='NaT', nonexistent='shift_forward')
    elif isinstance(dates, pd.Series) and pd.api.types.is_datetime64_any_dtype(dates):
        localised_dates = dates.dt.tz_localize('Australia/Adelaide', ambiguous='NaT', nonexistent='shift_forward')
    else:
        raise ValueError("Input must be a pandas DatetimeIndex or datetime64 Series.")
    local_five_min_index = (localised_dates.hour * 60 + localised_dates.minute) // 5
    
    return local_five_min_index
And then I include in it
LightGBMCV()
like so:
Copy code
cv = LightGBMCV(
    freq='5min',
    target_transforms=[Differences([288])],
    lags=[1,2,3,4,5,6,12,288],
    lag_transforms={
        1: [
            ExponentiallyWeightedMean(alpha=0.5), 
            RollingMean(window_size=12), 
        ],
        12: [RollingMean(window_size=288)],
    },
    date_features=[localise_five_min_index, 'hour', 'dayofweek'],
    num_threads=4,
)
This resulted in reduced errors during cross-validation (as expected) :)
I do have another question about including exogenous features though. I'm trying to forecasting household energy consumption, and I have exogenous features like air temperature and humidity, which have been merged into the data. The data has columns:
unique_id
,
ds
,
y
,
temp
,
relative_humidity
y
is each household (
unique_id
)'s energy consumption, and the temperature and humidity comes from one weather station, so it is the same for each household. Here, temperature and humidity changes with
ds
it varies depending on the time of the day. I've estimated the correlation coefficients between
temp
and
y
as well as
relative_humidity
and
y
and also ran test of statistical signicance, and all correlations are statistically significantly different from 0. I do this as one of the steps to ensure that all the data are merged correctly and also that there isn't anything wrong with
temp
and
relative_humidity
. When I train the model with
temp
and
relative_humidity
, it performs worse than without it. This should not be happening, but I cannot figure out why it is happening...
Copy code
cv = LightGBMCV(
    freq='5min',
    target_transforms=[Differences([288]), LocalStandardScaler()],
    lags=[1,2,3,4,5,6,12,288],
    lag_transforms={
        1: [
            ExponentiallyWeightedMean(alpha=0.5), 
            RollingMean(window_size=12), 
            RollingMean(window_size=288),
            RollingMean(window_size=864),# 3 days
            RollingQuantile(window_size=12, p=0.5),
            RollingQuantile(window_size=288, p=0.5),
            RollingQuantile(window_size=864, p=0.5),
            RollingStd(window_size=12),
            RollingStd(window_size=288),
        ],
        12: [RollingMean(window_size=288)],
        24: [RollingMean(window_size=288)],
    },
    date_features=[localize_five_min_index, localize_hour, localize_and_identify_weekend, localize_dayofweek],
    num_threads=4,
)
cv_hist = cv.fit(
    df_filtered_subsample_with_weather,
    n_windows=4,
    h=288,
    params=lgb_params,
    eval_every=5,
    early_stopping_evals=5,    
    compute_cv_preds=True,
    metric = 'rmse',
    static_features=[]
)
Here is
df_filtered_subsample_with_weather
is the for cross-validation that has columns
unique_id
,
ds
,
y
,
temp
,
relative_humidity
. Any help would be much appreciated!
j
Hmm, it's kind of hard to tell without looking at the data, it's possible that if the features are too good it starts to overfit, are you seeing that it stops earlier? Also, the models are saved in
<http://cv.cv|cv.cv>_models_
, can you inspect the feature importance with and without using those exog features?
j
@Quang Bui have you figured out why? I am seeing similar performance, also in household forecast. Best
q
Hi @Joaquin FERNANDEZ, I did see some improvements eventually, but they don't increase accuracy as much as features that are just transformations of load.
A lot of the increased accuracy gains happened when I revisited the data to perform further cleaning on each household's load power
Why adding weather features worsen the model to begin with doesn't make sense to me
j
Thanks for the answers @Quang Bui. Can you give me some hints about the data cleaning you did?