how can I make multiple predictions at once? It se...
# neural-forecast
j
how can I make multiple predictions at once? It seems that calling predict with a dataframe with multiple rows only predicts on the last row:
o
Note sure what you mean? The predictions returned depend on the test df that you provide to the .predict function?
r
@Jonathan Mackenzie isn't the 'horizont' parameter u set in the training phase what u want?
j
@Olivier I have horizon=6 so I expect 6 records, but I pass in a test df with 2 rows, one at 11am and another at 2:40pm, I want predictions made at both of those times
m
Hello! Data must be contiguous, meaning that it must be ordered in time, ideally without missing values. Here, by passing a df with a value at 11am 2:40pm, the model uses those two values as input and makes a forecast for 2:50pm and so on. If you want to forecast from 11am, then you must first pass a df that ends at 11am, such that this input is used to forecast 11:10am, and so on. Otherwise, if you are forecasting two different series (two different unique ids), then id1 can end at 11am, and id 2 can end at 2:40pm. I hope this helps 🙂
j
@Marco thanks, I have another question, I have a dataset that goes from 4am to 8pm every day, at 10 minute intervals and I want to predict data over the whole day (a horizon of 96). Should I make a different unique_id for each day?
m
No, just set your horizon to 96 and all timestamps will be predicted
j
I got some pretty bad results doing that
o
We don't know enough about the forecasting problem to provide further guidance.... e.g. you could be trying to predict a financial instrument and then the entire exercise is pointless, or it can be a misconfiguration, or a data issue, or an incorrect model setup, etc, etc. Happy to provide further guidance if you'd give more context or ideally a piece of code that we can run / look at
j
@Olivier the data is solar power generation data (I have total kwh generated for the current day, and instantaneous power output), I have 2 years of data at 10 minute intervals, from 6am to 8pm each day. I've also got weather data from a nearby weather station (added it to see if it would help improve predictions). I can't share the whole solar data, but my test dataframe looks like this:
Copy code
test_df.iloc[0]
Out[12]: 
timestamp                  2025-01-01 18:10:00
ac_power_site                              0.0
irradiance                                 0.0
daily_energy_site                       375.47
daily_energy_inverter_1                 124.19
ac_power_inverter_1                        0.0
daily_energy_inverter_2                 117.41
ac_power_inverter_2                        0.0
daily_energy_inverter_3                 133.88
ac_power_inverter_3                        0.0
air_temperature                           17.8
air_pressure                            1013.2
humidity                                  84.0
sunshine_duration                          0.0
solar_radiation                            0.0
unique_id                                 S034
We wanted to also make a model for the individual inverters that make up the whole solar plant My code looks like this:
Copy code
def train(site, inverter, target_prefix="daily_energy"):
    """
    Train a model using neuralforecast NHITS model for a given site or inverter
    """
    if not inverter:
        target = f"{target_prefix}_site"
    else:
        target = f"{target_prefix}_{inverter}"

    data = load_data(site)
    df = data.reset_index(drop=False)
    print(f"Training site={site} inverter={inverter} target={target}")
    # drop null values
    # get weather prediction value
    df = df.dropna(subset=['timestamp', 'irradiance', target])
    split_idx = int(0.8 * len(df))
    train_df = df.iloc[:split_idx]
    test_df = df.iloc[split_idx:]
    horizon = 96  # n predictions at 10 minutes ahead
    tb_logger = TensorBoardLogger(save_dir="tb_logs", name="solar_tb_logs")
    extra_fields = [
        'irradiance',
        'air_temperature',
        'air_pressure',
        'humidity',
        'sunshine_duration',
        'solar_radiation',
    ]
    # Use your own config or AutoNHITS.default_config
    nhits_config = {
        "learning_rate": tune.choice([1.9e-7]),
        "max_steps": tune.choice([1024]),  # Number of SGD steps
        "input_size": tune.choice([5 * horizon, 3 * horizon, 8 * horizon]),  # input_size = multiplier * horizon
        "batch_size": tune.choice([32, 16, 8]),  # Number of series in windows
        "windows_batch_size":  tune.choice([128, 256, 512, 1024]),  # Number of windows in batch
        "n_pool_kernel_size": tune.choice(
            [[2, 2, 1], 3 * [1], 3 * [2], 3 * [4], [8, 4, 1], [16, 8, 1]]
        ),  # MaxPool's Kernel size
        "n_freq_downsample": tune.choice([[168, 24, 1], [24, 12, 1], [4, 2, 1], [1, 1, 1]]),
        "hist_exog_list": tune.choice([extra_fields]),
        # Interpolation expressivity ratios
        "activation": tune.choice(['ReLU']),  # Type of non-linear activation
        "n_blocks": tune.choice([[1, 1, 1]]),  # Blocks per each 3 stacks
        "mlp_units": tune.choice([3 * [[512, 512]], 4 * [[512, 512]], 5 * [256, 256]]),
        "early_stop_patience_steps": tune.choice([4]),
        # 2 512-Layers per block for each stack
        "interpolation_mode": tune.choice(['linear']),  # Type of multi-step interpolation
        "val_check_steps": tune.choice([20]),  # Compute validation every 100 epochs
        "random_seed": tune.randint(1, 20),
        "logger": tune.choice([tb_logger]),
        "callbacks": tune.choice([[LearningRateFinder()]])
    }



    nf = NeuralForecast(
        models=[
            AutoNHITS(
                h=horizon,
                config=nhits_config,
                num_samples=16,
            ),
        ],
        freq='10min'
    )
    # print("Best config", nf.models[0].results.get_best_result().config)

    nf.fit(
        df=df[['unique_id', 'timestamp', target] + extra_fields],
        time_col="timestamp",
        target_col=target,
        val_size=int(0.15 * len(train_df)),
    )

    if inverter:
        model_name = f"{site}_inverter_{inverter}_h_{horizon}"
    else:
        model_name = f"{site}_site_h_{horizon}"
    model_output = root_path / 'models' / model_name
    nf.save(str(model_output), overwrite=True, save_dataset=False)
    print("Writing to", model_output)
    return nf, train_df, test_df, target, f"{site} {inverter}"
o
Thanks - I'll have a look but first thing I noticed is that .fit should probably be on
train_df
, not the full
df
, no? Second, do you have more than one unique_id? Otherwise there's an error in how you create the train and test sets - do that by doing e.g.
df.groupby(["unique_id"], sort=False).tail(20)
if you want to keep the last 20 timesteps. Make sure the dataframe is sorted too before making the split:
df.sort_values(by=["unique_id", "ds"])
. I'll come back later but these are a few easy things from looking at your code.
j
thanks. I've fixed the train_df bit, and there is only 1 unique_id value in this dataset ("S034"), ie. site 34
👍 1
o
Futher continuing it seems you don't have a
scaler_type
set which is usually in any algorithm the most important hyperparameter. Learning rate doesn't make sense, way too low. Just start with the default config of AutoNHITS.
j
the plots of the data we might want to predict look like this:
I used the LRFinder callback, is that not compatible with setting the learning rate?
o
You're overcomplicating things. All these things don't move the needle and should be something you look at at the very end when you want to squeeze the last 1% of performance.
E.g. start with something simple:
nhits_config = {
"scaler_type"= tune.choice(["minmax1", "robust", "identity"]),
"max_steps"=tune.choice([500, 1000, 2000, 5000]),
}
j
what's wrong with doing early stopping?
o
Nothing, but as I said, you're overcomplicating things, which seems unnecessary at this point. Start out simple, you can always add complexity. Now you have no clue why performance is bad, because you're immediately jumping to the most complex pipeline in the history of mankind 🙂