I have a model understanding question concerning t...
# neural-forecast
I have a model understanding question concerning the NHITS model. I am hoping to get some clarifications about the connection between two hyper-parameters:
. My question may also stem from how these parameters interact with one another. From my understanding, the higher the value in
, the less datapoints we will have after downsampling. i.e. if we have daily data and we downsample with
n_freq_downsample = 7
, we will have 1/7 the amount of data.
also has a "downsampling effect", if we have an input size of
n_pool_kernel_size = 2
, then we will now have
N / 2
datapoints. First of all, is this understanding correct? And if so, it seems like we would want these two hyperparameters be inversely proportional. I would not want to downsample by a large number and then apply large kernel size pooling layer. These would dramatically decrease the amount of information being fed into my model? The reason I am asking this question is because I was looking at the AutoNHITS parameter space
Copy code
"n_pool_kernel_size": tune.choice(
            [[2, 2, 1], 3 * [1], 3 * [2], 3 * [4], [8, 4, 1], [16, 8, 1]]
        "n_freq_downsample": tune.choice(
                [168, 24, 1],
                [24, 12, 1],
                [180, 60, 1],
                [60, 8, 1],
                [40, 20, 1],
                [1, 1, 1],
Intuitively, I would have assumed that the last two choices for
would have been reversed. I.E.
[1, 4, 8], [1, 8, 16]
Hi @Phil! You understanding on both parameters is correct. Regarding the inverse, not necessarily. The
downsamples only the inputs, and the
only the output. Having both at the same time do not compound, because they affect different parts of the architecture. Here is the diagram of the paper, the kernel controls the Maxpool im the inputs of the MLP stack, and
controls the output dimension of theta (points in the forecasting window)
The intuition of having both larger at the same time is that to output a lower dimensional output (higher
) you need less information from the inputs, so kernel is larger
With that said, we have observed that larger kernel sizes only help in very high frequency data, and usually keeping a value of 1 or 2 is the best. That is why in our default config we kept the option of no downsampling ([1,1,1])
I see that makes more sense! Thank you! Have you observed any effects of varying the number of blocks across frequencies. or keeping them constant is roughly equivalent in performance
We recommend increasing the blocks with larger datasets. For example, the NBEATS uses 30 blocks in total for each frequency of the M4 dataset, with around 30k series