If sequences shorter than input_size are automatically padded with 0 at the beginning, and future exogenous variables are also automatically padded with 0 at the beginning (is this true?), in order to distinguish those zeros from real zeros of the time series, could it make sense to create a "real_data" exogenous variable with value 1 for the time series points, so that its value is automatically set to 0 for padding data and we can tell to the model that padding data have real_data=0 and true data have real_data=1? Basically we would set that variable to 1 for all time series points and then leverage the all-zero padding to create real_data=0. Could the scaler (e.g. robust) negatively impact this "real_data" indicator since future exogenous variables get scaled? Is the exogenous variables scaling performed before or after the automatic zero-padding? Thanks
04/10/2023, 6:07 PM
Hi @Manuel! Yes, all variables are padded at the beginning. We have not tried the "real_data" dummy, but it sounds intuitive. We previously had an "available_mask" dummy to directly mask missing/padded data with zeros. This is a little bit different but related. The scaling is done after padding the variable, but it shouldn't affect the information provided by the variable
Also note that most models use only the information in the input window of size
. If the time series have a lot of history then it should not be necessary to add the "real_data" variable.
04/10/2023, 6:46 PM
@Cristian (Nixtla) Thank you! Yes the problem in my specific case is that many time series have a limited history which is shorter than the input_size I need for modeling the yearly seasonality in an acceptable way.