Hi everyone - I work at <H2O.ai> and have really e...
# general
j
Hi everyone - I work at H2O.ai and have really enjoyed exploring Nixtla's tools. A lot of my focus in the last decade has been on building large-scale, high-frequency forecasting systems, so I love to see some of the greatest R packages getting some Pythonic love! I'm looking into using
HierarchicalForecast
for a few applications - are there are utility or helper functions to generate the
S
matrix for a given hierarchy? thanks in advance
👀 1
f
Hi @Jonathan Farland! A good example can be found here: https://nixtla.github.io/hierarchicalforecast/examples/AustralianDomesticTourism.html. If you have a set of time series of the lowest level and want to construct the
S
matrix and the dataset with all hierarchies, you can use the
aggregate
function (
from hierarchicalforecast.utils *import* aggregate
). The function takes the time series of the lowest level and the hierarchical structure. Please let me know if that example works for your use case. :)
j
Nice! That looks like exactly what I was thinking of, I'll give it a shot when I have a second and will let you know if there are any issues. Thanks so much!
🙌 1
Hi @fede (nixtla) (they/them) Thanks again for pointing me to that example previously - it looks like it got taken down? I'm running into a perplexing issue while using a toy example. I have a data set with 672 time series and I can successfully follow the docs and reproduce the forecasts, but when I try to reconcile them, I appear to have a mismatch of dimensions somewhere and I am not sure how to track it down past the sanity checks I've already done. Here's the stacktrace
Copy code
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/01/4mysj8cx1bjg_w8rbw097jwr0000gp/T/ipykernel_22347/4254128446.py in <module>
      7 hrec = HierarchicalReconciliation(reconcilers=reconcilers)
      8 #Y_rec_df = hrec.reconcile(Y_hat_df, Y_df_train, S, tags)
----> 9 Y_rec_df = hrec.reconcile(Y_hat_df=Y_hat_df, Y_df=Y_df_train, S=S, tags=tags)

~/opt/anaconda3/envs/ts_recon/lib/python3.9/site-packages/hierarchicalforecast/core.py in reconcile(self, Y_hat_df, S, tags, Y_df, level, bootstrap)
    148                 kwargs = {key: common_vals[key] for key in kwargs}
    149                 fcsts_model = reconcile_fn(y_hat=y_hat_model, **kwargs)
--> 150                 fcsts[f'{model_name}/{reconcile_fn_name}'] = fcsts_model['mean'].flatten()
    151                 if (pi and has_level and level is not None) or (bootstrap and level is not None):
    152                     for lv in level:

~/opt/anaconda3/envs/ts_recon/lib/python3.9/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3653         else:
   3654             # set column
-> 3655             self._set_item(key, value)
   3656 
   3657     def _setitem_slice(self, key: slice, value):

~/opt/anaconda3/envs/ts_recon/lib/python3.9/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3830         ensure homogeneity.
   3831         """
-> 3832         value = self._sanitize_column(value)
   3833 
   3834         if (

~/opt/anaconda3/envs/ts_recon/lib/python3.9/site-packages/pandas/core/frame.py in _sanitize_column(self, value)
   4533 
   4534         if is_list_like(value):
-> 4535             com.require_length_match(value, self.index)
   4536         return sanitize_array(value, self.index, copy=True, allow_2d=True)
   4537 

~/opt/anaconda3/envs/ts_recon/lib/python3.9/site-packages/pandas/core/common.py in require_length_match(data, index)
    555     """
    556     if len(data) != len(index):
--> 557         raise ValueError(
    558             "Length of values "
    559             f"({len(data)}) "

ValueError: Length of values (4704) does not match length of index (2688)
Dimensions of training data:
Dimensions of forecast data:
More checks and calls to reconcile:
@fede (nixtla) (they/them) if no obvious issue jumps right out at you here, I can try to make a reproducible github issue if that's your suggestion
f
Hey @Jonathan Farland! Thanks for sharing your code and tests. It seems that the problem is related to the
Y_df_train
data. I can see that it contains 672 base time series, but seeing the shape of the summing matrix
S
, it seems that it is constructed for
673
base time series. To create
S
, did you use the
aggregate
function? Perhaps we are missing something
j
Thanks! Here's how I created `S`:
df
looks like:
and the only other step that I think is related is :
f
I see. I’m thinking that maybe some rows or time series are being deleted in that step. Would it be possible for you to share your data to explore the problem in detail? Perhaps you could mask
y
j
I can share the data and my code, the data is public and just a sample from the M5 competition. I'll package it up here. I also see now that I actually have 628 base time series (store x dept) and 45 time series (store) = gets us to the 673 and therefore S is 673 x 628
but yeah, still end up with 672 rows in
Y_df_train
Here's my notebook and toy data set. Really appreciate you taking a look, I will also continue to look for anything silly I am doing here
f
nice! Thank you @Jonathan Farland! I found the problem. The series
store36/dept6
has only two observations. Also, there are other time series with missing values. I solved the problem by imputing the missing values with zero; since we are dealing with demand data, it makes sense. I’m imputing the missing values for each time series from the first date of the series until the last date of the whole dataset. For example, the first observation of
store36/dept6
is
2012-06-08
, and the last observation of the training set is
2012-10-26
, thus, that series will range from
2012-06-08
until`2012-10-26`. The second image shows this.
I have also added (bootstrapped) prediction intervals to the final forecasts, which I think are a nice feature
❤️ 1
Here’s the nb
Please let me know if you have any questions 🙂
j
Wow! Nice! So basically you just padded every time series to conform to the global begin and end dates, right?
Padded by imputation
Oh I re-read your note - you start from whenever the series begins, but to the global end
f
you start from whenever the series begins, but to the global end
Yes, exactly
j
Ok I’ll take a look myself and get back to you. Thanks for taking a look - maybe we can make some add some warnings about this issue in the future
f
Sure! We are also working on a library to do time series preprocessing efficiently and to detect inconsistencies in the data. That library feature, would be valuable for your use cases? Besides missing values, is there any other functionality that you would like to have? Your input will be very useful for us to prioritize development
🙌 1
j
Nothing comes to mind right now in terms of features, but I will say that maybe our two companies can collaborate a little here. now that you've unblocked me (thanks again), I need a little bit of time to build out what I am thinking but will be happy to share with you here, and then we can take it from there?
f
Yes, sure! 🎉