Nixtla's Open Source Time Series Ecosystem.

Nixtla Community

This is my code:
`sf = StatsForecast(
    df=df,
    models=models,
    freq='W',
    n_jobs=-1,
    ray_address='10.10.10.110:6379'
)

forecasts_df = sf.forecast(h=52, level=[90])` 

I don't have a yaml file though. I start my ray cluster on my EC2 instance and then pass the address to StatsForecast.

Hi <@U040AMG6N69>! Thank you for using statsforecast.

Speed using a cluster is typically achieved when you have many time series, usually more than available cpus. In your example, since you are handling only 10 time series, using a ray cluster may be less useful. Running your code on a c6a.8xlarge instance with `n_jobs=-1` might be best. StatsForecast uses a map reduce approach. So if you have 10 time series and 32 cores available, statsforecast will use 10 cores to train (one for each series). The training speed of those time series will depend on the models used and the length of the series. For example, models like autoarima in very long time series (more than 100 observations) are usually very slow. Models like MSTL tend to be faster.

<@U0316L4HWQ7>  I am using AutoARIMA with weekly data of 7 years so the length of the series is 364 and my horizon is 52 weeks. Thanks a lot for your explanation but one question though.  How did your example on the m5.2xlarge with 8 cores performed so well? That also used AutoARIMA and had millions of series. Is it because the forecast horizon was short in your case (7 days)?

In that case, using the MSTL model (even the Theta or ETS) is probably better. Large seasonalities (as in the weekly case) are often detrimental in time to the autoarima model.

Your first intuition is correct about the blog post: 250 EC2 instances of 8 cores each were deployed to obtain the cluster with 2000 cpus. And the experiment with the millions of time series used the 2000 cpus, hence the good performance in time.

The horizon is not a problem once you fit the model.

Thanks a lot. That clarified the details.

<https://valeman.medium.com/multi-horizon-probabilistic-forecasting-with-conformal-prediction-and-neuralprophet-5ec5af3888c8>

Hi, I just wanted to say that in <https://nixtla.github.io/statsforecast/arima.html> we can see that

but in <https://nixtla.github.io/statsforecast/models.html#autoarima> we can see that

but in <https://github.com/Nixtla/statsforecast/blob/main/statsforecast/models.py#L63> we can see that

imagen.png

Hi <@U04KWSDJWGN>! Thank you for reporting this. We will fix the issue soon. :slightly_smiling_face:

Amazing that you found those discrepancies, thanks.

Thanks to you for making available this amazing package. I will try to help you as much as I can.

hi <@U040AMG6N69>! AutoTheta handles the seasonality automatically

Hi Farzad. That functionality is not available yet, but we're working on it. In the meantime, you can generate a Naive model for all your data and then compare it with your forecast.

IMG_20230125_182817.jpg

Oh! That's weird. Sometimes I get an error if I remove the fallback option but it works when I put the fallback in! I actually don't want it to fall back to any model. Initially I included the fallback because the example from the documentation did it. Now I include it because if I don't, then I get a weird error sometimes.

IMG_20230125_183135.jpg

I think this error might come from AutoCES but it's interesting that when I include the fallback option, it does not error out!

Seems like it has something to do with parallel execution and the async calls? But it doesn't happen with other models and it goes away if fallback option is used!!

If fallback_model is not implemented, then this behavior must be some glitch. Somehow it cancels out the other error.

No, when I said "that functionality is not available" I meant knowing which time series were forecasted using the fallback model

the fallback model is implemented already and works well. In your case, there is probably one or more time series where the AutoCES can't be generated, and hence, it defaults to the seasonal naive.

Yeah that was what I was thinking. I misunderstood your first comment. Thanks for clarifying.

No problem, I think I wasn't clear enough and I can see how you misunderstood it:sweat_smile:

All models of stataforecast are local, therefore in your case you would have 100 diferented models for hundred different individual time series.

hey <@U040AMG6N69>. Thank you for letting us know about this. Your data have particular characteristics you can disclose? for example, a lot of zeros.

The `StatsForecast`  class has the `fallback_model`  argument to prevent the pipeline to fail: <https://nixtla.github.io/statsforecast/core.html#statsforecast>.

No it didn't have zeros. For that example I was using 30 time series and each had six years of weekly data. I don't want my forecasts to fallback to other models because then I wouldn't know which series used which model. But ultimately this is not an issue for me because I decided not to use Ray with StatsForecast. I now handle parallelism outside of StatsForecast. I just shared the issue here in case others see similar behavior. Basically StatsForecast might behave differently for AutoCES if ray_address is provided. Without ray_address, everything is fine.

<https://github.com/Nixtla/statsforecast/releases> contains a list of all releases and what was changed. ctrl+f to search for the model that you want

hi <@U04LW8TSZFX>! to complement <@U04KVSU29FE>, <https://nixtla.github.io/statsforecast/examples/models_intro.html|here> you can find a list of the available models and their main features. The table is up to date with the `main`  branch. We will release a new version to pip soon.

hey <@U04KVSU29FE>! Yes, we are planning to migrate tsfeatures to nbdev. We could use some help. Feel free to open a new PR, I’ll be happy to review it.

is there already some ideas/design on a direction of implementation? or just getting it in a working state ?

just so to know what kind of level one should aim for

currently, just getting it in a working state. We would like to have the same structure/functionality but with nbdev. maybe this library could be useful: <https://novetta.github.io/lib2nbdev/tutorial.html>

already have a somewhat workable state, will submit for pr in the next week or so

Hola! StatsForecast sí, tsFeatures no. Aqui hay una ejemplo de usar Spark con StatsForecas.
<https://www.databricks.com/blog/2022/12/06/intermittent-demand-forecasting-nixtla-databricks.html>

Con gusto! Curiosidad: en qué andas trabajando?

Andamos haciendo forecasting de balances de cuenta en el banco (el azul), para 700k series de tiempo

asi que un tema es la velocidad de preprocesamiento. Aunque el tsforecast lo queremos usar para segmentar las series de tiempo e identificar los modelos mas convenientes por cluster

Que bien! Mira, este tutorial que acabamos de hacer te puede ayudar. 

Max otra consulta, al usar tsfeatures sobre series de tiempo me marca el siguiente error "Exception: Failed to infer frequency from the `ds` column, please provide the frequency using the `freq` argument." Mi dataframe tiene esta información . Para corregirlo tengo que especificarle la frecuencia, pero quisiera que automáticamente lo hiciera. Alguna sugerencia?

No te puse la liga al nuevo tutorial... perdon: <https://nixtla.github.io/statsforecast/examples/statisticalneuralmethods.html>

Sobre las frencuencias: <@U0316L4HWQ7> te puede ayudar.

hola <@U04N2QZK6PJ>! Por dentro tsfeatures usa `pd.infer_freq`  para inferir la frecuencia, en algunos casos puede fallar al inferirla si la columna `ds`  no tiene las fechas completas (por ejemplo si falta un día en caso de que sea diaria). Lo más seguro en todos los casos es pasar el argumento `freq` , hay algún caso en específico para el que no te convenga usarlo? quizá podemos encontrar alguna manera que te sea útil e integrarlo en la librería

Muchas gracias por tu respuesta <@U0316L4HWQ7>. Justamente el problema es que tenemos fechas faltantes, lo que hicimos fue imputar usando LOCF.

image.png

<@U035DR8HD6D> una pregunta sobre StatsForecast. Quisiera usar variables exogenas para los modelos. Para ello uso StatsForecast.forecast() y paso en el parámetro *df* un dataframe de pyspark con las siguientes columnas (unique_id, ds, y y las columnas de variables exogenas). Pero si no paso el parámetro *X_df* al tratar de mostrar el resultado del forecast usando un "show()", el proceso en pyspark falla. Ahora, si paso el parámetro *X_df* que información debería contener? de acuerdo con la documentación (<https://nixtla.github.io/statsforecast/core.html#statsforecast.forecast>) esto debería ser información del *FUTURO* de las variables exogenas...¿Esto es para evaluar el pronóstico?Por ejemplo, si tengo información sobre 100 días de dos variables (y, var_exog1) y quiero predecir los siguientes 5 días (101-105), según yo, la info de las variables exógenas que debería usar en X_df es la que tengo disponible entre los dias 95-100, estoy en lo correcto? Ahora si incluyo las variables exógenas como acabo de mencionar, la columna  *ds* de las predicciones es devuelta como un número y no como una columna de fecha/string. Como puedo convertir el numero de vuelto a una fecha? Me podrías orientar sobre estos temas por favor? Te anexo algunas imágenes sobre estas cuestiones

Hola, <@U04N2QZK6PJ>! Sí, el argumento `X_df`  toma las variables exógenas para el horizon para el que quieres generar forecasts. Por ejemplo, si quisieras predecir los siguientes 5 días tendrías qué usar/tener las variables exógenas para los días 101-105. En el caso de que estas variables exógenas no estén disponibles, una opción es hacer lo que mencionas. Utilizar las últimas 5 observaciones disponibles, sin embargo esto tiene la desventaja de que el performance del modelo puede resultar menor.

Respecto a las fechas, gracias por comentárnoslo! Vi que se debe a un bug de nuestro lado, trataré de resolverlo entre hoy y mañana, te aviso una vez que esté arreglado :slightly_smiling_face:

exogenous.dbc

hola, <@U04N2QZK6PJ>! Ya tenemos una versión que resuelve el problema de las fechas:

• Esta versión aún no se encuentra en pip por lo por el momento puedes probarla usando `%pip install “statsforecast @ git+<https://github.com/Nixtla/statsforecast.git@main>”` o bien añadiendo `git+<https://github.com/nixtla/statsforecast.git@main>` a tu cluster.
• A partir de la versión `1.5.0`  de statsforecast ya no es necesario usar un backend, puedes pasar directamente spark dataframes a `forecast` y se reconocerá en el engine (como en la imagen). 
• Adjunto un ejemplo del nuevo uso de stastforecast con variables exogenas en databricks.
Avísame si te encuentras con algún error, o algo más requiere de clarificación. :slightly_smiling_face:

I fit AutoARIMA to my data and get the (p, d, q) (P, D, Q) out of it and pass that to ARIMA from the statsmodels package and get a different output with lower errors! Has anyone experienced this?

Don't do that! Don't pad with zeros like that. There is a huge difference between 'zero' and 'nonexistent' data! When you do zero padding like that, you are essentially extrapolating into the past. This is very different than intermittent data where sometimes you might be able to use zero padding. When you do this, you introduce lots of zeros into the past and will end up averaging your good data with so many zeros from products that didn't even exist back then! My recommendation is to try to forecast at a higher level of granularity where you have more data. For instance, think of clustering some products together to get more available data for forecasting.

Hi Stephen,

I have had to deal with this problem before when generating a forecast for a large retailer (pet supplies). In general, it is a hard problem, but what has worked for me is to use a hierarchical approach based on forecasted proportions. You can find a description of this method in the following paper by Hyndman et al. (See section 3.3).

<https://robjhyndman.com/papers/hiertourism.pdf>

With this approach, you don't need to remove discontinued items or categories. Just make sure their forecast is zero before doing the splashing.  We currently don't have this approach in Nixtla's hierarchical forecast, but you can generate a forecast for a middle and a bottom level using any of Nixtla's models and then do the splashing described in the paper above.

Happy to help if you want to discuss this more in detail.

&gt; My recommendation is to try to forecast at a higher level of granularity where you have more data. For instance, think of clustering some products together to get more available data for forecasting.
While I expect this could correct the flaw in assumption, this wouldn’t meet the requirements of the project, since we need to forecast individual items not a category of items.

&gt; With this approach, you don’t need to remove discontinued items or categories. Just make sure their forecast is zero before doing the splashing.
When you use the word “splashing”, are you referring to distributing a forecast down a hierarchy based on proportion? I couldn’t find that phrase in the paper and just wanted to make sure I understand.

Repeating it back so I’m clear. In this scenario, records wouldn’t be padded and forecasts would only be produced for an item when data is available. However, once they are converted to a distribution, they would be padded to facilitate the “splashing”. Meaning, there would always be a bottom level proportion for each base level category, however items or categories which either (a) have not yet launched or (b) have been discontinued would be given a fixed 0% for the distribution.

<@U04MWKTPS3G>  Forecasting is not magic as Max said once. If you have items that aren't launched or have been around for only a month or two, there is just not enough data to understand their behavior. You can try different methods but under the hood they all have to make the inference by looking at other items that have more data. I know communicating these types of scientific facts with the business is often challenging and the requirements might be unreasonable. Wish you good luck with that. Let us know if you find some method that works.

Hi Stephen,

Yes, by "splashing" I meant the process of distributing the forecast down.
And yes, in this scenario, the discounted items or categories are used as part of the history, but end up with a zero forecast.

hi <@U04N83YHHSP>! Thanks for using statsforecast. Currently, `AutoARIMA`  does not calculate p-values. But you can recover them using the standard errors, to compute them just use,

``` np.sqrt(np.diag(model.model.model_['var_coef'])) ```

hi <@U040AMG6N69>! The only preprocessing `StatsForecast`  does is to convert each time series to `float32` . Are the differences too big?

I don't remember the exact numbers right now. Have to go back and run it again for that particular series. I saw this happening for only one series while my other series always returned same errors both with and without StatsForecast. Will post here if I see it again.

Amazon Fortuna launches conformal prediction forecasting <https://aws-fortuna.readthedocs.io/en/latest/examples/enbpi_ts_regression.html|https://aws-fortuna.readthedocs.io/en/latest/examples/enbpi_ts_regression.html>

Plus 1 for TBATS, TBATS is great. Auto TBATS would be awesome.

If you feel like it, please <https://www.linkedin.com/posts/mergenthaler_together-ray-fugue-and-nixtla-provide-act[…]051553001472-fGp1?utm_source=share&amp;utm_medium=member_desktop|show some love.>

<https://www.linkedin.com/posts/mergenthaler_together-ray-fugue-and-nixtla-provide-activity-7029860051553001472-fGp1?utm_source=share&amp;utm_medium=member_desktop|https://www.linkedin.com/posts/mergenthaler_together-ray-fugue-and-nixtla-provide-act[…]051553001472-fGp1?utm_source=share&amp;utm_medium=member_desktop>

They might correct me but I don’t think there is native support with cudf code. You might be able to utilize a GPU through the Dask backend though, but not fully sure

It is integrated with Ray and Ray natively supports GPUs. But I haven't used it myself yet.

The missing data is nonexistent data or truly missing because it was not collected or other issues with data collection/processing? If it is nonexistent data (e.g., an item was not sold on Monday) then you can use intermittent forecasting techniques. Nixtla has bunch of them (croston, adida, imapa). If the data is truly missing then you need to make a decision on how to impute it.

Is there any way to implement machine learning algorithm like light gbm in statsforecast? for multiple time series?

Hi <@U04QFSNGNH1> my suggestion would be to interpolate the values (can be anything linear, splines, etc.) and then create a dummy variable to indicate that you have imputed for that particular timeslot

sorry if this has already been asked but is there a way to extract the entire test set feature matrix, including features added (such as rolling/lagged) when instantiating an MLForecast object? the preprocess method does this for the training set but I want to recover my test set with the aforementioned features added without having to do the calculations for my test from scratch. When I do `test_sample = model.preprocess(test, id_col='my_id', time_col='ds', target_col='y', static_features=[])` , I only recover the last sample of the test set feature matrix. TIA.