Hey - is it OK to ask a question about statsforeca...
# general
m
Hey - is it OK to ask a question about statsforecast here? It’s not clear to me why we need to set
Copy code
index=pd.Index([0] * ap_test.size, name='unique_id')
It is something like that
0
indicates points belonging to one time series,
1
would indicate points belonging to another time series, and that by putting them all into the same dataframe and running
auto_arima
once it’s more efficient than to run
auto_arima
separately for each time series?
👀 2
f
Hi @Marco Gorelli! Yes, using the
StatsForecast
class is more efficient if you have a lot of time series because each time series is fitted in parallel. For example, the following image shows a dataframe with two time series. To run the process in parallel you just have to set the
n_jobs
parameter. For example,
fcst= StatsForecast(series_train, models=[(auto_arima, 24)], n_jobs=2)
and each time series will be fitted at the same time using multithreading. 🙂
m
wow, neat!
🙌 2
m
@Marco Gorelli: how is your “coding friday” going? Do you need any assistance?
m
Hey @Max M (Nixtla), thanks for asking! Locally this works great! I then tried running on DataBricks, and it takes significantly longer - but I think the issue might be with DataBricks Example of execution time of a sklearn script, as a function of
n_jobs
, which takes 9 minutes with 1 core and >10 minutes with more cores. Seems strange - hope we can resolve this so I can use
statsforecast
to its full potential!
m
Intersting issue
Why are you using DataBricks?
m
it’s what they want us to use at work
m
I see it’s a super powerful tool.
What do you mean with “sklearn script”?
m
I meant:
Copy code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import *
import time

n_cpu = 4

def load_and_train(name):
    dataset = globals()['load_' + name]()
    X = np.random.randn(50_000, 1_000)
    y = np.random.randint(low=0, high=3, size=50_000)
    tdelta_list = []
    for i in range(1, n_cpu+1):
        s = time.time()
        model = KNeighborsClassifier(n_jobs=i)
        clf = GridSearchCV(model, parameters, cv = 10)
        model.fit(X, y)
        e = time.time()
        tdelta = e - s 
        tdelta_list.append({'time' : tdelta, 'bin' : i})
    return tdelta_list

import matplotlib.pyplot as plt
import pandas as pd
datasets_list = ['iris']#, 'digits', 'wine', 'breast_cancer','diabetes']
import numpy as np
parameters = {  'n_neighbors'   : np.arange(2, 25),
                'weights'       : ['uniform', 'distance'],
                'metric'        : ['euclidean', 'manhattan', 
                                   'chebyshev', 'minkowski'],
                'algorithm'     : ['ball_tree', 'kd_tree']
            }
for d in datasets_list:
    tdelta_list = load_and_train(d)
    df = pd.DataFrame(tdelta_list)
    plt.plot(df['bin'], df['time'], label=d)
plt.grid()
plt.legend()
plt.xlabel('N Jobs')
plt.ylabel('Time for fit (sec.)')
plt.title('Time for fit VS number of CPUs')
plt.show()
(just to check that it was a generic problem with the way I was running it that
n_jobs
was making things slower, and not something specific to
statsforecast
)
m
Ok, si it is now excluded that it is something specific with
statsforecast
?
m
yeah, virtually certain it’s not
statsforecast
Really like the interface BTW! hopefully will have more feedback once I sort out the development environment
m
How is everything going with your dev env?