Marco Gorelli
03/07/2022, 1:53 PMindex=pd.Index([0] * ap_test.size, name='unique_id')
It is something like that 0
indicates points belonging to one time series, 1
would indicate points belonging to another time series, and that by putting them all into the same dataframe and running auto_arima
once it’s more efficient than to run auto_arima
separately for each time series?fede (nixtla) (they/them)
03/07/2022, 5:37 PMStatsForecast
class is more efficient if you have a lot of time series because each time series is fitted in parallel. For example, the following image shows a dataframe with two time series. To run the process in parallel you just have to set the n_jobs
parameter. For example, fcst= StatsForecast(series_train, models=[(auto_arima, 24)], n_jobs=2)
and each time series will be fitted at the same time using multithreading. 🙂Marco Gorelli
03/07/2022, 5:38 PMMax M (Nixtla)
03/11/2022, 4:09 PMMarco Gorelli
03/11/2022, 4:12 PMn_jobs
, which takes 9 minutes with 1 core and >10 minutes with more cores.
Seems strange - hope we can resolve this so I can use statsforecast
to its full potential!Max M (Nixtla)
03/11/2022, 4:15 PMMarco Gorelli
03/11/2022, 4:16 PMMax M (Nixtla)
03/11/2022, 4:17 PMMarco Gorelli
03/11/2022, 4:21 PMfrom sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import *
import time
n_cpu = 4
def load_and_train(name):
dataset = globals()['load_' + name]()
X = np.random.randn(50_000, 1_000)
y = np.random.randint(low=0, high=3, size=50_000)
tdelta_list = []
for i in range(1, n_cpu+1):
s = time.time()
model = KNeighborsClassifier(n_jobs=i)
clf = GridSearchCV(model, parameters, cv = 10)
model.fit(X, y)
e = time.time()
tdelta = e - s
tdelta_list.append({'time' : tdelta, 'bin' : i})
return tdelta_list
import matplotlib.pyplot as plt
import pandas as pd
datasets_list = ['iris']#, 'digits', 'wine', 'breast_cancer','diabetes']
import numpy as np
parameters = { 'n_neighbors' : np.arange(2, 25),
'weights' : ['uniform', 'distance'],
'metric' : ['euclidean', 'manhattan',
'chebyshev', 'minkowski'],
'algorithm' : ['ball_tree', 'kd_tree']
}
for d in datasets_list:
tdelta_list = load_and_train(d)
df = pd.DataFrame(tdelta_list)
plt.plot(df['bin'], df['time'], label=d)
plt.grid()
plt.legend()
plt.xlabel('N Jobs')
plt.ylabel('Time for fit (sec.)')
plt.title('Time for fit VS number of CPUs')
plt.show()
(just to check that it was a generic problem with the way I was running it that n_jobs
was making things slower, and not something specific to statsforecast
)Max M (Nixtla)
03/11/2022, 4:30 PMstatsforecast
?Marco Gorelli
03/11/2022, 4:34 PMstatsforecast
Really like the interface BTW! hopefully will have more feedback once I sort out the development environmentMax (Nixtla)
03/21/2022, 9:17 PM