This message was deleted Nixtla Community #general

Join Slack

This message was deleted.

# general

Slackbot

03/07/2022, 1:53 PM

This message was deleted.

👀 2

fede (nixtla) (they/them)

03/07/2022, 5:37 PM

Hi @Marco Gorelli! Yes, using the

StatsForecast

class is more efficient if you have a lot of time series because each time series is fitted in parallel. For example, the following image shows a dataframe with two time series. To run the process in parallel you just have to set the

n_jobs

parameter. For example,

fcst= StatsForecast(series_train, models=[(auto_arima, 24)], n_jobs=2)

and each time series will be fitted at the same time using multithreading. 🙂

Marco Gorelli

03/07/2022, 5:38 PM

wow, neat!

🙌 2

Max M (Nixtla)

03/11/2022, 4:09 PM

@Marco Gorelli: how is your “coding friday” going? Do you need any assistance?

Marco Gorelli

03/11/2022, 4:12 PM

Hey @Max M (Nixtla), thanks for asking! Locally this works great! I then tried running on DataBricks, and it takes significantly longer - but I think the issue might be with DataBricks Example of execution time of a sklearn script, as a function of

n_jobs

, which takes 9 minutes with 1 core and >10 minutes with more cores. Seems strange - hope we can resolve this so I can use

statsforecast

to its full potential!

Max M (Nixtla)

03/11/2022, 4:15 PM

Intersting issue

Max M (Nixtla)

03/11/2022, 4:15 PM

Why are you using DataBricks?

Marco Gorelli

03/11/2022, 4:16 PM

it’s what they want us to use at work

Max M (Nixtla)

03/11/2022, 4:17 PM

I see it’s a super powerful tool.

Max M (Nixtla)

03/11/2022, 4:20 PM

What do you mean with “sklearn script”?

Marco Gorelli

03/11/2022, 4:21 PM

I meant:

Copy code

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import *
import time

n_cpu = 4

def load_and_train(name):
    dataset = globals()['load_' + name]()
    X = np.random.randn(50_000, 1_000)
    y = np.random.randint(low=0, high=3, size=50_000)
    tdelta_list = []
    for i in range(1, n_cpu+1):
        s = time.time()
        model = KNeighborsClassifier(n_jobs=i)
        clf = GridSearchCV(model, parameters, cv = 10)
        model.fit(X, y)
        e = time.time()
        tdelta = e - s 
        tdelta_list.append({'time' : tdelta, 'bin' : i})
    return tdelta_list

import matplotlib.pyplot as plt
import pandas as pd
datasets_list = ['iris']#, 'digits', 'wine', 'breast_cancer','diabetes']
import numpy as np
parameters = {  'n_neighbors'   : np.arange(2, 25),
                'weights'       : ['uniform', 'distance'],
                'metric'        : ['euclidean', 'manhattan', 
                                   'chebyshev', 'minkowski'],
                'algorithm'     : ['ball_tree', 'kd_tree']
            }
for d in datasets_list:
    tdelta_list = load_and_train(d)
    df = pd.DataFrame(tdelta_list)
    plt.plot(df['bin'], df['time'], label=d)
plt.grid()
plt.legend()
plt.xlabel('N Jobs')
plt.ylabel('Time for fit (sec.)')
plt.title('Time for fit VS number of CPUs')
plt.show()

(just to check that it was a generic problem with the way I was running it that

n_jobs

was making things slower, and not something specific to

statsforecast

)

Max M (Nixtla)

03/11/2022, 4:30 PM

Ok, si it is now excluded that it is something specific with

statsforecast

Marco Gorelli

03/11/2022, 4:34 PM

yeah, virtually certain it’s not

statsforecast

Really like the interface BTW! hopefully will have more feedback once I sort out the development environment

Max (Nixtla)

03/21/2022, 9:17 PM

How is everything going with your dev env?

2 Views

Open in Slack

Previous Next