Hello all! I have an issue with AutoARIMA that it ...
# statsforecast
m
Hello all! I have an issue with AutoARIMA that it requires the exogenous variable if I am using the Ray or Dask. It does work if I use the regular Pandas module, even if I not providing the exogenous variable. Is it some kind of a bug or this is how it should be? Is there any way to avoid it? Thank you in advance This is the snapcode (I use the Google Colab environment):
!pip3 install -U statsforecast==1.7.4
!pip3 install -U ray==2.20.0
!pip3 install -U dask[dataframe]==2023.8.1
import os
os.environ['NIXTLA_ID_AS_COL'] = '1'
from statsforecast.core import StatsForecast
from statsforecast.models import (
AutoARIMA,
AutoETS,
)
from statsforecast.utils import generate_series
from sklearn.preprocessing import RobustScaler
max_orders = 30
FIT_SETTINGS = dict(
d=None,
D=None,
max_order=max_orders,
max_p=max_orders,
max_d=max_orders,
max_q=max_orders,
max_P=max_orders,
max_D=max_orders,
max_Q=max_orders,
start_p=2,
start_q=2,
start_P=2,
start_Q=2,
test="kpss",
stepwise=True,
method='lbfgs',
seasonal=True,
)
n_series = 1000
horizon = 180
models = [AutoARIMA(season_length=7, **FIT_SETTINGS)]
series = generate_series(n_series, min_length=365*3, max_length=365*3)
sf = StatsForecast(
verbose=True,
models=models,
freq='D',
n_jobs=-1
)
# IT RUNNING SMOOTHLY
p_statsforecast = sf.forecast(df=series, h=horizon)
import ray
import logging
ray.init(logging_level=logging.ERROR)
series = series.reset_index()
series['unique_id'] = series['unique_id'].astype(str)
ctx = ray.data.context.DatasetContext.get_current()
ctx.use_streaming_executor = False
ray_series = ray.data.from_pandas(series).repartition(150)
#IT CAN'T BE RUN
p = sf.forecast(df=ray_series, h=horizon)
j
I don't think the reset_index is necessary. In the ray version you're passing an extra column "index" which gets interpreted as an exogenous feature
m
Ahh, I see, i have tried it and it run without an error. Thank you for your help
Hi, @José Morales, your suggestion is a very big help for me to make the code run in Ray. Now I want to run it on the Spark Environment that my company has. It has Python 3.7 and Pyspark 2.4.7. I've installed it with Statsforecast 1.6.0 and Fugue 0.8.6. I want to try it using a dummy dataset and dummy code (which I retrieved from the Nixtla website right here: https://nixtlaverse.nixtla.io/statsforecast/docs/distributed/spark.html). I also did some modify to the code so it adapts to the spark environment we have, so the code will be:
Copy code
from statsforecast.core import StatsForecast
from statsforecast.models import ( 
   AutoARIMA,
   AutoETS,
)
from statsforecast.utils import generate_series

n_series = 4
horizon = 7

series = generate_series(n_series)

sf = StatsForecast(
   models=[AutoETS(season_length=7)],
   freq='D',
)
sf.forecast(df=series, h=horizon).head()
from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName('forecast') \
   .config('spark.ui.port', 'xxxx') \
   .config('spark.executor.memory', 'xg') \
   .config('spark.driver.memory', 'xxg') \
   .master('local[*]')\
   .enableHiveSupport() \
   .getOrCreate()

# Make unique_id a column
series = series.reset_index()
series['unique_id'] = series['unique_id'].astype(str)

# Convert to Spark
sdf = spark.createDataFrame(series)

# Returns a Spark DataFrame
sf.forecast(df=sdf, h=horizon, level=[90]).show(5)
When I ran the code, I retrieved an error like in the attachment. Am I doing something wrong? What can I do so it can be run on the Spark Environment?
j
Are you able to upgrade to python 3.8? There have been several fixes since 1.6.0
m
Unfortunately, I can't upgrade to python 3.8 right now because the environment also used by other people in the department, I'll ask the IT Infrastructure first to consult about new environment. Thank you for your assistance