Can anyone tell me how to or direct me to document...
# statsforecast
b
Can anyone tell me how to or direct me to documentation on how to do cross-validation while using spark for distributed processing? The page linked below says it's possible, but I can't find any documentation or tutorials on how to. Also couldn't find any previous posts with this question. https://nixtla.github.io/statsforecast/docs/getting-started/getting_started_complete.html#train-multiple-models-for-many-series
k
This you try just passing a Spark DataFrame instead? I forget if that works already
Otherwise, you can make a
backend
object like this and the backend has the cross validate method
b
@Kevin Kho thanks for your response. I tried passing the spark dataframe to
cross_validation
but that just creates the a spark datframe itself. But doesn't run/train the models. See example here: ** sf = StatsForecast( models=SF_models, freq='M', # n_jobs = -1, fallback_model = Naive() ) SF_crossvalidation_df = sf.cross_validation(df = sdf, h = 3, step_size = 1, n_windows = 5)SF_crossvalidation_df = sf.cross_validation(df = sdf, h = 3, step_size = 1, n_windows = 5) ** Also tried the
parallel=backend
in the link you provided in both the
StatsForecast
and
cross_validation
. When I put it in the former I get an error "TypeError: __init__() got an unexpected keyword argument 'parallel'". When I put it in the latter I get an error "TypeError: cross_validation() got an unexpected keyword argument 'parallel'".
k
Ah ok. I meant try:
Copy code
backend = FugueBackend(spark, {"fugue.spark.use_pandas_udf":True})
backend.cross_validation(df = sdf,
                          h = 3,
                          step_size = 1,
                          n_windows = 5)
Oh my bad, my instructions were very bad
It would use this part of the code
And then it will return a SparkDataFrame so you might need to do something to trigger the action
j
The cross_validation method returns a spark dataframe. As Kevin said, in order to trigger the action you have to do something with it, for example:
Copy code
cv_results = sf.cross_validation(df=spark_df, h=10)
cv_results.write.parquet('cv_results')
If you're using a remote cluster make sure to save it in a shared storage like s3
b
Thank you @Kevin Kho and @José Morales this resolved my issues with statsforecast cv. Appreciate the help! I am having an issue with doing CV in MLforecast now. I can post my question on that in the mlforecast channel.
k
Nice!
Oh Jose is right, you were probably right already, just needed to save into parquet or show or something to trigger the action