Hi I got this error on databricks trying to replic...
# general
d
Hi I got this error on databricks trying to replicate the prediction process in this link: https://nixtla.github.io/mlforecast/distributed.forecast.html#spark Can someone help me with this? Thanks!
I also encountered the same error when I was this command:
from statsforecast.models import ( ADIDA, CrostonClassic, IMAPA, TSB )
k
Not sure but it sounds like you have installation issues? Might be worth trying a fresh environment
d
How should I approach this? I am very new to databricks.
k
Is it importing that literally errors?
How did you install the library? Did you go to the cluster and then to the libraries?
Btw, you don’t need to create
spark
variable on Databricks. It already exists when you load the notebook. If you click the + on
TypeError
, can I see the full Traceback?
d
I directly used pip install in the notebook. Also, both errors were on the import command.
Screenshot 2023-06-05 at 4.47.24 PM.png,Screenshot 2023-06-05 at 4.47.45 PM.png
k
Don’t pip install in notebook. It only installs on the driver and not the workers. Go to the cluster settings and install in the Libraries tab
Screen Shot 2023-06-05 at 4.14.34 PM.png
Try installing it with this UI instead and maybe things will work better
d
Cool, what is the source and type for them?
k
PyPI and then just need to type in the library name
d
Thank you! I successfully installed the libraries but the same error still occurred.
Here is the complete error message, which is very long:
OK the message is too long to copy into here and screenshot too
k
email me a log at kdykho@gmail.com and i can look
d
Sent!
k
Oh shit lol it’s a Fugue issue. Can you go to a notebook and do:
Copy code
!pip show fugue
To get me the Fugue version?
I will try to replicate
d
Screenshot 2023-06-05 at 6.30.15 PM.png
k
Ok let me spin up a cluster and try this now
Can you also give me Databricks Runtime version? It’s attached to the cluster
d
Its 13.1
k
I suspect the latest 12 will work but I’m verifying. I think 13 had breaking changes
I can confirm 12 works for me, trying on 13
Works for 13.0, trying to reproduce on 13.1
Pinging @Han Wang
He attached this log with this:
h
can you do this @Dihong Huang
Copy code
!pip show antlr4-python3-runtime
@Kevin Kho can you reproduce the error on 13.1?
d
Screenshot 2023-06-05 at 6.54.06 PM.png
h
ah this is very weird
in your import can you force
Copy code
antlr4-python3-runtime==4.11.1
the version you installed is incorrect, it is for py 3.7 and it is also too old
i don't know why you were able to do that, fugue 0.8.4 should bring it to 4.11.1 automatically
maybe you didn't install the packages in the correct way
ah i see this is because stix2-patterns requires a very old version of antlr https://github.com/oasis-open/cti-pattern-validator/blob/master/setup.py#L40
@Dihong Huang is stix2-patterns important to you?
k
Just installing statsforecast worked for me for DBR 12 - 13.1 👍
h
yeah it failed on @Dihong Huang side because he has a very special dependency stix2-patterns requiring a version old version of a package that conflicts with fugue
k
I think that is a dependency of his
synapse
library
h
since it doesn't complain on DB side, you can simple enforce the version of antlr4-python3-runtime
d
@Han Wang I think this stix2 thing doesn’t really matter
Thank you guys so much! How should I enforce the version of antlr4-python3-runtime?
k
When you install the PyPI libraries, add another one:
Copy code
antlr4-python3-runtime==4.11.1
and I think that should work
d
I installed that but the error still come with importing, what else could be wrong?
k
You may need a restart of the cluster
h
when you restart the cluster
you should check the version of the pkg again
to see if it is the new one or old one
d
Oh I forgot to mention that my databricks is the community edition so I might need to create a new cluster and install them again.
Anyway thank you so much! I will try it later and let you know if there are any other questions!
h
sure
we will see if we can make antler a soft dependency that you won't hit it unless you need it. but this will take a few weeks
d
The import seems working, but new issue comes up. My dataframe is a pyspark.pandas.frame.DataFrame converted from spark dataframe by df.pandas_api(), and there is a error when calling StatsForecast(): ValueError: is not allowed
Do I have to use a pure pandas dataframe?
h
you have to use a native spark dataframe
k
That is expected. PySpark Pandas is a different class and will not be compatible with libraries that are compatible with Pandas or even with Spark. Statsforecast and either can a Spark DataFrame or Pandas DataFrame through Fugue
h
we have not supported pandas on spark dataframe
the conversion is trivial
k
If you’re on Databricks, you should use a Spark DataFrame
h
yes
oh and please also make sure cloudpickle is installed, the new spark no long installs cloudpickle automatically, but at least for now we need it
d
OK thanks
Do I use FugueBackend as illustrated in this example?
h
no
k
That will work but the latest versions just do it for you under the hood. You just need to pass a Spark DataFrame to statsforecast when you do:
Copy code
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

sf = StatsForecast(
    models = [AutoARIMA(season_length = 12)],
    freq = 'M'
)

sf.fit(df). # insert Spark DataFrame here
d
Great thanks
Sorry to bother again, but spark dataframe doesn’t work here either.
In the very end of the error message there is a if not statement specifying that the dataframe must be a pd.DataFrame
k
Let me test one second
d
By converting the dataframe to pandas dataframe there is a memory issue.
k
I guess the new way hasn’t been released yet. You might need to use FugueBackend for now but it will become more invisible in the future
Here is a working example
Copy code
from statsforecast.distributed.utils import forecast
from statsforecast.distributed.fugue import FugueBackend
from statsforecast.models import AutoARIMA
from statsforecast.core import StatsForecast

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
backend = FugueBackend(spark, {"fugue.spark.use_pandas_udf":True})

series = generate_series(n_series=3, seed=1).reset_index()
sdf = spark.createDataFrame(series)

res = forecast(sdf, 
         [AutoARIMA()], 
         freq="D", 
         h=7, 
         parallel=backend).toPandas()
res.head()
For timeseries, the
forecast
is preferred over the
fit-predict
because the model can be big as it contains all of the weights and the whole timeseries on the input.
forecast
will do the step in one go and be more memory efficient. Operationally, the
fit-predict
makes sense when you want to store the model and then retrieve it to predict new points later. For timeseries, it’s more common to just run the forecast every so often (every day or week)
In the future, you shouldn’t need to specify the FugueBackend, it will just be inferred for Spark, Dask, and Ray DataFrames
Actually no, I think you shouldn’t need FugueBackend. I’ll look into it
d
This seems to run but I don’t know what’s going on that it run for 3 hours and stuck at the same point
k
You can try on a sample of the data like 10% of timeseries. It may be your compute resources are struggling? You can check utilization right? Also, I don’t have a good feel for how long the models take to run. I’m on the Fugue project and we just collaborate with Nixtla so I am more on the Spark side. It might worth asking a new question in this Slack, but I think you can benchmark the times with the smaller dataset and you’ll get a feel for how expensive each model is.
It might be bottlenecking. Can you add this config to Spark on your cluster
Copy code
"spark.task.cpus": "1"
d
My local environment takes only less than 10 mins to finish the same thing, very weird.
I just tried to run it again an found this. My new cluster has exactly the same runtime and libraries as the previous one
k
Will respond more in like 15 mins
Hey so on the Spark hanging, were you able to set the config I mentioned a bit above and try that? On this one, we made some release last night and it seems to have broken things. Looking into it
Can you restart the cluster? We removed the package with an error, if you restart, I think it will install good versions
d
Do I set the config like this?
I am restarting it right now.
k
That is right, but I believe maybe no
:
You can try these also:
Copy code
"spark.speculation": "true",
"spark.sql.adaptive.enabled": "false",
"spark.task.cpus": "1"
d
I am trying this now, but the restart before didn’t work
k
Maybe uninstall and reinstall statsforecast?
The restart of the cluster did not reinstall the libraries?
You can do
!pip show triad
and you want it 0.8.9
d
I can see they being reinstalled
Screenshot 2023-06-06 at 12.09.47 PM.png
k
If qpd is 0.4.2 and triad is 0.9.0 it should also work. Did youg et it working?
d
It’s now like this, should I stop it?
h
@Dihong Huang if you have time we can have a quick meeting
i think that is just a warning
k
Hi @Dihong Huang, the new versions of the Fugue dependencies have been updated so installing statsforecast and Fugue should not give any issue. Thanks for reporting!
d
Thank you!