https://github.com/nixtla logo
#general
Title
# general
d

Dihong Huang

06/02/2023, 9:59 PM
Hi I got this error on databricks trying to replicate the prediction process in this link: https://nixtla.github.io/mlforecast/distributed.forecast.html#spark Can someone help me with this? Thanks!
I also encountered the same error when I was this command:
from statsforecast.models import ( ADIDA, CrostonClassic, IMAPA, TSB )
k

Kevin Kho

06/05/2023, 8:00 PM
Not sure but it sounds like you have installation issues? Might be worth trying a fresh environment
d

Dihong Huang

06/05/2023, 8:21 PM
How should I approach this? I am very new to databricks.
k

Kevin Kho

06/05/2023, 8:22 PM
Is it importing that literally errors?
How did you install the library? Did you go to the cluster and then to the libraries?
Btw, you don’t need to create
spark
variable on Databricks. It already exists when you load the notebook. If you click the + on
TypeError
, can I see the full Traceback?
d

Dihong Huang

06/05/2023, 8:46 PM
I directly used pip install in the notebook. Also, both errors were on the import command.
Screenshot 2023-06-05 at 4.47.24 PM.png,Screenshot 2023-06-05 at 4.47.45 PM.png
k

Kevin Kho

06/05/2023, 9:13 PM
Don’t pip install in notebook. It only installs on the driver and not the workers. Go to the cluster settings and install in the Libraries tab
Screen Shot 2023-06-05 at 4.14.34 PM.png
Try installing it with this UI instead and maybe things will work better
d

Dihong Huang

06/05/2023, 9:37 PM
Cool, what is the source and type for them?
k

Kevin Kho

06/05/2023, 9:58 PM
PyPI and then just need to type in the library name
d

Dihong Huang

06/05/2023, 10:17 PM
Thank you! I successfully installed the libraries but the same error still occurred.
Here is the complete error message, which is very long:
OK the message is too long to copy into here and screenshot too
k

Kevin Kho

06/05/2023, 10:19 PM
email me a log at kdykho@gmail.com and i can look
d

Dihong Huang

06/05/2023, 10:25 PM
Sent!
k

Kevin Kho

06/05/2023, 10:27 PM
Oh shit lol it’s a Fugue issue. Can you go to a notebook and do:
Copy code
!pip show fugue
To get me the Fugue version?
I will try to replicate
d

Dihong Huang

06/05/2023, 10:30 PM
Screenshot 2023-06-05 at 6.30.15 PM.png
k

Kevin Kho

06/05/2023, 10:30 PM
Ok let me spin up a cluster and try this now
Can you also give me Databricks Runtime version? It’s attached to the cluster
d

Dihong Huang

06/05/2023, 10:33 PM
Its 13.1
k

Kevin Kho

06/05/2023, 10:34 PM
I suspect the latest 12 will work but I’m verifying. I think 13 had breaking changes
I can confirm 12 works for me, trying on 13
Works for 13.0, trying to reproduce on 13.1
Pinging @Han Wang
He attached this log with this:
h

Han Wang

06/05/2023, 10:50 PM
can you do this @Dihong Huang
Copy code
!pip show antlr4-python3-runtime
@Kevin Kho can you reproduce the error on 13.1?
d

Dihong Huang

06/05/2023, 10:54 PM
Screenshot 2023-06-05 at 6.54.06 PM.png
h

Han Wang

06/05/2023, 10:55 PM
ah this is very weird
in your import can you force
Copy code
antlr4-python3-runtime==4.11.1
the version you installed is incorrect, it is for py 3.7 and it is also too old
i don't know why you were able to do that, fugue 0.8.4 should bring it to 4.11.1 automatically
maybe you didn't install the packages in the correct way
ah i see this is because stix2-patterns requires a very old version of antlr https://github.com/oasis-open/cti-pattern-validator/blob/master/setup.py#L40
@Dihong Huang is stix2-patterns important to you?
k

Kevin Kho

06/05/2023, 11:01 PM
Just installing statsforecast worked for me for DBR 12 - 13.1 👍
h

Han Wang

06/05/2023, 11:02 PM
yeah it failed on @Dihong Huang side because he has a very special dependency stix2-patterns requiring a version old version of a package that conflicts with fugue
k

Kevin Kho

06/05/2023, 11:03 PM
I think that is a dependency of his
synapse
library
h

Han Wang

06/05/2023, 11:03 PM
since it doesn't complain on DB side, you can simple enforce the version of antlr4-python3-runtime
d

Dihong Huang

06/05/2023, 11:04 PM
@Han Wang I think this stix2 thing doesn’t really matter
Thank you guys so much! How should I enforce the version of antlr4-python3-runtime?
k

Kevin Kho

06/05/2023, 11:08 PM
When you install the PyPI libraries, add another one:
Copy code
antlr4-python3-runtime==4.11.1
and I think that should work
d

Dihong Huang

06/05/2023, 11:23 PM
I installed that but the error still come with importing, what else could be wrong?
k

Kevin Kho

06/05/2023, 11:38 PM
You may need a restart of the cluster
h

Han Wang

06/05/2023, 11:41 PM
when you restart the cluster
you should check the version of the pkg again
to see if it is the new one or old one
d

Dihong Huang

06/05/2023, 11:42 PM
Oh I forgot to mention that my databricks is the community edition so I might need to create a new cluster and install them again.
Anyway thank you so much! I will try it later and let you know if there are any other questions!
h

Han Wang

06/05/2023, 11:47 PM
sure
we will see if we can make antler a soft dependency that you won't hit it unless you need it. but this will take a few weeks
d

Dihong Huang

06/06/2023, 1:00 AM
The import seems working, but new issue comes up. My dataframe is a pyspark.pandas.frame.DataFrame converted from spark dataframe by df.pandas_api(), and there is a error when calling StatsForecast(): ValueError: is not allowed
Do I have to use a pure pandas dataframe?
h

Han Wang

06/06/2023, 1:01 AM
you have to use a native spark dataframe
k

Kevin Kho

06/06/2023, 1:02 AM
That is expected. PySpark Pandas is a different class and will not be compatible with libraries that are compatible with Pandas or even with Spark. Statsforecast and either can a Spark DataFrame or Pandas DataFrame through Fugue
h

Han Wang

06/06/2023, 1:02 AM
we have not supported pandas on spark dataframe
the conversion is trivial
k

Kevin Kho

06/06/2023, 1:02 AM
If you’re on Databricks, you should use a Spark DataFrame
h

Han Wang

06/06/2023, 1:02 AM
yes
oh and please also make sure cloudpickle is installed, the new spark no long installs cloudpickle automatically, but at least for now we need it
d

Dihong Huang

06/06/2023, 1:05 AM
OK thanks
Do I use FugueBackend as illustrated in this example?
h

Han Wang

06/06/2023, 1:05 AM
no
k

Kevin Kho

06/06/2023, 1:06 AM
That will work but the latest versions just do it for you under the hood. You just need to pass a Spark DataFrame to statsforecast when you do:
Copy code
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

sf = StatsForecast(
    models = [AutoARIMA(season_length = 12)],
    freq = 'M'
)

sf.fit(df). # insert Spark DataFrame here
d

Dihong Huang

06/06/2023, 1:07 AM
Great thanks
Sorry to bother again, but spark dataframe doesn’t work here either.
In the very end of the error message there is a if not statement specifying that the dataframe must be a pd.DataFrame
k

Kevin Kho

06/06/2023, 1:38 AM
Let me test one second
d

Dihong Huang

06/06/2023, 1:50 AM
By converting the dataframe to pandas dataframe there is a memory issue.
k

Kevin Kho

06/06/2023, 1:51 AM
I guess the new way hasn’t been released yet. You might need to use FugueBackend for now but it will become more invisible in the future
Here is a working example
Copy code
from statsforecast.distributed.utils import forecast
from statsforecast.distributed.fugue import FugueBackend
from statsforecast.models import AutoARIMA
from statsforecast.core import StatsForecast

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
backend = FugueBackend(spark, {"fugue.spark.use_pandas_udf":True})

series = generate_series(n_series=3, seed=1).reset_index()
sdf = spark.createDataFrame(series)

res = forecast(sdf, 
         [AutoARIMA()], 
         freq="D", 
         h=7, 
         parallel=backend).toPandas()
res.head()
For timeseries, the
forecast
is preferred over the
fit-predict
because the model can be big as it contains all of the weights and the whole timeseries on the input.
forecast
will do the step in one go and be more memory efficient. Operationally, the
fit-predict
makes sense when you want to store the model and then retrieve it to predict new points later. For timeseries, it’s more common to just run the forecast every so often (every day or week)
In the future, you shouldn’t need to specify the FugueBackend, it will just be inferred for Spark, Dask, and Ray DataFrames
Actually no, I think you shouldn’t need FugueBackend. I’ll look into it
d

Dihong Huang

06/06/2023, 5:10 AM
This seems to run but I don’t know what’s going on that it run for 3 hours and stuck at the same point
k

Kevin Kho

06/06/2023, 5:27 AM
You can try on a sample of the data like 10% of timeseries. It may be your compute resources are struggling? You can check utilization right? Also, I don’t have a good feel for how long the models take to run. I’m on the Fugue project and we just collaborate with Nixtla so I am more on the Spark side. It might worth asking a new question in this Slack, but I think you can benchmark the times with the smaller dataset and you’ll get a feel for how expensive each model is.
It might be bottlenecking. Can you add this config to Spark on your cluster
Copy code
"spark.task.cpus": "1"
d

Dihong Huang

06/06/2023, 1:28 PM
My local environment takes only less than 10 mins to finish the same thing, very weird.
I just tried to run it again an found this. My new cluster has exactly the same runtime and libraries as the previous one
k

Kevin Kho

06/06/2023, 3:02 PM
Will respond more in like 15 mins
Hey so on the Spark hanging, were you able to set the config I mentioned a bit above and try that? On this one, we made some release last night and it seems to have broken things. Looking into it
Can you restart the cluster? We removed the package with an error, if you restart, I think it will install good versions
d

Dihong Huang

06/06/2023, 3:35 PM
Do I set the config like this?
I am restarting it right now.
k

Kevin Kho

06/06/2023, 3:37 PM
That is right, but I believe maybe no
:
You can try these also:
Copy code
"spark.speculation": "true",
"spark.sql.adaptive.enabled": "false",
"spark.task.cpus": "1"
d

Dihong Huang

06/06/2023, 3:58 PM
I am trying this now, but the restart before didn’t work
k

Kevin Kho

06/06/2023, 3:59 PM
Maybe uninstall and reinstall statsforecast?
The restart of the cluster did not reinstall the libraries?
You can do
!pip show triad
and you want it 0.8.9
d

Dihong Huang

06/06/2023, 4:09 PM
I can see they being reinstalled
Screenshot 2023-06-06 at 12.09.47 PM.png
k

Kevin Kho

06/06/2023, 4:35 PM
If qpd is 0.4.2 and triad is 0.9.0 it should also work. Did youg et it working?
d

Dihong Huang

06/06/2023, 4:38 PM
It’s now like this, should I stop it?
h

Han Wang

06/06/2023, 4:39 PM
@Dihong Huang if you have time we can have a quick meeting
i think that is just a warning
k

Kevin Kho

06/06/2023, 7:21 PM
Hi @Dihong Huang, the new versions of the Fugue dependencies have been updated so installing statsforecast and Fugue should not give any issue. Thanks for reporting!
d

Dihong Huang

06/06/2023, 8:24 PM
Thank you!
9 Views