Nixtla's Open Source Time Series Ecosystem.

Nixtla Community

:wave: Is there a general guidance on optimizing spark config to run distributed forecast faster? In Nixtla doc, it has a section called "<https://nixtlaverse.nixtla.io/statsforecast/docs/distributed/spark.html#helpful-configuration|Helpful Configuration>", but it was never mentioned how it was tuned.

Please ignore that configuration and set `spark.sql.shuffle.partitions` to a multiple of your executors

I see. are we subject to the general tuning guideline of too large a multiple, maybe goes OOM, too small, too much overhead.

More or less. Memory isn't really a concern here, since it doesn't spike as some common ETL tasks could, it's more towards reducing the overhead. So you can try with 1 or 2 as the multiple and it should work

Mainly we want to avoid databricks' default (200)

Here's an old thread with some more discussion if you're interested: <https://linen.nixtla.io/t/16075533/hello-nixtla-community-i-have-a-few-questions-regarding-dist>. It's on a different page because we're on the free tier and slack deletes old messages.