Slackbot
11/20/2023, 5:44 PMJosé Morales
11/21/2023, 11:34 PMn_jobs
controlls the parallelism within each partition, but since I think each partition holds a single serie it wouldn't make a difference (I'll also verify this).
• There's an nmodels
argument that controls how many combinations are tried, so that's an easy one to limit the search space.José Morales
11/23/2023, 1:11 AMn_jobs=1
hardcoded. How many partitions did you have when you were using 320 CPUs?Antoine SCHWARTZ -CROIX-
11/23/2023, 9:48 AMAntoine SCHWARTZ -CROIX-
11/23/2023, 9:51 AMdf = spark.read.parquet(f"{base_s3_dir_path}/{cutoff}/{input_dfs_dir_name}/df.parquet").repartition("unique_id")
df.rdd.getNumPartitions() # 162
Antoine SCHWARTZ -CROIX-
11/23/2023, 9:58 AMspark.conf.get("spark.sql.shuffle.partitions") # 200
Antoine SCHWARTZ -CROIX-
11/23/2023, 10:48 AMJosé Morales
11/23/2023, 3:39 PMAntoine SCHWARTZ -CROIX-
11/24/2023, 10:09 AMJosé Morales
11/24/2023, 4:03 PMspark.conf.set("spark.sql.shuffle.partitions", 640)
Antoine SCHWARTZ -CROIX-
11/24/2023, 6:30 PMnum_worker_cores = 320
spark.conf.set("spark.sql.shuffle.partitions", num_worker_cores*4)
spark.conf.set("spark.sql.adaptive.enabled", "false")
Antoine SCHWARTZ -CROIX-
11/24/2023, 6:30 PMJosé Morales
11/24/2023, 6:54 PMAntoine SCHWARTZ -CROIX-
12/04/2023, 9:03 AMspark.conf.set("spark.sql.shuffle.partitions", num_worker_cores)
is enough, the problem was simply to overcome the default limit of 200.
Thanks again for your help @José Morales!