在 pyspark databricks 中并行执行多个笔记本

问题很简单:

master_dim.py 调用 dim_1.pydim_2.py 以并行执行。这在databricks pyspark中可能吗?

下图解释了我想要做什么,它由于某种原因出错,我在这里遗漏了什么吗?

enter image description here

stack overflow Execute multiple notebooks in parallel in pyspark databricks
原文答案
author avatar

接受的答案

Just for others in case they are after how it worked:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

答案:

作者头像

your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.

You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)

Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)

作者头像

Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.