As an intro, my name is Hugo, and I’m one of the founders of Saturn We focus on managing Jupyter and Dask, as well as deploying data science dashboards, models and pipelines - Hence my interest in Kedro.
I’m actually writing one of our ETL jobs in Kedro to learn more, and I’m planning on submitting a Kedro deployment guide around Dask in the near future.
There are 2 flavors of how one would use Kedro with Dask.
- One approach would be to run each Kedro node on a dask cluster - this would enable parallel execution of large Kedro pipelines.
- The other approach would be for the Kedro pipeline to know nothing about dask, but have the nodes themselves optionally leverage Dask for computation.
The first approach is only relevant if people had “large” Kedro pipelines, that is pipelines with a large number of nodes. Is this common? From the docs it seems like the input/outputs of nodes in Kedro are manually named (with human readable and semantically valuable names) This would suggest that most Kedro pipelines are “small” - by which i mean a fewer number of big nodes, rather than a large number of small nodes.