Kedro pipelines - few number of big nodes? Or large number of small nodes?


As an intro, my name is Hugo, and I’m one of the founders of Saturn We focus on managing Jupyter and Dask, as well as deploying data science dashboards, models and pipelines - Hence my interest in Kedro.

I’m actually writing one of our ETL jobs in Kedro to learn more, and I’m planning on submitting a Kedro deployment guide around Dask in the near future.

There are 2 flavors of how one would use Kedro with Dask.

  1. One approach would be to run each Kedro node on a dask cluster - this would enable parallel execution of large Kedro pipelines.
  2. The other approach would be for the Kedro pipeline to know nothing about dask, but have the nodes themselves optionally leverage Dask for computation.

The first approach is only relevant if people had “large” Kedro pipelines, that is pipelines with a large number of nodes. Is this common? From the docs it seems like the input/outputs of nodes in Kedro are manually named (with human readable and semantically valuable names) This would suggest that most Kedro pipelines are “small” - by which i mean a fewer number of big nodes, rather than a large number of small nodes.



Hi Hugo,

thanks for your post. Kedro + Dask is also something I would be very much interested in so please do share your findings! :smile:

As for your question - are you asking about best practices, Kedro’s vision or what exactly? Could you also clarify what you mean with small / large numbers of nodes? :smiley:

My personal opinion is that Kedro pipelines are just one of the many features that the framework offers. The pipelines I have written are so far all fairly small, do all processing in memory and are more often than not run locally. For me it works well for now since the projects I am involved in are still POC data science projects. For production workloads I would probably also look at Dagster/Prefect/Airflow. :slight_smile:

:wave: Welcome to the community @hugo!

How many nodes would be considered large?

Every example of kedro that I have seen has statically created nodes. By that I mean that for every node created there is a line of code that creates that one node and only that node.

I personally find myself creating a lot of pipelines with dynamic nodes. By that I mean that I might have a number of datasets that I want to follow the same pattern. I typically use something like for node in itertools.product(layers, datasets) to create many nodes at once that all need very similar processing done to them.

Personally I work on a small team with many small projects. We typically end up with 10s of nodes, maybe low 100s occasionally.

I really like to work on pipelines with relatively many small nodes over a few complex nodes. Often times my nodes are not much more than a few lines.

I do not see any reason why kedro would have any upper limit of nodes.

For distributed deployment I am using the kedro Batch Runner. This will run each node as its own batch process, potentially on its own ec2 instance, but will not make an individual node distributed like dask would.

Thanks @waylonwalker, @sebastianbertoli

I think 10-100 would be considered large - or at least large enough where Dask could be interesting.

In that case I think both methodologies are useful

  1. Running an individual node in parallel over a Dask Cluster
  2. Running each node as a Dask task.

@waylonwalker, Dask has a pretty low task latency that might make it easier to work with than the Batch runner.

I’ll ping both of you when the deployment guide is ready, would love to get your feedback.



Looking forward to it @hugo!