Dynamic DAG

Hi,

Thank you for the framework. I’m on 0.16.6 .

I noticed from the source code that Kedro DAG is static. That causes some questions.

  1. I have a data source:
  • it is an SQL database that returns names of CSV files by some DS_ID
  • the next part is to fetch CSV files with data. The count varies

I’d like to parametrize DS_ID at catalog.yml per each run at production. The only possibility I found is as follows:

  • register TemplatedConfigLoader at hooks
  • create conf/concrete_ds_id/globals.yml folder on the flight and pass --env concrete_ds_id to run.

Is there a better way?

I found the ProjectContext._get_pipelines method. Seems like it can solve the problem. The question how good is an idea to rely on overriding a protected method?

  1. Next thing. I’d like to compose a pipeline based on fetched CSV files. I’d like to see each of them as an input to a node. But I have no access to catalog items at ProjectHooks to create nodes dynamically at register_pipelines . How to do that?

Sounds like you are on the right track. Another idea that comes to mind is a custom runner. I am thinking about Lorena’s BatchRunner. The way it works is that you submit one pipeline run, then it watches nodes to be completed and submits new batch jobs as they are ready kinda like kedro run --nodes my_node.

Your use case may be in a similar vein. You want to submit a job that does some thinking, then submits pipeline runs in succession until the problem is complete.

https://kedro.readthedocs.io/en/stable/10_deployment/07_aws_batch.html

re-reading your problem 1. again. There might be a simpler less elegant solution. What if your first node collected the list of csv’s. Then the second node returned a dictionary or list of DataFrames.

This way may require you to utilize a PickleDataSet or MemoryDataSet over a dataset designed for tabular data for at least one step.