Run same pipeline on different data

Hi! I’m considering adopting Kedro for a project I’m working on, but seem to have arrived at a limitation for my use case.

I would like to run the same pipeline on different sets of input data. Say, I want to run a pipeline for country A and country B. The country is defined in the SQL query as follows:

SELECT * FROM input_data WHERE country = {country}

Is there a way I could define the country as a parameter in the CLI? e.g kedro run --params country=a. In essence, use params for data catalogs (not just for nodes).

Also, is there a way in which I could load the processed data by specifying the country? Sort of like a tag, e.g. country_data = io.load("processed_data:country_a")

Thank you!

1 Like

Hi Luciano,

have you considered just having a node that loads the dataset you specify as a parameter and passes that one to your remaining nodes in the pipeline? I have not tested v0.17 yet, but perhaps it is worth trying if the following snipped would work (adopted from the url below):

from kedro.framework.session import get_current_session

def dataset_selection(selection):
    session = get_current_session()
    context = session.load_context()
    return context.catalog.load(selection)

Hi Sebastian,

Thanks for this suggestion, I’m having a similar use case. My problem now is how to template the output in Could I do something like this:

             inputs = ["X", "y", "params:model_training_params"],
             outputs = [f"{params:ML_matrix}_cv_result_mean_sorted"]

where ML_matrix is the name of the dataset I want to switch out?

Also, do I have to manually add to the catalog the different output files? I could use Kedro’s Jinja templating functionality and do:

{% for ML_matrix in ['unfiltered', 'filtered'] %}

    type: pandas.CSVDataSet
    filepath: path/to/file/{{ML_matrix}}_cv_result_mean_sorted.csv

{% endfor %}

but now the parameters and the catalog are decoupled so updating them will be more cumbersome. Can I, instead, leave part of the file name as templated string and then format them at runtime?

ie in the data catalog:

    type: pandas.CSVDataSet
    filepath: path/to/file/{ML_matrix}_cv_result_mean_sorted.csv

then in the

             inputs = ["X", "y", "params:model_training_params"],
             outputs = ["{ML_matrix}_cv_result_mean_sorted".format(ML_matrix = "params:ML_matrix"]
1 Like

Hi @Bun_Without_B nice use case!

Right off the bat I am not sure either but I’d like to look into this as well in the coming days and will come back to this post if I have any insights. Perhaps someone else will be able to chime in before that. :slight_smile:

Hi Sebastian! Thank you for your reply. I did think about that, but it seems a bit hacky, because we wouldn’t be actually using the DataCatalog for abstracting away the loading of data.

I think I didn’t quite get the dataset_selection function. Would the parameter be passed as the selection argument? I think the function catalog.load needs you to pass the exact name of the registered dataset, correct?

@Luciano_Viola This would be my very naive approach:

sebastian@DESKTOP-V0HUNTE:~/projects/luciano_iris$ kedro run --params ds_selection:ds_a

2021-01-22 11:42:59,520 - kedro.pipeline.node - INFO - Running node: print_data_head([df]) → None
name col2
0 ds_a some_value

sebastian@DESKTOP-V0HUNTE:~/projects/luciano_iris$ kedro run --params ds_selection:ds_b

2021-01-22 11:43:04,471 - kedro.pipeline.node - INFO - Running node: print_data_head([df]) → None
name col2
0 ds_b some_value

How many countries do you have? Can you generate a catalog with all possible countries?

I’ve done this in the past for several of my pipelines. Using itertools.product to generate a large number of nodes and catalog entries based on a few small lists. For me I generated my catalog yamls with a simple cli script. In 0.17.0 the TemplatedConfigLoader supports Jinja2 and would make this much easier.