How to dynamically alter the catalog each run

Hi!

I’ve experimented with Kedro for a week now, and absolutely love it. I have also been learning more about hooks (watched DE1’s video of it, and read docs) and find that to be very interesting to my needs.

Simplified, I have the following situation and problem:

You can asume that I have the latest version of Kedro. I have also a succesful implementation of the Spaceflights-tutorial. I can run the pipeline from terminal. Easy peasy.

However, what I now would like is the ability to supply additional input when running the pipeline from terminal. That input should then, before the pipeline is executed, alter the filepaths of my registered datasets in config/base/catalog.

More specifically, I would like to be able to change the filepaths below for every run.

companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv

reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv

It would be fantastic if there was a way to make this work something like this:

kedro run companies_filepath==‘xxx’ reviews_filepath==‘yyy’

I suspect that there actually is a fairly smart, perhaps even straightforward, way to implement this, but I just havn’t been able to solve it yet…

Thanks!

I believe what you are looking for is the templated config loader. You can then specify extra_params in the arguments of Templated Config Loader.

    def register_config_loader(
        self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any],
    ) -> ConfigLoader:
        loader = TemplatedConfigLoader(
            conf_paths,
            globals_pattern="*globals.yml",
            globals_dict=extra_params,
        )
        return loader

In gloabals.yml, you would have your default filepath. Then you can specify the desired dataset using “kedro run --params filepath:yyy”

Docs:
https://kedro.readthedocs.io/en/0.17.2/kedro.config.TemplatedConfigLoader.html

1 Like

I think that using the before_pipeline_run Hook it could be done too. And even the register _catalog Hook. What I don’t know is what is the most recommended way (the most aligned with Kedro standards).

Hi Ljam,

Thank you very much for the response. I’m trying to figure out how to make this work, but have run into an issue and not really sure how to proceed. So, I’ve changed my default ‘hooks.py’ to look like this:

from typing import Any, Dict, Iterable, Optional

from kedro.config import ConfigLoader
from kedro.config import TemplatedConfigLoader
from kedro.framework.hooks import hook_impl
from kedro.io import DataCatalog
from kedro.versioning import Journal

class ProjectHooks:
** #@hook_impl**
** #def register_config_loader(**
** # self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any],**
** #) → ConfigLoader:**
** # return ConfigLoader(conf_paths)**

** @hook_impl**
** def register_config_loader(self, conf_paths: Iterable[str]) → ConfigLoader:**
** return TemplatedConfigLoader(**
** conf_paths,**
** globals_pattern="globals.yml",*
** globals_dict={“param1”: “pandas.CSVDataSet”}**
** )**

** @hook_impl**
** def register_catalog(**
** self,**
** catalog: Optional[Dict[str, Dict[str, Any]]],**
** credentials: Dict[str, Dict[str, Any]],**
** load_versions: Dict[str, str],**
** save_version: str,**
** journal: Journal,**
** ) → DataCatalog:**
** return DataCatalog.from_config(**
** catalog, credentials, load_versions, save_version, journal**

which in practice only really means switching out the ConfigLoader to the TemplatedConfigLoader, Register_Catalog was already there. I tried running this using kedro run and ran into this error:

raise MissingConfigException(
*kedro.config.config.MissingConfigException: No files found in [‘C:\Users\Leyla\Desktop\code\kedro-tutorial\conf\base’, ‘C:\Users\Leyla\Desktop\code\kedro-tutorial\conf\local’] matching the glob pattern(s): ['globals.yml’]

So I guess I’ll have to ask more about the role of globals.yml, since I dont quite understand its link to us being able to do something like ‘kedro run --params filepath:yyy’.

Again, thank you very much! Really appreciate it

[quote=“evolute, post:4, topic:422”]

Hey @evolute !
So you only need one config loader so you would need to delete the normal config loader. Your globals.yml would basically be your default folder should no param be specified, eg: “kedro run”.
I made an example pandas iris project with a templated config loader. Feel free to clone it :slight_smile: . I added a print(data) in the pipeline. When I do kedro run, it displays the normal iris dataset in the folder 2019 (default). With “kedro run --params year:2020”, it displays the iris csv where I shuffled the labels a bit.

1 Like

Thanks! I’ll try this now. Will return with results!

And btw my formatting was terrible, in the code above I actually have commented out the normal config loader, in favor of the templated one :slight_smile:

Ok, back!

Just followed your guide on my own setup and it worked! Super easy, and the concept is clear to me now aswell. Thank you! :slight_smile:

EDIT:

So I did actually have a follow-up question regarding the syntax in terminal when sending in multiple parameters, but I managed to solve it myself. For anyone else curious, here’s the syntax:

kedro run --params param1:value1,param2:value2

Please note that there is NO space after the comma. If you place one, you’ll hit an error.

Again, thank you very much for the help!

Thanks everyone, very informative!

Hi again @evolute,

I should have first asked you how many different datasets you have?
If you have say 1~4 different combinations of datasets, using a configuration env might better suit your purpose. You can create folders in conf named for instance “2020” and “2021” with each containing a catalog.yml pointing to data in a specific folder. When you do kedro run --env=2020, the data in the catalog in that folder will override the default in the catalog in base (which might be 2019 for instance). Check out the video below for more info.
Otherwise, if you have a dozen different combinations or more, this approach might become impraticle as it requires systematically creating a new folder and catalog. In which case using the templated config loader is a good idea.