Update globals.yml with kedro 0.17.0

Hi,

With the latest Kedro update I can see that to update CongigLoader using TemplatedConfigLoader, rather than overiding ProjectContext._create_config_loader, I need to move the code to the hooks.py under register_config_loader.

However, I was passing extra_params (params specified at run time) as globals_dict. This is passed in as an input to run.run_package.

How can I also use this in the hooks.py?

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:

    loader = TemplatedConfigLoader(
        conf_paths,
        globals_pattern="*globals.yml",
        globals_dict=self._extra_params,
    )   
    #print(loader._arg_dict)     

    return loader

Thanks,
Vinay

2 Likes

Ah, I see you’re overriding some of the templates using the CLI parameters, correct?

You can take advantage of the new KedroSession object.

from kedro.framework.session import get_current_session

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    session = get_current_session()
    extra_params = session.store.get('extra_params')
    loader = TemplatedConfigLoader(
        conf_paths,
        globals_pattern="*globals.yml",
        globals_dict=extra_params,
    )
    return loader

Try this out and let me know; this is untested, haha.

Hi DE1!

thank you for your reply!

Ive tried your solution, however it now raises the error:

RuntimeError: There is no active Kedro session.

I feel like im doing missing something quiet obvious (as normal :slight_smile:) but not sure what.

Im creating a Kedro Session by doing the below in the run.py file:

with KedroSession.create(package_name,
        project_path=None,
        save_on_close=True,
        env=None,
        extra_params=extra_params) as session:

    session.run(
                pipeline_name=None,
                tags=tags,
                runner=runner,
                node_names=node_names,
                from_nodes=from_nodes,
                to_nodes=to_nodes,
                from_inputs=from_inputs,
                load_versions=load_versions,
                )

Is there anything else i need to do? though it does run fine without the TemplatedConfigLoader.

Thanks in advance!

Vinay

1 Like

Oh, kedro sessions are currently being created in the cli.py file when kedro run is called. There shouldn’t be a need to put the session inside of your run.py file unless you are running the pipeline in a non-standard way?

1 Like

Sorry I dont think I was clear. We are currently not using the CLI to run kedro. We merely deploy the code onto databricks and import the run.py file and call the main method (passing in tags, extra_params etc).

It sounds like this isn’t the recommended solution. Is there a workaround with the way we run our pipelines or would you recommend moving to CLI commands?

Thanks,
Vinay

so our main method looks like this:

def main(tags: Iterable[str] = None,
        runner: AbstractRunner = None,
        node_names: Iterable[str] = None,
        from_nodes: Iterable[str] = None,
        to_nodes: Iterable[str] = None,
        from_inputs: Iterable[str] = None,
        load_versions: Dict[str, str] = None,
        pipeline_name: str = None,
        extra_params: Dict[str, Any] = None,
        update_hive_col_comments: Dict[str, Any] = {'update_flag': 0}):
    # Entry point for running a Kedro project packaged with `kedro package`
    # using `python -m <project_package>.run` command.
    logger = logging.getLogger(__name__)
    # By default, project main will not update hive col comments
    package_name = Path(__file__).resolve().parent.name
    with KedroSession.create(package_name,
        project_path=None,
        save_on_close=True,
        env=None,
        extra_params=extra_params) as session:
1 Like

Ah, I see I see. So this will depend a lot on what kind of parameters you’re interleaving with the TemplatedConfigLoader.

My suggestion would be to see how far you can get using native kedro implementations for things; it’s quite comprehensive, so you probably can get away from using too many extra params. Perhaps by using environments, and then figuring out a way to move some of the templated variables into the pipeline itself

1 Like

Im mainly using the globals.yml to pass the correct environments or mount paths or specific dates of data i.e:

{mnt_base_folders.processed}/{environment}/source/name/$(date)

The ability to update globals.yml file with these at run time is definitely something we love. Is there another way we can pass these at run time?

I agree that we defo should be working with native implementations and want to uprade to 0.17.0 with that in mind :smiley:

1 Like

Aha, interesting. How many datasets and of what kind are they? I’m assuming they’re all basically SparkDataSets?

We have 100+ datasets.
We’ve created custom class to read spark datasets as we wanted to use hive as well as build some other fun features.

@DataEngineerOne tried your register_config_loader hook but also got RuntimeError: There is no active Kedro session. Unlike @vinay I’m not messing with run.py.

Any ideas?

Hi @facepalm - I posted this issue to stack overflow and got a response stating that there will be a fix for this in the next release.