In order to get the best help, it is suggested to answer the following questions:
What is the goal you are trying to achieve?
I have S3 bucket with 20K csv files in 3 different subfolders. I want to preprocess them, merge them, group them and save them into 100 aggregated csv files.
What have you tried, in order to accomplish the goal?
I have built S3 lister which saves all the filepaths for these 20K files, then I put this list into
@hook_impl def register_catalog and update the data catalog. So now I have 20K entries.
When I create the pipeline I create about 300 nodes to process subgroups differently based on the subfolder they come. What I am stuck with, is that I don’t see the name of the dataframe inside the node. Based on the name, I would be able to resolve the strategy. So, my question is: how to get a name for the dataframe inside the node? And even larger question - am I right with doing it like this? I also though to create the catalog.yml before the run.py, and then read it through, but it will involve creating new catalog.yml before every run.
What version of Kedro are you using? (Use
kedro, version 0.17.4
Do you have any custom plugins?
What is the full stack trace of the error (if applicable)
No error, I just need advice, how to work with completely dynamic big list of datasets