Kedro Transformers - What are your usage scenarios?

I wanted to hear other peoples opinion on the Kedro transformers functionality. In your opinion are they suitable to do some basic data processing that is specific to a particular catalog entry or should I restructure my project instead?

For instance, I do something along these lines (1) :

df = catalog.load("te_indicators")

and then (2):

df["country"] = df["country"].replace(to_replace=mappings)
df.indicator = normalise_string_col(df.indicator)
df.date = df.date.apply(lambda dt: dt.replace(day=1))

Would you put (2) in a transformer hook or would you pre-process the te_indicators file and then just use the pre-processed file (it is a tiny file)?

Just before I thought I would put (2) in a transformer hook but looking at the implementation my impression is that their use-case is for transformations that are more general and apply to multiple datasets (e.g. measure the loading time). I am aware that I can specify on which catalog entry a transformation should be applied to but maybe it becomes all a bit messy (?)

Any thoughts? Thanks!

Edit: After writing this, I even have more doubts. Maybe it would be better to write my own DataSetType and do these things within the _load() function?

3 Likes

This feels like something that belongs in a node to me. We are following a layered approach. I would love if someone has a better description of what DE layers are, but here is my shot at it.


raw: As it came from the source, do not delete, may not be easily regenerated, all data should start here somehow.
int: throw some automated sanitization at it, looks like you have a normalise_string_col function that belongs here.
pri: manual cleasing and coercing types go here. Your apply statement feels like it would go here.
fea: feature engineering - add new columns/calculations/aggregations/joins
modin model input - generally one final join of everything into a giant dataframe, or a few dataframes.


I try not to overthin raw-pri. I generally load raw, apply our standard function to int, and lambda x:x pri, get it in and figure out if there was anything I want to go back to add to pri right away. Again I often dont worry too much about over engineering large datasets with many columns we are not likely to use and let the project show what needs modified as we go. I’m sure there are differing opinons on that, but it can take days to get a dataset looking how you think it should vs just moving forward and coming back to fix what needs fixed. I’m sure some projects are more costly to come back and make corrections, but not generally the ones I am on. Think through what works best for your team.

I don’t worry about having too many nodes. I lean heavily on nearly oneliner nodes (it might be a big pandas method chain, but one statement). I see a lot of newcomers trying to do too much in each node and it makes it harder to see what is happening inside and requires running the node with a debugger or copying the code into an adhoc environment and walking through it.

2 Likes

Thanks for your insights! Do you strictly follow this layered approach for every dataset and do you create catalog entries for every intermediate step? When I wrote this question I was thinking about a tiny ancillary dataset (some economic indicators, 2Mb in size) which I thought I would want to transform on the fly to avoid cluttering the catalog.yml with too many entries of intermediate files.

For the main dataset, which is 4GB of .csv files I follow your approach although I only have a raw to processed step and only have the processed file in the catalog (not ideal, I will have to refactor my code at some point).

2 Likes

I do. I find that it takes less effort to manage when I am consistent. Don’t put much thought into the layers and just do it.

3 Likes

I can’t really help with the rest of the answer, but I quite like our section of the docs on the data engineering convention. The table describes what should be on each layer quite well:

2 Likes

I’ve found this topic super interesting. Thanks a lot for your shared experiencies. In fact, I was thinking about starting a topic on “good practices” to create standard pipelines, and I’ve found some good ideas here.