I wanted to hear other peoples opinion on the Kedro transformers functionality. In your opinion are they suitable to do some basic data processing that is specific to a particular catalog entry or should I restructure my project instead?
For instance, I do something along these lines (1) :
df = catalog.load("te_indicators")
and then (2):
df["country"] = df["country"].replace(to_replace=mappings)
df.indicator = normalise_string_col(df.indicator)
df.date = df.date.apply(lambda dt: dt.replace(day=1))
Would you put (2) in a transformer hook or would you pre-process the te_indicators
file and then just use the pre-processed file (it is a tiny file)?
Just before I thought I would put (2) in a transformer hook but looking at the implementation my impression is that their use-case is for transformations that are more general and apply to multiple datasets (e.g. measure the loading time). I am aware that I can specify on which catalog entry a transformation should be applied to but maybe it becomes all a bit messy (?)
Any thoughts? Thanks!
Edit: After writing this, I even have more doubts. Maybe it would be better to write my own DataSetType
and do these things within the _load()
function?