Kedro 6 Months In | bazaarvoice: engineering

So am I a Kedro convert? Yah, you betcha.

A glowing review of kedro!

Incremental Dataset : This support exists for reading data, but it’s lacking for writing datasets. This affected us a few times when we had a node that would take 8-10 hours to run. We lost work if the node failed part of the way through

The one issue he describes with IncrementalDataSets is a real one. To address it in the past, I’ve had to write custom datasets with their incremental-nature built in.

1 Like

For sure. I have a 40 hr pull from a DB. My hacky solution is to write a function that accepts arguments a way to slice it into smaller pieces (year/month/some range) then I generate a catalog entry for each chunk and a node for each chunk in a simple for loop.

Then I manually bring them together by passing pd.concat as the func to an additional node.

Generating the nodes feels proper, but currently, the catalog requires a python function to generate yaml, and that feels a bit hacky. Maybe @deepyaman’s Jinja PR will change that.

Honestly, I have never used Incremental/Partitioned datasets. Not sure if I fully understand them. I feel like every time I take the time to learn my data doesn’t quite fit for some reason, but that could be my lack of understanding.

1 Like