I have been thinking of creating a Kedro Plugin which leverages DVCs Version Controlled Data but Kedros beautiful pipelines. Somewhere I have read a discussion of this idea before, but I can’t find it anymore. Maybe Github, Discord, etc.
Background: At work I had the problem of jobs taking a long time and failing often, but once they ran successfully, all was good. However, when changing parts of the pipeline, after a while, you don’t know if your data is up to date any more - so you have to rerun all. And this takes a long time. So I used DVC to track my input data (e.g. CSV, Hadoop Data*, etc.) but I also tracked the scripts (.py, .sh, etc.). When any one of those changed, the Pipeline would run only that what was necessary. For development this was a game changer.
*) I added an extra step, which simply called ‘DESCRIBE my_hive/impala_table’, saved the output somewhere, and then DVC’ed just this proxy. If the Data on Hive changed, DVC would not know - but this was fine.
For Kedro I was thinking of creating a Plugin/Hook which has very similar behaviour. A Nodes Input is very easily tracked by DVC (especially easy for local data). The Code is not so easily tracked.
Some difficulties: I cannot track all the code, because if I change Node17 then I would have to rerun everything, because accoarding to DVC some code changed.
Thus I want to have a DVC run only look at the current Nodes Code and Data. I was thinking of using python’s inspect module to track the code for this Node. I think this would work. The problem might be, that the Node imports code from a different file. If this file then changes, would DVC know? Is it possible to track all relevant code?
DVC can only run one task at a time. Thus integrating with kedro-accelrator would be impossible. Right?
To reiterate: The goal of kedro-dvc would not be the Data Version Control per se, but rather it would help during devlopment of kedro pipelines. You could be sure that the code/data is always up to date, but only have to run individual nodes - the rest can be skipped as the data is up to date.
The goal is not to keep the data versioned (I guess you should use versioned datasets xD)
Before I embark on this, I would like to ask the community on feedback.