I was wondering if there is a way / has it been discussed to speed up kedro pipelines by using similar tools as e.g. docker or ansible and only rerun a node if either the code or the input data has changed. In essence, the idea would be to wrap each node call with a caching wrapper that checks if the node’s underlying code has changed or if the input datasets have been updated.
This could be very easily achieved with local data (hashing) and functions (hash of table body) and it could be an opt-in flag such as
--avoid-reruns or whatever. It would be harder if the data sits on S3 or a remote location as we would have to look at the dates of the underlying file systems to make sure that the file hasn’t been updated since it was last processed.
I would however assume that the performance gains could be pretty impressive. Once the pipeline DAG has been constructed, it would skip all nodes until it hits one that has been updated. That would really aid development flow IMHO.
Any thoughts from the community?
Also hi everyone