This will depend on the use, but definitely if there’s going to be more than one individual running the pipeline, then this state must be shared somewhere. Even in the case of hashing the data, the “shared” state would be the data itself.
Given the simplicity of the ID, it can really be stored in any medium: a touched file next to the target data in S3, in a collection in a remote git repo, rows in a database or key-value store, or even as a file on a shared drive, etc etc.
The difficulty of this then becomes synchronizing the update of the non-database storage mechanisms, but this is a problem for all shared state exercises and there are work-arounds.
Yep, that’s correct, but I still see this as a problem of sharing state rather than monotonic as an implementation.
DS == data source, right?
If it’s all in kedro, I don’t see a problem hooking into read/write IO; that’s a well accepted practice, to modify dataset IO . In fact, with a bit of finagling, we wouldn’t need to change any IO at all, and I think this can be done only using the built in kedro hooks
after_node_run, a custom parameter, and node decorators.
If we share state correctly, in what cases do we need to check timestamp or hashing on the data?
- Timestamp only changes if we save a new version of the data the the storage mechanism. If we’re going to rely on the storage mechanism’s metadata, what about just relying on our own metadata? It’s cleaner and we can guarantee its correctness (timestamp is not guaranteed to be correct, depending on medium).
- Hashing is only done when a new version of the data has been saved. If we already know we’ve saved new data, why not just store that state as metadata? Hashing will eventually become untenable as data pipelines grow, data grows, and data formats grow as well.
Hashing at the code level, though, I do think is necessary. That brings up other questions, though, on data versioning during individual development versus release, which is another bag of worms .
How do you feel about these ideas?