How to deal with a non-static folder of text files?

I have a folder of text files in data/01_raw that I want to read from, process and store them in data/02_intemediate as csv files. The folder is non-static: it contains N files (N not known in advance) with no fix filename format – so specifying each file in the yaml config file is not possible.

I coded a working prototype using PartitionDataSet. While it produces the desired outputs, I notice the processing happens entirely in memory, i.e. it will read all the, say, 20 text files and dump the 20 csv files in one go at end of processing. (I monitored the destination folder and the files all appear instantly instead of one-by-one)
This could lead to out-of-memory issues in a production environment with thousands of huge text files.

What is the correct approach I should take here?

  1. Should I be using dynamic catalog instead of PartitionDataSet?
  2. There is a PartitionedDataSet._release() method: does calling this dump the contents out? And where/how should this method be called? (since my code never deals with any object of PartitionDataSet directly)

Using kedro 0.17.0

Hi @Frankie,

looks like this was addressed just a few days ago: Let PartitionedDataSet lazily materialize data to save by lou-k · Pull Request #744 · quantumblacklabs/kedro · GitHub

@sebastianbertoli Thanks, but I’m not sure what to make of this. Doesn’t lazy save mean the contents are held back (in memory) until it is flushed? This appears to be exactly the issue I am facing, which I think will cause problems for large number of files. In fact, I would think I want to disable lazy save and just have kedro output each file sequentially – I don’t have a performance issue to warrant lazy save.

1 Like

I have not tried it but the problem statement seemed quite similar to yours. It could also be that I have misunderstood your question though. :slight_smile:

Hi Frankie,

Unless I am mistaken, the point of lazy saving is that it will run the node on a partition and save that partition to disk before moving on to the next partition. Unfortunately this feature is available earliest in Kedro 0.17.4.

For now, have you considered using IncrementalDataSet? Your first run might need to go through a lot but future runs will only need to go through new data.

Hi ljam, I have checked up IncrementalDataSet but it does not suit my requirements as I do not need the checkpointing feature.
My datafiles are downloaded from the Internet, the number of files and the total data size are unknown.
I noticed that Kedro seems to process the entire pipeline in memory, so there is a risk of out-of-memory error if the number/size of files are too huge in total.
What I really want is for Kedro to process each file individually and save the output immediately upon completion, and not to process the files as one batch.

Are you using MemoryDataSets or datasets that write to a file?