I have a folder of text files in data/01_raw that I want to read from, process and store them in data/02_intemediate as csv files. The folder is non-static: it contains N files (N not known in advance) with no fix filename format – so specifying each file in the yaml config file is not possible.
I coded a working prototype using PartitionDataSet. While it produces the desired outputs, I notice the processing happens entirely in memory, i.e. it will read all the, say, 20 text files and dump the 20 csv files in one go at end of processing. (I monitored the destination folder and the files all appear instantly instead of one-by-one)
This could lead to out-of-memory issues in a production environment with thousands of huge text files.
What is the correct approach I should take here?
- Should I be using dynamic catalog instead of PartitionDataSet?
- There is a PartitionedDataSet._release() method: does calling this dump the contents out? And where/how should this method be called? (since my code never deals with any object of PartitionDataSet directly)
Using kedro 0.17.0