I have a situation where I have very large Datasets on Hadoop. About 20 Tables exist. They have identical Schema, but are for different customers (eg. Table_CUSTOMER1_info, Table_CUSTOMER2_info, etc.) the Tables are roughly 100Gb in size each. I don’t have write access to the Hadoop cluster.
Furthermore other Teams in my company stream data to those Tables. I want to do a monthly analysis aggregating a bunch of data. The aggregated info per table is small (1-2mb each)
Currently i do it as follows:
- I have a Text Dataset which contains all customer names. (arguably I should use parameters)
- I have a Text Dataset containing a list of months and years.
A first node combines those into a cross table, meaning I have a long list of customer_name_month_year info
- This textdataset is then used as input in the Node which uses Spark to do the big data processing and saves the aggregated CSV “table_customer1_month_year_aggregate” csv. There are a lot of those - they are a partitioned CSV dataset.
- While this does work, it is a bit stupid for me to include a new month into the text file once I want to look at the next month.
- Currently my Text Datasets are kind of MetaDataDatasets for my Hive Tables. Again, this works, but it’s a bit of a misuse of the Kedro API. Somebody that is new to the project would be very confused as to what is going on.
My question is, do you have any good Ideas? Thanks for any feedback & suggestions.
P.S. thinking about 1), i guess I could simply create a node which checks the current Data, and the appends a new month if the month has passed.