How to work with Partitioned Datasets on SQL/Hive Databases

I have a situation where I have very large Datasets on Hadoop. About 20 Tables exist. They have identical Schema, but are for different customers (eg. Table_CUSTOMER1_info, Table_CUSTOMER2_info, etc.) the Tables are roughly 100Gb in size each. I don’t have write access to the Hadoop cluster.

Furthermore other Teams in my company stream data to those Tables. I want to do a monthly analysis aggregating a bunch of data. The aggregated info per table is small (1-2mb each)

Currently i do it as follows:

  • I have a Text Dataset which contains all customer names. (arguably I should use parameters)
  • I have a Text Dataset containing a list of months and years.

A first node combines those into a cross table, meaning I have a long list of customer_name_month_year info

  • This textdataset is then used as input in the Node which uses Spark to do the big data processing and saves the aggregated CSV “table_customer1_month_year_aggregate” csv. There are a lot of those - they are a partitioned CSV dataset.
  1. While this does work, it is a bit stupid for me to include a new month into the text file once I want to look at the next month.
  2. Currently my Text Datasets are kind of MetaDataDatasets for my Hive Tables. Again, this works, but it’s a bit of a misuse of the Kedro API. Somebody that is new to the project would be very confused as to what is going on.

My question is, do you have any good Ideas? Thanks for any feedback & suggestions.

P.S. thinking about 1), i guess I could simply create a node which checks the current Data, and the appends a new month if the month has passed.

Hi @Yy_S,

forgive me for my naive question but why do you not leave the months_years dataset away and generate them programmatically based on some data-range parameter? Is it because you have non-contiguous ranges?

Hi @sebastianbertoli ,
that is an idea, i havent tried it. But does that work for an Iterative Dataset? (I know I wrote Partitioned)
If I understood Iterative Datasets correctly, they check for a node in pipeline
[partitioned/iterative dataset ] → (node) → [iterative dataset]

And if on the left side there is something new, create the node(itertive new item) on the right.

But if i only have a function on the left, Kedro might not do as I want? Not sure.

1 Like