I thought this would help others.
Partitioned Dataset fits distributed jobs very well. You can simply convert a single threaded job to distributed job by using Dask
Example:
Let’s say you have a partitioned data below
df_dict = {
‘patient_1’: pandas DataFrame of patient 1,
‘patient_2’: pandas DataFrame of patient 2,
.
.
}
Each dataframe is a time-series of the encounter. Each row is a point of time…
When I want to do transformation I usually do something like:
For df in dictionary:
Impute or do something of df()
The problem with that statement is it is a single thread job so for 100 of thousands of dataframes it takes forever.
Dask is the distributed version of pandas so the above code can be simply wrapped like this:
Def impute(df):
impute transformation (df)
return df
bag = db.from_sequence(list(df_dict.values()))
bag = bag.map(lambda x: impute(x))
bag_output = bag.compute()
you can convert it back to partitioned dataframe using dictionary comprehension
return {str(x): x for x in bag_output}
This was ~ 10 times faster on my 24 core computer as opposed to the for statement.