What is the best way to have dynamic filename in a Dataset?

Hello,

If we have a dataset catalog item and we want to have the filename to change dynamically with python code, what is the best way to do it?

For example, the cars filename to depend on the variable {datetime} or another variable in python.

cars:
type: pandas.CSVDataSet
filepath: data/01_raw/company/cars_{datetime}.csv
load_args:
sep: ‘,’
save_args:
index: False
date_format: ‘%Y-%m-%d %H:%M’
decimal: .

What are the ways to handle this?

2 Likes

Can you explain a bit more on what you mean by dynamic? Do you want to change the filename during a pipeline run or during the catalog loading phase?


Option 1

Based on your example its just during catalog load. I have never used it, but there is Jinja capabilities in the catalog. I think its a simple change to the catalog loader to a templated config loader, but I have not used it.

Option 2

You can use a after_catalog_created hook to change the _filepath inside the dataset.

2 Likes

As a side note in case you just want to timestamp your dataset you might want to look into Versioning. :slight_smile:

1 Like

@sebastianbertoli , But the with the versioning dataset could you set the timestamp to be in the filename instead of having the timestamp as a folder?

@waylonwalker I tried the option one, but I didn’t like it for one dataset in the catalog. Option two seems interesting.

We are finally going for the PartitionedDataset, changing the name when saving the file.

You are correct @Javier_Hernandez. If the timestamp needs to be in the filename the build-in versioning is not suited for that. :slight_smile:

I am not a fan of the default, the nested folder is a bit weird and unccessarily complicated. You have to open up a lot of folder just to open a file.

For example, if I want to get all output dataset from an experiment, they will be created in different folder and hard to organize.