Appending and reading data from table

Hi all,

Sorry if its a dumb question but I’m trying to read a table after its been appended and reading that full table instead of the latest delta.

output_table1 has its save_args set to append/insert, output_table1 also has existing data

node(func=fn1, inputs=“input_table”, outputs=“output_table1”)
node(func=fn2, inputs=“output_table1”, outputs=“test”) # how do i ensure that output_table1 is the full table?

Thanks all

if you are using CSV for output table 1, you can use the append parameter with headers None. Eg:

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  save_args:
       headers: None
       mode: a

Unfortunately that means the loss of the column names… You could save you column name elsewhere for later reuse, for instance in a TextDataSet or yamldataset.

Another option would be to use Kedro-Wings which allows you to get around the fact that Kedro does not allow for circular pipelines. You can read your current data on disk, append your new data to it, and save the data to the same location and then reread it by adding “!” at the start of the input name.

However this has the disadvantage of reading a big dataframe from file twice and writing it to file once rather than just reading it once. This can be expensive if you have a big file on disk and you just want to add a small amount of data.

Both methods have advantages and tradeoffs! Hope one meets your needs

Thanks @ljam forgot to mention, it’s a Hive table, yes the append operation works. I can query it from Impala. However, only the dataframe output of fn1 is passed into fn2. How do I get that output + the existing data on Hive to be read, i.e. the updated table?

please check out my new updated answer. I submitted the first answer by mistake. Does it solve your problem?

thanks @ljam will look into kedro wings :grinning_face_with_smiling_eyes:

I dont know hive, but I see there are 3 write options:
https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/spark/spark_hive_dataset.html#SparkHiveDataSet

write_mode: ``insert``, ``upsert`` or ``overwrite`` are supported.

Im guessing you need “insert”. Im guessing this is equivalent to appending such that on save you only add new data but on load you get all data. I dont know what upsert is on the other hand…

@ljam thanks! I’m using insert at the moment to append the data, its fine. Upsert replaces the data in-place. Sometimes at least to me its gets confusing because a catalog.yml entry refers to the database.table_name, so when I write to that with the append option the next node in the pipeline mentally I’m thinking that that catalog entry represents the entire updated table when its just the new data.