Run node only at end of the pipeline

What is the goal you are trying to achieve?
I am trying to get dataset A from a database, compare it with the latest data that is dataset B and only at the end of the pipeline i’d like to set dataset A to dataset B.

What have you tried, in order to accomplish the goal?

  def create_pipeline(**kwargs):
   node_consumer = node(
       lambda x: x, ["data_redshift"], "data_csv", name="data_consumer"
   node_get_diff = node(
       get_diff, ["data_csv", "csv_latest"], "data_to_sync", name="get_diff"
   node_sync = node(sync, ["data_to_sync"], None, name="data_sync")
   node_update_latest = node(lambda x: x, ["data_csv"], "csv_latest", name="update_latest")
   return Pipeline([node_consumer, node_get_diff, node_sync, node_update_latest])

But when I try to run this I receive a circular dependency error cause csv_latest is used in node_consumer as input and node_update_latest as output. I understood it. But what should be a correct way to do that with kedro?

What version of Kedro are you using? (Use kedro -V)

Do you have any custom plugins?

Welcome to the community @soufraz!

The most common way I have seen to resolve circular dependencies is to have a second dataset in the catalog that points to the same filepath. Make sure you document this well so that your team knows what is going as kedro will treat them as separate sources.

1 Like

Thank you. Simple and awesome solution hahahaha.

1 Like

Glad to help !

1 Like