Looping same pipeline several times

Hi everyone,

I have several pipelines that are used to process data and use it to train ML (Neural networks + Random Forest) models, I want to create a loop where the training pipeline is run several times with different inputs (but read from the same file preferably) and also hyper-parameters (which I am guessing can just be read from a file if I can make my pipeline read the file every time its called again)

The following code in main pipeline only runs the t_models pipeline once.
model_pipeline = t_models.create_pipeline()
model_pipeline_1 = t_models.create_pipeline()
model_pipeline_2 = t_models.create_pipeline()
return {
default”: t_model_pipeline + t_model_pipeline_1 + t_model_pipeline_2
}
kedro, version 0.16.3

I’d like to run t_models pipelines X times in a loop and in between each run I need to change data inside some input files or hyper-parameters.

1 Like

Hi @mirfan,

I am not sure I one hundred percent understand what you are trying to do. If i understood correctly, you want to test different combinations of features (ex: once petal_length + sepal_length and another time just petal_length) as well as different hyper-parameters (learning rates, etc). In which case, I don’t think there is a need for running a single data science pipeline several times. I solved a similar issue for training a lightgbm model using a for loop for the features and sklearn’s GridSearchCV. I inserted the features and the hyperparamters I wanted to use inside the parameters yaml file as so:

target: "SumQuantity"
feature_sets: 
    [
    ["Weekday", "Week", "Year",]
    ["Weekday", "Week", ]
    ["Weekday", "Week", ]
    ] 

model_parameters: 
    boosting_type: ["gbdt", "goss", "dart"] , 
    objective: ["regression"]
    learning_rate: [0.01, 0.05, 0.1]
    num_iterations: [100]
    num_leaves: [10, 20, 30]
    feature_fraction: [0.9]
    verbose: [-1]

The I created the following training node inside my ds pipeline:

 def train_model(
    X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray, parameters: Dict
 ) -> [lgb.LGBMRegressor, pd.DataFrame]:
     """Train the regression model.

         Args:
             X_train: Training data of independent features.
             y_train: Training data for DailyQuantity.

         Returns:
             Trained model.

     """
     model_parameters = parameters["model_parameters"]
     estimator = lgb.LGBMRegressor()
     grid = GridSearchCV(estimator, model_parameters, cv=6)
     gbm.fit(X_train, y_train)
     logger = logging.getLogger(__name__)
     logger.info("Best: %f using %s" % (gbm.best_score_, gbm.best_params_))
     cv_results = pd.DataFrame.from_dict(gbm.cv_results_)
return [gbm, cv_results]

I havent added it in the code snippet, but you would need to loop through your feature sets. Hope this helps!

1 Like

This same question came up in one of our projects as well Eg. We are processing pages in a document say PDF. We need to run the same pipeline for all the pages. Dynamically creating pipelines with page number as parameter looked a bit hackish. Is there any elegant method for this ?

1 Like

Hi @pavanbi2i,

Once again, I am not sure rerunning the pipeline for each page is the way to go. Have you considered trying Partitionned or IncrementalDataset? IncrementalDataset is perfect if you are for instance receiving new pages every day. You could name your pages as “2020-12-24-page-001.pdf” for instance. It requires a bit of effort, but you would need to write each node such that it loops through the items of your partitionned/ incremental dataset.
Here are good examples:

Hope this helps!

2 Likes

Hey @ljam
Expanding on the previous question regarding re-running same pipeline for different pages of a document.
As you can see in the diagram above, I want to process the pages in the given document in parallel and then combine the results corresponding to individual pages to get the entire document level information. The pipeline that each page would go through would be the same.
Is there a way to implement this kind of solution using kedro?

1 Like

Yeah. For the nodes where you process the pages separately, just use the solution I mentioned previously and then you can concatenate it all using a similar method to this:

If you want to parallelize the processing, just parallelize the for loop using pool.apply() or something similar: https://www.machinelearningplus.com/python/parallel-processing-python/. However, I dont know if this will create problems with for instance kedro run --parallel

2 Likes

Hi @Shivalika,

Just out of curiosity, what package do you use to read the pdfs?

Hi @Shivalika ,

Are you also reading pdf documents? If yes, may I also ask you what package you use?

Hi @ljam

I use the following python libraries: pdfrw, PyPDF2
There’s also pdf-annotate which can be used for annotating PDFs

If you’re interested in a paid option, you can try out PDFTron