Hi @mirfan,
I am not sure I one hundred percent understand what you are trying to do. If i understood correctly, you want to test different combinations of features (ex: once petal_length + sepal_length and another time just petal_length) as well as different hyper-parameters (learning rates, etc). In which case, I don’t think there is a need for running a single data science pipeline several times. I solved a similar issue for training a lightgbm model using a for loop for the features and sklearn’s GridSearchCV. I inserted the features and the hyperparamters I wanted to use inside the parameters yaml file as so:
target: "SumQuantity"
feature_sets:
[
["Weekday", "Week", "Year",]
["Weekday", "Week", ]
["Weekday", "Week", ]
]
model_parameters:
boosting_type: ["gbdt", "goss", "dart"] ,
objective: ["regression"]
learning_rate: [0.01, 0.05, 0.1]
num_iterations: [100]
num_leaves: [10, 20, 30]
feature_fraction: [0.9]
verbose: [-1]
The I created the following training node inside my ds pipeline:
def train_model(
X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray, parameters: Dict
) -> [lgb.LGBMRegressor, pd.DataFrame]:
"""Train the regression model.
Args:
X_train: Training data of independent features.
y_train: Training data for DailyQuantity.
Returns:
Trained model.
"""
model_parameters = parameters["model_parameters"]
estimator = lgb.LGBMRegressor()
grid = GridSearchCV(estimator, model_parameters, cv=6)
gbm.fit(X_train, y_train)
logger = logging.getLogger(__name__)
logger.info("Best: %f using %s" % (gbm.best_score_, gbm.best_params_))
cv_results = pd.DataFrame.from_dict(gbm.cv_results_)
return [gbm, cv_results]
I havent added it in the code snippet, but you would need to loop through your feature sets. Hope this helps!