Other tools for creating a complete solution for MLOps

In order to get the best help, it is suggested to answer the following questions:

What is the goal you are trying to achieve?

Hello!!! I am new to Kedro, but I have been hours reading today and now I am a believer!! I’m very excited for all the good things it has.

I am a very immature data scientist that has a decent knowledge of ML concepts and algorithms, some grasp of Python but nothing of MLOps and DevOps. I work for a big company that is starting its ML way, and for the moment everything we do is manual and poorly standarized.

So I will be very grateful to know some other tools that might be recommended, at least in your experience, in order to set up good MLOps practices. For example, frlm what I have read:

Prefect/Luigi/Airflow for orchestration. Any recomendation?
MLflow for experiment tracking and model registry. By the way, the plugin between kedro and MLFlow is recommended?
Feast for feature store
Great expectations for test on data products
Dask/PySpark for parallel executions. Any recomendation?

Do you think that this libraries can be enough (apart from pandas, sklearn and the other obvious choices…) for setting up the standards that we seek? Of course I didn’t mention Kedro, which will be like in the middle of everything, as the framework for our project and code.

I’m sorry if these are too many questions… But I have been reading 8-10 hours a day about MLOps in blogs and books, libraries comparisons and this is the first time that I see the light at the end of tje tunnel… Thanks in advance!!

2 Likes

Hi @Jaime_Arboleda_Casti ,

I also have been reviewing MLOps tools.

Here is yet another comparison of Python pipeline/workflow management packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX:

For workflow management, I would recommend Kubeflow if your team is familiar with Kubernetes. Otherwise, perhaps Airflow or Prefect would be good.

Here is yet another comparison of ML Life Cycle Management (Experiment Tracking, Model Management, etc.): MLflow, DVC, Pachyderm, Sacred, Polyaxon, Allegro Trains, VertaAI ModelDB, Kubeflow Katib, Guild AI, Kubeflow Metadata, Weights & Biases, Neptune.ai, Valohai, Comet:

Among these, I think MLflow would be the best free tool for teams (multiple users). If your company is willing to pay, you can consider other tools as well.

Regarding integration of MLflow to Kedro, PipelineX provides more features than kedro-mlflow although kedro-mlflow is much more well-documented. (Disclaimer: the author of PipelineX is me. Better documentation for PipelineX is in the backlog.). I would suggest you consider PipelineX if you are not satisfied with kedro-mlflow.

1 Like

Thanks a lot, Minyus. I have read both comparisons and they gave me a lot of insights and information!

By the way, PipelineX seems a very nice library, I’ll have to check it out too!

Hi Jaime! Welcome to the Kedro family!

Let me see if I can help provide insight on the tools that you’re asking about. A lot of this depends on what skills you have in house though.

  • Orchestration - Either Prefect or Airflow, Luigi does not seem to be actively developed when you look at their Contributor history. You can find how to use Kedro and Prefect or Airflow in our documentation. Prefect seems to have a lower learning curve :woman_shrugging:t5: but we like both.
  • Experiment tracking and model registry - Kedro and MLflow seems to be the most common pattern we’ve seen, so much so that we might double down on increased support for it later in the year :wink: However, we do like the Kedro-MLflow plugin quite a bit, the developers are really cool, and we do also support basic integration using Hooks in our documentation.
  • Feature store: Feast, I can’t say anything about this but @limdauto is giving a talk on using this with Kedro soon. Sign-up!
  • Data validation: We use Great Expectations and the team behind it is lovely! You can check out our basic integration using Hooks in our documentation.
  • Parallel execution: Our most commonly used pattern for parallel execution is PySpark and this is purely because it has a lower learning curve trying to configure it. If your team has the time to learn how to configure Dask then try it out. We do support a single Dask dataset in the catalog but never expanded the range when we realised it wasn’t being used.