Other tools for creating a complete solution for MLOps

In order to get the best help, it is suggested to answer the following questions:

What is the goal you are trying to achieve?

Hello!!! I am new to Kedro, but I have been hours reading today and now I am a believer!! I’m very excited for all the good things it has.

I am a very immature data scientist that has a decent knowledge of ML concepts and algorithms, some grasp of Python but nothing of MLOps and DevOps. I work for a big company that is starting its ML way, and for the moment everything we do is manual and poorly standarized.

So I will be very grateful to know some other tools that might be recommended, at least in your experience, in order to set up good MLOps practices. For example, frlm what I have read:

Prefect/Luigi/Airflow for orchestration. Any recomendation?
MLflow for experiment tracking and model registry. By the way, the plugin between kedro and MLFlow is recommended?
Feast for feature store
Great expectations for test on data products
Dask/PySpark for parallel executions. Any recomendation?

Do you think that this libraries can be enough (apart from pandas, sklearn and the other obvious choices…) for setting up the standards that we seek? Of course I didn’t mention Kedro, which will be like in the middle of everything, as the framework for our project and code.

I’m sorry if these are too many questions… But I have been reading 8-10 hours a day about MLOps in blogs and books, libraries comparisons and this is the first time that I see the light at the end of tje tunnel… Thanks in advance!!

2 Likes

Hi @Jaime_Arboleda_Casti ,

I also have been reviewing MLOps tools.

Here is yet another comparison of Python pipeline/workflow management packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX:

For workflow management, I would recommend Kubeflow if your team is familiar with Kubernetes. Otherwise, perhaps Airflow or Prefect would be good.

Here is yet another comparison of ML Life Cycle Management (Experiment Tracking, Model Management, etc.): MLflow, DVC, Pachyderm, Sacred, Polyaxon, Allegro Trains, VertaAI ModelDB, Kubeflow Katib, Guild AI, Kubeflow Metadata, Weights & Biases, Neptune.ai, Valohai, Comet:

Among these, I think MLflow would be the best free tool for teams (multiple users). If your company is willing to pay, you can consider other tools as well.

Regarding integration of MLflow to Kedro, PipelineX provides more features than kedro-mlflow although kedro-mlflow is much more well-documented. (Disclaimer: the author of PipelineX is me. Better documentation for PipelineX is in the backlog.). I would suggest you consider PipelineX if you are not satisfied with kedro-mlflow.

1 Like

Thanks a lot, Minyus. I have read both comparisons and they gave me a lot of insights and information!

By the way, PipelineX seems a very nice library, I’ll have to check it out too!

1 Like

Hi Jaime! Welcome to the Kedro family!

Let me see if I can help provide insight on the tools that you’re asking about. A lot of this depends on what skills you have in house though.

  • Orchestration - Either Prefect or Airflow, Luigi does not seem to be actively developed when you look at their Contributor history. You can find how to use Kedro and Prefect or Airflow in our documentation. Prefect seems to have a lower learning curve :woman_shrugging:t5: but we like both.
  • Experiment tracking and model registry - Kedro and MLflow seems to be the most common pattern we’ve seen, so much so that we might double down on increased support for it later in the year :wink: However, we do like the Kedro-MLflow plugin quite a bit, the developers are really cool, and we do also support basic integration using Hooks in our documentation.
  • Feature store: Feast, I can’t say anything about this but @limdauto is giving a talk on using this with Kedro soon. Sign-up!
  • Data validation: We use Great Expectations and the team behind it is lovely! You can check out our basic integration using Hooks in our documentation.
  • Parallel execution: Our most commonly used pattern for parallel execution is PySpark and this is purely because it has a lower learning curve trying to configure it. If your team has the time to learn how to configure Dask then try it out. We do support a single Dask dataset in the catalog but never expanded the range when we realised it wasn’t being used.
2 Likes

Hello Yetunde!!! Thanks a lot for your kind and warm welcome! :slight_smile: I’m very happy to be here.

It’s very good news to know that the Kedro team is planning such interesting extensions.

The reason I asked about Dask is because (at least from what I have read) is easier to use. But maybe there are some config issues that can be time consuming.

I have sign up for this meeting. Don’t know if I could finally attend but thanks a lot for mentioning it!!

We’re glad to have you here. You might want to check out this issue raised by a team at Saturn that are going to create a Dask guide. Please comment on it so that they cover your use cases.

1 Like

Hi @Jaime_Arboleda_Casti ,

Documentation regarding MLflow-on-Kedro by PipelineX is ready!

https://pipelinex.readthedocs.io/en/latest/section07.html

The differences (advantages and disadvantages) with kedro-mlflow package are also explained.

Hope it helps.

Thanks a lot Minyus! I have read the docs of your project and it’s very well explained.

Hi again, @Minyus! I keep wondering that framework/infraestructure I should use for my project. I have a question regarding PipelineX. As it integrates several tools in a very nice way, do you think that it’s already and end-to-end MLOps solution? We are a very small team, with little experience, no data engineers (only data scientist) and no support for buying cloud solutions or a all-in-one product. So we will need to stick to open source solutions that help us manage our models and pipelines. If you don’t mind sharing it here, apart from PipelineX do you use more tools?

As we don’t have resources (neither personal nor commercial software), we need the solution to be as lightweight and easy to mantain as possible. We cannot afford an extra burden in our daily work. To give you an example, maybe we could get to Docker, but Kubernetes is beyond our capabilites.

We belong to an organization that has other resources we can/must use. For example:

  • We have SVN (so git is not an option for us).
  • We have our own workflow orchestrator (Control-M and DataStage).
  • We have Cloudera nodes, but with the bare minimum license. So we can use Spark, Impala an other tools for distributing processes if needed.

I am very worried about this issues, and I have two contradictory concerns:

  • Not using enough products and arriving in a couple of years to a state where we can not manage our own code, project and models.
  • Using too many tools and imposing a huge burden to our small team.

So any help will be REALLY appreciated.

I am pretty happy with Kedro, so we think that Kedro will be the scaffolding for our projects. But I am not sure if with Kedro (and some plugins, or an extension like PipelineX) we will be pretty much covered or not. Well, we will need at least a way to make our models deployable for real time predictions, and for that I have been reading about streamlit or FastAPI. But apart from that, and having what we have at our organization, what would you do?

Hi @Jaime_Arboleda_Casti ,

Kedro and MLflow (with PipelineX) would be good for prototyping and batch jobs, but you would need frameworks for live API services in most cases.

I would recommend to adopt OpenAPI Specification to standardize designing, testing, and documentation.
FastAPI supports OpenAPI Specification and would be a good tool.

In the beginning, it would be good to start small with Docker (or perhaps Docker Compose) in a single server machine.

As your organization grows, you might want to consider more about high availability and sustainability using frameworks/tools like Kubernetes, Prometheus, and CICD/GitOps (not sure, but perhaps “SVNOps” in your case) in the future.

Good luck!

1 Like

Thank you very much, Minyus!!

1 Like