A Gentle Introduction to MLOps - Yashaswi Nayak - Medium

https://medium.com/@yashaswi_nayak/a-gentle-introduction-to-mlops-7d64a3e890ff • Oct 3, 2021 18:18

Extracto

In this article, we learn what is MLOps, or Machine Learning Operations. I will try to simplify the vast and intriguing world of ML and its associated infrastructure. This article is for people who…

Resumen

Resumen Principal

El artículo "A Gentle Introduction to MLOps" de Yashaswi Nayak ofrece una introducción accesible al concepto de MLOps (Machine Learning Operations), disciplina que busca integrar prácticas de DevOps en el ciclo de vida del machine learning. El autor desglosa de forma clara cómo los modelos de machine learning, una vez desarrollados, requieren de un enfoque sistemático para su implementación, monitoreo y mantenimiento en entornos productivos. Se destaca la importancia de la colaboración entre equipos de ciencia de datos, ingeniería y operaciones, para superar los desafíos comunes como la degradación del modelo, la reproducibilidad y la escalabilidad. El contenido está dirigido a profesionales que buscan comprender cómo estructurar procesos eficientes para la gestión de modelos de machine learning, sin necesidad de conocimientos avanzados en infraestructura técnica.

Elementos Clave

Definición de MLOps: Es una disciplina que combina machine learning con prácticas de DevOps para automatizar y estandarizar el despliegue y mantenimiento de modelos. Su objetivo es acelerar el ciclo de vida de los modelos y garantizar su rendimiento en producción.
Ciclo de vida del modelo: El artículo describe las etapas clave, desde la recopilación de datos y entrenamiento hasta el despliegue, monitoreo y reentrenamiento. Cada fase requiere herramientas y procesos específicos para evitar fallos operativos.
Importancia del monitoreo continuo: Una vez en producción, los modelos pueden degradarse debido a cambios en los datos (data drift) o en el comportamiento del entorno (concept drift). Por ello, es crucial implementar sistemas de alerta y retroalimentación.
Colaboración interdisciplinaria: El éxito de MLOps depende de la coordinación entre científicos de datos, ingenieros de machine learning y equipos de operaciones, quienes deben alinear sus flujos de trabajo para garantizar eficiencia y calidad en la entrega de soluciones.

Análisis e Implicaciones

La adopción de MLOps representa un paso fundamental para las organizaciones que buscan escalar sus iniciativas de machine learning más allá del laboratorio. Al establecer procesos estandarizados, se reduce el tiempo de salida al mercado y se mejora la confiabilidad de los modelos en producción. Esto tiene un impacto directo en la toma de decisiones basada en datos y en la generación de valor empresarial.

Contexto Adicional

El enfoque de este artículo es ideal para profesionales técnicos que están comenzando a explorar cómo integrar modelos de machine learning en sistemas reales. Además, resalta cómo MLOps no solo es una cuestión técnica, sino también de cultura organizacional y gestión de procesos.

Contenido

Hello There!

You want to learn about MLOps, you have come to the right place.

In this article, we learn what is MLOps, or Machine Learning Operations. I will try to simplify the vast and intriguing world of ML and its associated infrastructure. This article is for people who want to understand how an ML model is deployed to production, the stages, the process, and the tears it involves. 😅

Let us begin!

What is MLOps?

Machine Learning Operations involves a set of processes or rather a sequence of steps implemented to deploy an ML model to the production environment. There are several steps to be undertaken before an ML Model is production-ready. These processes ensure that your model can be scaled for a large user base and perform accurately. 💯

Why do we need MLOps?

Creating an ML model that can predict what you want it to predict from the data you have fed is easy. However, creating an ML model that is reliable, fast, accurate, and can be used by a large number of users is difficult.

The necessity of MLOps can be summarized as follows:

ML models rely on a huge amount of data, difficult for a single person to keep track of.
Difficult to keep track of parameters we tweak in ML models. Small changes can lead to enormous differences in results.
We have to keep track of the features the model works with, feature engineering is a separate task that contributes largely to model accuracy.
Monitoring an ML model isn’t like monitoring a deployed software or web app.
Debugging ML model is an extremely complicated art
Models rely on real-world data for predicting, as real-world data changes, so should the model. This means we have to keep track of new data changes and make sure the model learns accordingly.

Remember the old excuse Software Engineers used 😅 We want to avoid it.

DevOps vs MLOps

You must have heard of good old DevOps. The process to build and deploy software applications. You might wonder how MLOps is different.

The DevOps stages are targeted for developing a software application. You plan the features of the application you want to release, write code, build the code, test it, create a release plan and deploy it. You monitor the infrastructure where the app is deployed. And this cycle continues until the app is fully built.

In MLOps, things are different. We implement the following stages

Scoping — We define the project, check if the problem requires Machine Learning to solve it. Perform Requirement Engineering, check if the relevant data is available. Verify if the data is non-biased and reflects the real-world use case.

Data Engineering — This stage involves collecting data, establishing baselines, cleaning the data, formatting the data, labeling, and organizing the data.

Modeling — Now we come to the coding part, here we create the ML model. We train the model with the processed data. Perform error analysis, define error measurement, and track the model performance.

Deployment — Here we package the model, deploy it in the cloud or on edge devices as necessary. Packaging could be model wrapped with an API server, exposing REST or gRPC endpoints, A Docker container, Deployed on serverless cloud infrastructure, or a Mobile app for edge-based models.

Monitoring — Once the Deployment is done, we rely on a Monitoring Infrastructure to help us maintain and update the model. This stage has the following components:

Monitor the infrastructure where we deploy — for load, usage, storage, and health. This tells us about the environment where the ML model is deployed.
Monitor the model for its performance, accuracy, loss, bias, and data drift. This informs us if the model is performing as expected, valid for real-world scenarios or not.

There will sometimes be a feedback loop as some models might require learning from the user inputs and predictions it makes. This lifecycle is valid for most of the ML use-cases.

Equipped with the knowledge of basic lifecycle of an ML project, let’s take a look at how the Infrastructure scene is on the ML side.

ML Production Infrastructure

Now we learn what infrastructure setup we would need for a model to be deployed in production. You can see in the above picture, ML Code is only a small part of it. Let us understand the components one by one.

Data Collection — This step involves collecting data from various sources. ML models require a lot of data to learn. Data Collection involves consolidating all kinds of raw data related to the problem. i.e Image Classification might require you to collect all available images or scrape the web for images. Voice Recognition may require you to collect tons of audio samples.

Data Verification — In this step we check the validity of the data, if the collected data is up to date, reliable, and reflects the real world, is it in a proper consumable format, is the data structured properly.

Feature Extraction — Here, we select the best features for the model to predict. In other words, your model may not require all the data in its entirety for discovering patterns, some columns or parts of data might be not used at all. Some models perform well when a few columns are dropped. We usually rank the features with importance, features with high importance are included, lower ones or near zero ones are dropped.

Configuration — This step involves setting up the protocols for communications, system integrations, and how various components in the pipeline are supposed to talk to each other. You want your Data Pipeline to be connected to the database, you want your ML model to connect to DB with proper access, your model to expose prediction endpoints in a certain way, your model inputs to be formatted in a certain way. All the necessary configurations required for the system need to be properly finalized and documented.

ML Code — Now we, come to the actual coding part. In this stage, we develop a base model, which can learn from the data and predict. There are tons of ML libraries out there with multiple language support. Ex: Tensorflow, Pytorch, Scikit-Learn, Keras, Fast-ai and many more. Once we have a model, we starting improving its performance by tweaking the hyper-parameters, testing different learning approaches until we are satisfied that the model is performing relatively better than it’s previous version.

Machine Resource Management — This step involves the planning of the resources for the ML model. Usually, ML models require heavy resources in terms of CPU, memory, and storage. Deep Learning models are dependent on GPUs and TPUs for computation. Training ML models involves cost in terms of time and money. Slower CPUs involve more time, Powerful CPUs are pricier. The larger the model, bigger the storage you will have to invest in.

Analysis Tool — Once your model is ready, how do you know if the model is performing up to mark. We decide on model analysis in this stage. How do we compute loss, what error measurement do we use, how do we check if the model is drifting, is the prediction result proper, has the model been overfitted or underfit? Usually, the libraries with which we implement the model; ship with analysis kits and error measurements.

Project Management Tool — Tracking an ML project is very important. It’s easy to get lost and mess up while dealing with huge data, features, ML code, resource management. Luckily there are a lot of Project Management Tools out on the Internet to help us out.

Serving Infrastructure — Once the model is developed, tested, and ready to go, we need to deploy it somewhere the users can access it. The majority of the models are deployed on Cloud. Public Cloud providers like AWS, GCP, and Azure even have specific ML-related features for easy deployment of models. Depending on the budget you can select the provider suited for your needs.

If we are dealing with an edge-based model, we need to decide on how the ML model can be used, it could be a mobile application for use cases like image recognition, voice recognition. We could also have a custom chip and processor for certain use-case like Autonomous Driving as in the case of Tesla. Here we have to take into account how much computing capability is available and how large our model size is.

Monitoring — We need to implement a monitoring system to observe our deployed model and the system on which it runs. Collecting model logs, user access logs, and prediction logs will help in maintaining the model. There are several monitoring solutions like Greylog, ElasticStack, and Fluentd available. Cloud providers usually ship their own monitoring systems.

A Walk-through

Now that we have understood how the ML Project Lifecycle works, how the infrastructure scene is in an ML production. We will learn how the ML model is deployed in Production.

To understand this we will look at Jen and her quest for ML Engine.

Jen has a huge pumpkin patch, every year she sells the pumpkins to townsfolk and local pumpkin spice latte factory. Since she has a huge demand every year, it became tedious for her to look at every pumpkin and check if it is good or bad.
So she approaches you to help her develop an ML Engine which will help her predict if a given pumpkin is good or bad.

Jen needs help classifying her pumpkins.

Off the bat, this is a simple classification problem.

Let’s discuss our approach to solve this problem.

First, we collect all the info about the pumpkins in Jen’s patch. All the photos of good pumpkins, ok pumpkins, and bad pumpkins. We then ask the good townsfolk to send in pictures of the pumpkins they had bought from Jen. (Some of them send pictures of watermelon! 😑 Shame on them 😒 ) This step is EDA + Compiling Dataset.
Now that we have a good collection of pictures. It’s time to label these, with Jen’s help we label several hundred pictures. We check the resolution of the pictures, set a standard resolution, discard the low-quality images, format the image contrast and brightness for better readability. This step is Data Preparation
Now we train a Tensorflow model to classify the images. Say we use a Sequential Neural Net with ReLU activation. We define 1 Input Layer, 2 Hidden Layers, and 1 Output layer, just a basic Convolution Neural Net. We split our dataset into training and testing sets, provide the training data as input to ConvoNet. The model gets trained. This step is Model Training.
Once the model is trained, we evaluate the model by using the testing dataset. Based on the prediction, we compare the result and check for the accuracy of the prediction. This step is Model Evaluation.
We tweak the hypermeters of the model, to increase the accuracy, retrain the model and evaluate it again. This iteration is done until we are satisfied that the model is good enough for Jen. This step is Model Analysis.
Now we have our working model, we deploy it so that Jen can use this ML Engine for her daily work. We create a server in cloud with prediction APIs, create an app or a website where she can upload the images and get result in real time. This is Deployment.

We have done all the work manually, from Data Preparation to Deployment. Congratulations! This process of MLOps is called Level-0. We have achieved our deployment, but all things are done manually. You can refer to the diagram below.

Jen is happy. 😄

Now the demand has begun to rise. You cannot train the model every day manually. So you create an automation pipeline, to validate data, prep it, and train the model. You also try to fetch the best available model, by comparing multiple error metrics. The pipeline takes care of it all. This process is Level-1 of MLOps. Here the training and analysis of the model are taken care of automatically. You just have to check if proper data is available and make sure there isn’t a skewed dataset so that the model is trained properly.

Most companies achieve this level of MLOps. This is also achievable by an individual data scientist or ML engineer. This is good enough when you test the model in your dev environment.

Let’s now ask ourselves a few questions.

Is your model able to replicate the result with different varieties of pumpkins?
When new data is added to the dataset, is your model able to retrain?
Can your model be used by hundreds of thousands of people at once? Scaled well?
How do you keep track of models when you deploy them across a large region or even across the globe?

This leads us to Level-2.

It’s time to look at the bigger picture…

Let’s break down the process.

Everything we do above the red line, in the flowchart — is Level-1.
This entire Orchestrated Experiment is now part of the Automated ML Pipeline.
We introduce a Feature Store, which pulls data from various sources and transforms the data to features required by the model. The ML Pipeline uses the data from the Store in batches.
The ML Pipeline is connected to a Metadata Store. Think of it as bookkeeping, since you don’t train the model manually — this store has the records of each stage in the pipeline. Once a stage is completed, the next stage looks up the record list, finds the previous stage records, picks up from there.
The models are then stored in a Model Registry. We have a bunch of models with various accuracy stored here. Based on the requirement, the appropriate model is then sent to a CI/CD pipeline which deploys it as a Prediction Service. Authorized users are able to access the prediction service when desired.
This system is monitored for performance. Say you have a new bunch of genetically modified pumpkins. Your model isn’t aware of this. This is a new dataset with a high possibility of being wrongly classified. This drop in performance would set a trigger, that would lead to the retraining of the model on the new data.
This cycle continues.

We retrain the model, when there is a drop in performance or there is new data available. It’s good to keep the model up to date, so that the real world changes are reflected in the model. Ex: Your model shouldn’t be recommending cassette tapes, when the world has moved onto digital streaming.

Google Cassette tapes if you don’t know what it is 😆

Where to go from here?

Phew! That was quite an adventure. Give yourself a pat on back 😄

We covered what is MLOps? Why would you use it? What would the production infrastructure setup be? And, once you have the infrastructure, how would you implement it — the process.

You can start by creating simple models and automating the steps. Remember, it is an iterative process, will take time to get it right.

Do check out some of the MLOps Tools like MLFlow, Seldon Core, Metaflow, and Kubeflow Pipelines.

Thank you for reading! Stay Safe! Adios! 👋 😄

Morpheus is right!