Use Azure Databricks to orchestrate MLOps

Azure Databricks

Solution ideas

This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation of this architecture. Use this article as a starting point to design a well-architected solution that aligns with your workload's specific requirements.

This article provides a machine learning operations (MLOps) architecture and process that uses Azure Databricks. Data scientists and engineers can use this standardized process to move machine learning models and pipelines from development to production.

This solution can take advantage of full automation, continuous monitoring, and robust collaboration. As a result, it targets level 4 of MLOps maturity. This architecture uses the promote code that generates the model approach rather than the promote models approach. The promote code that generates the model approach focuses on writing and managing the code that generates machine learning models. The recommendations in this article include options for automated or manual processes.

Architecture

Download a Visio file of this architecture.

Workflow

The following workflow corresponds to the previous diagram. Use source control and storage components to manage and organize code and data.

Source control: This project's code repository organizes the notebooks, modules, and pipelines. You can create development branches to test updates and new models. Develop code in Git-supported notebooks or integrated development environments (IDEs) that integrate with Git folders so that you can sync with your Azure Databricks workspaces. Source control promotes machine learning pipelines from the development environment, to testing in the staging environment, and to deployment in the production environment.

Lakehouse production data: As a data scientist, you have read-only access to production data in the development environment. The development environment can have mirrored data and redacted confidential data. You also have read and write access in a dev storage environment for development and experimentation. We recommend that you use a lakehouse architecture for data in which you store Delta Lake-format data in Azure Data Lake Storage. A lakehouse provides a robust, scalable, and flexible solution for data management. To define access controls, use table access controls.

The main workflow consists of the following environments.

Development

In the development environment, you develop machine learning pipelines.

Do exploratory data analysis (EDA). Explore data in an interactive, iterative process. You might not deploy this work to staging or production. Use tools like Databricks SQL, the dbutils.data.summarize command, and Databricks AutoML.
Develop model training and other machine learning pipelines. Develop machine learning pipelines modular code, and orchestrate code via Databricks notebooks or an MLflow project. In this architecture, the model training pipeline reads data from the feature store and other lakehouse tables. The pipeline trains and tunes log model parameters and metrics to the MLflow tracking server. The feature store API logs the final model. These logs include the model, its inputs, and the training code.
Commit code. To promote the machine learning workflow toward production, commit the code for featurization, training, and other pipelines to source control. In the code base, place machine learning code and operational code in different folders so that team members can develop code at the same time. Machine learning code is code that's related to the model and data. Operational code is code that's related to Databricks jobs and infrastructure.

This core cycle of activities that you do when you write and test code are referred to as the innerloop process. To carry out the innerloop process for the development phase, use Visual Studio Code (VS Code) in combination with the dev container command-line interface (CLI) and the Databricks CLI. You can write the code and do unit testing locally. You should also submit, monitor, and analyze the model pipelines from the local development environment.

Staging

In the staging environment, continuous integration (CI) infrastructure tests changes to machine learning pipelines in an environment that mimics production.

Merge a request. When you submit a merge request or pull request against the staging (or main) branch of the project in source control, a continuous integration and continuous delivery (CI/CD) tool like Azure DevOps runs tests.
Run unit tests and CI tests. Unit tests run in CI infrastructure, and integration tests run in end-to-end workflows on Azure Databricks. If tests pass, the code changes merge.
Build a release branch. When you want to deploy the updated machine learning pipelines to production, you can build a new release. A deployment pipeline in the CI/CD tool redeploys the updated pipelines as new workflows.

Production

Machine learning engineers manage the production environment, where machine learning pipelines directly serve end applications. The key pipelines in production refresh feature tables, train and deploy new models, run inference or serving, and monitor model performance.

Feature table refresh: This pipeline reads data, computes features, and writes to feature store tables. You can set up this pipeline to either run continuously in streaming mode, run on a schedule, or run on a trigger.
Model training: In production, you can set up the model training or retraining pipeline to either run on a trigger or a schedule to train a fresh model on the latest production data. Models automatically register to Unity Catalog.
Model evaluation and promotion: When a new model version is registered, the CD pipeline starts, which runs tests to ensure that the model performs well in production. When the model passes tests, Unity Catalog tracks its progress via model stage transitions. Tests include compliance checks, A/B tests to compare the new model with the current production model, and infrastructure tests. Lakehouse tables record test results and metrics. You can optionally require manual sign-offs before models transition to production.
Model deployment: When a model enters production, it's deployed for scoring or serving. The most common deployment modes include the following options:
- Batch or streaming scoring: For latencies of minutes or longer, batch and streaming are the most cost-effective options. The scoring pipeline reads the latest data from the feature store, loads the latest production model version from Unity Catalog, and performs inference in a Databricks job. It can publish predictions to lakehouse tables, a Java Database Connectivity (JDBC) connection, flat files, message queues, or other downstream systems.
- Online serving (REST APIs): For low-latency use cases, you generally need online serving. MLflow can deploy models to Mosaic AI Model Serving, cloud provider serving systems, and other systems. In all cases, the serving system initializes with the latest production model from Unity Catalog. For each request, it fetches features from an online feature store and makes predictions.
Monitoring: Continuous or periodic workflows monitor input data and model predictions for drift, performance, and other metrics. You can use the Lakeflow Declarative Pipelines framework to automate monitoring for pipelines and store the metrics in lakehouse tables. Databricks SQL, Power BI, and other tools can read from those tables to create dashboards and alerts. To monitor application metrics, logs, and infrastructure, you can also integrate Azure Monitor with Azure Databricks.
Drift detection and model retraining: This architecture supports both manual and automatic retraining. Schedule retraining jobs to keep models fresh. After a detected drift crosses a preconfigured threshold that you set in the monitoring step, the retraining pipelines analyze the drift and trigger retraining. You can set up pipelines to start automatically, or you can receive a notification and then run the pipelines manually.

Components

A data lakehouse architecture unifies the elements of data lakes and data warehouses. This architecture uses a lakehouse to get data management and performance capabilities that you typically find in data warehouses but with the low-cost, flexible object stores that data lakes provide.

We recommend Delta Lake as the open-source data format for a lakehouse. In this architecture, Delta Lake stores all machine learning data in Data Lake Storage and provides a high-performance query engine.
MLflow is an open-source project for managing the end-to-end machine learning life cycle. In this architecture, MLflow tracks experiments, manages model versions, and facilitates model deployment to various inference platforms. MLflow has the following components:
- The tracking feature in MLflow is a system for logging and managing machine learning experiments. In this architecture, it records and organizes parameters, metrics, and model artifacts for each experiment run. This capability enables you to compare results, reproduce experiments, and audit model development.
- Databricks autologging is an automation feature that extends MLflow automatic logging to track machine learning experiments by capturing model parameters, metrics, files, and lineage information. In this architecture, Databricks autologging ensures consistent experiment tracking and reproducibility by automatically recording these details.
- An MLflow model is a standardized packaging format. In this architecture, MLflow models support model storage and deployment across different serving and inference platforms.
- Unity Catalog is a data governance solution that provides centralized access control, auditing, lineage, and data-discovery capabilities across Azure Databricks workspaces. In this architecture, it governs access, maintains lineage, and structures models and data across workspaces.
- Mosaic AI Model Serving is a service that hosts MLflow models as REST endpoints. In this architecture, it enables deployed machine learning models to serve predictions through APIs.
Azure Databricks is a managed platform for analytics and machine learning. In this architecture, Azure Databricks integrates with enterprise security, provides high availability, and connects MLflow and other machine learning components for end-to-end MLOps.
- Databricks Runtime for Machine Learning is a preconfigured environment that automates the creation of a cluster that's optimized for machine learning and preinstalls popular machine learning libraries like TensorFlow, PyTorch, and XGBoost. It also preinstalls Azure Databricks for Machine Learning tools, like AutoML and feature store clients. In this architecture, it provides ready-to-use clusters with popular machine learning libraries and tools.
- A feature store is a centralized repository of features. In this architecture, the feature store supports feature discovery and sharing, and helps prevent data skew between model training and inference.
- Databricks SQL is a serverless data warehouse that integrates with different tools so that you can author queries and dashboards in your preferred environments without adjusting to a new platform. In this architecture, Databricks SQL lets you query data and create dashboards for analysis and reporting.
- Git folders are integrated workspace directories. In this architecture, Git folders connect Azure Databricks workspaces to your Git provider. This integration improves notebook or code collaboration and IDE integration.
- Workflows and jobs provide a way to run non-interactive code in an Azure Databricks cluster. In this architecture, workflows and jobs provide automation for data preparation, featurization, training, inference, and monitoring.

Alternatives

You can tailor this solution to your Azure infrastructure. Consider the following customizations:

Use multiple development workspaces that share a common production workspace.
Exchange one or more architecture components for your existing infrastructure. For example, you can use Azure Data Factory to orchestrate Databricks jobs.
Integrate with your existing CI/CD tooling via Git and Azure Databricks REST APIs.
Use Microsoft Fabric as an alternative service for machine learning capabilities. Fabric provides integrated workloads for data engineering (lakehouses with Apache Spark), data warehousing, and OneLake for unified storage.

Scenario details

This solution provides a robust MLOps process that uses Azure Databricks. You can replace all elements in the architecture, so you can integrate other Azure services and partner services as needed. This architecture and description are adapted from the e-book The Big Book of MLOps: Second Edition. The e-book explores this architecture in more detail.

MLOps helps reduce the risk of failures in machine learning and AI systems and improve the efficiency of collaboration and tooling. For an introduction to MLOps and an overview of this architecture, see Architect MLOps on the lakehouse.

Use this architecture to take the following actions:

Connect your business stakeholders with machine learning and data science teams. Use this architecture to incorporate notebooks and IDEs for development. Business stakeholders can view metrics and dashboards in Databricks SQL, all within the same lakehouse architecture.
Focus your machine learning infrastructure around data. This architecture treats machine learning data just like other data. Machine learning data includes data from feature engineering, training, inference, and monitoring. This architecture reuses tooling for production pipelines, dashboarding, and other general data processing for machine learning data processing.
Implement MLOps in modules and pipelines. As with any software application, use the modularized pipelines and code in this architecture to test individual components and decrease the cost of future refactoring.
Automate your MLOps processes as needed. In this architecture, you can automate steps to improve productivity and reduce the risk of human error, but you don't need to automate each step. Azure Databricks permits user interface (UI) and manual processes, as well as APIs for automation.

Potential use cases

This architecture applies to all types of machine learning, deep learning, and advanced analytics. This architecture uses the following common machine learning and AI techniques:

Classical machine learning, like linear models, tree-based models, and boosting
Modern deep learning, like TensorFlow and PyTorch
Custom analytics, like statistics, Bayesian methods, and graph analytics

The architecture supports both small data on a single machine and large data by using distributed computing and graphics processing unit (GPU)-accelerated resources. At each stage of the architecture, you can choose compute resources and libraries to adapt to your scenario's data size and problem dimensions.

The architecture applies to all types of industries and business use cases. Azure Databricks customers that use this architecture include small and large organizations in the following industries:

Consumer goods and retail services
Financial services
Healthcare and life sciences
Information technology

For more information, see Databricks customers.

Foundation model fine-tuning in MLOps workflows

As more organizations use large language models for specialized tasks, they must add foundation model fine-tuning to the MLOps process. You can use Azure Databricks to fine-tune foundation models with your data. This capability supports custom applications and a mature MLOps process. In the context of the MLOps architecture in this article, fine-tuning aligns with several best practices:

Modularized pipelines and codes: Fine-tuning tasks can be encapsulated as modular components within the training pipeline. This structure enables isolated evaluation and simplifies refactoring.
Experiment (fine-tuning run) tracking: MLflow integration logs each fine-tuning run with specific parameters like the number of epochs and learning rate, and with metrics like loss and cross-entropy. This process improves reproducibility, auditability, and the ability to measure improvements.
Model registry and deployment: Fine-tuned models are automatically registered in Unity Catalog. This automation supports deployment and governance.
Automation and CI/CD: Fine-tuning jobs can be initiated via Databricks workflows or CI/CD pipelines. This process supports continuous learning and model refresh cycles.

This approach lets teams maintain high MLOps maturity while using the flexibility and power of foundation models. For more information, see Foundation model fine-tuning.

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal authors:

Brandon Cowen | Senior Cloud Solution Architect
Prabal Deb | Principal Software Engineer

Other contributors:

Rodrigo Rodríguez | Senior Cloud Solution Architect, AI & Quantum

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps

Feedback

Was this page helpful?