Share via


CI/CD on Azure Databricks

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common in software development, and is becoming increasingly necessary in data engineering and data science. By automating the building, testing, and deployment of code, development teams deliver releases more reliably than with manual processes.

Databricks provides tools for developing CI/CD pipelines that support approaches that may differ slightly from organization to organization due to unique aspects of each organization's software development lifecycle. This page provides information about available tools for CI/CD pipelines on Databricks. For details about CI/CD recommendations and best practices, see Best practices and recommended CI/CD workflows on Databricks.

For an overview of CI/CD for machine learning projects on Azure Databricks, see How does Databricks support CI/CD for machine learning?.

High-level flow

A common flow for an Azure Databricks CI/CD pipeline is:

  1. Version: Store your Azure Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members.
  2. Code: Develop code and unit tests in an Azure Databricks notebook in the workspace or locally using an IDE.
  3. Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments.
  4. Deploy: Deploy changes to the Azure Databricks workspace using Databricks Asset Bundles with tools like Azure DevOps, GitHub Actions, or Jenkins.
  5. Test: Develop and run automated tests to validate your code changes.
    • Use tools like pytest to test your integrations.
  6. Run: Use the Databricks CLI with Databricks Asset Bundles to automate runs in your Azure Databricks workspaces.
  7. Monitor: Monitor the performance of your code and production workloads in Azure Databricks using tools such as jobs monitoring. This helps you identify and resolve any issues that arise in your production environment.

Available tools

The following tools support CI/CD core principles: version all files and unify asset management, define infrastructure as code, isolate environments, automate testing, and monitor and automate rollbacks.

Area Use these tools when you want to…
Databricks Asset Bundles Programmatically define, deploy, and run resources, including Lakeflow Jobs, Lakeflow Spark Declarative Pipelines, and MLOps Stacks using CI/CD best practices and flows.
Databricks Terraform provider Provision and manage Databricks workspaces and infrastructure using Terraform.
Continuous integration and delivery on Azure Databricks using Azure DevOps Develop a CI/CD pipeline for Azure Databricks that uses Azure DevOps.
Authenticate with Azure DevOps on Azure Databricks Authenticate with Azure DevOps.
GitHub Actions Include a GitHub Action developed for Azure Databricks in your CI/CD flow.
CI/CD with Jenkins on Azure Databricks Develop a CI/CD pipeline for Azure Databricks that uses Jenkins.
Orchestrate Lakeflow Jobs with Apache Airflow Manage and schedule a data pipeline that uses Apache Airflow.
Service principals for CI/CD Use service principals, instead of users, with CI/CD.
Authenticate access to Azure Databricks using OAuth token federation Use workload identity federation for CI/CD authentication, which eliminates the need for Databricks secrets, making it the most secure way to authenticate to Databricks.

Databricks Asset Bundles

Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.

Bundles includes many features such as custom templates for enforcing consistency and best practices across your organization, and comprehensive support for deploying the code files and configuration for many Databricks resources. Authoring a bundle requires some knowledge of bundle configuration syntax.

For recommendations on how to use bundles in CI/CD, see Best practices and recommended CI/CD workflows on Databricks.

Other tools for source control

As an alternative to applying full CI/CD with Databricks Asset Bundles, Databricks offers options to only source-control and deploy code files and notebooks.

  • Git folder: Git folders can be used to reflect the state of a remote Git repository. You can create a git folder for production to manage source-controlled source files and notebooks. Then manually pull the Git folder to the latest state, or use external CI/CD tools such as GitHub Actions to pull the Git folder on merge. Use this approach when you don't have access to external CI/CD pipelines.

    This approach works for external orchestrators such as Airflow, but note that only the code files, such as notebooks and dashboard drafts, are in source control. Configurations for jobs or pipelines that run assets in the Git folder and configurations for publishing dashboards are not in source control.

  • Git with jobs: Git with jobs enables you to configure some job types to use a remote Git repository as the source for code files. When a job run begins, Databricks takes a snapshot of the repository and runs all tasks against that version. This approach only supports limited job tasks, and only code files (notebooks and other files) are source-controlled. Job configurations such as task sequences, compute settings, and schedules are not source controlled, making this approach less suitable for multi-environment, cross-workspace deployments.