Share via


MLflow 3 for GenAI

MLflow 3 for GenAI is an open platform that unifies tracking, evaluation, and observability for GenAI apps and agents throughout the development and production lifecycle. It includes realtime trace logging, built-in and custom scorers, incorporation of human feedback, and version tracking to help you efficiently evaluate and improve app quality during development and continue tracking and improving quality in production.

Managed MLflow on Databricks extends open source MLflow with capabilities designed for production GenAI applications, including enterprise-ready governance, fully managed hosting, production-level scaling, and integration with your data in the Databricks lakehouse and Unity Catalog.

For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide. For MLflow 3, the Agent Evaluation SDK methods have been integrated with Databricks-managed MLflow.

For a set of tutorials to get you started, see Get started.

How MLflow 3 helps optimize GenAI app quality

Evaluating GenAI applications and agents is more complex than evaluating traditional software. Inputs and outputs are often free-form text, and many different outputs can be considered correct. Quality depends not only on correctness but also on factors like precision, length, completeness, appropriateness, and other criteria specific to the use case. Because LLMs are inherently non-deterministic, and GenAI agents include additional components such as retrievers and tools, their responses can vary from run to run.

Developers need concrete quality metrics, automated evaluation, and continuous monitoring to build and deploy robust AI apps. MLflow 3 for GenAI provides these key pieces for efficient development, deployment, and continuous improvement:

Using MLflow 3 on Databricks, you can bring AI to your data to help you deeply understand and improve quality. Unity Catalog provides consistent governance for prompts, apps, and traces. Using any model or framework, MLflow supports you throughout the development loop all the way to and in production.

Get started

Start building better GenAI applications with comprehensive observability and evaluation tools.

Task Description
Quick start guide Get up and running in minutes with step-by-step instructions for instrumenting your first application with tracing, running evaluation, and collecting human feedback.
Get started: MLflow Tracing for GenAI (Databricks Notebook) Instrument a simple GenAI app to automatically capture detailed traces for debugging and optimization.
Tutorial: Evaluate and improve a GenAI application Steps you through evaluating an email generation app that uses Retrieval-Augmented Generation (RAG).
10-minute demo: Collect human feedback Collect end-user feedback, add developer annotations, create expert review sessions, and use that feedback to evaluate your GenAI app's quality.

Tracing

MLflow Tracing provides observability and logs the trace data required for evaluation and monitoring.

Feature Description
MLflow Tracing End-to-end observability for GenAI applications, including complex agent-based systems. Track inputs, outputs, intermediate steps, and metadata for a complete picture of how your app behaves.
What is tracing? Introduction to tracing concepts.
Review your app's behavior and performance Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs.
Production observability Use the same instrumentation in development and production environments for consistent evaluation.
Build evaluation datasets Analyze traces to identify quality issues, select representative traces, create evaluation datasets, and systematically improve your application.
Tracing integrations MLflow Tracing is integrated with many libraries and frameworks for automatic tracing that allows you to gain immediate observability into your GenAI applications with minimal setup.

Evaluation and monitoring

Replace manual testing with automated evaluation using built-in and custom LLM judges and scorers that match human expertise and can be applied in both development and production. Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Feature Description
Evaluate and monitor GenAI agents Overview of evaluating and monitoring agents using MLflow 3 on Databricks.
LLM judges and scorers MLflow 3 includes built-in LLM judges for safety, relevance, correctness, retrieval quality and more. You can also create custom LLM judges and code-based scorers for your specific business requirements.
Evaluation Run evaluation during development or as part of a release process.
Production monitoring Continuously monitor a sample of production traffic using LLM judges and scorers.
Collect human feedback Collect and use feedback from domain experts and end users during development and during production for continuous improvement.

Manage the GenAI app lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.

Feature Description
Application versioning Track code, parameters, and evaluation metrics for each version.
Prompt Registry Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration.
Enterprise integration Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features.
Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders.
Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor.