Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
MLflow 3 for GenAI is an open platform that unifies tracking, evaluation, and observability for GenAI apps and agents throughout the development and production lifecycle. It includes realtime trace logging, built-in and custom scorers, incorporation of human feedback, and version tracking to help you efficiently evaluate and improve app quality during development and continue tracking and improving quality in production.
Managed MLflow on Databricks extends open source MLflow with capabilities designed for production GenAI applications, including enterprise-ready governance, fully managed hosting, production-level scaling, and integration with your data in the Databricks lakehouse and Unity Catalog.
For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide. For MLflow 3, the Agent Evaluation SDK methods have been integrated with Databricks-managed MLflow.
For a set of tutorials to get you started, see Get started.
How MLflow 3 helps optimize GenAI app quality
Evaluating GenAI applications and agents is more complex than evaluating traditional software. Inputs and outputs are often free-form text, and many different outputs can be considered correct. Quality depends not only on correctness but also on factors like precision, length, completeness, appropriateness, and other criteria specific to the use case. Because LLMs are inherently non-deterministic, and GenAI agents include additional components such as retrievers and tools, their responses can vary from run to run.
Developers need concrete quality metrics, automated evaluation, and continuous monitoring to build and deploy robust AI apps. MLflow 3 for GenAI provides these key pieces for efficient development, deployment, and continuous improvement:
- Tracing automatically logs inputs, intermediate steps, and outputs and provides the data foundation for evaluation and monitoring.
- Built-in and custom LLM judges and scorers let you define various aspects of quality and customize metrics to your use case.
- Review apps for expert feedback allow you to collect and label datasets for evaluation and to align automated judges and scorers with expert judgement.
- Automated evaluation and monitoring leverage the same judges and scorers during development and production.
- App and prompt versioning allow you to compare versions and track improvements over iterations.
Using MLflow 3 on Databricks, you can bring AI to your data to help you deeply understand and improve quality. Unity Catalog provides consistent governance for prompts, apps, and traces. Using any model or framework, MLflow supports you throughout the development loop all the way to and in production.
Get started
Start building better GenAI applications with comprehensive observability and evaluation tools.
| Task | Description |
|---|---|
| Quick start guide | Get up and running in minutes with step-by-step instructions for instrumenting your first application with tracing, running evaluation, and collecting human feedback. |
| Get started: MLflow Tracing for GenAI (Databricks Notebook) | Instrument a simple GenAI app to automatically capture detailed traces for debugging and optimization. |
| Tutorial: Evaluate and improve a GenAI application | Steps you through evaluating an email generation app that uses Retrieval-Augmented Generation (RAG). |
| 10-minute demo: Collect human feedback | Collect end-user feedback, add developer annotations, create expert review sessions, and use that feedback to evaluate your GenAI app's quality. |
Tracing
MLflow Tracing provides observability and logs the trace data required for evaluation and monitoring.
| Feature | Description |
|---|---|
| MLflow Tracing | End-to-end observability for GenAI applications, including complex agent-based systems. Track inputs, outputs, intermediate steps, and metadata for a complete picture of how your app behaves. |
| What is tracing? | Introduction to tracing concepts. |
| Review your app's behavior and performance | Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs. |
| Production observability | Use the same instrumentation in development and production environments for consistent evaluation. |
| Build evaluation datasets | Analyze traces to identify quality issues, select representative traces, create evaluation datasets, and systematically improve your application. |
| Tracing integrations | MLflow Tracing is integrated with many libraries and frameworks for automatic tracing that allows you to gain immediate observability into your GenAI applications with minimal setup. |
Evaluation and monitoring
Replace manual testing with automated evaluation using built-in and custom LLM judges and scorers that match human expertise and can be applied in both development and production. Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.
| Feature | Description |
|---|---|
| Evaluate and monitor GenAI agents | Overview of evaluating and monitoring agents using MLflow 3 on Databricks. |
| LLM judges and scorers | MLflow 3 includes built-in LLM judges for safety, relevance, correctness, retrieval quality and more. You can also create custom LLM judges and code-based scorers for your specific business requirements. |
| Evaluation | Run evaluation during development or as part of a release process. |
| Production monitoring | Continuously monitor a sample of production traffic using LLM judges and scorers. |
| Collect human feedback | Collect and use feedback from domain experts and end users during development and during production for continuous improvement. |
Manage the GenAI app lifecycle
Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.
| Feature | Description |
|---|---|
| Application versioning | Track code, parameters, and evaluation metrics for each version. |
| Prompt Registry | Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration. |
| Enterprise integration | Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features. Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders. Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor. |