Develop applications with LlamaIndex and Microsoft Foundry

Note

This document refers to the Microsoft Foundry (classic) portal.

🔄 Switch to the Microsoft Foundry (new) documentation if you're using the new portal.

Note

This document refers to the Microsoft Foundry (new) portal.

In this article, you learn how to use LlamaIndex with models deployed from the Foundry model catalog in Microsoft Foundry portal.

Models deployed to Microsoft Foundry can be used with LlamaIndex in two ways:

Using the model's provider specific API: Some models, like OpenAI, Cohere, or Mistral, offer their own set of APIs and extensions for LlamaIndex. Those extensions might include specific functionalities that the model support and hence are suitable if you want to exploit them. When working with llama-index, install the extension specific for the model you want to use, like llama-index-llms-openai or llama-index-llms-cohere.
Using the Azure AI Model Inference API: All models deployed to Foundry support the Model Inference API, which offers a common set of functionalities that can be used for most of the models in the catalog. The benefit of this API is that, since it's the same for all the models, changing from one to another is as simple as changing the model deployment being use. No further changes are required in the code. When working with LlamaIndex, install the extensions llama-index-llms-azure-inference and llama-index-embeddings-azure-inference.

Important

If you're currently using an Azure AI Inference beta SDK with Microsoft Foundry Models or Azure OpenAI service, we strongly recommend that you transition to the generally available OpenAI/v1 API, which uses an OpenAI stable SDK.

For more information on how to migrate to the OpenAI/v1 API by using an SDK in your programming language of choice, see Migrate from Azure AI Inference SDK to OpenAI SDK.

In this example, we're working with the Model Inference API.

Prerequisites

To run this tutorial, you need:

An Azure account with an active subscription. If you don't have one, create a free Azure account, which includes a free trial subscription.
A Foundry project.
A model supporting the Model Inference API deployed. In this example, we use a Mistral-Large deployment, but use any model of your preference. For using embeddings capabilities in LlamaIndex, you need an embedding model like cohere-embed-v3-multilingual.
- You can follow the instructions at Deploy models as serverless API deployments.
Python 3.8 or later installed, including pip.
LlamaIndex installed. You can do it with:
```
pip install llama-index
```
In this example, we're working with the Model Inference API, hence we install the following packages:
```
pip install -U llama-index-llms-azure-inference
pip install -U llama-index-embeddings-azure-inference
```
Important

Using the Foundry Models service requires version 0.2.4 for llama-index-llms-azure-inference or llama-index-embeddings-azure-inference.

Configure the environment

To use LLMs deployed in Microsoft Foundry portal, you need the endpoint and credentials to connect to it. Follow these steps to get the information you need from the model you want to use:

Tip

Because you can customize the left pane in the Microsoft Foundry portal, you might see different items than shown in these steps. If you don't see what you're looking for, select ... More at the bottom of the left pane.

Sign in to Microsoft Foundry. Make sure the New Foundry toggle is off. These steps refer to Foundry (classic).
Open the project where the model is deployed, if it isn't already open.
Go to Models + endpoints and select the model you deployed as indicated in the prerequisites.
Copy the endpoint URL and the key.

Tip

If your model was deployed with Microsoft Entra ID support, you don't need a key.

In this scenario, set the endpoint URL and key as environment variables. (If the endpoint you copied includes additional text after /models, remove it so the URL ends at /models as shown below.)

export AZURE_INFERENCE_ENDPOINT="https://<resource>.services.ai.azure.com/models"
export AZURE_INFERENCE_CREDENTIAL="<your-key-goes-here>"

To use LLMs deployed in Microsoft Foundry portal, you need the endpoint and credentials to connect to it. Follow these steps to get the information you need from the model you want to use:

Sign in to Microsoft Foundry. Make sure the New Foundry toggle is on. These steps refer to Foundry (new).
Open the project where the model is deployed, if it isn't already open.
Copy the endpoint URL and the key.

In this scenario, set the endpoint URL and key as environment variables. (If the endpoint you copied includes additional text after /models, remove it so the URL ends at /models as shown below.)

export AZURE_INFERENCE_ENDPOINT="https://<resource>.services.ai.azure.com/models"
export AZURE_INFERENCE_CREDENTIAL="<your-key-goes-here>"

Once configured, create a client to connect to the endpoint.

import os
from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
)

Tip

If your model deployment is hosted in Azure OpenAI in Foundry Models or Foundry Tools resource, configure the client as indicated at Azure OpenAI models and Foundry Models service.

If your endpoint is serving more than one model, like with the Foundry Models service or GitHub Models, you have to indicate model_name parameter:

import os
from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="mistral-large-2411",
)

Alternatively, if your endpoint support Microsoft Entra ID, you can use the following code to create the client:

import os
from azure.identity import DefaultAzureCredential
from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

Note

When using Microsoft Entra ID, make sure that the endpoint was deployed with that authentication method and that you have the required permissions to invoke it.

If you're planning to use asynchronous calling, it's a best practice to use the asynchronous version for the credentials:

from azure.identity.aio import (
    DefaultAzureCredential as DefaultAzureCredentialAsync,
)
from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredentialAsync(),
)

Azure OpenAI models and Foundry Models service

If you're using Azure OpenAI or Foundry Models service, ensure you have at least version 0.2.4 of the LlamaIndex integration. Use api_version parameter in case you need to select a specific api_version.

For the Foundry Models service, you need to pass model_name parameter:

from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint="https://<resource>.services.ai.azure.com/models",
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="mistral-large-2411",
)

For Azure OpenAI in Foundry Models:

from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint="https://<resource>.openai.azure.com/openai/deployments/<deployment-name>",
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    api_version="2024-05-01-preview",
)

Tip

Check which is the API version that your deployment is using. Using a wrong api_version or one not supported by the model results in a ResourceNotFound exception.

Inference parameters

You can configure how inference in performed for all the operations that are using this client by setting extra parameters. This helps avoid indicating them on each call you make to the model.

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    temperature=0.0,
    model_kwargs={"top_p": 1.0},
)

Parameters not supported in the Model Inference API (reference) but available in the underlying model, you can use the model_extras argument. In the following example, the parameter safe_prompt, only available for Mistral models, is being passed.

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    temperature=0.0,
    model_kwargs={"model_extras": {"safe_prompt": True}},
)

Use LLMs models

You can use the client directly or Configure the models used by your code in LlamaIndex. To use the model directly, use the chat method for chat instruction models:

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with colorful personality."
    ),
    ChatMessage(role="user", content="Hello"),
]

response = llm.chat(messages)
print(response)

You can stream the outputs also:

response = llm.stream_chat(messages)
for r in response:
    print(r.delta, end="")

The complete method is still available for model of type chat-completions. On those cases, your input text is converted to a message with role="user".

Use embeddings models

In the same way you create an LLM client, you can connect to an embeddings model. In the following example, we're setting the environment variable to now point to an embeddings model:

export AZURE_INFERENCE_ENDPOINT="<your-model-endpoint-goes-here>"
export AZURE_INFERENCE_CREDENTIAL="<your-key-goes-here>"

Then create the client:

from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

embed_model = AzureAIEmbeddingsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ['AZURE_INFERENCE_CREDENTIAL'],
    model="<your-model-name>",
)

The following example shows a simple test to verify it works:

from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="Before college the two main things I worked on, "
        "outside of school, were writing and programming."
    )
]
response = embed_model(nodes=nodes)
print(response[0].embedding)

Configure the models used by your code

You can use the LLM or embeddings model client individually in the code you develop with LlamaIndex or you can configure the entire session using the Settings options. Configuring the session has the advantage of all your code using the same models for all the operations.

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

However, there are scenarios where you want to use a general model for most of the operations but a specific one for a given task. On those cases, it's useful to set the LLM or embedding model you're using for each LlamaIndex construct. In the following example, we set a specific model:

from llama_index.core.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(llm=llm)

In general, you use a combination of both strategies.

Feedback

Was this page helpful?

Last updated on 2025-12-04

Share via

Develop applications with LlamaIndex and Microsoft Foundry

Prerequisites

Configure the environment

Azure OpenAI models and Foundry Models service

Inference parameters

Use LLMs models

Use embeddings models

Configure the models used by your code

Related content

Feedback

Additional resources