Create a "BYOM" endpoint to use for Edge RAG Preview enabled by Azure Arc

If you plan to bring your own language model (BYOM) instead of one of the models included in Edge RAG, you must set up an OpenAI API compatible endpoint for your Edge RAG deployment. Choose one of the following methods included in this article to create your endpoint.

By bringing your own model, you can enable advanced search types, like hybrid multimodal and deep search, that aren’t available with Edge RAG-provided models. For deep search, we recommend OpenAI GPT-4o, GPT-4.1-mini or a later version.

After you create your endpoint, use the endpoint when you deploy the extension for Edge RAG and choose to add your own language model.

Important

Edge RAG Preview, enabled by Azure Arc is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.

Azure AI Foundry

To use your own model with Edge RAG, you can deploy a language model and create an endpoint by using Azure AI Foundry.

Go to Azure AI Foundry and sign in with your Azure account.
Create a new Azure AI Foundry resource or go to an existing resource.
On the Azure AI Foundry resource, select Models + endpoints.
Select Deploy model > Deploy base model.
Choose a chat completion model from the list like gpt-4o.
Select Confirm.
Edit the following fields as appropriate for your scenario:

Field Description

Deployment name Choose deployment name. The default is name of the model you selected.

Deployment type Select deployment type. The default is Global Standard.
Select Deploy to selected resource.
Wait for the deployment to complete and the State is Succeeded.
Get the endpoint and API Key by selecting on the deployed model. For example, the endpoint looks like the following URL.

https://<Azure AI Foundry Resource Name>.openai.azure.com/openai/deployments/<Model Deployment Name>/chat/completions?api-version=<API Version>

Field	Description
Deployment name	Choose deployment name. The default is name of the model you selected.
Deployment type	Select deployment type. The default is Global Standard.

For more information, see the following articles:

KAITO

To deploy an AI model by using Kubernetes AI Toolchain Operator (KAITO) on Azure Kubernetes, see Deploy an AI model on AKS enabled by Azure Arc with the Kubernetes AI toolchain operator.

Foundry Local

To deploy an AI model using Foundry Local, see GitHub - microsoft/Foundry-Local.

Ollama

You can set up Ollama as a language model endpoint on your Kubernetes cluster. Use either CPU or GPU.

If you're using the Ollama with GPU, you must set the following two things on the GPU node. Replace moc-gpunode with the name of your GPU node.
```
kubectl taint nodes <moc-gpunode> ollamasku=ollamagpu:NoSchedule –overwrite

kubectl label node <moc-gpunode> hardware=ollamagpu`
```

Create a yaml file by using one of the following snippets depending on whether you're using GPU or CPU for your model.

GPU yaml:

# ollama-deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
    name: ollama-deploy
    namespace: default
spec:
    replicas: 1
    selector:
        matchLabels:
            app: ollama-deploy
    template:
        metadata:
            labels:
                app: ollama-deploy
        spec:
            affinity:
                nodeAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                        nodeSelectorTerms:
                        - matchExpressions:
                            - key: hardware
                                operator: In
                                values:
                                - ollamagpu
            containers:
                - name: ollama
                    image: ollama/ollama
                    args: ["serve"]
                    ports:
                        - containerPort: 11434
                    volumeMounts:
                        - name: ollama-data
                            mountPath: /root/.ollama
                    resources:
                        limits:
                            nvidia.com/gpu: "1"
            volumes:
                - name: ollama-data
                    emptyDir: {}
            tolerations:
                - effect: NoSchedule
                    key: ollamasku
                    operator: Equal
                    value: ollamagpu

---
apiVersion: v1
kind: Service
metadata:
    name: ollama-llm
    namespace: default
spec:
    selector:
        app: ollama-deploy
    ports:
        - port: 11434
            targetPort: 11434
            protocol: TCP

CPU yaml:

# ollama-deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
    name: ollama-deploy
    namespace: default
spec:
    replicas: 1
    selector:
        matchLabels:
            app: ollama-deploy
    template:
        metadata:
            labels:
                app: ollama-deploy
        spec:
            containers:
                - name: ollama
                    image: ollama/ollama
                    args: ["serve"]
                    ports:
                        - containerPort: 11434
                    volumeMounts:
                        - name: ollama-data
                            mountPath: /root/.ollama
            volumes:
                - name: ollama-data
                    emptyDir: {}

---
apiVersion: v1
kind: Service
metadata:
    name: ollama-llm
    namespace: default
spec:
    selector:
        app: ollama-deploy
    ports:
        - port: 11434
            targetPort: 11434
            protocol: TCP

Deploy Ollama in the default namespace by using the following yaml snippet. This snippet creates an Ollama deployment and service in default namespace.
```
kubectl apply -f ollama-deploy.yaml
```
Download a model by using one of the following commands. Get the latest supported models here: Ollama Search.
```
kubectl exec -n default -it deploy/ollama-deploy -- bash -c "ollama pull <model_name>"
```
Or use k9s to connect to the pod and execute the following command inside the ollama pod:
```
ollama pull <model_name>
```
Use the following endpoint value as you configure the Edge RAG extension deployment:

http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions

After you deploy the Edge RAG extension, verify that the model can be accessed from another namespace. Run the following curl command from inference flow pod in the arc-rag namespace.

curl http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama3:8b",
        "messages": [
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": "What is the capital of Japan?" }
        ]
    }'

Next step

Verify file share access

Feedback

Was this page helpful?

Last updated on 2025-11-18