Edit

Share via


Create a "BYOM" endpoint to use for Edge RAG Preview enabled by Azure Arc

If you plan to bring your own language model (BYOM) instead of one of the models included in Edge RAG, you must set up an OpenAI API compatible endpoint for your Edge RAG deployment. Choose one of the following methods included in this article to create your endpoint.

By bringing your own model, you can enable advanced search types, like hybrid multimodal and deep search, that aren’t available with Edge RAG-provided models. For deep search, we recommend OpenAI GPT-4o, GPT-4.1-mini or a later version.

After you create your endpoint, use the endpoint when you deploy the extension for Edge RAG and choose to add your own language model.

Important

Edge RAG Preview, enabled by Azure Arc is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.

Azure AI Foundry

To use your own model with Edge RAG, you can deploy a language model and create an endpoint by using Azure AI Foundry.

  1. Go to Azure AI Foundry and sign in with your Azure account.

  2. Create a new Azure AI Foundry resource or go to an existing resource.

  3. On the Azure AI Foundry resource, select Models + endpoints.

  4. Select Deploy model > Deploy base model.

  5. Choose a chat completion model from the list like gpt-4o.

  6. Select Confirm.

  7. Edit the following fields as appropriate for your scenario:

    Field Description
    Deployment name Choose deployment name. The default is name of the model you selected.
    Deployment type Select deployment type. The default is Global Standard.
  8. Select Deploy to selected resource.

  9. Wait for the deployment to complete and the State is Succeeded.

  10. Get the endpoint and API Key by selecting on the deployed model. For example, the endpoint looks like the following URL.

    https://<Azure AI Foundry Resource Name>.openai.azure.com/openai/deployments/<Model Deployment Name>/chat/completions?api-version=<API Version>

For more information, see the following articles:

KAITO

To deploy an AI model by using Kubernetes AI Toolchain Operator (KAITO) on Azure Kubernetes, see Deploy an AI model on AKS enabled by Azure Arc with the Kubernetes AI toolchain operator.

Foundry Local

To deploy an AI model using Foundry Local, see GitHub - microsoft/Foundry-Local.

Ollama

You can set up Ollama as a language model endpoint on your Kubernetes cluster. Use either CPU or GPU.

  1. If you're using the Ollama with GPU, you must set the following two things on the GPU node. Replace moc-gpunode with the name of your GPU node.

    kubectl taint nodes <moc-gpunode> ollamasku=ollamagpu:NoSchedule –overwrite
    
    kubectl label node <moc-gpunode> hardware=ollamagpu`
    
  2. Create a yaml file by using one of the following snippets depending on whether you're using GPU or CPU for your model.

    • GPU yaml:

      # ollama-deploy.yaml
      
      apiVersion: apps/v1
      kind: Deployment
      metadata:
          name: ollama-deploy
          namespace: default
      spec:
          replicas: 1
          selector:
              matchLabels:
                  app: ollama-deploy
          template:
              metadata:
                  labels:
                      app: ollama-deploy
              spec:
                  affinity:
                      nodeAffinity:
                          requiredDuringSchedulingIgnoredDuringExecution:
                              nodeSelectorTerms:
                              - matchExpressions:
                                  - key: hardware
                                      operator: In
                                      values:
                                      - ollamagpu
                  containers:
                      - name: ollama
                          image: ollama/ollama
                          args: ["serve"]
                          ports:
                              - containerPort: 11434
                          volumeMounts:
                              - name: ollama-data
                                  mountPath: /root/.ollama
                          resources:
                              limits:
                                  nvidia.com/gpu: "1"
                  volumes:
                      - name: ollama-data
                          emptyDir: {}
                  tolerations:
                      - effect: NoSchedule
                          key: ollamasku
                          operator: Equal
                          value: ollamagpu
      
      ---
      apiVersion: v1
      kind: Service
      metadata:
          name: ollama-llm
          namespace: default
      spec:
          selector:
              app: ollama-deploy
          ports:
              - port: 11434
                  targetPort: 11434
                  protocol: TCP
      
    • CPU yaml:

      # ollama-deploy.yaml
      
      apiVersion: apps/v1
      kind: Deployment
      metadata:
          name: ollama-deploy
          namespace: default
      spec:
          replicas: 1
          selector:
              matchLabels:
                  app: ollama-deploy
          template:
              metadata:
                  labels:
                      app: ollama-deploy
              spec:
                  containers:
                      - name: ollama
                          image: ollama/ollama
                          args: ["serve"]
                          ports:
                              - containerPort: 11434
                          volumeMounts:
                              - name: ollama-data
                                  mountPath: /root/.ollama
                  volumes:
                      - name: ollama-data
                          emptyDir: {}
      
      ---
      apiVersion: v1
      kind: Service
      metadata:
          name: ollama-llm
          namespace: default
      spec:
          selector:
              app: ollama-deploy
          ports:
              - port: 11434
                  targetPort: 11434
                  protocol: TCP
      
  3. Deploy Ollama in the default namespace by using the following yaml snippet. This snippet creates an Ollama deployment and service in default namespace.

    kubectl apply -f ollama-deploy.yaml
    
  4. Download a model by using one of the following commands. Get the latest supported models here: Ollama Search.

    kubectl exec -n default -it deploy/ollama-deploy -- bash -c "ollama pull <model_name>"
    

    Or use k9s to connect to the pod and execute the following command inside the ollama pod:

    ollama pull <model_name>
    
  5. Use the following endpoint value as you configure the Edge RAG extension deployment:

    http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions

After you deploy the Edge RAG extension, verify that the model can be accessed from another namespace. Run the following curl command from inference flow pod in the arc-rag namespace.

curl http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama3:8b",
        "messages": [
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": "What is the capital of Japan?" }
        ]
    }'

Next step