Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
If you plan to bring your own language model (BYOM) instead of one of the models included in Edge RAG, you must set up an OpenAI API compatible endpoint for your Edge RAG deployment. Choose one of the following methods included in this article to create your endpoint.
By bringing your own model, you can enable advanced search types, like hybrid multimodal and deep search, that aren’t available with Edge RAG-provided models. For deep search, we recommend OpenAI GPT-4o, GPT-4.1-mini or a later version.
After you create your endpoint, use the endpoint when you deploy the extension for Edge RAG and choose to add your own language model.
Important
Edge RAG Preview, enabled by Azure Arc is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
Azure AI Foundry
To use your own model with Edge RAG, you can deploy a language model and create an endpoint by using Azure AI Foundry.
Go to Azure AI Foundry and sign in with your Azure account.
Create a new Azure AI Foundry resource or go to an existing resource.
On the Azure AI Foundry resource, select Models + endpoints.
Select Deploy model > Deploy base model.
Choose a chat completion model from the list like
gpt-4o.Select Confirm.
Edit the following fields as appropriate for your scenario:
Field Description Deployment name Choose deployment name. The default is name of the model you selected. Deployment type Select deployment type. The default is Global Standard. Select Deploy to selected resource.
Wait for the deployment to complete and the State is Succeeded.
Get the endpoint and API Key by selecting on the deployed model. For example, the endpoint looks like the following URL.
https://<Azure AI Foundry Resource Name>.openai.azure.com/openai/deployments/<Model Deployment Name>/chat/completions?api-version=<API Version>
For more information, see the following articles:
KAITO
To deploy an AI model by using Kubernetes AI Toolchain Operator (KAITO) on Azure Kubernetes, see Deploy an AI model on AKS enabled by Azure Arc with the Kubernetes AI toolchain operator.
Foundry Local
To deploy an AI model using Foundry Local, see GitHub - microsoft/Foundry-Local.
Ollama
You can set up Ollama as a language model endpoint on your Kubernetes cluster. Use either CPU or GPU.
If you're using the Ollama with GPU, you must set the following two things on the GPU node. Replace
moc-gpunodewith the name of your GPU node.kubectl taint nodes <moc-gpunode> ollamasku=ollamagpu:NoSchedule –overwrite kubectl label node <moc-gpunode> hardware=ollamagpu`Create a yaml file by using one of the following snippets depending on whether you're using GPU or CPU for your model.
GPU yaml:
# ollama-deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: ollama-deploy namespace: default spec: replicas: 1 selector: matchLabels: app: ollama-deploy template: metadata: labels: app: ollama-deploy spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: hardware operator: In values: - ollamagpu containers: - name: ollama image: ollama/ollama args: ["serve"] ports: - containerPort: 11434 volumeMounts: - name: ollama-data mountPath: /root/.ollama resources: limits: nvidia.com/gpu: "1" volumes: - name: ollama-data emptyDir: {} tolerations: - effect: NoSchedule key: ollamasku operator: Equal value: ollamagpu --- apiVersion: v1 kind: Service metadata: name: ollama-llm namespace: default spec: selector: app: ollama-deploy ports: - port: 11434 targetPort: 11434 protocol: TCPCPU yaml:
# ollama-deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: ollama-deploy namespace: default spec: replicas: 1 selector: matchLabels: app: ollama-deploy template: metadata: labels: app: ollama-deploy spec: containers: - name: ollama image: ollama/ollama args: ["serve"] ports: - containerPort: 11434 volumeMounts: - name: ollama-data mountPath: /root/.ollama volumes: - name: ollama-data emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: ollama-llm namespace: default spec: selector: app: ollama-deploy ports: - port: 11434 targetPort: 11434 protocol: TCP
Deploy Ollama in the default namespace by using the following yaml snippet. This snippet creates an Ollama deployment and service in default namespace.
kubectl apply -f ollama-deploy.yamlDownload a model by using one of the following commands. Get the latest supported models here: Ollama Search.
kubectl exec -n default -it deploy/ollama-deploy -- bash -c "ollama pull <model_name>"Or use k9s to connect to the pod and execute the following command inside the ollama pod:
ollama pull <model_name>Use the following endpoint value as you configure the Edge RAG extension deployment:
http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions
After you deploy the Edge RAG extension, verify that the model can be accessed from another namespace. Run the following curl command from inference flow pod in the arc-rag namespace.
curl http://ollama-llm.default.svc.cluster.local:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the capital of Japan?" }
]
}'