Online Endpoint Creation

Mark Ritcey 20 Reputation points
2025-12-04T18:04:30.54+00:00

Online Endpoints resource type is available, but deployments fail with provisioningState=Failed and no compute attached. Can you please advise on any action to correct or if I need real‑time inference feature flag enabled on the back-end for my subscription?

Azure Machine Learning
{count} votes

Answer accepted by question author
  1. Sridhar M 2,840 Reputation points Microsoft External Staff Moderator
    2025-12-04T19:41:44.0966667+00:00

    Hi Mark Ritcey

    You do not need a special real-time inference feature flag enabled for Azure Machine Learning managed Online Endpoints. The capability is standard for all AML workspaces and subscriptions. Failures during endpoint provisioning are almost always due to configuration, quota, or network constraints — not missing features.

    When an endpoint deployment fails before compute is attached, it generally means the service could not allocate or prepare the compute instance. Common causes include quota shortages, unavailable VM SKUs in the selected region, VNet restrictions that block outbound traffic, or failures when pulling container images from ACR.

    Failures at this stage typically stem from one of the following:

    • Insufficient quota or SKU availability for the VM size you selected.
    • Networking restrictions, especially in VNet-injected workspaces where outbound access to ACR, storage, or control-plane endpoints is blocked.
    • ACR permission issues, such as missing AcrPull role for the managed identity.
    • Container startup errors, including failed environment setup or errors in score.py.
    • Endpoint name conflicts, especially if a previously deleted endpoint with the same name has not been fully purged.

    Azure ML only surfaces a generic “Failed” state in the UI. To see the actual cause, retrieve logs from the deployment:

    • Use the CLI:
    az ml online-deployment get-logs \
        --name <deployment> \
        --endpoint-name <endpoint> \
        --resource-group <rg> \
        --workspace-name <ws>
    

    Or view Deployment logs in the Azure Portal under the endpoint’s deployment tab. These logs usually reveal the precise problem (image-pull errors, missing dependencies, health-probe failures, etc.).

    Recommended troubleshooting sequence

    To validate your environment, follow these steps:

    Deploy a minimal test endpoint using a small CPU SKU and a standard AML environment (no custom container).

    If the test deployment succeeds, reintroduce custom components gradually — VNet, custom container, environment, etc.

    Check compute quota for the region you are using.

    Ensure the identity used by the endpoint has AcrPull permissions if a private ACR is involved.

    Verify that egress traffic to required Azure services is allowed if using VNets or firewalls.

    Use a fresh endpoint name, avoiding names of recently deleted endpoints.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.