Azure Kubernetes Service (AKS)

Question

Azure Kubernetes Service (AKS)

Tigran 0

Service

Azure Kubernetes Service (AKS)

Severity

Sev A / Production outage

Region

West Europe

Problem description (paste exactly):

Our AKS cluster is in a degraded state. Both system node pools (newpool1, newpool2) are in provisioningState: Failed. All nodepool operations (scale, add, reconcile) return InternalOperationError. Attempting to add a new user node pool fails with: InvalidGalleryImageRef – Gallery image reference invalid. Current nodeImageVersion is: AKSUbuntu-1804gen2containerd-202505.27.0 Existing nodes are running, but Azure cannot create or reconcile any new nodes. This blocks scheduling of stateful workloads (PostgreSQL Pending with volume node affinity conflict).

Request: Backend repair / node image reconciliation without data loss.

Manish Deshpande 1,330 Reputation points Microsoft External Staff Moderator

2025-12-15T02:41:33.47+00:00
Hello

This issue may occur when the VM scale set or availability set supporting the node pool experiences an error during provisioning, scaling, or updating. Potential causes include insufficient capacity, quota limitations, network disruptions, policy violations, resource locks, or other circumstances that prevent proper allocation or configuration of the VM.

To troubleshoot this issue, follow these steps:

Check the status of the node pool using the az aks nodepool show command. If the provisioning state is Failed, you can see the error message and code in the output.

Check the status of the VM scale set or availability set using the az vmss show or az vm availability-set show command. If the provisioning state is Failed, you can see the error message and code in the output.

Check the status of the individual VM in the node pool using the az vmss list-instances or az vm list command. If any VM is in a Failed or Unhealthy state, you can see the error message and code in the output.

Check the activity log and diagnostic setting of the VM scale set or availability set to see if any events or alerts that indicate the cause of the failure. You can use the Azure portal, Azure CLI, or Azure Monitor API to access the activity log and diagnostic setting.

Check the quota and capacity of the region and subscription where the node pool is deployed. You can use the az vm list-usage command or the Azure portal to check the quota and capacity. If the quota or capacity limit is reached, you can request an increase or delete some unused resources.

Check the policy and role assignments of the node pool. You can use the az policy and az role commands or the Azure portal to check the policy definitions, assignments, compliance, and exemptions. You can also check the role assignments and permissions of the node pool using the az role assignment command or the Azure portal.

Check the resource locks of the node pool. You can use the az lock command or the Azure portal to check the lock level, scope, and notes. You can also delete or update the lock if needed.

References
https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/availability-performance/cluster-node-virtual-machine-failed-state#scenario-3-node-pool-is-in-a-failed-state

Thanks,
Manish Deshpande.

2 answers

Your answer

Answer 1

To address the issue with your Azure Kubernetes Service (AKS) cluster being in a degraded state and the node pools (newpool1, newpool2) showing provisioningState: Failed, you can follow these troubleshooting steps:

Check Node Pool Status: Use the command az aks nodepool show to check the status of your node pools. Look for any specific error messages or codes that can provide more insight into the failure.
VM Scale Set Status: Check the status of the VM scale set backing your node pools with az vmss show. Again, look for any error messages or codes.
Inspect Individual VMs: Use az vmss list-instances to check the status of individual VMs in the node pools. If any VMs are in a Failed or Unhealthy state, investigate the error messages.
Quota and Capacity Check: Verify the quota and capacity for your region and subscription using az vm list-usage. If you have reached limits, consider requesting an increase or deleting unused resources.
Policy and Role Assignments: Check the policy and role assignments for your node pools using az policy and az role. Ensure that there are no restrictions preventing the creation or scaling of nodes.
Resource Locks: Check for any resource locks on your node pools using az lock. If there are locks that prevent scaling or updates, you may need to adjust or remove them.
Node Image Version: The error InvalidGalleryImageRef suggests there may be an issue with the specified node image version. Ensure that the image version AKSUbuntu-1804gen2containerd-202505.27.0 is valid and available in your region. You might need to update to a newer image version if it is deprecated or unavailable.
Reconciliation: If necessary, trigger a reconciliation for the managed cluster using the command az resource update --ids <AKS cluster id>. This can help align the actual state with the desired configuration.

If these steps do not resolve the issue, consider reaching out to Azure support for further assistance, especially since this is a production outage.

References:

Answer 2

anrodrigues-MSFT 0 Microsoft Employee

Hi @Tigran

Please note that images of aks ubuntu 1804 have been retired and no new nodes and images are available.

More details here. https://github.com/Azure/AKS/issues/4873

We recommend you to upgrade to Ubuntu 22.04.

If you have any questions, please drop a comment by tagging my userid @anrodrigues-MSFT

If this does answer your question, please accept it as the answer as a token of appreciation.

Share via

Azure Kubernetes Service (AKS)

2 answers

Your answer