Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This release introduces highly requested new features, addresses several key issues, and improves overall performance.
New features
Azure CycleCloud offers greatly improved node health monitoring and reporting via the new HealthAgent (see the Azure CycleCloud HealthAgent Project).
You can disable the Azure CycleCloud HealthAgent by setting the node configuration property
cyclecloud.healthagent.disable=true.Azure CycleCloud now offers node, GPU, and Slurm scheduler metrics for monitoring and alerting. Monitoring is provided via Azure Monitor Workspace and Managed Grafana. See the Azure CycleCloud Monitoring Project for details.
Azure CycleCloud Slurm cluster changes:
- Azure CycleCloud Slurm clusters support Slurm version 25.05.2.
- Azure CycleCloud Slurm clusters support Ubuntu 22/24, Alma 8/9, and RedHat 8/9 images.
- Azure CycleCloud Slurm clusters support ARM64 images and machine types.
- Azure CycleCloud Slurm clusters offer built-in, continuous health-checking, reporting, and recovery for cluster nodes by automatically configuring the Slurm HealthCheckProgram, Prolog, and Epilog scripts to use the Azure CycleCloud HealthAgent.
- Azure CycleCloud Slurm clusters offer built-in metrics collection and monitoring in Azure Monitor Workspace.
- The Azure CycleCloud Slurm cluster creation UI provides a new
Monitoringsection to support enabling and configuring the new metrics collection and monitoring capabilities (disabled by default). - Azure CycleCloud Slurm configures and starts the slurmrestd service automatically to support monitoring.
- Azure CycleCloud Slurm clusters offer built-in, automated topology plugin configuration for both the tree and block topology plugins via the
azslurm topologyCLI. Automatic topology configuration is supported for clusters with Virtual Machine Scale Sets topology, SHARP, or the NVLink Domain for Slurm topology-aware scheduling. - Azure CycleCloud Slurm clusters include a new
azslurmdsystem service that synchronizes shared Slurm and Azure CycleCloud state. For example,azslurmdsynchronizes Azure CycleCloud's node keep-alive setting with Slurm’s native keep-alive feature. - Cyclecloud Slurm clusters now include prolog and epilog scripts to automatically configure the "Nvidia IMEX" service on a per-job basis for Nvidia GPU clusters.
- Azure CycleCloud Slurm clusters using the
cyclecloud-slurmproject, version 4.x and later, no longer require Chef for node configuration.
Jetpack CLI changes
- The Jetpack CLI includes a new
jetpack propscommand to support reading and writing node data (properties) from cluster nodes for use in cluster-init scripts. Properties are stored back to Azure CycleCloud as theNodePropertiestype in the Azure CycleCloud datastore. - The Jetpack CLI includes a new
jetpack conditioncommand used to report node health conditions to Azure CycleCloud.
- The Jetpack CLI includes a new
Azure CycleCloud UI Changes
- The cluster-level
Issuesbutton now opens as a full page and aggregates allocation and health issues for easier viewing. - The cluster-level
Activity Logtab in the Cluster UI was repositioned alongside theEvent Logpane. - The node-level
Show DetailsdialogOverviewtab was redesigned and updated with direct links to the Azure portal and copy buttons for all fields. - The node-level
Show Detailsdialog includes a new action bar that provides node-specific operations includingRestartandReimagefor node health remediation. - The node-level
Show Detailsdialog now shows just the first node health condition and provides a link to a newIssuestab to display all current node conditions.
- The cluster-level
NVMe device support
- Azure CycleCloud automatically mounts and formats NVMe storage devices on Linux nodes on machine types with NVMe ephemeral disks.
- Linux nodes mount NVMe ephemeral disks at
/nvme. - Machine types with NVMe boot disks, such as the v6, HBv5, and HBv6 machine types, are now supported.
ARM64 support
- Azure CycleCloud and Jetpack support ARM64 nodes and ARM64 images if the cluster type provides ARM64 support. Currently, only the Slurm cluster type provides built-in ARM64 support.
- ARM64 packages for Jetpack are available for installation in custom images.
Azure CycleCloud now provides
ReimageandRestartactions on Virtual Machine Scale Set nodes for node recovery and repair.The new
RestartandReimageactions are available via the new Azure CycleCloud REST APIs:/clusters/{cluster}/nodes/restartand/clusters/{cluster}/nodes/reimage.Azure CycleCloud node arrays now support attaching precreated Virtual Machine Scale Sets (also known as bring-your-own Virtual Machine Scale Sets) by setting the new
PredefinedScaleSetIdnode attribute.You can configure Linux nodes to run without the legacy Chef framework for nodes that don't require Chef.
Chef is disabled by default for new Slurm clusters, unless required by specific node configurations.
All filesystem mounts for cluster nodes are now persisted to
/etc/fstab. This change ensures that filesystems properly remount upon reboot.Linux nodes now bind the temporary directory (
/tmp) to a directory created on the ephemeral disk (if the machine type provides an ephemeral disk) to reduce OS disk usage.Azure CycleCloud supports Blobfuse2 as a mount type in cluster templates.
When you modify node configuration settings on running clusters, you can apply changes to running nodes by issuing a reconverge command on the nodes.
Azure CycleCloud now uses the Azure Compute RP API version 2024-11-01.
Resolved problems
- The Azure CycleCloud UI formatting made converge errors difficult to interpret.
- The
/c/{cluster_name}URL for direct-linking to clusters in the UI redirected to a blank page for unauthenticated users. - Cloud-init errors were reported correctly.
- Cloud-init failures didn't differentiate user-script errors from image-level errors.
- The
azslurm nodesCLI command sometimes failed and showed the message: "missing 'buckets' param." - When used by non-root users, log rotation for the
azslurmCLI failed due to log file ownership and user permissions. - Azure CycleCloud Slurm clusters stored private IP addresses in the Slurm node data. This problem led to Slurm rejecting nodes under certain conditions.
- The Azure CycleCloud UI lost the active cluster selection when it refreshed the
Issuespanel. - The
Keep Alivetoggle in the node status report didn't work. - Pressing
Enteron the sign-in page didn't submit the authentication form. - The default shell selection in Linux was inconsistent for different OS images.
- The
jetpack usersCLI command provided no output for some cluster types. - Azure CycleCloud CLI installation failed on macOS.
- The
jetpack report_issueCLI command failed to upload the generated log bundle. - Using the Azure CLI
az vm run-commandon an Azure CycleCloud node caused Azure CycleCloud to flag the node as failed with the message: "An unspecified error occurred." - Updating a cluster could fail and report an "Attribute mismatch error" for the
TerminateNotificationTimeoutandMaxPricenode array attributes, even when the value is unchanged. - Azure reported an incorrect GPU count and memory size for GB200 and the incorrect data was reflected in Azure CycleCloud machine data for scheduling.
- Azure CycleCloud threw an exception during node creation if the
StartTimeattribute wasn't set on the node record. - Cluster nodes sometimes failed to reconverge after a
Reimageoperation because cluster-init marker files stored on the node's ephemeral disk weren't removed by the operation.
Breaking changes
- The Jetpack package is now installed by default for custom images.
- To revert to the old behavior, set
InstallJetpack=falseon the node in the cluster template.
- To revert to the old behavior, set
- The Azure CycleCloud Slurm cluster now defaults to
ReturnProxy=false.- To revert to the original behavior, set the
ReturnProxyparameter totrueduring cluster creation.
- To revert to the original behavior, set the
- For better default security, Azure CycleCloud Slurm clusters now disable public IPs by default.
- To revert to the original behavior, set the
UsePublicNetworkparameter totrueduring cluster creation.
- To revert to the original behavior, set the
Known issues
- The new
RestartandReimageactions are available *only- for nodes in node arrays (Virtual Machine Scale Set instances). Single nodes (individual VMs) don't yet supportRestartorReimage. For single nodes, use the Azure portal or the Azure CLI to restart or reimage the VM. - The Azure CycleCloud HPC Pack cluster type fails to converge.