Edit

Share via


Azure CycleCloud version 8.8.0

This release introduces highly requested new features, addresses several key issues, and improves overall performance.

New features

  • Azure CycleCloud offers greatly improved node health monitoring and reporting via the new HealthAgent (see the Azure CycleCloud HealthAgent Project).

  • You can disable the Azure CycleCloud HealthAgent by setting the node configuration property cyclecloud.healthagent.disable=true.

  • Azure CycleCloud now offers node, GPU, and Slurm scheduler metrics for monitoring and alerting. Monitoring is provided via Azure Monitor Workspace and Managed Grafana. See the Azure CycleCloud Monitoring Project for details.

  • Azure CycleCloud Slurm cluster changes:

    • Azure CycleCloud Slurm clusters support Slurm version 25.05.2.
    • Azure CycleCloud Slurm clusters support Ubuntu 22/24, Alma 8/9, and RedHat 8/9 images.
    • Azure CycleCloud Slurm clusters support ARM64 images and machine types.
    • Azure CycleCloud Slurm clusters offer built-in, continuous health-checking, reporting, and recovery for cluster nodes by automatically configuring the Slurm HealthCheckProgram, Prolog, and Epilog scripts to use the Azure CycleCloud HealthAgent.
    • Azure CycleCloud Slurm clusters offer built-in metrics collection and monitoring in Azure Monitor Workspace.
    • The Azure CycleCloud Slurm cluster creation UI provides a new Monitoring section to support enabling and configuring the new metrics collection and monitoring capabilities (disabled by default).
    • Azure CycleCloud Slurm configures and starts the slurmrestd service automatically to support monitoring.
    • Azure CycleCloud Slurm clusters offer built-in, automated topology plugin configuration for both the tree and block topology plugins via the azslurm topology CLI. Automatic topology configuration is supported for clusters with Virtual Machine Scale Sets topology, SHARP, or the NVLink Domain for Slurm topology-aware scheduling.
    • Azure CycleCloud Slurm clusters include a new azslurmd system service that synchronizes shared Slurm and Azure CycleCloud state. For example, azslurmd synchronizes Azure CycleCloud's node keep-alive setting with Slurm’s native keep-alive feature.
    • Cyclecloud Slurm clusters now include prolog and epilog scripts to automatically configure the "Nvidia IMEX" service on a per-job basis for Nvidia GPU clusters.
    • Azure CycleCloud Slurm clusters using the cyclecloud-slurm project, version 4.x and later, no longer require Chef for node configuration.
  • Jetpack CLI changes

    • The Jetpack CLI includes a new jetpack props command to support reading and writing node data (properties) from cluster nodes for use in cluster-init scripts. Properties are stored back to Azure CycleCloud as the NodeProperties type in the Azure CycleCloud datastore.
    • The Jetpack CLI includes a new jetpack condition command used to report node health conditions to Azure CycleCloud.
  • Azure CycleCloud UI Changes

    • The cluster-level Issues button now opens as a full page and aggregates allocation and health issues for easier viewing.
    • The cluster-level Activity Log tab in the Cluster UI was repositioned alongside the Event Log pane.
    • The node-level Show Details dialog Overview tab was redesigned and updated with direct links to the Azure portal and copy buttons for all fields.
    • The node-level Show Details dialog includes a new action bar that provides node-specific operations including Restart and Reimage for node health remediation.
    • The node-level Show Details dialog now shows just the first node health condition and provides a link to a new Issues tab to display all current node conditions.
  • NVMe device support

    • Azure CycleCloud automatically mounts and formats NVMe storage devices on Linux nodes on machine types with NVMe ephemeral disks.
    • Linux nodes mount NVMe ephemeral disks at /nvme.
    • Machine types with NVMe boot disks, such as the v6, HBv5, and HBv6 machine types, are now supported.
  • ARM64 support

    • Azure CycleCloud and Jetpack support ARM64 nodes and ARM64 images if the cluster type provides ARM64 support. Currently, only the Slurm cluster type provides built-in ARM64 support.
    • ARM64 packages for Jetpack are available for installation in custom images.
  • Azure CycleCloud now provides Reimage and Restart actions on Virtual Machine Scale Set nodes for node recovery and repair.

  • The new Restart and Reimage actions are available via the new Azure CycleCloud REST APIs: /clusters/{cluster}/nodes/restart and /clusters/{cluster}/nodes/reimage.

  • Azure CycleCloud node arrays now support attaching precreated Virtual Machine Scale Sets (also known as bring-your-own Virtual Machine Scale Sets) by setting the new PredefinedScaleSetId node attribute.

  • You can configure Linux nodes to run without the legacy Chef framework for nodes that don't require Chef.

  • Chef is disabled by default for new Slurm clusters, unless required by specific node configurations.

  • All filesystem mounts for cluster nodes are now persisted to /etc/fstab. This change ensures that filesystems properly remount upon reboot.

  • Linux nodes now bind the temporary directory (/tmp) to a directory created on the ephemeral disk (if the machine type provides an ephemeral disk) to reduce OS disk usage.

  • Azure CycleCloud supports Blobfuse2 as a mount type in cluster templates.

  • When you modify node configuration settings on running clusters, you can apply changes to running nodes by issuing a reconverge command on the nodes.

  • Azure CycleCloud now uses the Azure Compute RP API version 2024-11-01.

Resolved problems

  • The Azure CycleCloud UI formatting made converge errors difficult to interpret.
  • The /c/{cluster_name} URL for direct-linking to clusters in the UI redirected to a blank page for unauthenticated users.
  • Cloud-init errors were reported correctly.
  • Cloud-init failures didn't differentiate user-script errors from image-level errors.
  • The azslurm nodes CLI command sometimes failed and showed the message: "missing 'buckets' param."
  • When used by non-root users, log rotation for the azslurm CLI failed due to log file ownership and user permissions.
  • Azure CycleCloud Slurm clusters stored private IP addresses in the Slurm node data. This problem led to Slurm rejecting nodes under certain conditions.
  • The Azure CycleCloud UI lost the active cluster selection when it refreshed the Issues panel.
  • The Keep Alive toggle in the node status report didn't work.
  • Pressing Enter on the sign-in page didn't submit the authentication form.
  • The default shell selection in Linux was inconsistent for different OS images.
  • The jetpack users CLI command provided no output for some cluster types.
  • Azure CycleCloud CLI installation failed on macOS.
  • The jetpack report_issue CLI command failed to upload the generated log bundle.
  • Using the Azure CLI az vm run-command on an Azure CycleCloud node caused Azure CycleCloud to flag the node as failed with the message: "An unspecified error occurred."
  • Updating a cluster could fail and report an "Attribute mismatch error" for the TerminateNotificationTimeout and MaxPrice node array attributes, even when the value is unchanged.
  • Azure reported an incorrect GPU count and memory size for GB200 and the incorrect data was reflected in Azure CycleCloud machine data for scheduling.
  • Azure CycleCloud threw an exception during node creation if the StartTime attribute wasn't set on the node record.
  • Cluster nodes sometimes failed to reconverge after a Reimage operation because cluster-init marker files stored on the node's ephemeral disk weren't removed by the operation.

Breaking changes

  • The Jetpack package is now installed by default for custom images.
    • To revert to the old behavior, set InstallJetpack=false on the node in the cluster template.
  • The Azure CycleCloud Slurm cluster now defaults to ReturnProxy=false.
    • To revert to the original behavior, set the ReturnProxy parameter to true during cluster creation.
  • For better default security, Azure CycleCloud Slurm clusters now disable public IPs by default.
    • To revert to the original behavior, set the UsePublicNetwork parameter to true during cluster creation.

Known issues

  • The new Restart and Reimage actions are available *only- for nodes in node arrays (Virtual Machine Scale Set instances). Single nodes (individual VMs) don't yet support Restart or Reimage. For single nodes, use the Azure portal or the Azure CLI to restart or reimage the VM.
  • The Azure CycleCloud HPC Pack cluster type fails to converge.