Machine Learning Studio jobs fail due to DATA_CAPABILITY timeout

Question

Machine Learning Studio jobs fail due to DATA_CAPABILITY timeout

Alert101 5

I'm running a custom deep learning experiment with TensorFlow/Keras in Azure ML Studio. The data is in tfrecord files, each around 1GB. Smaller datasets (~10GB) have worked fine, but now that I'm ready to use my main dataset which is over 200 files I'm getting data errors which breaks the training.

Failed to execute command group with error Docker container `a5a95ec6912f43219377f97c4d2ed5e3-lifecycler` failed with status code `1`. System unhealthy with error: Component DATA_CAPABILITY unhealthy with err Service 'DATA_CAPABILITY' returned invalid response: status: Unavailable, message: "ping timeout", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

The error happens after the first epoch. I assume during the saving of the epoch checkpoint. The weird thing is that reaching the end of the epoch takes around 30-40 minutes but after that the job basically stalls. The error then happens randomly after 5 minutes or maybe five hours. I have two mounts, one RO for the training data and one RW for the checkpoints. The error started happening after I added the following mount settings to the job env vars. Without them the VM runs out of disk space (64GB) due to the size of the dataset.

Region: East US
Storage account: General Purpose V2 Standard/LRS
VM: Standard_NC24ads_A100_v4 (low priority)
Mount settings:

"DATASET_MOUNT_CACHE_SIZE": "-4096 MB"
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": "true"
"DATASET_MOUNT_MEMORY_CACHE_SIZE": "32768"
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": "true"

My guess is that the mount settings aren't compatible with the RW mount for the checkpoints. Any ideas on how to troubleshoot this?

The error seems to indicate that the job is unable to access data from the mount so I will try and change the storage account from LRS to ZRS tonight. However, that doesn't explain why the job stalls when trying to save the checkpoint, hence the this post.

Manas Mohanty 13,340 Reputation points Moderator

2025-12-02T06:52:05.3066667+00:00
Hi Alert101

From above description, it seems you are facing memory outage issue in first epoch itself.

The stall right after the first epoch and the eventual DATA_CAPABILITY … "ping timeout" strongly suggests your job is getting stuck on writes to the RW mount (checkpointing) while the BlobFuse/AML data runtime becomes unhealthy under load. There are two compounding factors here:

Write path + cache mode — frequent appends/writes over a mounted blob can degrade or wedge the mount, especially with block‑based cache enabled; BlobFuse2 has known issues in block‑cache mode for certain write patterns and recommends file‑cache mode for reliability. [github.com]

Low‑priority (spot) compute — preemption or transient capacity blips can interrupt the data runtime and make the lifecycler/mount service report timeouts; for long I/O‑heavy training, switching to Dedicated VMs is the recommended way to rule this out. [hi-ml.readthedocs.io], [Architectu...soft Learn | Learn.Microsoft.com]

Please reduce your batch size, cache_size to successfully save the weight periodically through check points.

Below is a pragmatic checklist to stabilize the run and get throughput back.

1) Don’t write checkpoints directly to the RW mount

Write locally (NVMe) and upload/sync in the background. This bypasses the FUSE overhead on every small write/append and avoids re‑upload of whole files that AML mount can trigger. The AML guidance explicitly calls out that file writes over mount can slow down or stall; prefer upload semantics or copy from local. [Troublesho...soft Learn | Learn.Microsoft.com]

YAML (v2) – use outputs and write locally, then sync

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json experiment_name: tfrecord-large display_name: tfrecord-large-run # Inputs: RO mount for training data inputs: train_data: type: uri_folder path: azureml://datastores/<blob-datastore>/paths/<container>/tfrecords/ mode: ro_mount # Outputs: write locally; AML will upload at job end outputs: checkpoints: type: uri_folder path: azureml://datastores/<blob-datastore>/paths/checkpoints/${{name}}/ mode: upload # avoids live RW mount writes environment_variables: # See section 2 for cache tuning; keep write path local regardless DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" # for RO input DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED: "true" DATASET_MOUNT_CACHE_SIZE: "-8192MB" # keep ~8GB free on disk Keras callback: write locally, then sync periodically Python import os, subprocess, time from tensorflow.keras.callbacks import Callback, ModelCheckpoint local_ckpt_dir = "/mnt/outputs/checkpoints" # local NVMe-backed path os.makedirs(local_ckpt_dir, exist_ok=True) # Save checkpoint locally each epoch ckpt_cb = ModelCheckpoint( filepath=os.path.join(local_ckpt_dir, "ckpt-{epoch:04d}.h5"), save_weights_only=True, save_freq="epoch" ) # Optional: background sync to blob every N epochs to reduce loss risk class AzCopySync(Callback): def __init__(self, every_n=5, dest_url=None): super().__init__() self.every_n = every_n self.dest_url = dest_url # SAS or datastore-mounted target def on_epoch_end(self, epoch, logs=None): if (epoch + 1) % self.every_n == 0 and self.dest_url: # azcopy sync is efficient & resumable subprocess.run([ "azcopy", "sync", local_ckpt_dir, self.dest_url, "--recursive=true" ], check=False) # model.fit(..., callbacks=[ckpt_cb, AzCopySync(every_n=5, dest_url="<blob-sas-url>")]) ``

Show more lines

The data‑loading best practices doc also emphasizes when to use mount vs download and how to tune them for large workloads. [github.com]

2) Tune mount/cache settings (especially disable block‑cache on write paths)

Given your env vars:

DATASET_MOUNT_CACHE_SIZE: "-4096 MB" — AML accepts negative values to reserve free disk; prefer no space in the unit and a consistent unit, e.g. "-4096MB" or "-4GB".

DATASET_MOUNT_MEMORY_CACHE_SIZE: "32768" — MB value; set based on available RAM.

Critical: Keep block‑based cache ON for the RO training dataset (large sequential reads) but OFF for any RW path (checkpoints). BlobFuse2 cautions that block‑cache has correctness/performance problems for random writes; use file‑cache only for write scenarios. [github.com], [github.com]

If you must keep a live RW mount (e.g., for mid‑run artifact access), set:

For RW mounts (checkpoint path)

DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED = "false" DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED = "true" DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET = "0.20" # start pruning earlier

Reference: Access data in a job and best practices describe available mount settings and when each cache mode is appropriate. [learn.microsoft.com], [github.com]

3) Use Dedicated compute to rule out spot preemption

Low‑priority clusters can be interrupted any time; under heavy I/O, that looks like random “ping timeouts” from the lifecycler/data runtime. For a large, long‑running training with a single node and big mounts, switch to Dedicated (same VM size) until the pipeline is stable; re‑introduce spot later if needed. [hi-ml.readthedocs.io], [Architectu...soft Learn | Learn.Microsoft.com]

4) Consider Azure Files (Premium) for checkpoint path

If you want a networked RW store that tolerates frequent small writes/appends, Azure Files (SMB) often behaves better than Blob over FUSE for this specific pattern. Register an azure_file datastore and point outputs.checkpoints there (still writing locally and uploading at end is preferable, but SMB can help if you must share intermediate files). Guidance on mount vs download and RW scenarios is covered in the AML docs. [learn.microsoft.com]

5) Validate BlobFuse2 health & versions

Ensure your workspace/runtime uses a BlobFuse2 version without block‑cache write issues; if you see those, switch to file‑cache for all mounts. Known‑issues page: [github.com]

Use BlobFuse2 Health Monitor and the AML job logs to check cache pressure and mount errors. Troubleshooting refs: [learn.microsoft.com], [docs.azure.cn]

6) tf.data input pipeline hygiene (RO mount)

Keep the GPU fed but avoid excessive random seeks into TFRecords on a mount: use num_parallel_reads, prefetch, and co‑located storage/compute (same region) as you already do (East US). The AML “Efficient data loading” doc has concrete patterns and throughput notes. [github.com]

7) What about LRS → ZRS?

Changing redundancy (LRS→ZRS) won’t address the mount stalling during checkpoint writes. ZRS improves durability/availability of the storage service, but your failure is happening inside the job’s data runtime/mount path (gRPC ping to the data capability). Focus on write strategy & cache mode first. The behavior you posted matches other recent reports of DATA_CAPABILITY/lifecycler timeouts under heavy I/O. [learn.microsoft.com]

8) Collect the right logs quickly

Stream job logs: az ml job stream -n <job_id>

Inspect system_logs → data-capability and user_logs/std_log.txt for mount errors and cache pruning messages; the troubleshooting doc points out these locations and the impact of writes over mount. [Troublesho...soft Learn | Learn.Microsoft.com]

Suggested course of action.

Write checkpoints locally (/mnt/outputs) and sync (upload mode or azcopy) rather than a live RW mount. [Troublesho...soft Learn | Learn.Microsoft.com]

Disable block‑cache for any RW path; keep file‑cache only there; continue using block‑cache for RO training data. [github.com], [github.com]

Run on Dedicated for this debug cycle to eliminate spot preemption. [hi-ml.readthedocs.io], [Architectu...soft Learn | Learn.Microsoft.com]

Optionally, move the checkpoint target to Azure Files (Premium) if you need shared intermediate access. [learn.microsoft.com]the stall right after the first epoch and the eventual DATA_CAPABILITY … "ping timeout" strongly suggests your job is getting stuck on writes to the RW mount (checkpointing) while the BlobFuse/AML data runtime becomes unhealthy under load. There are two compounding factors here:

Write path + cache mode — frequent appends/writes over a mounted blob can degrade or wedge the mount, especially with block‑based cache enabled; BlobFuse2 has known issues in block‑cache mode for certain write patterns and recommends file‑cache mode for reliability. [github.com]

Low‑priority (spot) compute — preemption or transient capacity blips can interrupt the data runtime and make the lifecycler/mount service report timeouts; for long I/O‑heavy training, switching to Dedicated VMs is the recommended way to rule this out. [hi-ml.readthedocs.io], [Architectu...soft Learn | Learn.Microsoft.com]

Thank you.

2 answers

Your answer

Answer 1

Alert101 5

Update: Forgot that the conversion to ZRS is done by Azure so it's going to take a while.

Disregard my guess about the mount settings somehow affecting writing to the mount. I tested this with a smaller dataset and it successfully managed to save the weights between epochs. Problem lies elsewhere. I'll keep testing while I wait for the ZRS conversion to happen. I started a new job with the full dataset, will report back tomorrow.

Any ideas are welcome.

Answer 2

Thank you for the comprehensive answer, it helped a lot.

You are right about running out of memory, that was the culprit which was in turn caused by the training script having .cache() for both the training and validation datasets. I disabled that and it got past the first epoch. However, after that fix I ran out of disk space again.

AzureMLCompute job failed
DiskFullError: Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 64197 MB, available space: 1665 MB (under AZ_BATCH_NODE_ROOT_DIR).
	Appinsights Reachable: Some(true)

For some reason the negative DATASET_MOUNT_CACHE_SIZE doesn't seem to work and now I'm running another test with the following mount options:

    "DATASET_RESERVED_FREE_DISK_SPACE":         "4096",
    "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED":  "true",   
    "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED":   "true",
    "DATASET_MOUNT_MEMORY_CACHE_SIZE":          "65536",

And in my training script:

    options = tf.data.Options()
    options.threading.private_threadpool_size = 12
    options.experimental_optimization.map_parallelization = True
    options.experimental_optimization.map_and_batch_fusion = True
    options.experimental_optimization.parallel_batch = True
    options.experimental_deterministic = False

    dataset = tf.data.Dataset.from_tensor_slices(file_paths).interleave(
        lambda filename: tf.data.TFRecordDataset(
            filename,
            buffer_size=64*1024*1024,
            num_parallel_reads=tf.data.AUTOTUNE
        ),
        cycle_length=min(12, len(file_paths)),
        block_length=4,
        num_parallel_calls=tf.data.AUTOTUNE
    )

You mentioned disabling the block based cache on the RW mount. How do I do that as the mount options are env vars for the job/VM itself?

I'm using the mount directly as destination for Keras's BackupAndRestore callback because I'm using spot instances. This is a personal project so I need to keep costs down. :) Syncing the checkpoints back and forth when spot instances are killed/back online seems a bit like a hassle right now. I'll keep that in mind and look into it when I find the extra time.

Share via

Machine Learning Studio jobs fail due to DATA_CAPABILITY timeout

1) Don’t write checkpoints directly to the RW mount

2) Tune mount/cache settings (especially disable block‑cache on write paths)

For RW mounts (checkpoint path)

3) Use Dedicated compute to rule out spot preemption

4) Consider Azure Files (Premium) for checkpoint path

5) Validate BlobFuse2 health & versions

6) tf.data input pipeline hygiene (RO mount)

7) What about LRS → ZRS?

8) Collect the right logs quickly

Suggested course of action.

2 answers

Your answer