Machine Learning Studio jobs fail due to DATA_CAPABILITY timeout

Alert101 5 Reputation points
2025-12-01T08:41:40.15+00:00

I'm running a custom deep learning experiment with TensorFlow/Keras in Azure ML Studio. The data is in tfrecord files, each around 1GB. Smaller datasets (~10GB) have worked fine, but now that I'm ready to use my main dataset which is over 200 files I'm getting data errors which breaks the training.

Failed to execute command group with error Docker container `a5a95ec6912f43219377f97c4d2ed5e3-lifecycler` failed with status code `1`. System unhealthy with error: Component DATA_CAPABILITY unhealthy with err Service 'DATA_CAPABILITY' returned invalid response: status: Unavailable, message: "ping timeout", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

The error happens after the first epoch. I assume during the saving of the epoch checkpoint. The weird thing is that reaching the end of the epoch takes around 30-40 minutes but after that the job basically stalls. The error then happens randomly after 5 minutes or maybe five hours. I have two mounts, one RO for the training data and one RW for the checkpoints. The error started happening after I added the following mount settings to the job env vars. Without them the VM runs out of disk space (64GB) due to the size of the dataset.

Region: East US
Storage account: General Purpose V2 Standard/LRS
VM: Standard_NC24ads_A100_v4 (low priority)
Mount settings:

  • "DATASET_MOUNT_CACHE_SIZE": "-4096 MB"
  • "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": "true"
  • "DATASET_MOUNT_MEMORY_CACHE_SIZE": "32768"
  • "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": "true"

My guess is that the mount settings aren't compatible with the RW mount for the checkpoints. Any ideas on how to troubleshoot this?

The error seems to indicate that the job is unable to access data from the mount so I will try and change the storage account from LRS to ZRS tonight. However, that doesn't explain why the job stalls when trying to save the checkpoint, hence the this post.

Azure Machine Learning
{count} votes

2 answers

Sort by: Most helpful
  1. Alert101 5 Reputation points
    2025-12-01T15:17:41.6+00:00

    Update: Forgot that the conversion to ZRS is done by Azure so it's going to take a while.

    Disregard my guess about the mount settings somehow affecting writing to the mount. I tested this with a smaller dataset and it successfully managed to save the weights between epochs. Problem lies elsewhere. I'll keep testing while I wait for the ZRS conversion to happen. I started a new job with the full dataset, will report back tomorrow.

    Any ideas are welcome.

    0 comments No comments

  2. Alert101 5 Reputation points
    2025-12-02T15:42:58.49+00:00

    Thank you for the comprehensive answer, it helped a lot.

    You are right about running out of memory, that was the culprit which was in turn caused by the training script having .cache() for both the training and validation datasets. I disabled that and it got past the first epoch. However, after that fix I ran out of disk space again.

    AzureMLCompute job failed
    DiskFullError: Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 64197 MB, available space: 1665 MB (under AZ_BATCH_NODE_ROOT_DIR).
    	Appinsights Reachable: Some(true)
    

    For some reason the negative DATASET_MOUNT_CACHE_SIZE doesn't seem to work and now I'm running another test with the following mount options:

        "DATASET_RESERVED_FREE_DISK_SPACE":         "4096",
        "DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED":  "true",   
        "DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED":   "true",
        "DATASET_MOUNT_MEMORY_CACHE_SIZE":          "65536", 
    
    

    And in my training script:

        options = tf.data.Options()
        options.threading.private_threadpool_size = 12
        options.experimental_optimization.map_parallelization = True
        options.experimental_optimization.map_and_batch_fusion = True
        options.experimental_optimization.parallel_batch = True
        options.experimental_deterministic = False
    
        dataset = tf.data.Dataset.from_tensor_slices(file_paths).interleave(
            lambda filename: tf.data.TFRecordDataset(
                filename,
                buffer_size=64*1024*1024,
                num_parallel_reads=tf.data.AUTOTUNE
            ),
            cycle_length=min(12, len(file_paths)),
            block_length=4,
            num_parallel_calls=tf.data.AUTOTUNE
        )
    

    You mentioned disabling the block based cache on the RW mount. How do I do that as the mount options are env vars for the job/VM itself?

    I'm using the mount directly as destination for Keras's BackupAndRestore callback because I'm using spot instances. This is a personal project so I need to keep costs down. :) Syncing the checkpoints back and forth when spot instances are killed/back online seems a bit like a hassle right now. I'll keep that in mind and look into it when I find the extra time.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.