Thank you for the comprehensive answer, it helped a lot.
You are right about running out of memory, that was the culprit which was in turn caused by the training script having .cache() for both the training and validation datasets. I disabled that and it got past the first epoch. However, after that fix I ran out of disk space again.
AzureMLCompute job failed
DiskFullError: Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 64197 MB, available space: 1665 MB (under AZ_BATCH_NODE_ROOT_DIR).
Appinsights Reachable: Some(true)
For some reason the negative DATASET_MOUNT_CACHE_SIZE doesn't seem to work and now I'm running another test with the following mount options:
"DATASET_RESERVED_FREE_DISK_SPACE": "4096",
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": "true",
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": "true",
"DATASET_MOUNT_MEMORY_CACHE_SIZE": "65536",
And in my training script:
options = tf.data.Options()
options.threading.private_threadpool_size = 12
options.experimental_optimization.map_parallelization = True
options.experimental_optimization.map_and_batch_fusion = True
options.experimental_optimization.parallel_batch = True
options.experimental_deterministic = False
dataset = tf.data.Dataset.from_tensor_slices(file_paths).interleave(
lambda filename: tf.data.TFRecordDataset(
filename,
buffer_size=64*1024*1024,
num_parallel_reads=tf.data.AUTOTUNE
),
cycle_length=min(12, len(file_paths)),
block_length=4,
num_parallel_calls=tf.data.AUTOTUNE
)
You mentioned disabling the block based cache on the RW mount. How do I do that as the mount options are env vars for the job/VM itself?
I'm using the mount directly as destination for Keras's BackupAndRestore callback because I'm using spot instances. This is a personal project so I need to keep costs down. :) Syncing the checkpoints back and forth when spot instances are killed/back online seems a bit like a hassle right now. I'll keep that in mind and look into it when I find the extra time.