Share via


Best practices for Serverless GPU compute

This article presents you with best practice recommendations for using serverless GPU compute in your notebooks and jobs.

By following these recommendations, you will enhance the productivity, cost efficiency, and reliability of your workloads on Azure Databricks.

Use the right compute

  • Use Serverless GPU compute. This option comes with torch, cuda, and torchvision optimized for compatibility. Exact package versions will depend on the environment versions.
  • Select your accelerator in the environment side panel.
    • For remote distributed training workloads, use an A10 GPU, which will be the client to send a job to the remote H100 later.
    • For running large interactive jobs on the notebook itself, you can attach your notebook to H100, which will take up 1 node (8 H100 GPUs).
  • To avoid taking up GPUs, you can attach your notebook to a CPU cluster for some operations like git clone and converting Spark Dataframe to Mosaic Data Shard (MDS) format.

MLflow recommendations

For an optimal ML development cycle, use MLflow 3 on Databricks. Follow these tips:

  • Upgrade your environment's MLflow to version 3.6 or newer and follow the MLflow deep learning flow in MLflow 3 deep learning workflow.

  • Set the step parameter in MLFlowLogger to a reasonable number of batches. MLflow has a limit of 10 million metric steps that can be logged. See Resource limits.

  • Enable mlflow.pytorch.autolog() if Pytorch Lightning is used as the trainer.

  • Customize your MLflow run name by encapsulating your model training code within the mlflow.start_run() API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using the run_name parameter in mlflow.start_run(run_name="your-custom-name") or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name is jobTaskRun-xxxxx.

    from transformers import TrainingArguments
    args = TrainingArguments(
        report_to="mlflow",
        run_name="llama7b-sft-lr3e5",  # <-- MLflow run name
        logging_steps=50,
    )
    
  • The serverless GPU API launches an MLflow experiment to log system metrics. By default, it uses the name /Users/{WORKSPACE_USER}/{get_notebook_name()} unless the user overwrites it with the environment variable MLFLOW_EXPERIMENT_NAME.

    • When setting the MLFLOW_EXPERIMENT_NAME environment variable, use an absolute path. For example,/Users/<username>/my-experiment.
    • The experiment name must not contain the existing folder name. E.g. if my-experiment is an existing folder, the above example will error out.
    import os
    from serverless_gpu import distributed
    os.environ['MLFLOW_EXPERIMENT_NAME'] = '/Users/{WORKSPACE_USER}/my_experiment'
    @distributed(gpus=num_gpus, gpu_type=gpu_type, remote=True)
    def run_train():
    # my training code
    
  • To resume training from a previous run, specify the MLFLOW_RUN_ID from the previous run as follows.

    import os
    os.environ[‘MLFLOW_RUN_ID’] = <previous_run_id>
    run_train.distributed()
    

Model checkpoint

Save model checkpoints to Unity Catalog volumes, which provide the same governance as other Unity Catalog objects. Use the following path format to reference files in volumes from a Databricks notebook:

/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

Save checkpoints to volumes the same way you save them to local storage.

The example below shows how to write a PyTorch checkpoint to Unity Catalog volumes:

import torch

checkpoint = {
    "epoch": epoch,  # last finished epoch
    "model_state_dict": model.state_dict(),  # weights & buffers
    "optimizer_state_dict": optimizer.state_dict(),  # optimizer state
    "loss": loss,  # optional current loss
    "metrics": {"val_acc": val_acc},  # optional metrics
    # Add scheduler state, RNG state, and other metadata as needed.
}
checkpoint_path = "/Volumes/my_catalog/my_schema/model/checkpoints/ckpt-0001.pt"
torch.save(checkpoint, checkpoint_path)

This approach also works for distributed checkpoints from multiple nodes. The example below shows distributed model checkpointing with the Torch Distributed Checkpoint API:

import torch.distributed.checkpoint as dcp

def save_checkpoint(self, checkpoint_path):
    state_dict = self.get_state_dict(model, optimizer)
    dcp.save(state_dict, checkpoint_id=checkpoint_path)

trainer.save_checkpoint("/Volumes/my_catalog/my_schema/model/checkpoints")

Multi-user collaboration

  • To ensure all users can access shared code (e.g., helper modules, environment.yaml), create git folders in /Workspace/Repos or /Workspace/Shared instead of user-specific folders like /Workspace/Users/<your_email>/.
  • For code that is in active development, use Git folders in user-specific folders /Workspace/Users/<your_email>/ and push to remote Git repos. This allows multiple users to have a user-specific clone (and branch) but still use a remote Git repo for version control. See best practices for using Git on Databricks.
  • Collaborators can share and comment on notebooks.

Global limits in Databricks

See Resource limits.