Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article presents you with best practice recommendations for using serverless GPU compute in your notebooks and jobs.
By following these recommendations, you will enhance the productivity, cost efficiency, and reliability of your workloads on Azure Databricks.
Use the right compute
- Use Serverless GPU compute. This option comes with torch, cuda, and torchvision optimized for compatibility. Exact package versions will depend on the environment versions.
- Select your accelerator in the environment side panel.
- For remote distributed training workloads, use an A10 GPU, which will be the client to send a job to the remote H100 later.
- For running large interactive jobs on the notebook itself, you can attach your notebook to H100, which will take up 1 node (8 H100 GPUs).
- To avoid taking up GPUs, you can attach your notebook to a CPU cluster for some operations like git clone and converting Spark Dataframe to Mosaic Data Shard (MDS) format.
MLflow recommendations
For an optimal ML development cycle, use MLflow 3 on Databricks. Follow these tips:
Upgrade your environment's MLflow to version 3.6 or newer and follow the MLflow deep learning flow in MLflow 3 deep learning workflow.
Set the
stepparameter inMLFlowLoggerto a reasonable number of batches. MLflow has a limit of 10 million metric steps that can be logged. See Resource limits.Enable
mlflow.pytorch.autolog()if Pytorch Lightning is used as the trainer.Customize your MLflow run name by encapsulating your model training code within the
mlflow.start_run()API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using therun_nameparameter inmlflow.start_run(run_name="your-custom-name")or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name isjobTaskRun-xxxxx.from transformers import TrainingArguments args = TrainingArguments( report_to="mlflow", run_name="llama7b-sft-lr3e5", # <-- MLflow run name logging_steps=50, )The serverless GPU API launches an MLflow experiment to log system metrics. By default, it uses the name
/Users/{WORKSPACE_USER}/{get_notebook_name()}unless the user overwrites it with the environment variableMLFLOW_EXPERIMENT_NAME.- When setting the
MLFLOW_EXPERIMENT_NAMEenvironment variable, use an absolute path. For example,/Users/<username>/my-experiment. - The experiment name must not contain the existing folder name. E.g. if
my-experimentis an existing folder, the above example will error out.
import os from serverless_gpu import distributed os.environ['MLFLOW_EXPERIMENT_NAME'] = '/Users/{WORKSPACE_USER}/my_experiment' @distributed(gpus=num_gpus, gpu_type=gpu_type, remote=True) def run_train(): # my training code- When setting the
To resume training from a previous run, specify the MLFLOW_RUN_ID from the previous run as follows.
import os os.environ[‘MLFLOW_RUN_ID’] = <previous_run_id> run_train.distributed()
Model checkpoint
Save model checkpoints to Unity Catalog volumes, which provide the same governance as other Unity Catalog objects. Use the following path format to reference files in volumes from a Databricks notebook:
/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>
Save checkpoints to volumes the same way you save them to local storage.
The example below shows how to write a PyTorch checkpoint to Unity Catalog volumes:
import torch
checkpoint = {
"epoch": epoch, # last finished epoch
"model_state_dict": model.state_dict(), # weights & buffers
"optimizer_state_dict": optimizer.state_dict(), # optimizer state
"loss": loss, # optional current loss
"metrics": {"val_acc": val_acc}, # optional metrics
# Add scheduler state, RNG state, and other metadata as needed.
}
checkpoint_path = "/Volumes/my_catalog/my_schema/model/checkpoints/ckpt-0001.pt"
torch.save(checkpoint, checkpoint_path)
This approach also works for distributed checkpoints from multiple nodes. The example below shows distributed model checkpointing with the Torch Distributed Checkpoint API:
import torch.distributed.checkpoint as dcp
def save_checkpoint(self, checkpoint_path):
state_dict = self.get_state_dict(model, optimizer)
dcp.save(state_dict, checkpoint_id=checkpoint_path)
trainer.save_checkpoint("/Volumes/my_catalog/my_schema/model/checkpoints")
Multi-user collaboration
- To ensure all users can access shared code (e.g., helper modules, environment.yaml), create git folders in
/Workspace/Reposor/Workspace/Sharedinstead of user-specific folders like/Workspace/Users/<your_email>/. - For code that is in active development, use Git folders in user-specific folders
/Workspace/Users/<your_email>/and push to remote Git repos. This allows multiple users to have a user-specific clone (and branch) but still use a remote Git repo for version control. See best practices for using Git on Databricks. - Collaborators can share and comment on notebooks.
Global limits in Databricks
See Resource limits.