다중 GPU 및 다중 노드 워크로드

서버리스 GPU Python API를 사용하여 단일 노드 내 또는 여러 노드에서 여러 GPU에서 분산 워크로드를 시작할 수 있습니다. API는 GPU 프로비전, 환경 설정 및 워크로드 배포의 세부 정보를 추상화하는 간단하고 통합된 인터페이스를 제공합니다. 코드 변경을 최소화하면 단일 GPU 학습에서 동일한 Notebook의 원격 GPU 간에 분산 실행으로 원활하게 이동할 수 있습니다.

빠른 시작

분산 학습을 위한 서버리스 GPU API는 Databricks Notebook용 서버리스 GPU 컴퓨팅 환경에 미리 설치됩니다. GPU 환경 4 이상을 사용하는 것이 좋습니다. 분산 학습에 사용하려면 distributed 데코레이터를 가져와서 학습 기능을 분산시키는 데 사용하십시오.

아래 코드 조각은 @distributed의 기본 사용법을 보여 줍니다.

# Import the distributed decorator
from serverless_gpu import distributed

# Decorate your training function with @distributed and specify the number of GPUs, the GPU type,
# and whether or not the GPUs are remote
@distributed(gpus=8, gpu_type='A10', remote=True)
def run_train():
    ...

다음은 Notebook에서 8개의 A10 GPU 노드에서 MLP(다중 계층 퍼셉트론) 모델을 학습시키는 전체 예제입니다.

모델을 설정하고 유틸리티 함수를 정의합니다.


# Define the model
import os
import torch
import torch.distributed as dist
import torch.nn as nn

def setup():
    dist.init_process_group("nccl")
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

def cleanup():
    dist.destroy_process_group()

class SimpleMLP(nn.Module):
    def __init__(self, input_dim=10, hidden_dim=64, output_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

serverless_gpu 라이브러리 및 분산 모듈을 가져옵니다.
```
import serverless_gpu
from serverless_gpu import distributed
```

모델 학습 코드를 함수로 감싸고 @distributed 데코레이터를 사용하여 함수를 장식합니다.

@distributed(gpus=8, gpu_type='A10', remote=True)
def run_train(num_epochs: int, batch_size: int) -> None:
    import mlflow
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.data import DataLoader, DistributedSampler, TensorDataset

    # 1. Set up multi node environment
    setup()
    device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")

    # 2. Apply the Torch distributed data parallel (DDP) library for data-parellel training.
    model = SimpleMLP().to(device)
    model = DDP(model, device_ids=[device])

    # 3. Create and load dataset.
    x = torch.randn(5000, 10)
    y = torch.randn(5000, 1)

    dataset = TensorDataset(x, y)
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=batch_size)

    # 4. Define the training loop.
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)
        model.train()
        total_loss = 0.0
        for step, (xb, yb) in enumerate(dataloader):
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            loss = loss_fn(model(xb), yb)
            # Log loss to MLflow metric
            mlflow.log_metric("loss", loss.item(), step=step)

            loss.backward()
            optimizer.step()
            total_loss += loss.item() * xb.size(0)

        mlflow.log_metric("total_loss", total_loss)
        print(f"Total loss for epoch {epoch}: {total_loss}")

    cleanup()

사용자 정의 인수를 사용하여 분산 함수를 호출하여 분산 학습을 실행합니다.
```
run_train.distributed(num_epochs=3, batch_size=1)
```
실행하면 Notebook 셀 출력에 MLflow 실행 링크가 생성됩니다. MLflow 실행 링크를 클릭하거나 실험 패널에서 찾아서 실행 결과를 확인합니다.

분산 실행 세부 정보

서버리스 GPU API는 다음과 같은 몇 가지 주요 구성 요소로 구성됩니다.

컴퓨팅 관리자: 리소스 할당 및 관리 처리
런타임 환경: Python 환경 및 종속성 관리
시작 관리자: 작업 실행 및 모니터링 조율

분산 모드에서 실행하는 경우:

함수는 지정된 수의 GPU에 직렬화되고 분산됩니다.
각 GPU는 동일한 매개 변수를 사용하여 함수의 복사본을 실행합니다.
환경은 모든 노드에서 동기화됩니다.
결과는 모든 GPU에서 수집 및 반환됩니다.

remote가 True로 설정되면, 워크로드는 원격 GPU에 분산됩니다. remote가 False로 설정된 경우, 워크로드는 현재 노트북에 연결된 단일 GPU 노드에서 실행됩니다. 노드에 여러 GPU 칩이 있는 경우 모두 활용됩니다.

API는 DDP( 분산 데이터 병렬 ), FSDP( 완전 분할된 데이터 병렬 ), DeepSpeed 및 Ray와 같은 인기 있는 병렬 학습 라이브러리를 지원합니다.

Notebook 예제의 다양한 라이브러리를 사용하여 보다 실제 분산 학습 시나리오를 찾을 수 있습니다.

Ray를 사용하여 시작

서버리스 GPU API는 @distributed 위에 계층화된 데코레이터를 사용하여 Ray로 분산 학습을 시작할 수도 있습니다. 각 ray_launch 작업은 먼저 PyTorch 분산 랑데부 초기화를 통해 Ray 헤드 작업자를 결정하고 IP를 수집합니다. Rank-zero가 ray start --head (활성화된 경우 메트릭 내보내기를 사용하여) 시작하고, RAY_ADDRESS을 설정한 후, 데코레이팅된 함수를 Ray 드라이버로 실행합니다. 다른 노드를 통해 ray start --address 조인하고 드라이버가 완료 마커를 쓸 때까지 기다립니다.

추가 구성 세부 정보:

Ray 시스템 메트릭 수집을 각 노드에서 활성화하려면 RayMetricsMonitor다음을 사용합니다remote=True.
표준 Ray API를 사용하여 데코레이팅된 함수 내에서 Ray 런타임 옵션(행위자, 데이터 세트, 배치 그룹 및 일정)을 정의합니다.
함수 외부에서 데코레이터 인수나 노트북 환경에서 클러스터 전체 컨트롤(GPU 수 및 형식, 원격 모드와 로컬 모드, 비동기 동작 및 Databricks 풀 환경 변수)을 관리합니다.

아래 예는 @ray_launch을(를) 사용하는 방법을 보여줍니다.

from serverless_gpu.ray import ray_launch
@ray_launch(gpus=16, remote=True, gpu_type='A10')
def foo():
    import os
    import ray
    print(ray.state.available_resources_per_node())
    return 1
foo.distributed()

전체 예제를 보려면 이 Notebook을 참조하세요. 이 노트북은 여러 A10 GPU에서 Resnet18 신경망을 학습시키기 위해 Ray를 시작합니다.

자주 묻는 질문 (FAQ)

데이터 로드 코드는 어디에 배치해야 하나요?

분산 학습에 서버리스 GPU API 를 사용하는 경우 @distributed 데코레이터 내에서 데이터 로드 코드를 이동합니다. 데이터 세트 크기는 피클에서 허용하는 최대 크기를 초과할 수 있으므로 아래와 같이 데코레이터 내부에 데이터 세트를 생성하는 것이 좋습니다.

from serverless_gpu import distributed

# this may cause pickle error
dataset = get_dataset(file_path)
@distributed(gpus=8, remote=True)
def run_train():
  # good practice
  dataset = get_dataset(file_path)
  ....

예약된 GPU 풀을 사용할 수 있나요?

작업 공간에서 예약된 GPU 풀이 사용 가능할 경우(관리자에게 확인하세요), 데코레이터에서 remote에서 True를 지정하면, 워크로드는 기본적으로 예약된 GPU 풀에서 시작됩니다. 주문형 GPU 풀을 사용하려면 아래와 같이 분산 함수를 호출하기 전에 환경 변수 DATABRICKS_USE_RESERVED_GPU_POOL 를 False 설정하세요.

import os
os.environ['DATABRICKS_USE_RESERVED_GPU_POOL'] = 'False'
@distributed(gpus=8, remote=True)
def run_train():
    ...

더 알아보세요

API 참조는 서버리스 GPU Python API 설명서를 참조하세요.

피드백

이 페이지가 도움이 되었나요?

Last updated on 2025-12-19