AKS(Azure Kubernetes Service)에서 Ray 클러스터 구성 및 배포

이 문서에서는 KubeRay를 사용하여 AKS(Azure Kubernetes Service)에서 Ray 클러스터를 구성하고 배포합니다. 또한 Ray 클러스터를 사용하여 간단한 기계 학습 모델을 학습시키고 결과를 Ray 대시보드에 표시하는 방법을 알아봅니다.

이 문서에서는 AKS에 Ray 클러스터를 배포하는 두 가지 방법을 제공합니다.

비대화형 배포: GitHub 리포지토리의 스크립트를 사용하여 deploy.sh 전체 Ray 샘플을 비대화형으로 배포합니다.
수동 배포: 수동 배포 단계에 따라 RAY 샘플을 AKS에 배포합니다.

필수 조건

AKS 개요에서 Ray 클러스터를 검토하여 구성 요소 및 배포 프로세스를 이해합니다.
Azure 구독 Azure 구독이 없는 경우 여기에서 무료 계정을 만들 수 있습니다.
로컬 컴퓨터에 설치된 Azure CLI입니다. Azure CLI를 설치하는 방법의 지침을 사용하여 설치할 수 있습니다.
Azure Kubernetes Service 미리 보기 확장이 설치되었습니다.
Helm 이 설치되었습니다.
Terraform 클라이언트 도구 또는 OpenTofu 가 설치되었습니다. 이 문서에서는 Terraform을 사용하지만 사용된 모듈은 OpenTofu와 호환되어야 합니다.

비대화형으로 Ray 샘플 배포

전체 Ray 샘플을 비대화형으로 배포하려는 경우 GitHub 리포지토리(deploy.sh)에서 스크립트를 사용할 https://github.com/Azure-Samples/aks-ray-sample 수 있습니다. 이 스크립트는 Ray 배포 프로세스 섹션에 설명된 단계를 완료합니다.

다음 명령을 사용하여 GitHub 리포지토리를 로컬로 복제하고 리포지토리의 루트로 변경합니다.
```
git clone https://github.com/Azure-Samples/aks-ray-sample
cd aks-ray-sample
```
다음 명령을 사용하여 전체 샘플을 배포합니다.
```
chmod +x deploy.sh
./deploy.sh
```
배포가 완료되면 Azure Portal에서 로그 및 리소스 그룹의 출력을 검토하여 생성된 인프라를 확인합니다.

Ray 샘플 수동 배포

Fashion MNIST는 60,000개의 예제와 10,000개의 예제의 테스트 집합으로 구성된 Zalando의 문서 이미지의 데이터 세트입니다. 각 예제는 10개 클래스의 레이블과 연결된 28x28 회색조 이미지입니다. 이 가이드에서는 Ray 클러스터를 사용하여 이 데이터 세트에서 간단한 PyTorch 모델을 학습시킵니다.

RayJob 사양 배포

모델을 학습하려면 프라이빗 AKS 클러스터에서 실행되는 KubeRay 연산자에 Ray 작업 사양을 제출해야 합니다. Ray 작업 사양은 Docker 이미지, 실행할 명령 및 사용할 작업자 수를 포함하여 작업을 실행하는 데 필요한 리소스를 설명하는 YAML 파일입니다.

Ray 작업 설명을 살펴보면 환경에 맞게 일부 필드를 수정해야 할 수 있습니다.

섹션 replicas 아래의 workerGroupSpecsrayClusterSpec 필드는 KubeRay가 Kubernetes 클러스터로 예약하는 작업자 Pod 수를 지정합니다. 각 작업자 Pod에는 3개의 CPU 와 4GB의 메모리가 필요합니다. 헤드 Pod에는 1개의 CPU 와 4GB의 메모리가 필요합니다. replicas 필드를 2로 설정하려면 작업에 대한 RayCluster를 구현하는 데 사용되는 노드 풀에 8개의 vCPU가 필요합니다.
NUM_WORKERS 아래 runtimeEnvYAMLspec 필드는 시작할 Ray 행위자 수를 지정합니다. 각 광선 행위자는 Kubernetes 클러스터의 작업자 Pod에서 서비스해야 하므로 이 필드는 필드보다 작거나 같 replicas 아야 합니다. 이 예제에서는 필드와 일치하는 2NUM_WORKERS 설정합니다replicas.
CPUS_PER_WORKER 필드는 각 작업자 Pod에 할당된 CPU 수에서 1을 뺀 값보다 작거나 같아야 합니다. 이 예제에서는 작업자 Pod당 CPU 리소스 요청이 3이므로 CPUS_PER_WORKER 2로 설정됩니다.

요약하면 PyTorch 모델 학습 작업을 실행하려면 노드 풀에 총 8개의 vCPU 가 필요합니다. 사용자 Pod를 예약할 수 없도록 시스템 노드 풀에 taint를 추가했으므로 Ray 클러스터를 호스트할 vCPU가 8개 이상인 새 노드 풀을 만들어야 합니다.

다음 명령을 사용하여 Ray 작업 사양 파일을 다운로드합니다.

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml

Ray 작업 사양 파일을 필요한 수정합니다.
명령을 사용하여 PyTorch 모델 학습 작업을 시작합니다 kubectl apply .
```
kubectl apply -n kuberay -f ray-job.pytorch-mnist.yaml
```

RayJob 배포 확인

명령을 사용하여 kubectl get pods 네임스페이스에서 두 개의 작업자 Pod와 1개의 헤드 Pod가 실행되고 있는지 확인합니다.

kubectl get pods -n kuberay

출력은 다음 예제 출력과 비슷하게 됩니다.

NAME                                                      READY   STATUS    RESTARTS   AGE
kuberay-operator-7d7998bcdb-9h8hx                         1/1     Running   0          3d2h
pytorch-mnist-raycluster-s7xd9-worker-small-group-knpgl   1/1     Running   0          6m15s
pytorch-mnist-raycluster-s7xd9-worker-small-group-p74cm   1/1     Running   0          6m15s
rayjob-pytorch-mnist-fc959                                1/1     Running   0          5m35s
rayjob-pytorch-mnist-raycluster-s7xd9-head-l24hn          1/1     Running   0          6m15s

명령을 사용하여 RayJob의 상태를 확인합니다 kubectl get .

kubectl get rayjob -n kuberay

출력은 다음 예제 출력과 비슷하게 됩니다.

NAME                   JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
rayjob-pytorch-mnist   RUNNING      Running             2024-11-22T03:08:22Z              9m36s

RayJob이 완료될 때까지 기다립니다. 몇 분 정도 걸릴 수 있습니다. 일단이 JOB STATUSSUCCEEDED면 학습 로그를 확인할 수 있습니다. 이 작업을 수행하려면 먼저 명령을 사용하여 kubectl get pods RayJob을 실행하는 Pod의 이름을 가져옵니다.

kubectl get pods -n kuberay

출력에는 다음 예제 출력과 유사한 이름으로 시작하는 rayjob-pytorch-mnistPod가 표시됩니다.

NAME                                                      READY   STATUS      RESTARTS   AGE
kuberay-operator-7d7998bcdb-9h8hx                         1/1     Running     0          3d2h
pytorch-mnist-raycluster-s7xd9-worker-small-group-knpgl   1/1     Running     0          14m
pytorch-mnist-raycluster-s7xd9-worker-small-group-p74cm   1/1     Running     0          14m
rayjob-pytorch-mnist-fc959                                0/1     Completed   0          13m
rayjob-pytorch-mnist-raycluster-s7xd9-head-l24hn          1/1     Running     0          14m

명령을 사용하여 RayJob의 로그를 봅니다 kubectl logs . RayJob을 실행하는 Pod의 이름으로 바꿔 rayjob-pytorch-mnist-fc959 야 합니다.

kubectl logs -n kuberay rayjob-pytorch-mnist-fc959

출력에는 다음 예제 출력과 유사하게 PyTorch 모델에 대한 학습 로그가 표시됩니다.

2024-11-21 19:09:04,986 INFO cli.py:39 -- Job submission server address: http://rayjob-pytorch-mnist-raycluster-s7xd9-head-svc.kuberay.svc.cluster.local:8265
2024-11-21 19:09:05,712 SUCC cli.py:63 -- -------------------------------------------------------
2024-11-21 19:09:05,713 SUCC cli.py:64 -- Job 'rayjob-pytorch-mnist-hndpx' submitted successfully
2024-11-21 19:09:05,713 SUCC cli.py:65 -- -------------------------------------------------------
2024-11-21 19:09:05,713 INFO cli.py:289 -- Next steps
2024-11-21 19:09:05,713 INFO cli.py:290 -- Query the logs of the job:
2024-11-21 19:09:05,713 INFO cli.py:292 -- ray job logs rayjob-pytorch-mnist-hndpx
2024-11-21 19:09:05,713 INFO cli.py:294 -- Query the status of the job:
...

View detailed results here: /home/ray/ray_results/TorchTrainer_2024-11-21_19-11-23
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-11-21_19-08-24_556164_1/artifacts/2024-11-21_19-11-24/TorchTrainer_2024-11-21_19-11-23/driver_artifacts`

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                 │
├─────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker      16 │
│ train_loop_config/epochs                     10 │
│ train_loop_config/lr                      0.001 │
╰─────────────────────────────────────────────────╯
(RayTrainWorker pid=1193, ip=10.244.4.193) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=1138, ip=10.244.4.193) Started distributed worker processes:
(TorchTrainer pid=1138, ip=10.244.4.193) - (node_id=3ea81f12c0f73ebfbd5b46664e29ced00266e69355c699970e1d824b, ip=10.244.4.193, pid=1193) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=1138, ip=10.244.4.193) - (node_id=2b00ea2b369c9d27de9596ce329daad1d24626b149975cf23cd10ea3, ip=10.244.1.42, pid=1341) world_rank=1, local_rank=0, node_rank=1
(RayTrainWorker pid=1341, ip=10.244.1.42) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
(RayTrainWorker pid=1193, ip=10.244.4.193) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to /home/ray/data/FashionMNIST/raw/train-images-idx3-ubyte.gz
(RayTrainWorker pid=1193, ip=10.244.4.193)
  0%|          | 0.00/26.4M [00:00<?, ?B/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
  0%|          | 65.5k/26.4M [00:00<01:13, 356kB/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
100%|██████████| 26.4M/26.4M [00:01<00:00, 18.9MB/s]
(RayTrainWorker pid=1193, ip=10.244.4.193) Extracting /home/ray/data/FashionMNIST/raw/train-images-idx3-ubyte.gz to /home/ray/data/FashionMNIST/raw
(RayTrainWorker pid=1341, ip=10.244.1.42)
100%|██████████| 26.4M/26.4M [00:01<00:00, 18.7MB/s]
...
Training finished iteration 1 at 2024-11-21 19:15:46. Total running time: 4min 22s
╭───────────────────────────────╮
│ Training result               │
├───────────────────────────────┤
│ checkpoint_dir_name           │
│ time_this_iter_s        144.9 │
│ time_total_s            144.9 │
│ training_iteration          1 │
│ accuracy                0.805 │
│ loss                  0.52336 │
╰───────────────────────────────╯
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 0:  97%|█████████▋| 303/313 [00:01<00:00, 269.60it/s]
Test Epoch 0: 100%|██████████| 313/313 [00:01<00:00, 267.14it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Train Epoch 1:   0%|          | 0/1875 [00:00<?, ?it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 0: 100%|██████████| 313/313 [00:01<00:00, 270.44it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 0: 100%|█████████▉| 1866/1875 [00:24<00:00, 82.49it/s] [repeated 35x across cluster]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Train Epoch 0: 100%|██████████| 1875/1875 [00:24<00:00, 77.99it/s]
Train Epoch 0: 100%|██████████| 1875/1875 [00:24<00:00, 76.19it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 0:   0%|          | 0/313 [00:00<?, ?it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 0:  88%|████████▊ | 275/313 [00:01<00:00, 265.39it/s] [repeated 19x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:  19%|█▉        | 354/1875 [00:04<00:18, 82.66it/s] [repeated 80x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:   0%|          | 0/1875 [00:00<?, ?it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:  40%|████      | 757/1875 [00:09<00:13, 83.01it/s] [repeated 90x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:  62%|██████▏   | 1164/1875 [00:14<00:08, 83.39it/s] [repeated 92x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:  82%|████████▏ | 1533/1875 [00:19<00:05, 68.09it/s] [repeated 91x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 1:  91%|█████████▏| 1713/1875 [00:22<00:02, 70.20it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Train Epoch 1:  91%|█████████ | 1707/1875 [00:22<00:02, 70.04it/s] [repeated 47x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 1:   0%|          | 0/313 [00:00<?, ?it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 1:   8%|▊         | 24/313 [00:00<00:01, 237.98it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 1:  96%|█████████▋| 302/313 [00:01<00:00, 250.76it/s]
Test Epoch 1: 100%|██████████| 313/313 [00:01<00:00, 262.94it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Train Epoch 2:   0%|          | 0/1875 [00:00<?, ?it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 1:  92%|█████████▏| 289/313 [00:01<00:00, 222.57it/s]

Training finished iteration 2 at 2024-11-21 19:16:12. Total running time: 4min 48s
╭───────────────────────────────╮
│ Training result               │
├───────────────────────────────┤
│ checkpoint_dir_name           │
│ time_this_iter_s       25.975 │
│ time_total_s          170.875 │
│ training_iteration          2 │
│ accuracy                0.828 │
│ loss                  0.45946 │
╰───────────────────────────────╯
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 1: 100%|██████████| 313/313 [00:01<00:00, 226.04it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Train Epoch 1: 100%|██████████| 1875/1875 [00:24<00:00, 76.24it/s] [repeated 45x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 2:  13%|█▎        | 239/1875 [00:03<00:24, 67.30it/s] [repeated 64x across cluster]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 1:   0%|          | 0/313 [00:00<?, ?it/s]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 1:  85%|████████▍ | 266/313 [00:01<00:00, 222.54it/s] [repeated 20x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
..

Training completed after 10 iterations at 2024-11-21 19:19:47. Total running time: 8min 23s
2024-11-21 19:19:47,596 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2024-11-21_19-11-23' in 0.0029s.

Training result: Result(
  metrics={'loss': 0.35892221605786073, 'accuracy': 0.872},
  path='/home/ray/ray_results/TorchTrainer_2024-11-21_19-11-23/TorchTrainer_74867_00000_0_2024-11-21_19-11-24',
  filesystem='local',
  checkpoint=None
)
(RayTrainWorker pid=1341, ip=10.244.1.42) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz [repeated 7x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to /home/ray/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz [repeated 7x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42) Extracting /home/ray/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to /home/ray/data/FashionMNIST/raw [repeated 7x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 9:  91%|█████████ | 1708/1875 [00:21<00:01, 83.84it/s] [repeated 23x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Train Epoch 9: 100%|██████████| 1875/1875 [00:23<00:00, 78.52it/s] [repeated 37x across cluster]
(RayTrainWorker pid=1341, ip=10.244.1.42)
Test Epoch 9:   0%|          | 0/313 [00:00<?, ?it/s]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 9:  89%|████████▉ | 278/313 [00:01<00:00, 266.46it/s] [repeated 19x across cluster]
(RayTrainWorker pid=1193, ip=10.244.4.193)
Test Epoch 9:  97%|█████████▋| 305/313 [00:01<00:00, 256.69it/s]
Test Epoch 9: 100%|██████████| 313/313 [00:01<00:00, 267.35it/s]
2024-11-21 19:19:51,728 SUCC cli.py:63 -- ------------------------------------------
2024-11-21 19:19:51,728 SUCC cli.py:64 -- Job 'rayjob-pytorch-mnist-hndpx' succeeded
2024-11-21 19:19:51,728 SUCC cli.py:65 -- ------------------------------------------

Ray 대시보드에서 학습 결과 보기

RayJob이 성공적으로 완료되면 Ray 대시보드에서 학습 결과를 볼 수 있습니다. Ray 대시보드는 Ray 클러스터의 실시간 모니터링 및 시각화를 제공합니다. Ray 대시보드를 사용하여 Ray 클러스터의 상태를 모니터링하고, 로그를 보고, 기계 학습 작업의 결과를 시각화할 수 있습니다.

Ray 대시보드에 액세스하려면 포트 8265 대신 포트 80에 Ray 헤드 서비스를 노출하는 서비스 shim을 만들어 Ray 헤드 서비스를 공용 인터넷에 노출해야 합니다.

참고 항목

이전 섹션에서 설명한 내용은 deploy.sh Ray 헤드 서비스를 공용 인터넷에 자동으로 노출합니다. 스크립트에 deploy.sh 포함되는 단계는 다음과 같습니다.

Ray 헤드 서비스의 이름을 가져와서 다음 명령을 사용하여 셸 변수에 저장합니다.

rayclusterhead=$(kubectl get service -n $kuberay_namespace | grep 'rayjob-pytorch-mnist-raycluster' | grep 'ClusterIP' | awk '{print $1}')

명령을 사용하여 포트 80에서 Ray 헤드 서비스를 노출하는 서비스 shim을 만듭니다 kubectl expose service .

kubectl expose service $rayclusterhead \
-n $kuberay_namespace \
--port=80 \
--target-port=8265 \
--type=NodePort \
--name=ray-dash

다음 명령을 사용하여 수신 컨트롤러를 사용하여 서비스 shim을 노출하는 수신을 만듭니다.

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-dash
  namespace: kuberay
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: webapprouting.kubernetes.azure.com
  rules:
  - http:
      paths:
      - backend:
          service:
            name: ray-dash
            port:
              number: 80
        path: /
        pathType: Prefix
EOF

명령을 사용하여 수신 컨트롤러의 공용 IP 주소를 가져옵니다 kubectl get service .
```
kubectl get service -n app-routing-system
```
출력에는 수신 컨트롤러에 연결된 부하 분산 장치의 공용 IP 주소가 표시됩니다. 공용 IP 주소를 복사하여 웹 브라우저에 붙여넣습니다. Ray 대시보드가 표시됩니다.

리소스 정리

이 가이드에서 만든 리소스를 정리하려면 AKS 클러스터가 포함된 Azure 리소스 그룹을 삭제할 수 있습니다.

다음 단계

AKS의 AI 및 기계 학습 워크로드에 대한 자세한 내용은 다음 문서를 참조하세요.

참가자

Microsoft는 이 문서를 유지 관리합니다. 다음 기여자는 원래 그것을 썼다:

Russell de Pina | 수석 TPM
Ken Kilty | 수석 TPM
Erin Schaffer | 콘텐츠 개발자 2
Adrian Joian | 수석 고객 엔지니어
Ryan Graham | 수석 기술 전문가

피드백

이 페이지가 도움이 되었나요?

Last updated on 2024-12-31