Spark 제출 작업 사용 중단 알림 및 마이그레이션 가이드

경고

Spark 제출 작업은 더 이상 사용되지 않으며 제거 보류 중입니다. 이 작업 유형의 사용은 새 사용 사례에 대해 허용되지 않으며 기존 고객에게는 사용하지 않는 것이 좋습니다. 이 작업 유형에 대한 원래 설명서는 Spark 제출(레거시) 을 참조하세요. 마이그레이션 지침은 계속 읽어 보세요.

Spark 제출이 더 이상 사용되지 않는 이유는 무엇인가요?

SPARK 제출 작업 유형은 JAR, Notebook 또는 Python 스크립트 태스크에 없는 기술 제한 사항 및 기능 차이로 인해 더 이상 사용되지 않습니다. 이러한 작업은 Databricks 기능과 더 나은 통합, 향상된 성능 및 안정성을 제공합니다.

사용 중단 조치

Databricks는 사용 중단과 관련하여 다음 조치를 구현하고 있습니다.

제한된 만들기: 2025년 11월부터 이전 달에 Spark 제출 작업을 사용한 사용자만 새 Spark 제출 작업을 만들 수 있습니다. 예외가 필요한 경우 계정 지원에 문의하세요.
Databricks 런타임 버전 제한: Spark 제출 사용은 기존 Databricks 런타임 버전 및 유지 관리 릴리스로 제한됩니다. Spark Submit를 사용하는 기존 Databricks 런타임 버전은 기능이 완전히 종료될 때까지 보안 및 버그 수정 유지 관리 릴리스를 계속 받습니다. Databricks Runtime 17.3 이상 및 18.x+는 이 작업 유형을 지원하지 않습니다.
UI 경고: Spark 제출 작업이 사용 중인 Databricks UI 전체에 경고가 표시되고 기존 사용자의 계정에서 작업 영역 관리자에게 통신이 전송됩니다.

JVM 워크로드를 JAR 작업으로 마이그레이션

JVM 워크로드의 경우 Spark 제출 작업을 JAR 작업으로 마이그레이션합니다. JAR 작업은 Databricks와 더 나은 기능 지원 및 통합을 제공합니다.

마이그레이션하려면 다음 단계를 수행합니다.

작업에 새 JAR 작업을 생성하세요.
Spark 제출 작업 매개 변수에서 처음 세 개의 인수를 식별합니다. 일반적으로 다음 패턴을 따릅니다. ["--class", "org.apache.spark.mainClassName", "dbfs:/path/to/jar_file.jar"]
매개 변수를 제거합니다 --class .
기본 클래스 이름(예: org.apache.spark.mainClassName)을 JAR 작업의 Main 클래스 로 설정합니다.
JAR 작업 구성에서 JAR 파일(예: dbfs:/path/to/jar_file.jar)에 대한 경로를 제공합니다.
Spark Submit 태스크의 나머지 인수를 JAR 작업 매개 변수로 복사합니다.
JAR 작업을 실행하고 예상대로 작동하는지 확인합니다.

JAR 작업 구성에 대한 자세한 내용은 JAR 작업을 참조하세요.

R 워크로드 마이그레이션

Spark 제출 작업에서 직접 R 스크립트를 시작하는 경우 여러 마이그레이션 경로를 사용할 수 있습니다.

옵션 A: Notebook 작업 사용

R 스크립트를 Databricks Notebook으로 마이그레이션합니다. Notebook 작업은 클러스터 자동 크기 조정을 비롯한 전체 기능 집합을 지원하고 Databricks 플랫폼과 더 나은 통합을 제공합니다.

옵션 B: Notebook 작업에서 R 스크립트 부트스트랩

Notebook 작업을 사용하여 R 스크립트를 부트스트랩합니다. 다음 코드를 사용하여 Notebook을 만들고 R 파일을 작업 매개 변수로 참조합니다. 필요한 경우 R 스크립트에서 사용하는 매개 변수를 추가하도록 수정합니다.

dbutils.widgets.text("script_path", "", "Path to script")
script_path <- dbutils.widgets.get("script_path")
source(script_path)

Spark 제출 작업을 사용하는 작업 찾기

다음 Python 스크립트를 사용하여 Spark 제출 태스크가 포함된 작업 영역에서 작업을 식별할 수 있습니다. 유효한 개인 액세스 또는 기타 토큰 이 필요하며 작업 영역 URL 을 사용해야 합니다.

옵션 A: 빠른 검색(이 첫 번째, 영구 작업만 실행)

이 스크립트는 영구 작업(또는 웹 인터페이스를 통해 생성됨)만 검색하며, 을 통해 /jobs/create/runs/submit만든 임시 작업은 포함하지 않습니다. 이는 Spark Submit 사용량을 식별하는 데 권장되는 첫 번째 줄 메서드입니다. 이는 훨씬 빠르기 때문입니다.

#!/usr/bin/env python3
"""
Requirements:
    databricks-sdk>=0.20.0

Usage:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
    export DATABRICKS_TOKEN="your-token"
    python3 list_spark_submit_jobs.py

Output:
    CSV format with columns: Job ID, Owner ID/Email, Job Name

Incorrect:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com/?o=12345678910"
"""

import csv
import os
import sys
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import PermissionDenied


def main():
    # Get credentials from environment
    workspace_url = os.environ.get("DATABRICKS_HOST")
    token = os.environ.get("DATABRICKS_TOKEN")

    if not workspace_url or not token:
        print(
            "Error: Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables",
            file=sys.stderr,
        )
        sys.exit(1)

    # Initialize client
    client = WorkspaceClient(host=workspace_url, token=token)

    # Scan workspace for persistent jobs with Spark Submit tasks
    # Using list() to scan only persistent jobs (faster than list_runs())
    print(
        "Scanning workspace for persistent jobs with Spark Submit tasks...",
        file=sys.stderr,
    )
    jobs_with_spark_submit = []
    total_jobs = 0

    # Iterate through all jobs (pagination is handled automatically by the SDK)
    skipped_jobs = 0
    for job in client.jobs.list(expand_tasks=True, limit=25):
        try:
            total_jobs += 1
            if total_jobs % 1000 == 0:
                print(f"Scanned {total_jobs} jobs total", file=sys.stderr)

            # Check if job has any Spark Submit tasks
            if job.settings and job.settings.tasks:
                has_spark_submit = any(
                    task.spark_submit_task is not None for task in job.settings.tasks
                )

                if has_spark_submit:
                    # Extract job information
                    job_id = job.job_id
                    owner_email = job.creator_user_name or "Unknown"
                    job_name = job.settings.name or f"Job {job_id}"

                    jobs_with_spark_submit.append(
                        {"job_id": job_id, "owner_email": owner_email, "job_name": job_name}
                    )
        except PermissionDenied:
            # Skip jobs that the user doesn't have permission to access
            skipped_jobs += 1
            continue

    # Print summary to stderr
    print(f"Scanned {total_jobs} jobs total", file=sys.stderr)
    if skipped_jobs > 0:
        print(
            f"Skipped {skipped_jobs} jobs due to insufficient permissions",
            file=sys.stderr,
        )
    print(
        f"Found {len(jobs_with_spark_submit)} jobs with Spark Submit tasks",
        file=sys.stderr,
    )
    print("", file=sys.stderr)

    # Output CSV to stdout
    if jobs_with_spark_submit:
        writer = csv.DictWriter(
            sys.stdout,
            fieldnames=["job_id", "owner_email", "job_name"],
            quoting=csv.QUOTE_MINIMAL,
        )
        writer.writeheader()
        writer.writerows(jobs_with_spark_submit)
    else:
        print("No jobs with Spark Submit tasks found.", file=sys.stderr)


if __name__ == "__main__":
    main()

옵션 B: 포괄적인 검사 (속도가 느리며, 지난 30일 간의 단기 임시 작업 포함)

/runs/submit을 통해 생성된 임시 작업을 식별해야 하는 경우, 더 철저한 이 스크립트를 사용하세요. 이 스크립트는 작업 영역에서 지난 30일 동안 실행된 모든 작업을 검사합니다. 여기에는 /jobs/create를 통해 생성된 영구 작업과 일시적인 작업이 포함됩니다. 이 스크립트는 큰 작업 영역에서 실행하는 데 몇 시간이 걸릴 수 있습니다.

#!/usr/bin/env python3
"""
Requirements:
    databricks-sdk>=0.20.0

Usage:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
    export DATABRICKS_TOKEN="your-token"
    python3 list_spark_submit_runs.py

Output:
    CSV format with columns: Job ID, Run ID, Owner ID/Email, Job/Run Name

Incorrect:
    export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com/?o=12345678910"
"""

import csv
import os
import sys
import time
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import PermissionDenied


def main():
    # Get credentials from environment
    workspace_url = os.environ.get("DATABRICKS_HOST")
    token = os.environ.get("DATABRICKS_TOKEN")

    if not workspace_url or not token:
        print(
            "Error: Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables",
            file=sys.stderr,
        )
        sys.exit(1)

    # Initialize client
    client = WorkspaceClient(host=workspace_url, token=token)

    thirty_days_ago_ms = int((time.time() - 30 * 24 * 60 * 60) * 1000)

    # Scan workspace for runs with Spark Submit tasks
    # Using list_runs() instead of list() to include ephemeral jobs created via /runs/submit
    print(
        "Scanning workspace for runs with Spark Submit tasks from the last 30 days... (this will take more than an hour in large workspaces)",
        file=sys.stderr,
    )
    runs_with_spark_submit = []
    total_runs = 0
    seen_job_ids = set()

    # Iterate through all runs (pagination is handled automatically by the SDK)
    skipped_runs = 0
    for run in client.jobs.list_runs(
        expand_tasks=True,
        limit=25,
        completed_only=True,
        start_time_from=thirty_days_ago_ms,
    ):
        try:
            total_runs += 1
            if total_runs % 1000 == 0:
                print(f"Scanned {total_runs} runs total", file=sys.stderr)

            # Check if run has any Spark Submit tasks
            if run.tasks:
                has_spark_submit = any(
                    task.spark_submit_task is not None for task in run.tasks
                )

                if has_spark_submit:
                    # Extract job information from the run
                    job_id = run.job_id if run.job_id else "N/A"
                    run_id = run.run_id if run.run_id else "N/A"
                    owner_email = run.creator_user_name or "Unknown"
                    # Use run name if available, otherwise try to construct a name
                    run_name = run.run_name or (
                        f"Run {run_id}" if run_id != "N/A" else "Unnamed Run"
                    )

                    # Track unique job IDs to avoid duplicates for persistent jobs
                    # (ephemeral jobs may have the same job_id across multiple runs)
                    key = (job_id, run_id)
                    if key not in seen_job_ids:
                        seen_job_ids.add(key)
                        runs_with_spark_submit.append(
                            {
                                "job_id": job_id,
                                "run_id": run_id,
                                "owner_email": owner_email,
                                "job_name": run_name,
                            }
                        )
        except PermissionDenied:
            # Skip runs that the user doesn't have permission to access
            skipped_runs += 1
            continue

    # Print summary to stderr
    print(f"Scanned {total_runs} runs total", file=sys.stderr)
    if skipped_runs > 0:
        print(
            f"Skipped {skipped_runs} runs due to insufficient permissions",
            file=sys.stderr,
        )
    print(
        f"Found {len(runs_with_spark_submit)} runs with Spark Submit tasks",
        file=sys.stderr,
    )
    print("", file=sys.stderr)

    # Output CSV to stdout
    if runs_with_spark_submit:
        writer = csv.DictWriter(
            sys.stdout,
            fieldnames=["job_id", "run_id", "owner_email", "job_name"],
            quoting=csv.QUOTE_MINIMAL,
        )
        writer.writeheader()
        writer.writerows(runs_with_spark_submit)
    else:
        print("No runs with Spark Submit tasks found.", file=sys.stderr)


if __name__ == "__main__":
    main()

도움이 필요하세요?

추가 도움이 필요한 경우 계정 지원에 문의하세요.

피드백

이 페이지가 도움이 되었나요?

Last updated on 2025-12-30