你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

无服务器计算上的模型训练

适用范围：Azure CLI ml 扩展 v2（最新版）Python SDK azure-ai-ml v2（最新版）

无需创建和管理计算，以可缩放的方式训练模型。可以改为将作业提交到名为 无服务器计算的计算目标类型。无服务器计算是在 Azure 机器学习上运行训练作业的最简单方法。无服务器计算是完全托管的按需计算。 Azure 机器学习为你创建、缩放和管理计算。使用无服务器计算来训练模型时，可以专注于构建机器学习模型，无需了解计算基础结构或设置模型。

可以指定作业所需的资源。 Azure 机器学习管理计算基础结构并提供托管的网络隔离，从而减轻了负担。

企业还可以通过为每个作业指定最佳资源来降低成本。 IT 管理员仍可以通过在订阅和工作区级别指定核心配额并应用 Azure 策略来应用控制。

可以使用无服务器计算服务来微调模型目录中的模型。可以使用它通过 Azure 机器学习工作室、Python SDK 和 Azure CLI 运行所有类型的作业。也可以使用无服务器计算生成环境映像和负责任 AI 仪表板方案。无服务器作业使用与 Azure 机器学习计算配额相同的配额。可以选择标准（专用）层或现成（低优先级）VM。无服务器作业支持托管标识和用户标识。计费模型与 Azure 机器学习计算的模型相同。

无服务器计算的优点

Azure 机器学习管理创建、设置、缩放、删除和修补计算基础结构，以减少管理开销。
无需了解计算、各种计算类型或相关属性。
无需为所需的每个 VM 大小重复创建群集，使用相同设置并复制每个工作区。
可以通过为实例类型（VM 大小）和实例计数指定运行时每个作业所需的确切资源来优化成本。还可以监视作业的利用率指标，以优化作业所需的资源。
运行作业所需的步骤更少。
若要进一步简化作业提交，可以完全跳过资源。 Azure 机器学习会默认实例计数，并考虑配额、成本、性能和磁盘大小等因素来选择实例类型。
在某些情况下，在作业开始运行之前等待时间会减少。
作业提交支持用户标识和工作区用户分配的托管标识。
使用托管网络隔离，可以简化和自动化网络隔离配置。还支持客户虚拟网络。
可以通过配额和 Azure 策略进行管理控制。

如何使用无服务器计算

创建自己的计算群集时，请在命令作业中使用其名称。例如，compute="cpu-cluster"。使用无服务器时，可以跳过创建计算群集，并省略参数 compute 以改用无服务器计算。如果未为作业指定 compute，则作业在无服务器计算上运行。省略 Azure CLI 或 Python SDK 作业中的计算名称，以便在以下作业类型中使用无服务器计算，并根据需要为实例计数和实例类型提供作业所需的资源：
- 命令作业，包括交互式作业和分布式训练
- AutoML 作业
- 扫描作业
- 并行作业
对于通过 Azure CLI 的管道作业，请使用 default_compute: azureml:serverless 管道级默认计算。对于通过 Python SDK 的管道作业，请使用 default_compute="serverless"。有关示例，请参阅管道作业。
在工作室中提交训练作业时，选择“无服务器”作为计算类型。
使用 Azure 机器学习设计器时，选择 “无服务器 ”作为默认计算。

性能注意事项

无服务器计算可以通过以下方式提高训练速度：

避免配额不足故障。 创建自己的计算群集时，负责确定 VM 大小和节点计数。作业运行时，如果没有足够的用于群集的配额，作业将失败。默认情况下，无服务器计算使用有关配额的信息来选择适当的 VM 大小。

缩减优化。 当计算群集正在纵向缩减时，新作业必须等待群集纵向缩减再纵向扩展，然后才能运行作业。使用无服务器计算，无需等待缩减。作业可以开始在另一个群集/节点上运行（假设你有配额）。

群集繁忙优化。 当一个作业在计算群集上运行时，若提交了新的作业，因此你的作业将排在当前正在运行的作业后面。使用无服务器计算，作业可以开始在另一个节点/群集上运行（假设你有配额）。

配额

提交作业时，仍需要足够的 Azure 机器学习服务计算配额才能继续（工作区级别和订阅级别配额）。基于此配额选择无服务器作业的默认 VM 大小。如果指定自己的 VM 大小/系列：

如果 VM 大小/系列有一些配额，但没有足够的实例数配额，则会看到错误。错误建议根据配额限制将实例数减少到有效数目，请求增加 VM 系列的配额，或更改 VM 大小。
如果没有指定 VM 大小的配额，则会看到错误。错误建议选择有配额的其他 VM 大小或为 VM 系列请求配额。
如果 VM 系列有足够的配额来运行无服务器作业，但其他作业正在使用配额，则会收到一条消息，指出作业必须在队列中等待，直到配额可用。

在 Azure 门户中查看使用情况和配额时，您会看到由无服务器作业消耗的所有配额的名称为无服务器。

标识支持和凭据传递

用户凭据透传：无服务器计算完全支持用户凭据透传。提交作业的用户的用户令牌用于存储访问。这些凭据来自 Microsoft Entra ID。

无服务器计算不支持系统分配的标识。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace.
from azure.identity import DefaultAzureCredential     # Authentication package.
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription ID>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
        identity=UserIdentityConfiguration(),
)
# Submit the command job.
ml_client.create_or_update(job)

创建包含以下项的文件：hello.yaml：

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
identity:
  type: user_identity

使用以下命令提交作业：

az ml job create --file hello.yaml --resource-group my-resource-group --workspace-name my-workspace

Azure CLI 示例的其余部分显示了 hello.yaml 文件的变体。以相同的方式运行各个文件。

用户分配的托管标识：如果工作区配置了用户分配的托管标识，则可以将该标识与无服务器作业一起使用，以便进行存储访问。有关访问机密的信息，请参阅使用 Azure 机器学习作业中的身份验证凭据机密。

验证工作区标识配置。

Python SDK
Azure CLI

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

subscription_id = "<your-subscription-id>"
resource_group = "<your-resource-group>"
workspace = "<your-workspace-name>"

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace
)

# Get workspace details.
ws = ml_client.workspaces.get(name=workspace)
print(ws)

az ml workspace show --name <workspace-sname>  --resource-group <resource-group-name>

在输出中查找用户分配的标识。如果缺少，请按照设置 Azure 机器学习和其他服务之间的身份验证中的说明，创建具有用户分配的托管标识的新工作区。

在作业中使用用户分配的托管标识。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace.
from azure.identity import DefaultAzureCredential    # Authentication package.
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import ManagedIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure-subscription-ID>", 
    resource_group_name="<Azure-resource-group>",
    workspace_name="<Azure-Machine-Learning-workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    identity= ManagedIdentityConfiguration(client_id="<workspace-UAMI-client-ID>"),
)
# Submit the command job.
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
identity:
  type: managed

配置命令作业的属性

如果未为命令、扫描和 AutoML 作业指定计算目标，则计算默认为无服务器计算。下面是一个示例：

Python SDK
Azure CLI

from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace.
from azure.identity import DefaultAzureCredential # Authentication package.

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure-subscription-ID>", 
    resource_group_name="<Azure-resource-group>",
    workspace_name="<Azure-Machine-Learning-workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
)
# Submit the command job.
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest

计算默认为无服务器计算：

此作业的单个节点。默认节点数基于作业类型。有关其他作业类型，请参阅以下部分。
CPU 虚拟机。 VM 根据配额、性能、成本和磁盘大小确定。
专用虚拟机。
工作区位置。

可以替代这些默认值。如果要为无服务器计算指定 VM 类型或节点数，请将 resources 添加到作业：

用于 instance_type 选择特定的 VM。如果需要特定的 CPU 或 GPU VM 大小，请使用此参数

用于 instance_count 指定节点数。

Python SDK
Azure CLI

from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace.
from azure.identity import DefaultAzureCredential # Authentication package.
from azure.ai.ml.entities import JobResourceConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure-subscription-ID>", 
    resource_group_name="<Azure-resource-group>",
    workspace_name="<Azure-Machine-Learning-workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    resources = JobResourceConfiguration(instance_type="Standard_NC24", instance_count=4)
)
# Submit the command job.
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
resources:
  instance_count: 4
  instance_type: Standard_NC24

若要更改作业层，请使用 queue_settings 在专用 VM (job_tier: Standard) 和低优先级 VM (job_tier: Spot) 之间进行选择。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient    # Handle to the workspace.
from azure.identity import DefaultAzureCredential    # Authentication package.
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure-subscription-ID>", 
    resource_group_name="<Azure-resource-group>",
    workspace_name="<Azure-Machine-Learning-workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    queue_settings={
      "job_tier": "Spot"  
    }
)
# Submit the command job.
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
component: ./train.yml 
queue_settings:
   job_tier: Standard # Possible values are Standard (dedicated) and Spot (low priority). The default is Standard.

命令作业的所有字段的示例

以下示例显示了指定的所有字段，包括作业应使用的身份。无需指定虚拟网络设置，因为会自动使用工作区级托管网络隔离。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient      # Handle to the workspace.
from azure.identity import DefaultAzureCredential     # Authentication package.
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com.
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure-subscription-ID>", 
    resource_group_name="<Azure-resource-group>",
    workspace_name="<Azure-Machine-Learning-workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
         identity=UserIdentityConfiguration(),
    queue_settings={
      "job_tier": "Standard"  
    }
)
job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=1)
# Submit the command job.
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
queue_settings:
   job_tier: Standard # Possible values are Standard and Spot. The default is Standard.
identity:
  type: user_identity # Possible values are Managed and user_identity.
resources:
  instance_count: 1
  instance_type: Standard_E4s_v3

下面是使用无服务器计算进行训练的另外两个示例：

AutoML 作业

无需为 AutoML 作业指定计算。可以选择指定资源。如果未指定实例计数，则会根据 max_concurrent_trials 参数和 max_nodes 参数设置为默认值。如果在未指定实例类型的情况下提交 AutoML 图像分类或 NLP 任务，则会自动选择 GPU VM 大小。可以使用 CLIS、Python SDK 或工作室提交 AutoML 作业。

Python SDK
Azure CLI

如果要指定类型或实例计数，使用 ResourceConfiguration 类。

# Create the AutoML classification job with the related factory-function.
from azure.ai.ml.entities import ResourceConfiguration 

classification_job = automl.classification(
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name="y",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    blocked_training_algorithms=[ClassificationModels.LOGISTIC_REGRESSION],
    enable_onnx_compatible_models=True,
)

# Serverless compute resources used to run the job
classification_job.resources = 
ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=6)

如果要指定类型或实例计数，添加 resources 部分。

$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json
type: automl
experiment_name: dpv2-cli-automl-classifier-experiment
description: A Classification job using bank marketing
# Serverless compute is used to run this AutoML job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.

task: classification
log_verbosity: debug
primary_metric: accuracy

target_column_name: "y"

#validation_data_size: 0.20
#n_cross_validations: 5
#test_data_size: 0.1

training_data:
  path: "./training-mltable-folder"
  type: mltable
validation_data:
  path: "./validation-mltable-folder"
  type: mltable
test_data:
  path: "./test-mltable-folder"
  type: mltable

limits:
  timeout_minutes: 180
  max_trials: 40
  max_concurrent_trials: 5
  trial_timeout_minutes: 20
  enable_early_termination: true
  exit_score: 0.92

featurization:
  mode: custom
  transformer_params:
    imputer:
      - fields: ["job"]
        parameters:
          strategy: most_frequent
  blocked_transformers:
    - WordEmbedding
training:
  enable_model_explainability: true
  allowed_training_algorithms:
    - gradient_boosting
    - logistic_regression
# Resources to run this serverless job
resources:
  instance_type="Standard_E4s_v3"
  instance_count=5

对于管道作业，将 "serverless" 指定为默认计算类型以使用无服务器计算。

# Construct pipeline
@pipeline()
def pipeline_with_components_from_yaml(
    training_input,
    test_input,
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
):
    """E2E dummy train-score-eval pipeline with components defined via yaml."""
    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=training_input,
        max_epochs=training_max_epochs,
        learning_rate=training_learning_rate,
        learning_rate_schedule=learning_rate_schedule,
    )

    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_input
    )
    score_with_sample_data.outputs.score_output.mode = "upload"

    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "trained_model": train_with_sample_data.outputs.model_output,
        "scored_data": score_with_sample_data.outputs.score_output,
        "evaluation_report": eval_with_sample_data.outputs.eval_output,
    }


pipeline_job = pipeline_with_components_from_yaml(
    training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
)

# set pipeline to use serverless compute
pipeline_job.settings.default_compute = "serverless"

对于管道作业，将 azureml:serverless 指定为默认计算类型以使用无服务器计算。

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components
# Serverless compute is used to run this pipeline job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.
inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:serverless

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

还可以在设计器中将无服务器计算设置为默认计算。

使用用户分配的托管身份配置无服务器流水线任务

在管道作业中使用无服务器计算时，我们建议在将在计算上运行的单个步骤级别上设置用户标识，而不是在根管道级别进行设置。（尽管根管道和步骤级别都支持标识设置，但如果同时设置了这两种设置，则步骤级别设置将优先。但是，对于包含管道组件的管道，必须在要运行的单个步骤上设置标识。在根管道或管道组件级别设置的标识不起作用。因此，为了简单起见，我们建议在各个步骤级别设置标识。

Python SDK
Azure CLI

def my_pipeline():
    train_job = train_component(
        training_data=Input(type="uri_folder", path="./data")
    )
    # Set managed identity for the job
    train_job.identity = {"type": "managed"}
    return {"train_output": train_job.outputs}

pipeline_job = my_pipeline()
# Configure the pipeline to use serverless compute.
pipeline_job.settings.default_compute = "serverless"

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
description: E2E dummy train-score-eval pipeline with registered components
settings:
    default_compute: azureml:serverless
jobs:
 train_job:
   type: command
   component: azureml:my_train@latest
inputs:
   training_data: 
     type: uri_folder 
      path: ./data   
 identity:
   type: managed

查看有关使用无服务器计算训练的更多示例：

反馈

此页面是否有帮助？

Last updated on 2025-12-09