使用 Terraform 建立叢集、筆記本和工作

本文說明如何使用 Databricks Terraform 提供者，在現有的 Azure Databricks 工作區中建立叢集、筆記本和工作。

您也可以調整本文中的 Terraform 設定，以在工作區中建立自訂叢集、筆記本和工作。

步驟 1：建立及設定 Terraform 專案

遵循 Databricks Terraform 提供者概觀一文的＜需求＞一節中的指示，建立 Terraform 專案。

若要建立叢集，請建立名為 cluster.tf　的檔案，並將下列內容新增至檔案。此內容會建立允許最少資源的叢集。此叢集使用最新的 Databricks Runtime 長期支援 (LTS) 版本。

針對與 Unity 目錄搭配運作的叢集：

variable "cluster_name" {}
variable "cluster_autotermination_minutes" {}
variable "cluster_num_workers" {}
variable "cluster_data_security_mode" {}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
  data_security_mode      = var.cluster_data_security_mode
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

針對多功能叢集：

variable "cluster_name" {
  description = "A name for the cluster."
  type        = string
  default     = "My Cluster"
}

variable "cluster_autotermination_minutes" {
  description = "How many minutes before automatically terminating due to inactivity."
  type        = number
  default     = 60
}

variable "cluster_num_workers" {
  description = "The number of workers."
  type        = number
  default     = 1
}

# Create the cluster with the "smallest" amount
# of resources allowed.
data "databricks_node_type" "smallest" {
  local_disk = true
}

# Use the latest Databricks Runtime
# Long Term Support (LTS) version.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "this" {
  cluster_name            = var.cluster_name
  node_type_id            = data.databricks_node_type.smallest.id
  spark_version           = data.databricks_spark_version.latest_lts.id
  autotermination_minutes = var.cluster_autotermination_minutes
  num_workers             = var.cluster_num_workers
}

output "cluster_url" {
 value = databricks_cluster.this.url
}

若要建立叢集，請建立另一個名為　cluster.auto.tfvars 的檔案，並將下列內容新增至檔案。此檔案包含自訂叢集的變數值。將占位符的值替換為您自己的值。

針對與 Unity 目錄搭配運作的叢集：

cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1
cluster_data_security_mode      = "SINGLE_USER"

針對多功能叢集：

cluster_name                    = "My Cluster"
cluster_autotermination_minutes = 60
cluster_num_workers             = 1

若要建立筆記本，請建立名為 notebook.tf 的另一個檔案，並將下列內容新增至檔案：

variable "notebook_subdirectory" {
  description = "A name for the subdirectory to store the notebook."
  type        = string
  default     = "Terraform"
}

variable "notebook_filename" {
  description = "The notebook's filename."
  type        = string
}

variable "notebook_language" {
  description = "The language of the notebook."
  type        = string
}

resource "databricks_notebook" "this" {
  path     = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}"
  language = var.notebook_language
  source   = "./${var.notebook_filename}"
}

output "notebook_url" {
 value = databricks_notebook.this.url
}

如果您要建立叢集，請將下列筆記本程式碼儲存至與檔案相同的目錄中的 notebook.tf 檔案：

針對 Python 筆記本，請使用下列程式代碼：

# Databricks notebook source
external_location = "<your_external_location>"
catalog = "<your_catalog>"

dbutils.fs.put(f"{external_location}/foobar.txt", "Hello world!", True)
display(dbutils.fs.head(f"{external_location}/foobar.txt"))
dbutils.fs.rm(f"{external_location}/foobar.txt")

display(spark.sql(f"SHOW SCHEMAS IN {catalog}"))

# COMMAND ----------

from pyspark.sql.functions import col

# Set parameters for isolation in workspace and reset demo
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
database = f"{catalog}.e2e_lakehouse_{username}_db"
source = f"{external_location}/e2e-lakehouse-source"
table = f"{database}.target_table"
checkpoint_path = f"{external_location}/_checkpoint/e2e-lakehouse-demo"

spark.sql(f"SET c.username='{username}'")
spark.sql(f"SET c.database={database}")
spark.sql(f"SET c.source='{source}'")

spark.sql("DROP DATABASE IF EXISTS ${c.database} CASCADE")
spark.sql("CREATE DATABASE ${c.database}")
spark.sql("USE ${c.database}")

# Clear out data from previous demo execution
dbutils.fs.rm(source, True)
dbutils.fs.rm(checkpoint_path, True)


# Define a class to load batches of data to source
class LoadData:

  def __init__(self, source):
    self.source = source

  def get_date(self):
    try:
      df = spark.read.format("json").load(source)
    except:
        return "2016-01-01"
    batch_date = df.selectExpr("max(distinct(date(tpep_pickup_datetime))) + 1 day").first()[0]
    if batch_date.month == 3:
      raise Exception("Source data exhausted")
      return batch_date

  def get_batch(self, batch_date):
    return (
      spark.table("samples.nyctaxi.trips")
        .filter(col("tpep_pickup_datetime").cast("date") == batch_date)
    )

  def write_batch(self, batch):
    batch.write.format("json").mode("append").save(self.source)

  def land_batch(self):
    batch_date = self.get_date()
    batch = self.get_batch(batch_date)
    self.write_batch(batch)

RawData = LoadData(source)

# COMMAND ----------

RawData.land_batch()

# COMMAND ----------

# Import functions
from pyspark.sql.functions import col, current_timestamp

# Configure Auto Loader to ingest JSON data to a Delta table
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .option("mergeSchema", "true")
  .toTable(table))

# COMMAND ----------

df = spark.read.table(table_name)

# COMMAND ----------

display(df)

適用於快速入門：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 工作的 Python Notebook，檔案名為notebook-quickstart-create-databricks-workspace-portal.py，其中包含下列內容：

# Databricks notebook source
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Seattle"
blob_sas_token = r""

# COMMAND ----------

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name,blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# COMMAND ----------

df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')

# COMMAND ----------

print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

如果您要建立筆記本，請建立另一個名為 notebook.auto.tfvars 的檔案，並將下列內容新增至檔案。此檔案包含自訂筆記本設定的變數值。

針對 Python 筆記本：
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-getting-started-lakehouse-e2e.py"
notebook_language     = "PYTHON"
```
針對適用於快速入門的 Python Notebook：使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 工作：
```
notebook_subdirectory = "Terraform"
notebook_filename     = "notebook-quickstart-create-databricks-workspace-portal.py"
notebook_language     = "PYTHON"
```

若要建立工作，請建立名為 job.tf 的另一個檔案，並將下列內容新增至檔案。此內容將創建一個工作來運行筆記本。

variable "job_name" {
  description = "A name for the job."
  type        = string
  default     = "My Job"
}

variable "task_key" {
  description = "A name for the task."
  type        = string
  default     = "my_task"
}

resource "databricks_job" "this" {
  name = var.job_name
  task {
    task_key = var.task_key
    existing_cluster_id = databricks_cluster.this.cluster_id
    notebook_task {
      notebook_path = databricks_notebook.this.path
    }
  }
  email_notifications {
    on_success = [ data.databricks_current_user.me.user_name ]
    on_failure = [ data.databricks_current_user.me.user_name ]
  }
}

output "job_url" {
  value = databricks_job.this.url
}

如果您要建立工作，請建立名為 job.auto.tfvars 的另一個檔案，並將下列內容新增至檔案。此檔案包含自訂工作設定的變數值。
```
job_name = "My Job"
task_key = "my_task"
```

步驟 2：執行設定

在此步驟中，您會執行 Terraform 設定，將叢集、筆記本和工作部署到您的 Azure Databricks 工作區。

執行 terraform validate 命令來檢查您的 Terraform 設定是否有效。如果報告任何錯誤，請修正錯誤，然後再次執行命令。
```
terraform validate
```
在 Terraform 實際執行之前，執行 terraform plan 命令，以查看 Terraform 將在您的工作區中做哪些操作。
```
terraform plan
```
執行 terraform apply 命令，將叢集、筆記本和工作部署至您的工作區。出現部署提示時，輸入 yes 並按 [Enter]。
```
terraform apply
```
Terraform 會部署專案中指定的資源。部署這些資源(特別是叢集)可能需要幾分鐘的時間。

步驟 3：探索結果

如果您建立叢集，請在命令的 terraform apply 輸出中，複製旁邊的 cluster_url連結，並將它貼到網頁瀏覽器的網址列中。
如果您已建立筆記本，請在命令的 terraform apply 輸出中，複製旁邊的 notebook_url連結，並將它貼到網頁瀏覽器的網址列中。

注意

使用筆記本之前，您可能需要自訂其內容。請參閱有關如何自訂筆記本的相關文件。
如果您已建立作業，請在命令的 terraform apply 輸出中，複製旁邊的 job_url連結，並將它貼到網頁瀏覽器的網址列中。

注意

執行筆記本之前，您可能需要自訂其內容。如需如何自訂筆記本的相關文件，請參閱本文開頭的連結。
如果您已建立任務，請執行此任務，如下所示：
1. 在工作頁面上按下 立即執行。
2. 作業完成執行之後，若要檢視作業執行的結果，請在工作頁面的 [已完成執行]（過去 60 天） 清單中，點擊 開始時間 資料行中的最近時間記錄。 [ 輸出 ] 窗格會顯示執行筆記本程式代碼的結果。

步驟 4：清理

在此步驟中，您會從工作區中刪除上述資源。

在 Terraform 實際執行之前，執行 terraform plan 命令，以查看 Terraform 將在您的工作區中做哪些操作。
```
terraform plan
```
執行 terraform destroy 命令，從工作區刪除叢集、筆記本和工作。出現刪除提示時，輸入 yes 並按 [Enter]。
```
terraform destroy
```
Terraform 會刪除專案中指定的資源。

意見反應

此頁面對您有幫助嗎？

Last updated on 2025-05-09