Wdrażanie potoków wnioskowania wsadowego

Ważne

Ta funkcja jest dostępna w publicznej wersji testowej.

Na tej stronie pokazano, jak można zintegrować funkcje sztucznej inteligencji z innymi danymi usługi Databricks i produktami sztucznej inteligencji w celu utworzenia kompletnych potoków wnioskowania wsadowego. Te potoki przetwarzania mogą wykonywać kompleksowe od końca do końca potoki przetwarzania, które obejmują pobieranie, wstępne przetwarzanie, wnioskowanie i przetwarzanie końcowe. Rurociągi można tworzyć za pomocą SQL lub Python i wdrażać jako:

Potoki deklaratywne platformy Spark w usłudze Lakeflow
Zaplanowane przepływy pracy przy użyciu przepływów pracy usługi Databricks
Przepływy pracy wnioskowania strumieniowego przy użyciu Zorganizowanego Przesyłania Strumieniowego

Wymagania

Obszar roboczy w regionie obsługiwanym przez interfejsy API modelu bazowego.
Środowisko Databricks Runtime 15.4 LTS lub nowsze jest wymagane w przypadku obciążeń wnioskowania wsadowego przy użyciu funkcji sztucznej inteligencji.
Uzyskaj uprawnienia do wykonywania zapytań w tabeli Delta w Unity Catalog, która zawiera dane, których chcesz użyć.
Ustaw we właściwościach tabeli pipelines.channel jako "podgląd", aby użyć ai_query(). Zobacz Wymagania , aby zapoznać się z przykładowym zapytaniem.

Wykonywanie wnioskowania wsadowego przyrostowego w potokach deklaratywnych platformy Spark w usłudze Lakeflow

W poniższym przykładzie wykonywane jest przyrostowe wsadowe wnioskowanie przy użyciu potoków deklaratywnych Lakeflow Spark, gdy dane są stale aktualizowane.

Krok 1: Ładowanie surowych danych wiadomości z woluminu

SQL

CREATE OR REFRESH STREAMING TABLE news_raw
COMMENT "Raw news articles ingested from volume."
AS SELECT *
FROM STREAM(read_files(
  '/Volumes/databricks_news_summarization_benchmarking_data/v01/csv',
  format => 'csv',
  header => true,
  mode => 'PERMISSIVE',
  multiLine => 'true'
));

Python

Zaimportuj pakiety i zdefiniuj schemat JSON odpowiedzi LLM jako zmienną języka Python

from pyspark import pipelines as dp
from pyspark.sql.functions import expr, get_json_object, concat

news_extraction_schema = (
    '{"type": "json_schema", "json_schema": {"name": "news_extraction", '
    '"schema": {"type": "object", "properties": {"title": {"type": "string"}, '
    '"category": {"type": "string", "enum": ["Politics", "Sports", "Technology", '
    '"Health", "Entertainment", "Business"]}}}, "strict": true}}'
)

Załaduj dane z woluminu Unity Catalog.

@dp.table(
  comment="Raw news articles ingested from volume."
)
def news_raw():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("header", True)
      .option("mode", "PERMISSIVE")
      .option("multiLine", "true")
      .load("/Volumes/databricks_news_summarization_benchmarking_data/v01/csv")
  )

Krok 2. Stosowanie wnioskowania LLM w celu wyodrębnienia tytułu i kategorii

SQL


CREATE OR REFRESH MATERIALIZED VIEW news_categorized
COMMENT "Extract category and title from news articles using LLM inference."
AS
SELECT
  inputs,
  ai_query(
    "databricks-meta-llama-3-3-70b-instruct",
    "Extract the category of the following news article: " || inputs,
    responseFormat => '{
      "type": "json_schema",
      "json_schema": {
        "name": "news_extraction",
        "schema": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "category": {
              "type": "string",
              "enum": ["Politics", "Sports", "Technology", "Health", "Entertainment", "Business"]
            }
          }
        },
        "strict": true
      }
    }'
  ) AS meta_data
FROM news_raw
LIMIT 2;

Python

@dp.materialized_view(
  comment="Extract category and title from news articles using LLM inference."
)
def news_categorized():
  # Limit the number of rows to 2 as in the SQL version
  df_raw = spark.read.table("news_raw").limit(2)
  # Inject the JSON schema variable into the ai_query call using an f-string.
  return df_raw.withColumn(
    "meta_data",
    expr(
      f"ai_query('databricks-meta-llama-3-3-70b-instruct', "
      f"concat('Extract the category of the following news article: ', inputs), "
      f"responseFormat => '{news_extraction_schema}')"
    )
  )

Krok 3. Zweryfikuj dane wyjściowe wnioskowania LLM przed podsumowaniem

SQL

CREATE OR REFRESH MATERIALIZED VIEW news_validated (
  CONSTRAINT valid_title EXPECT (size(split(get_json_object(meta_data, '$.title'), ' ')) >= 3),
  CONSTRAINT valid_category EXPECT (get_json_object(meta_data, '$.category') IN ('Politics', 'Sports', 'Technology', 'Health', 'Entertainment', 'Business'))
)
COMMENT "Validated news articles ensuring the title has at least 3 words and the category is valid."
AS
SELECT *
FROM news_categorized;

Python

@dp.materialized_view(
  comment="Validated news articles ensuring the title has at least 3 words and the category is valid."
)
@dp.expect("valid_title", "size(split(get_json_object(meta_data, '$.title'), ' ')) >= 3")
@dp.expect_or_fail("valid_category", "get_json_object(meta_data, '$.category') IN ('Politics', 'Sports', 'Technology', 'Health', 'Entertainment', 'Business')")
def news_validated():
  return spark.read.table("news_categorized")

Krok 4. Podsumowanie artykułów z wiadomości na podstawie zweryfikowanych danych

SQL

CREATE OR REFRESH MATERIALIZED VIEW news_summarized
COMMENT "Summarized political news articles after validation."
AS
SELECT
  get_json_object(meta_data, '$.category') as category,
  get_json_object(meta_data, '$.title') as title,
  ai_query(
    "databricks-meta-llama-3-3-70b-instruct",
    "Summarize the following political news article in 2-3 sentences: " || inputs
  ) AS summary
FROM news_validated;

Python

@dp.materialized_view(
  comment="Summarized political news articles after validation."
)
def news_summarized():
  df = spark.read.table("news_validated")
  return df.select(
    get_json_object("meta_data", "$.category").alias("category"),
    get_json_object("meta_data", "$.title").alias("title"),
    expr(
      "ai_query('databricks-meta-llama-3-3-70b-instruct', "
      "concat('Summarize the following political news article in 2-3 sentences: ', inputs))"
    ).alias("summary")
  )

Automatyzacja zadań wnioskowania wsadowego przy użyciu przepływów pracy platformy Databricks

Planowanie zadań wnioskowania wsadowego i automatyzowanie potoków sztucznej inteligencji.

SQL

SELECT
   *,
   ai_query('databricks-meta-llama-3-3-70b-instruct', request => concat("You are an opinion mining service. Given a piece of text, output an array of json results that extracts key user opinions, a classification, and a Positive, Negative, Neutral, or Mixed sentiment about that subject.
AVAILABLE CLASSIFICATIONS
Quality, Service, Design, Safety, Efficiency, Usability, Price
Examples below:
DOCUMENT
I got soup. It really did take only 20 minutes to make some pretty good soup. The noises it makes when it's blending are somewhat terrifying, but it gives a little beep to warn you before it does that. It made three or four large servings of soup. It's a single layer of steel, so the outside gets pretty hot. It can be hard to unplug the lid without knocking the blender against the side, which is not a nice sound. The soup was good and the recipes it comes with look delicious, but I'm not sure I'll use it often. 20 minutes of scary noises from the kitchen when I already need comfort food is not ideal for me. But if you aren't sensitive to loud sounds it does exactly what it says it does..
RESULT
[
 {'Classification': 'Efficiency', 'Comment': 'only 20 minutes','Sentiment': 'Positive'},
 {'Classification': 'Quality','Comment': 'pretty good soup','Sentiment': 'Positive'},
 {'Classification': 'Usability', 'Comment': 'noises it makes when it's blending are somewhat terrifying', 'Sentiment': 'Negative'},
 {'Classification': 'Safety','Comment': 'outside gets pretty hot','Sentiment': 'Negative'},
 {'Classification': 'Design','Comment': 'Hard to unplug the lid without knocking the blender against the side, which is not a nice sound', 'Sentiment': 'Negative'}
]
DOCUMENT
", REVIEW_TEXT, '\n\nRESULT\n')) as result
FROM catalog.schema.product_reviews
LIMIT 10

Python


import json
from pyspark.sql.functions import expr

# Define the opinion mining prompt as a multi-line string.
opinion_prompt = """You are an opinion mining service. Given a piece of text, output an array of json results that extracts key user opinions, a classification, and a Positive, Negative, Neutral, or Mixed sentiment about that subject.
AVAILABLE CLASSIFICATIONS
Quality, Service, Design, Safety, Efficiency, Usability, Price
Examples below:
DOCUMENT
I got soup. It really did take only 20 minutes to make some pretty good soup.The noises it makes when it's blending are somewhat terrifying, but it gives a little beep to warn you before it does that.It made three or four large servings of soup.It's a single layer of steel, so the outside gets pretty hot. It can be hard to unplug the lid without knocking the blender against the side, which is not a nice sound.The soup was good and the recipes it comes with look delicious, but I'm not sure I'll use it often. 20 minutes of scary noises from the kitchen when I already need comfort food is not ideal for me. But if you aren't sensitive to loud sounds it does exactly what it says it does.
RESULT
[
 {'Classification': 'Efficiency', 'Comment': 'only 20 minutes','Sentiment': 'Positive'},
 {'Classification': 'Quality','Comment': 'pretty good soup','Sentiment': 'Positive'},
 {'Classification': 'Usability', 'Comment': 'noises it makes when it's blending are somewhat terrifying', 'Sentiment': 'Negative'},
 {'Classification': 'Safety','Comment': 'outside gets pretty hot','Sentiment': 'Negative'},
 {'Classification': 'Design','Comment': 'Hard to unplug the lid without knocking the blender against the side, which is not a nice sound', 'Sentiment': 'Negative'}
]
DOCUMENT
"""

# Escape the prompt so it can be safely embedded in the SQL expression.
escaped_prompt = json.dumps(opinion_prompt)

# Read the source table and limit to 10 rows.
df = spark.table("catalog.schema.product_reviews").limit(10)

# Apply the LLM inference to each row, concatenating the prompt, the review text, and the tail string.
result_df = df.withColumn(
    "result",
    expr(f"ai_query('databricks-meta-llama-3-3-70b-instruct', request => concat({escaped_prompt}, REVIEW_TEXT, '\\n\\nRESULT\\n'))")
)

# Display the result DataFrame.
display(result_df)

Funkcje sztucznej inteligencji korzystające ze przesyłania strumieniowego ze strukturą

Zastosuj wnioskowanie sztucznej inteligencji w scenariuszach niemal w czasie rzeczywistym lub mikrosadowym przy użyciu ai_query i Structured Streaming.

Krok 1. Odczytaj statyczną tabelę Delta

Odczytaj statyczną tabelę delty tak, jakby była strumieniem.


from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

# Spark processes all existing rows exactly once in the first micro-batch.
df = spark.table("enterprise.docs")  # Replace with your table name containing enterprise documents
df.repartition(50).write.format("delta").mode("overwrite").saveAsTable("enterprise.docs")
df_stream = spark.readStream.format("delta").option("maxBytesPerTrigger", "50K").table("enterprise.docs")

# Define the prompt outside the SQL expression.
prompt = (
    "You are provided with an enterprise document. Summarize the key points in a concise paragraph. "
    "Do not include extra commentary or suggestions. Document: "
)

Krok 2. Zastosować `ai_query`

Platforma Spark przetwarza to tylko raz dla danych statycznych, chyba że nowe wiersze pojawią się w tabeli.


df_transformed = df_stream.select(
    "document_text",
    F.expr(f"""
      ai_query(
        'databricks-meta-llama-3-1-8b-instruct',
        CONCAT('{prompt}', document_text)
      )
    """).alias("summary")
)

Krok 3. Zapisywanie podsumowanych danych wyjściowych

Zapisz podsumowany wynik do innej tabeli Delta


# Time-based triggers apply, but only the first trigger processes all existing static data.
query = df_transformed.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/tmp/checkpoints/_docs_summary") \
    .outputMode("append") \
    .toTable("enterprise.docs_summary")

query.awaitTermination()

Sprzężenie zwrotne

Czy ta strona była pomocna?

Last updated on 2025-11-13

Udostępnij przez

Wdrażanie potoków wnioskowania wsadowego

Wymagania

Wykonywanie wnioskowania wsadowego przyrostowego w potokach deklaratywnych platformy Spark w usłudze Lakeflow

Krok 1: Ładowanie surowych danych wiadomości z woluminu

SQL

Python

Krok 2. Stosowanie wnioskowania LLM w celu wyodrębnienia tytułu i kategorii

SQL

Python

Krok 3. Zweryfikuj dane wyjściowe wnioskowania LLM przed podsumowaniem

SQL

Python

Krok 4. Podsumowanie artykułów z wiadomości na podstawie zweryfikowanych danych

SQL

Python

Automatyzacja zadań wnioskowania wsadowego przy użyciu przepływów pracy platformy Databricks

SQL

Python

Funkcje sztucznej inteligencji korzystające ze przesyłania strumieniowego ze strukturą

Krok 1. Odczytaj statyczną tabelę Delta

Krok 2. Zastosować ai_query

Krok 3. Zapisywanie podsumowanych danych wyjściowych

Sprzężenie zwrotne

Dodatkowe źródła

Krok 2. Zastosować `ai_query`