Edit

Share via


Use ai.extract with PySpark

The ai.extract function uses generative AI to scan input text and extract specific types of information designated by labels you choose (for example, locations or names). It uses only a single line of code.

Note

Overview

The ai.extract function is available for Spark DataFrames. You must specify the name of an existing input column as a parameter, along with a list of entity types to extract from each row of text.

The function returns a new DataFrame, with a separate column for each specified entity type that contains extracted values for each input row.

Syntax

df.ai.extract(labels=["entity1", "entity2", "entity3"], input_col="input")

Parameters

Name Description
labels
Required
An array of strings that represents the set of entity types to extract from the text values in the input column.
input_col
Required
A string that contains the name of an existing column with input text values to scan for the custom entities.
aifunc.ExtractLabel
Optional
One or more label definitions describing the fields to extract. For more information, refer to the ExtractLabel Parameters table.
error_col
Optional
A string that contains the name of a new column to store any OpenAI errors that result from processing each input text row. If you don't set this parameter, a default name generates for the error column. If an input row has no errors, the value in this column is null.

ExtractLabel Parameters

Name Description
label
Required
A string that represents the entity to extract from the input text values.
description
Optional
A string that adds extra context for the AI model. It can include requirements, context, or instructions for the AI to consider while performing the extraction.
max_items
Optional
An int that specifies the maximum number of items to extract for this label.
type
Optional
JSON schema type for the extracted value. Supported types for this class include string, number, integer, boolean, object, and array.
properties
Optional
More JSON schema properties for the type as a dictionary. It can include supported properties like "items" for arrays, "properties" for objects, "enum" for enum types, and more. See example usage in this article.
raw_col
Optional
A string that sets the column name for the raw LLM response. The raw response provides a list of dictionary pairs for every entity label, including "reason" and "extraction_text".

Returns

The function returns a Spark DataFrame with a new column for each specified entity type. The column or columns contain the entities extracted for each row of input text. If the function identifies more than one match for an entity, it returns only one of those matches. If no match is found, the result is null.

The default return type is a list of strings for each label. If users choose to specify a different type in the aifunc.ExtractLabel configuration, such as "type=integer", then the output will be a list of python int. If users specify "max_items=1" in the aifunc.ExtractLabel configuration, then only one element of the type is returned for that label.

Example

# This code uses AI. Always review output for mistakes. 

df = spark.createDataFrame([
        ("MJ Lee lives in Tuscon, AZ, and works as a software engineer for Contoso.",),
        ("Kris Turner, a nurse at NYU Langone, is a resident of Jersey City, New Jersey.",)
    ], ["descriptions"])

df_entities = df.ai.extract(labels=["name", "profession", "city"], input_col="descriptions")
display(df_entities)

This example code cell provides the following output:

Screenshot showing a new data frame with the columns 'name', 'profession',  and 'city', containing the data extracted from the original data frame.