Use ai.similarity with PySpark

The ai.similarity function uses generative AI to compare two string expressions and then calculate a semantic similarity score. It uses only a single line of code. You can compare text values from one column of a DataFrame with a single common text value or with pairwise text values in another column.

Note

This article covers using ai.similarity with PySpark. To use ai.similarity with pandas, see this article.
See other AI functions in this overview article.
Learn how to customize the configuration of AI functions.

Overview

The ai.similarity function is available for Spark DataFrames. You must specify the name of an existing input column as a parameter. You must also specify a single common text value for comparisons, or the name of another column for pairwise comparisons.

The function returns a new DataFrame that includes similarity scores for each row of input text that's in an output column.

df.ai.similarity(input_col="col1", other="value", output_col="similarity")

df.ai.similarity(input_col="col1", other_col="col2", output_col="similarity")

Parameters

Name	Description
`input_col` Required	A string that contains the name of an existing column with input text values to use for computing similarity scores.
`other` or `other_col` Required	Only one of these parameters is required. The `other` parameter is a string that contains a single common text value used to compute similarity scores for each row of input. The `other_col` parameter is a string that designates the name of a second existing column, with text values used to compute pairwise similarity scores.
`output_col` Optional	A string that contains the name of a new column to store calculated similarity scores for each input text row. If you don't set this parameter, a default name generates for the output column.
`error_col` Optional	A string that contains the name of a new column that stores any OpenAI errors that result from processing each input text row. If you don't set this parameter, a default name generates for the error column. If an input row has no errors, this column has a `null` value.

Returns

The function returns a Spark DataFrame that includes a new column that contains generated similarity scores for each input text row. The output similarity scores are relative, and they're best used for ranking. Score values can range from -1* (opposites) to 1 (identical). A score of 0 indicates that the values are unrelated in meaning.

Example

Compare with a single value
Compare with pairwise values

# This code uses AI. Always review output for mistakes. 

df = spark.createDataFrame([
        ("Bill Gates",), 
        ("Sayta Nadella",), 
        ("Joan of Arc",) 
    ], ["names"])

similarity = df.ai.similarity(input_col="names", other="Microsoft", output_col="similarity")
display(similarity)

This example code cell provides the following output:

# This code uses AI. Always review output for mistakes.

df = spark.createDataFrame([
        ("Bill Gates", "Technology"), 
        ("Satya Nadella", "Healthcare"), 
        ("Joan of Arc", "Agriculture")
    ], ["names", "industries"])

similarity = df.ai.similarity(input_col="names", other_col="industries", output_col="similarity")
display(similarity)

This example code cell provides the following output:

Use ai.similarity with pandas.
Detect sentiment with ai.analyze_sentiment.
Categorize text with ai.classify.
Generate vector embeddings with ai.embed.
Extract entities with ai_extract.
Fix grammar with ai.fix_grammar.
Answer custom user prompts with ai.generate_response
Summarize text with ai.summarize.
Translate text with ai.translate.
Learn more about the full set of AI functions.
Customize the configuration of AI functions.
Did we miss a feature you need? Suggest it on the Fabric Ideas forum.

Feedback

Was this page helpful?

Last updated on 2025-11-21