MS Purview Classifications with Regex
I am working with custom classification rules in Microsoft Purview and would like to clarify how regex-based column and data patterns are evaluated during scans. I have several related questions and observations.
1. Column Pattern Regex Evaluation
When a custom classification rule is applied during a Purview scan, does the column pattern regex evaluate against:
- the column name, or
- the column displayName, or
- both?
This distinction is important because I am working with data sources where the column name and display name differ.
2. Minimum Match Threshold and Sparse Columns
The minimum Data Pattern Match Threshold appears to be 1%. Is there any supported way to classify a column when the actual data match percentage is below 1%?
For example, if a column contains 999 NULL values and only 1 populated value, the effective match rate would be far below 1%. Even with a very permissive data pattern such as .*, it seems that the column would not be classified due to the threshold limitation. Is there any workaround or alternative approach for handling very sparse columns? I am asking this because I want to classify PII and even if 1 there is only 1 value in 1000 it is important to classify this column as PII.
3. Unexpected Classification Behavior with Column Patterns
I expected a column, from a Dataverse table, named iss_gendername to be classified by a custom rule configured as follows:
Column pattern: (?i)^.*gender.*
Data pattern: .*
Minimum match threshold: 1%
However, this column was not classified during the scan.
Notably:
The rule passes the “Test classification rule” feature when using a separate manually built CSV file containing a column named iss_gendername.
Another column named Gender was classified by the same rule.
I observed that both columns share the same Fully Qualified Name (FQN) in Purview.
This raised the question of whether having multiple columns with the same FQN can prevent one of them from being classified or if it has anything to do with the way this column exists inside Dataverse.
Any clarification on these behaviors, particularly around regex evaluation targets, minimum match thresholds, and FQN handling, would be greatly appreciated.
Thank you.