Azure Data Explorer: Working with aggregation materialized views and historical data

01725609 105 Reputation points
2025-08-27T07:13:57.3366667+00:00

HI

We have the following scenario

SourceTable => Deduplication Materialized View => Multiple Aggregation Materialized views based on the deduplicated view (Daily, Weekly, Monthly)

In the source table we can backfill historical data with LightIngest so that all data sharding is handled correctly, based on a custom datetime column in our data. This custom datetime column is also a group-by-key in our materialized views(Custom Datetime, bin(CustomDateTime, 1d, ...) .

However, when we create materialized views on top of the source data, the data sharding logic is lost.

This is now based on the materialized view ingestion time and not the original data sharding anymore.

We can make sure this is handled correctly using the backfill parameter when creating the materialized view, however this does not work when we add extra historical data to the source table when the view is already created (which is a valid scenario in our case)

What would be the best approach here?

  • We are talking about tens of billions of records and we need query performance to be split second.
  • We have a daily data ingestion that will always be based on ingestion time, so that is never an issue, but from time to time we need to add new historical data to the source table and this is where it becomes an issue due to the volume of data.

Is adding a Partitioning Policy to the materialized views so that data sharding is still handled correctly the best approach?

However, in the documentation I read the following:

https://learn.microsoft.com/en-us/kusto/management/materialized-views/materialized-view-policies?view=microsoft-fabric#partitioning-policy

A partitioning policy can be applied on a materialized view. We recommend configuring a partitioning policy on a materialized view only when most or all of the view queries filter by one of the materialized view's group-by keys. This situation is common in multi-tenant solutions, where one of the materialized view's group-by keys is the tenant's identifier (for example, tenantId, customerId). For more information, see the first use case described in the partitioning policy supported scenarios page. For the commands to alter a materialized view's partitioning policy, see partitioning policy commands. Adding a partitioning policy on a materialized view increases the number of extents in the materialized view, and creates more "work" for the materialization process. For more information on the reason for this behavior, see the extents rebuild process mentioned in how materialized views work.

Is this also a valid scenario when the Custom Datetime column is necessary for the data sharding to organize itself on?

The following text also makes me believe this is the right approach

https://learn.microsoft.com/en-us/kusto/management/partitioning-policy?view=microsoft-fabric#supported-scenarios

Out-of-order data ingestion:

  • Data ingested into a table might not be ordered and partitioned into extents (shards) according to a specific datetime column that represents the data creation time and is commonly used to filter data. This could be due to a backfill from heterogeneous source files that include datetime values over a large time span.
  • In this case, set the uniform range datetime partition key to be the datetime column.
  • If you need retention and caching policies to align with the datetime values in the column, instead of aligning with the time of ingestion, set the OverrideCreationTime property to true.

So I would say that we create partitioning policies for each of the materialized views,

  • setting the the Custom DateTime to be the uniform range datetime partition key
  • setting OverrideCreationTime to true

However, I do read

Only apply a uniform range datetime partition key on a datetime-typed column in a table when data ingested into the table is unlikely to be ordered according to this column.

Does this mean that after we have backfilled and the partitioning has completed, we delete the partitioning policy again? (because we also have daily ingestion which will be automatically partitioned correctly)

So when an extra backfilling process starts

  • The Partitioning policy kicks in and processes the data correctly according to the CustomDateTime column
  • We update the merge policy with a lookback to hot cache, the merge policy does it's job and merges the extents. (and make sure the data in the MV is set to hot cache)

The partitioning policy would then look like

{
  "ColumnName": "CustomDateTime",
  "Kind": "UniformRange",
  "Properties": {
    "Reference": "2021-01-01T00:00:00",
    "RangeSize": "1.00:00:00",
    "OverrideCreationTime": true
  }
}
  • We would then adjust the RangeSize depending on how the data is aggregated per view (1d, 7d, Monthly)

However, when we have backfilled the historical data, do we delete the partitioning policy again, just like we delete the merge policy again after backfilling? (as stated in the documentation)

I think this is the right approach, but trying to see if anybody else came across this scenario in a real-life situation.

Thanks in advance!

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
{count} votes

Answer accepted by question author
  1. Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator
    2025-08-29T11:10:54.8266667+00:00

    Hi @01725609


    Since keeping our partition policy disabled when we are not backfilling, how can we be 100% sure that our backfilled data has been completely sharded?

    You cannot rely on database-level statistics alone, because they always reflect the ongoing ingestion tail.
    To be 100% confident your backfill is completely sharded:

    1. Apply the partitioning policy with an EffectiveDateTime covering just the backfill window (Partitioning Policy).
    2. Or tag the backfill extents at ingest using Extent tags.
    3. Keep those extents hot using a temporary Caching policy.
    4. Use .show extents (docs) filtered on your tags or time range.
    5. Once no unpartitioned extents remain for that slice, the backfill is fully partitioned — even if .show database extents partitioning statistics (docs) still shows a small live tail.

    How do we know that our backfilled data has already been partitioned completely, knowing that we also have a constant stream of data coming in, which will also be subject to that partitioning policy?

    Even with continuous ingestion, you can decisively confirm backfill completion by isolating and tracking only the backfill slice:

    1. Isolate the backfill
    2. Keep extents hot
      • Partitioning runs only on hot extents. Temporarily adjust the Caching policy so the backfill time range stays hot until processing is done.
    3. Query backfill extents only
         .show table MyTable extents
         | where tags has "backfill:2025-08"
         | project ExtentId, CreatedOn, IsHomogeneous, Tags
      
      (.show extents docs)
      • Done = no nonhomogeneous extents remain for that backfill slice.

    Bottom line: despite the constant stream, you know your backfill is complete once its tagged/time-scoped extents are all homogeneous. The live ingestion tail can be ignored.


    We can indeed perfectly follow the progress on a database level using .show database extents partitioning statistics, but it seems that the continuous scraping will also be included all of the time?

    Correct — the command:

    .show database extents partitioning statistics
    

    (docs)

    • Always includes the continuous ingestion tail.
    • The PartitionedRowPercentage will get close to 100% but never exactly reach it while live data keeps arriving.

    How to use it effectively:

    • Treat this as a roll-up health indicator for the table/cluster.
    • For backfill completion, rely instead on:
        .show table MyTable extents
        | where tags has "backfill:2025-08"
      
      (.show extents docs)

    When no unpartitioned extents remain in the backfill slice, you can consider the backfill fully partitioned — even though .show database extents partitioning statistics continues to show the live scraping tail.


    Final Takeaway

    • Database-level stats always include the live tail, so they can’t be used as the sole indicator.
    • To be 100% sure about backfill completion:
      1. Scope partitioning with EffectiveDateTime or use extent tags.
      2. Keep backfill extents hot until partitioning finishes.
      3. Verify via .show extents that all backfill extents are homogeneous.

    Once those checks are satisfied, you can safely declare the backfill completely sharded, regardless of the ongoing stream.


    If you have any further questions, feel free to respond in this same thread.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator
    2025-08-28T15:44:02.25+00:00

    Hi @01725609,

    Welcome to the Microsoft Q&A Platform.

    It looks like you're working through a pretty complex scenario with Azure Data Explorer, especially around materialized views and handling historical data.

    Here’s a breakdown of your situation and some recommendations:

    When you backfill historical data in Azure Data Explorer (ADX), materialized views don’t shard data by your original time column. Instead, they use ingestion time. This means:

    Queries that filter on CustomDateTime (daily, weekly, monthly) become slower because extents are mixed.

    Retention and caching get misaligned, since ADX treats all backfilled data as “new.”


    The Fix: Partitioning Policy

    To handle this, you can temporarily apply a partitioning policy during backfill:

    Use CustomDateTime as the partition key.

    Pick a UniformRange range (1 day for daily, 7 days for weekly, ~30 days for monthly).

    Set OverrideCreationTime=true so extents are stamped with the actual data time.

    This keeps queries fast and ensures retention/caching follow the real data age.
    Partitioning policy supported scenarios


    How to Do It

    Before backfill – Extend caching so historical extents stay in hot cache.

    Apply partitioning policy – ADX reorganizes extents by CustomDateTime.

    Adjust merge policy – Set Lookback=HotCache so even “old” extents can merge.

    Monitor progress – Use .show extents partitioning statistics.

    Cleanup – After partitioning is complete, remove the partitioning & merge policies and reset cache.

    Ingest historical data – post ingestion
    Materialized view partitioning policies


    Should You Keep It On?

    No. Your daily ingestion is already in time order, so default sharding works fine. Keeping the partitioning policy permanently just adds background overhead.
    Best practice: enable during backfill, let it reorganize, then remove it.


    Bottom line: Your thinking is correct. Use partitioning with OverrideCreationTime for backfills, then clean it up. That way you get split-second queries on billions of records without unnecessary ongoing overhead.


    If you have any further questions, feel free to respond in this same thread.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.