Azure Data Explorer: Working with aggregation materialized views and historical data

Question

Azure Data Explorer: Working with aggregation materialized views and historical data

01725609 105

HI

We have the following scenario

SourceTable => Deduplication Materialized View => Multiple Aggregation Materialized views based on the deduplicated view (Daily, Weekly, Monthly)

In the source table we can backfill historical data with LightIngest so that all data sharding is handled correctly, based on a custom datetime column in our data. This custom datetime column is also a group-by-key in our materialized views(Custom Datetime, bin(CustomDateTime, 1d, ...) .

However, when we create materialized views on top of the source data, the data sharding logic is lost.

This is now based on the materialized view ingestion time and not the original data sharding anymore.

We can make sure this is handled correctly using the backfill parameter when creating the materialized view, however this does not work when we add extra historical data to the source table when the view is already created (which is a valid scenario in our case)

What would be the best approach here?

We are talking about tens of billions of records and we need query performance to be split second.
We have a daily data ingestion that will always be based on ingestion time, so that is never an issue, but from time to time we need to add new historical data to the source table and this is where it becomes an issue due to the volume of data.

Is adding a Partitioning Policy to the materialized views so that data sharding is still handled correctly the best approach?

However, in the documentation I read the following:

https://learn.microsoft.com/en-us/kusto/management/materialized-views/materialized-view-policies?view=microsoft-fabric#partitioning-policy

A partitioning policy can be applied on a materialized view. We recommend configuring a partitioning policy on a materialized view only when most or all of the view queries filter by one of the materialized view's group-by keys. This situation is common in multi-tenant solutions, where one of the materialized view's group-by keys is the tenant's identifier (for example, tenantId, customerId). For more information, see the first use case described in the partitioning policy supported scenarios page. For the commands to alter a materialized view's partitioning policy, see partitioning policy commands. Adding a partitioning policy on a materialized view increases the number of extents in the materialized view, and creates more "work" for the materialization process. For more information on the reason for this behavior, see the extents rebuild process mentioned in how materialized views work.

Is this also a valid scenario when the Custom Datetime column is necessary for the data sharding to organize itself on?

The following text also makes me believe this is the right approach

https://learn.microsoft.com/en-us/kusto/management/partitioning-policy?view=microsoft-fabric#supported-scenarios

Out-of-order data ingestion:

Data ingested into a table might not be ordered and partitioned into extents (shards) according to a specific datetime column that represents the data creation time and is commonly used to filter data. This could be due to a backfill from heterogeneous source files that include datetime values over a large time span.

In this case, set the uniform range datetime partition key to be the datetime column.

If you need retention and caching policies to align with the datetime values in the column, instead of aligning with the time of ingestion, set the OverrideCreationTime property to true.

So I would say that we create partitioning policies for each of the materialized views,

setting the the Custom DateTime to be the uniform range datetime partition key
setting OverrideCreationTime to true

However, I do read

Only apply a uniform range datetime partition key on a datetime-typed column in a table when data ingested into the table is unlikely to be ordered according to this column.

https://learn.microsoft.com/en-us/kusto/management/partitioning-policy?view=microsoft-fabric#uniform-range-datetime-partition-key

Does this mean that after we have backfilled and the partitioning has completed, we delete the partitioning policy again? (because we also have daily ingestion which will be automatically partitioned correctly)

So when an extra backfilling process starts

The Partitioning policy kicks in and processes the data correctly according to the CustomDateTime column
We update the merge policy with a lookback to hot cache, the merge policy does it's job and merges the extents. (and make sure the data in the MV is set to hot cache)

The partitioning policy would then look like

{
  "ColumnName": "CustomDateTime",
  "Kind": "UniformRange",
  "Properties": {
    "Reference": "2021-01-01T00:00:00",
    "RangeSize": "1.00:00:00",
    "OverrideCreationTime": true
  }
}

We would then adjust the RangeSize depending on how the data is aggregated per view (1d, 7d, Monthly)

However, when we have backfilled the historical data, do we delete the partitioning policy again, just like we delete the merge policy again after backfilling? (as stated in the documentation)

I think this is the right approach, but trying to see if anybody else came across this scenario in a real-life situation.

Thanks in advance!

Vinodh247 40,031 Reputation points MVP Volunteer Moderator

2025-08-27T11:56:37.1933333+00:00

The best hack in q&a is to keep the questions short. In that way, your question will attract a lot of interested minds to answer.
01725609 105 Reputation points

2025-08-27T12:01:29.6666667+00:00

Hi @Vinodh247

Thanks for your reply

I get it completely, only do I not know how to better formalize this. it's just a multistep process that in my opinion can be interpreted in a wrong way quite quickly, so I tried to give the complete context
01725609 105 Reputation points

2025-08-28T18:06:40.2+00:00
Thank you very much for your comment! This is exactly the response I was looking for :)

It seems that I had indeed mapped out the correct scenario, thanks a lot for validating.

There is however one more thing that is a bit unclear to me.

Since keeping our partition policy disabled when we are not backfilling, how can we be 100% sure that our backfilled data has been completely sharded?

Since the Microsoft docs say this about partitioning:

https://learn.microsoft.com/en-us/kusto/management/partitioning-policy?view=microsoft-fabric#the-data-partitioning-process

Data partitioning runs as a post-ingestion background process.

A table that is continuously ingested into is expected to always have a "tail" of data that is yet to be partitioned (nonhomogeneous extents).

How do we know that our backfilled data has already been partitioned completely, knowing that we also have a constant stream of data coming in, which will also be subject to that partitioning policy?

We can indeed perfectly follow the progress on a database level using .show database extents partitioning statistics, but it seems that the continuous scraping will also be included all of the time?

Thank you!
01725609 105 Reputation points

2025-08-29T11:14:47.0366667+00:00

Makes sense!

Thanks a lot for taking the time to help me with my question!
Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator

2025-08-29T15:22:23.21+00:00

Hi @01725609,
Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know
Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator

2025-09-01T03:55:55.7966667+00:00

Hi @01725609,

If you have any questions or need help with this, just let us know.
01725609 105 Reputation points

2025-09-01T05:31:00.0366667+00:00

That's ok thank you! I was checking to see if my train of thought was the correct one :)

Thanks for the help!

Answer accepted by question author

1 additional answer

Your answer

Vinodh247 40,031 Reputation points MVP Volunteer Moderator

2025-08-27T11:56:37.1933333+00:00

The best hack in q&a is to keep the questions short. In that way, your question will attract a lot of interested minds to answer.
01725609 105 Reputation points

2025-08-27T12:01:29.6666667+00:00

Hi @Vinodh247

Thanks for your reply

I get it completely, only do I not know how to better formalize this. it's just a multistep process that in my opinion can be interpreted in a wrong way quite quickly, so I tried to give the complete context
01725609 105 Reputation points

2025-08-28T18:06:40.2+00:00

Thank you very much for your comment! This is exactly the response I was looking for :)

It seems that I had indeed mapped out the correct scenario, thanks a lot for validating.

There is however one more thing that is a bit unclear to me.

Since keeping our partition policy disabled when we are not backfilling, how can we be 100% sure that our backfilled data has been completely sharded?

Since the Microsoft docs say this about partitioning:

https://learn.microsoft.com/en-us/kusto/management/partitioning-policy?view=microsoft-fabric#the-data-partitioning-process

Data partitioning runs as a post-ingestion background process.

A table that is continuously ingested into is expected to always have a "tail" of data that is yet to be partitioned (nonhomogeneous extents).

How do we know that our backfilled data has already been partitioned completely, knowing that we also have a constant stream of data coming in, which will also be subject to that partitioning policy?

We can indeed perfectly follow the progress on a database level using .show database extents partitioning statistics, but it seems that the continuous scraping will also be included all of the time?

Thank you!
01725609 105 Reputation points

2025-08-29T11:14:47.0366667+00:00

Makes sense!

Thanks a lot for taking the time to help me with my question!
Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator

2025-08-29T15:22:23.21+00:00

Hi @01725609,
Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know
Swapnesh Panchal 1,380 Reputation points Microsoft External Staff Moderator

2025-09-01T03:55:55.7966667+00:00

Hi @01725609,

If you have any questions or need help with this, just let us know.
01725609 105 Reputation points

2025-09-01T05:31:00.0366667+00:00

That's ok thank you! I was checking to see if my train of thought was the correct one :)

Thanks for the help!

Answer 1

Hi @01725609

Since keeping our partition policy disabled when we are not backfilling, how can we be 100% sure that our backfilled data has been completely sharded?

You cannot rely on database-level statistics alone, because they always reflect the ongoing ingestion tail.
To be 100% confident your backfill is completely sharded:

Apply the partitioning policy with an EffectiveDateTime covering just the backfill window (Partitioning Policy).
Or tag the backfill extents at ingest using Extent tags.
Keep those extents hot using a temporary Caching policy.
Use .show extents (docs) filtered on your tags or time range.
Once no unpartitioned extents remain for that slice, the backfill is fully partitioned — even if .show database extents partitioning statistics (docs) still shows a small live tail.

How do we know that our backfilled data has already been partitioned completely, knowing that we also have a constant stream of data coming in, which will also be subject to that partitioning policy?

Even with continuous ingestion, you can decisively confirm backfill completion by isolating and tracking only the backfill slice:

Isolate the backfill
- Scope partitioning to the backfill window with EffectiveDateTime (Partitioning Policy), or tag backfill extents (Extent tags).
Keep extents hot
- Partitioning runs only on hot extents. Temporarily adjust the Caching policy so the backfill time range stays hot until processing is done.

Query backfill extents only

   .show table MyTable extents
   | where tags has "backfill:2025-08"
   | project ExtentId, CreatedOn, IsHomogeneous, Tags

(.show extents docs)

Done = no nonhomogeneous extents remain for that backfill slice.

Bottom line: despite the constant stream, you know your backfill is complete once its tagged/time-scoped extents are all homogeneous. The live ingestion tail can be ignored.

We can indeed perfectly follow the progress on a database level using .show database extents partitioning statistics, but it seems that the continuous scraping will also be included all of the time?

Correct — the command:

.show database extents partitioning statistics

(docs)

Always includes the continuous ingestion tail.
The PartitionedRowPercentage will get close to 100% but never exactly reach it while live data keeps arriving.

How to use it effectively:

Treat this as a roll-up health indicator for the table/cluster.

For backfill completion, rely instead on:

  .show table MyTable extents
  | where tags has "backfill:2025-08"

(.show extents docs)

When no unpartitioned extents remain in the backfill slice, you can consider the backfill fully partitioned — even though .show database extents partitioning statistics continues to show the live scraping tail.

Final Takeaway

Database-level stats always include the live tail, so they can’t be used as the sole indicator.
To be 100% sure about backfill completion:
1. Scope partitioning with EffectiveDateTime or use extent tags.
2. Keep backfill extents hot until partitioning finishes.
3. Verify via .show extents that all backfill extents are homogeneous.

Once those checks are satisfied, you can safely declare the backfill completely sharded, regardless of the ongoing stream.

If you have any further questions, feel free to respond in this same thread.

01725609 105 Reputation points

2025-09-02T10:03:42.34+00:00

it does seem that

.show table MyTable extents | where tags has "backfill:2025-08" | project ExtentId, CreatedOn, IsHomogeneous, Tags

does not work, the IsHomogeneous metadata is not valid

The extent tags are also not available in the aggregation or deduplication materialized views

Answer 2

Hi @01725609,

Welcome to the Microsoft Q&A Platform.

It looks like you're working through a pretty complex scenario with Azure Data Explorer, especially around materialized views and handling historical data.

Here’s a breakdown of your situation and some recommendations:

When you backfill historical data in Azure Data Explorer (ADX), materialized views don’t shard data by your original time column. Instead, they use ingestion time. This means:

Queries that filter on CustomDateTime (daily, weekly, monthly) become slower because extents are mixed.

Retention and caching get misaligned, since ADX treats all backfilled data as “new.”

The Fix: Partitioning Policy

To handle this, you can temporarily apply a partitioning policy during backfill:

Use CustomDateTime as the partition key.

Pick a UniformRange range (1 day for daily, 7 days for weekly, ~30 days for monthly).

Set OverrideCreationTime=true so extents are stamped with the actual data time.

This keeps queries fast and ensures retention/caching follow the real data age.
Partitioning policy supported scenarios

How to Do It

Before backfill – Extend caching so historical extents stay in hot cache.

Apply partitioning policy – ADX reorganizes extents by CustomDateTime.

Adjust merge policy – Set Lookback=HotCache so even “old” extents can merge.

Monitor progress – Use .show extents partitioning statistics.

Cleanup – After partitioning is complete, remove the partitioning & merge policies and reset cache.

Ingest historical data – post ingestion
Materialized view partitioning policies

Should You Keep It On?

No. Your daily ingestion is already in time order, so default sharding works fine. Keeping the partitioning policy permanently just adds background overhead.
Best practice: enable during backfill, let it reorganize, then remove it.

Bottom line: Your thinking is correct. Use partitioning with OverrideCreationTime for backfills, then clean it up. That way you get split-second queries on billions of records without unnecessary ongoing overhead.

If you have any further questions, feel free to respond in this same thread.

Share via

Azure Data Explorer: Working with aggregation materialized views and historical data

Final Takeaway

1 additional answer

Your answer