Challenges with data policy

azure_learner 920 Reputation points
2025-11-18T10:40:31.84+00:00

Hi Kind and helpful friends, our scenario is like this, we have implemented retention policy on our ADLS landing, bronze, and silver layer at the moment. The data we ingested every day is appended data as whole for example today data would be appended with the lot that was ingested on day one of the landing zone. And volume of the data is in 2 GBs, hence there is smooth sailing so far.The primary reason is we ingest the data as API endpoint, and there is some challenge from source side to business to send incremental data.

However, we have implemented Auto loader and the data file format is being changed from Parquet to Delta. But there is one data source which has data volume >12 GB when proceed with similar set up what I have mentioned above. This is causing throughput and performance issues, and with the business affirmation we decided to implement incremental load (this is relational DB). We have decided to done away with retention policy for the reason:

1.       Late-Arriving Data & Correction Loss

2.       Breaks time travel and historical queries.

3.       Incremental Load Reconciliation Failure

4.       Unavailability of physical delta tables which would cause breakdown of analytics etc.

Please share you informed input whether implementing retention policy on 1 use case where data is nearly 2 GB daily would has any potential lurking dangers. What should be our best approach in this context to ensure scalability, performance and no breakage in data insights?

As always extremely grateful for your support. Thank you.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
{count} votes

3 answers

Sort by: Most helpful
  1. Q&A Assist
    2025-11-18T10:41:07.7933333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Implementing a retention policy in your Azure Data Lake Storage (ADLS) setup can have both advantages and potential challenges, especially when dealing with varying data volumes and the need for incremental loads.

    Potential Challenges with Retention Policy:

    1. Late-Arriving Data & Correction Loss: If you enforce a strict retention policy, late-arriving data may be discarded, leading to incomplete datasets.
    2. Breaks Time Travel and Historical Queries: Retention policies can limit your ability to perform time travel queries, which are essential for historical analysis and auditing.
    3. Incremental Load Reconciliation Failure: If your data ingestion process is not fully aligned with the retention policy, you might face reconciliation issues, especially with incremental loads.
    4. Unavailability of Physical Delta Tables: If delta tables are not retained, it could hinder the analytics process, as these tables are often crucial for efficient querying and data transformations.

    Recommendations for Scalability and Performance:

    • Evaluate Data Volume: Since you are currently ingesting around 2 GB daily, ensure that your retention policy is flexible enough to accommodate fluctuations in data volume without risking data loss.
    • Incremental Loading Strategy: Since you are moving towards an incremental load strategy, consider implementing a retention policy that allows for a grace period for late-arriving data while still maintaining a balance between data freshness and storage costs.
    • Monitor Performance: Regularly monitor the performance of your ingestion pipeline and adjust the retention policy as needed. This can help in identifying any bottlenecks or performance issues that arise from the retention settings.
    • Testing and Validation: Before fully implementing the retention policy, conduct tests to validate that your data processing and analytics workflows remain intact and efficient.

    In summary, while implementing a retention policy can help in managing storage costs and data lifecycle, it is crucial to tailor it to your specific use case, particularly considering the challenges of late-arriving data and the need for historical insights. Regular monitoring and adjustments will be key to ensuring scalability and performance without breaking data insights.


    References:

    0 comments No comments

  2. azure_learner 920 Reputation points
    2025-11-22T13:03:41.3133333+00:00

    Hi experts, could you please help me with the above question. Thank you

    0 comments No comments

  3. Pilladi Padma Sai Manisha 500 Reputation points Microsoft External Staff Moderator
    2025-11-27T10:54:02.33+00:00

    Hi azure_learner,
    Thankyou for Reaching Microsoft Q&A!
    It looks like you're navigating some complex challenges with your data retention policies and ingestion processes in Azure Data Lake Storage (ADLS). Here’s a breakdown of your concerns and some suggestions on your approach!
    For the daily ingestion of around 2 GB using retention policies on the ADLS landing, bronze, and silver layers, there are some important considerations to keep in mind. While retention policies help manage storage and improve query performance, they can sometimes lead to challenges such as losing late-arriving or corrected data, breaking historical queries due to time travel limitations, and causing reconciliation failures in incremental load processes. These risks are important but can be managed effectively with the right approach.

    When working with larger volumes like 12 GB or more daily, it’s often necessary to switch to an incremental load strategy. This helps address throughput and performance issues more efficiently. Delta Lake’s MERGE operation combined with incremental loads allows handling updates and deduplication without losing data integrity. You can retain the benefits of retention policies by setting a reasonable grace period that captures late-arriving records and cleans up only older data to avoid storage bloat.

    The best approach for your scenario is to continue using a retention policy for the smaller 2 GB daily append use case but ensure the retention settings allow enough time for late data adjustments. For the larger and more demanding source, adopt incremental loading with Delta Lake's capabilities to maintain high throughput and data accuracy. This balanced strategy supports scalability, performance, and uninterrupted data insights without risking data loss or analytics failures.

    By carefully combining retention policies with incremental loading and Delta Lake’s features like time travel, vacuum, and schema enforcement, you can achieve a robust, scalable, and performant pipeline. This preserves historical data for auditing and analytics while optimizing resources and minimizing issues related to late-arriving data or reconciliation failures.

    I hope this helps you strategize your approach effectively! If you need further assistance, feel free to ask!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.