Best practices for designing an ETL process for ingesting large volumes

HimanshuSinha 19,527 Reputation points Microsoft Employee Moderator
2025-10-22T20:34:34.9666667+00:00

How would you design an ETL process for ingesting large volumes of data from multiple sources into Azure Data Explorer? What are some best practices to keep in mind?

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
{count} votes

1 answer

Sort by: Most helpful
  1. Marcin Policht 68,850 Reputation points MVP Volunteer Moderator
    2025-10-22T22:19:59.7433333+00:00

    In short, land data in ADLS/Blob as Parquet whenever possible, batch into 100MB–1GB uncompressed chunks, push via queued ingestion for throughput, use streaming ingestion only where low latency matters, define ingestion mappings and idempotency, tune batching/caching/retention, and monitor ingestion health.

    More detailed info follows:

    1. High-level architecture (patterns)

    Use a hybrid of queued (batch) ingestion for throughput and streaming ingestion for low-latency needs.

    • Producers / Sources
      • Applications, databases (via ADF), IoT devices (IoT Hub), Kafka/Event Hubs, flat files in ADLS Gen2 / Blob Storage, 3rd-party logs.
    • Ingestion layer
      • Event Hub / Kafka for streaming telemetry.
      • Blob/ADLS landing zone + queued ingestion for bulk/historical and reliable ingestion.
      • Azure Data Factory / Synapse / Databricks for ETL orchestration and large-format writes (Parquet).
      • Azure Functions / Stream Analytics / Databricks Structured Streaming for lightweight or advanced transformations.
    • Ingest into ADX
      • Use ADX queued ingestion (preferred for throughput/reliability) or streaming ingestion when <1s latency is required.
      • Use the ADX ingestion client libraries / REST API / .ingest commands as appropriate.
    1. Concrete pipelines / flows
    1. Real-time telemetry
      • Devices → IoT Hub / Kafka → Event Hub → ADX streaming ingestion (or a microservice that writes via streaming API).
      • Use streaming ingestion when you need sub-second query visibility; otherwise queue.
    2. High-volume batch / bulk
      • Source DBs / apps → ADF/Databricks → write partitioned Parquet to ADLS Gen2 (landing).
      • Use queued ingestion from the ADLS/Blob container into ADX for reliable, cost-efficient loading. Parquet is recommended for columnar, compressed loads.
    3. Backfill / historical
      • Stage large historical files in ADLS, adjust retention/caching before repartitioning, then use queued ingestion and (if needed) repartitioning strategy.
    1. Data format & file-size guidance
    • Preferred formats: Parquet (columnar) or Avro for analytics; CSV/JSON acceptable but less efficient. Columnar reduces I/O and speeds queries.
    • Compression: Use compressed Parquet (snappy, etc.) — reduces storage and network.
    • Batch / file size: Send data in large chunks — recommended 100 MB – 1 GB (uncompressed) per batch for efficient ingestion; ADX batching policy works on uncompressed size. Avoid lots of tiny files.
    • Avoid extremely large single files (>100 GB) because some engines struggle; partition into sensible sizes.
    1. Ingestion mode choices & specifics
    • Queued ingestion — preferred for throughput and reliability; supports large volumes and automatic batching/merge of small files into extents. Default batching (time/size/count) is usually 5min/1GB/1000 blobs but can be tuned. Use the Kusto ingest library for retries & batching control.
    • Streaming ingestion — low-latency, useful when you need immediate availability (logs/alerts). More expensive per-record and less efficient for extremely high throughput per table; often used for small per-table streams.
    1. Schema, mapping, and transformation strategies
    • Schema design: Model tables for the queries you expect. Use narrow, columnar-friendly types and avoid wide, catch-all columns when possible.
    • Ingestion mappings: Define JSON/CSV/Parquet mappings for nested structures and to ensure robust parsing. Use explicit mappings for JSON arrays/objects.
    • Update policies: Use ADX update policies to perform server-side transforms during ingestion (e.g., denormalize/enrich). Use cautiously — they run on cluster resources and may add latency/cost.
    • Transform location: Prefer doing heavy transformations in Databricks/ADF (pre-ingest) and keep ADX for fast analytics queries and light enrichments.
    1. Partitioning, extents & performance tuning
    • ADX stores data in extents; fewer larger extents usually query faster than many tiny ones. Batching and larger files help create larger extents and reduce fragmentation.
    • Tune ingestion batching policy (time, size, count) at database/table level to control when batches are created and committed. For high throughput, increase batch size limits (but test).
    • Use caching (hot cache) policy for recent/hot datasets to prioritize SSD/RAM for faster queries; set retention and caching windows to balance cost vs performance.
    1. Reliability, ordering, and semantics
    • Exactly-once is not guaranteed — ADX queued ingestion follows at least once semantics; build idempotency if dedupe is needed (use ingestion properties like SourceId/SourceKey or de-dup keys).
    • Use ingestion reporting / diagnostics (.show ingestion failures, ingestion status APIs) to detect failed or partial ingestions.
    1. Monitoring, alerting, and observability
    • Monitor: ingestion throughput, ingestion success/failure, ingestion latency, number of extents, cache hit ratio, CPU/ingest queue metrics.
    • Set alerts on ingestion-failure rates, backlog growth, and throttling. Use Azure Monitor + ADX metrics + logs.
    • Capture lineage: tag ingestions with properties (pipeline id, job run id) so you can trace back problematic batches.
    1. Security & governance
    • Use Managed Identity for ADX ingestion from ADF/Databricks/Functions; avoid embedding keys.
    • Use Private Endpoints or VNet integration for secure connectivity between storage and ADX.
    • Use RBAC roles for access control, and ADX data masking or column-level protections if needed.
    • Encrypt-at-rest and in-transit are standard; ensure keys (if customer-managed) are stored in Key Vault.
    1. Cost & lifecycle management
    • Use retention and caching policies to move older data to colder storage or limit hot cache; ADX retention/caching policies let you trade cost vs query performance.
    • Compress data (Parquet), and prefer columnar storage to reduce compute and storage costs.
    1. Operational checklist (practical)
    • Use Parquet for analytics; avoid many tiny files.
    • Batch ingestions to ~100MB–1GB uncompressed for best throughput; tune ingestion batching policy.
    • Prefer queued ingestion for bulk; streaming only if you need sub-second latency.
    • Create ingestion mappings and validate on sample data.
    • Implement dedup/idempotency strategy (SourceId/SourceKey) if producers can resend.
    • Establish monitoring dashboards & alerts for ingestion failures and backlog.
    • Secure connections with Managed Identities and private endpoints.
    • Test backfill process and repartitioning plan for historical data (adjust retention/caching as needed).

    More at https://learn.microsoft.com/en-us/azure/data-explorer/ingestion-faq, https://docs.azure.cn/en-us/data-explorer/ingest-data-streaming, https://docs.azure.cn/en-us/data-explorer/data-lake-query-data, https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-historical, https://learn.microsoft.com/en-us/kusto/api/netfx/kusto-ingest-best-practices, https://docs.azure.cn/en-us/data-explorer/ingestion-faq, https://learn.microsoft.com/en-us/kusto/management/alter-table-command, https://learn.microsoft.com/en-us/kusto/management/batching-policy, https://learn.microsoft.com/en-us/azure/data-explorer/table-retention-policy-wizard, and https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-overview


    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.