Data Pipeline Using Stream Analytics Job with Event Hubs for Input and Output Dropping Events

Tarik Rashada 25 Reputation points
2025-10-31T21:36:11.58+00:00

I have a data pipeline that includes these services in sequence:

Input Event Hub -> Stream Analytics Job (With a simple WHERE clause Filter) -> Output Event Hub -> Function App (With EventHubTrigger)

I am sending a load test of 1000 events over a period of 5 seconds from my laptop and it appears that some of the events are being lost between the Stream Analytics Job and the output event hub.

I used az cli commands of the following form to determine that all of my events are reaching the input event hub from my laptop and that the stream analytics job is consuming and writing all 1000 events:

az monitor metrics list --resource <> --metric IncomingMessages --start-time <start-time> --interval PT1M --aggregation Total --query 'value[0].timeseries[0].data[?total != `null`].total' -o tsv | awk '{s+=$1} END {print "Total:", s+0}'

I used similar commands to check that:

  1. The Stream Analytics InputEvents was 1000
  2. The Stream Analytics OutputEvents was 1000
  3. The IncomingMessages to the Output Event Hub was < 1000

In particular, the events lost here tends to be 20-30 out of 1000. More events get lost in some application that inserts into a database after the functionApp but I suspect I know the issue there.

Also I monitored the cpu utilization on the stream analytics job and it appears to never go above 12%.

Things I have tried

  1. Increasing Streaming Units on Stream Analytics Job
  2. Increasing Throughput units on Event Hubs (I initially thought it could be a cold start issue or auto-inflate did not scale the event hub throughput fast enough. However I found when viewing the sequence numbers that although events are lost at the beginning - events are lost throughout the load test)
  3. Monitoring Errors in the Stream Analytics Job during the test runs (I found no errors)
  4. Toggling the Events that arrive late and Out of Order Events time. The action for handling these events remained adjust so even if events arrived late or out of order they should not be removed from the pipeline.
  5. Adding a sequence number on my events to track if there was a pattern to the dropped events (are they mostly at the end of the stream or beginning). One interesting finding here is the events are mostly dropped in a consecutive sequence. E.g. one run gave me the following sequence numbers were dropped:
    +------------+ | missing_id | +------------+ |          1 | |          2 | |          3 | |          4 | |          5 | |          6 | |          7 | |          8 | |          9 | |         13 | |         17 | |         21 | |         92 | |         93 | |         96 | |         97 | |        207 | |        208 | |        209 | |        211 | |        212 | |        213 | |        497 | |        498 | |        499 | |        500 | |        501 | |        502 | |        503 | |        504 | |        783 | |        784 | |        785 | |        787 | |        788 | |        789 | |        790 | |        794 | +------------+
  6. Toggling on Auto-Inflate on the Event Hubs

One observation is it does seem that the rate at which I send events affects the number of dropped events.

In particular, if I send 1000 records and use 3 streaming units on the stream analytics job with a input event hub throughput of 2 units and output event hub throughput of 5 units (both with auto-inflate) - I lose around 20 events.

But when I increase the streaming units to 6 streaming units and leave all else unchanged I lose around 12 events.

Finally, an experiment with 6 streaming units and half the sending speed only causes me to lose 6 events.

So

  1. streaming units and rate of sending seem to impact performance even though the metrics show CPU and SU utilization remains quite low
  2. this data volume is not very large and we cannot have missed events, so it is a bit concerning how many of them are randomly getting lost somewhere

One last point is that the event structure taken from stream analytic's input is JSON that looks like this:

{

"package": "01234567890130",

"time": 1761943591.187937,

"data": "ctS9cdZwtpoKIi4Rdf1M0S0zBJNpJoia...==",

"tag": "<company_name_derived_tag_to_filter_on>",

"EventProcessedUtcTime": "2025-10-31T21:21:12.4822115Z",

"PartitionId": 0,

"EventEnqueuedUtcTime": "2025-10-31T20:46:30.4850000Z"
```  }  
Where data is a hash generated through some reversible encryption scheme before sending the event to the input event hub.

Azure Stream Analytics
Azure Stream Analytics
An Azure real-time analytics service designed for mission-critical workloads.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Pratyush Vashistha 5,045 Reputation points Microsoft External Staff Moderator
    2025-11-04T05:02:23.13+00:00

    Hello Tarik Rashada, Thanks for posting your question on Microsoft Q&A!

    You're seeing intermittent event drops between your Stream Analytics job and the output Event Hub-despite metrics showing full input and output event counts at the job level suggesting the issue may lie in how events are batched, serialized, or acknowledged in the output.

    A few key details would help narrow this down:

    • Are you using capture or custom serialization on the output Event Hub?
    • What partition key (if any) are you setting in your Stream Analytics output configuration?
    • Are you checking sequence numbers or enqueued time on the output Event Hub directly (e.g., via a consumer like EventProcessorHost or az eventhubs eventhub receiver) to confirm the count mismatch isn’t due to Function App consumption issues downstream?

    One known behavior to consider: Stream Analytics batches output writes to Event Hubs for performance. If the job restarts or there’s a transient error during batch flush (even if not logged as a job failure), some events in the last batch might not be persisted—especially under high burst rates. This aligns with your observation that slower send rates reduce drops.

    Also verify if your output Event Hub has sufficient partitions. If all events go to a single partition (e.g., due to a static partition key), you’re limited to 1 MB/s per partition, which could cause throttling or silent drops even if throughput units are high.

    For validation, try capturing output directly from the Event Hub (bypassing the Function App) using a simple receiver and compare counts. Microsoft’s guidance on output consistency and delivery guarantees notes that Event Hubs output is "at-least-once," but batching and retries can occasionally lead to gaps under extreme burst loads if not tuned properly.

    In Short, to confirm where the loss occurs:

    • Bypass your Function App and read directly from the output Event Hub using a simple receiver (like az eventhubs eventhub receive or a basic EventProcessor) to rule out downstream consumption issues.
    • Ensure your test runs long enough (e.g., 15–20 seconds after the last event is sent) to allow Stream Analytics to flush any pending output batches.
    • Verify whether you're using a partition key in your Stream Analytics output configuration. If not, consider adding one (even a random or round-robin key) to distribute load across partitions.

    Let us know the above details so that we can then pinpoint whether this is a batching, partitioning, or downstream consumption issue.

    If this answers your query, do click UpVoteandYes` for was this answer helpful. And, if you have any further query do let us know.

    Thanks

    Pratyush


1 additional answer

Sort by: Most helpful
  1. Tarik Rashada 25 Reputation points
    2025-11-04T18:49:40.84+00:00

    Hi Pratyush - thank you for your reply. I am beginning to look into these suggestions.

    I will note that each event is approximately 5kB so if I write 1000 events that is about 5MB. However note that my script delivers them at a rate of around 5 req/sec which means I am only writing about 25-30 kB / sec which is much much smaller than the 1MB/sec throughput. Though I suppose that the stream analytics job could be batching and delivering at a much higher rate than the script is writing to the first event hub.

    For your questions:

    A few key details would help narrow this down:

    • Are you using capture or custom serialization on the output Event Hub?
      • I think I am using custom serialization. The only serialization I configured in my terraform was json line-separated with utf8 encoding for both the input and the output event hub.
    • What partition key (if any) are you setting in your Stream Analytics output configuration?
      • I am not explicitly setting a partition key but both of my event hubs have 4 partitions and it appears that the load is evenly distributed as the attached screenshots show. These are screenshots of the input and output events across partitions collected from the stream analytics job resource:
    • Screenshot 2025-11-04 at 1.48.18 PM

    Screenshot 2025-11-04 at 1.48.34 PM

    • Are you checking sequence numbers or enqueued time on the output Event Hub directly (e.g., via a consumer like EventProcessorHost or az eventhubs eventhub receiver) to confirm the count mismatch isn’t due to Function App consumption issues downstream?
      • This I have not done. I will do this next and let you know what I see.

    Though I am not explicitly setting a partition key, it appears that Stream Analytics is using default round-robin behavior since both of my event hubs have 4 partitions and the load - based on those graphs - is evenly distributed.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.