Implementing and Measuring Real-Time Data Pipelines with Spark Structured Streaming in the Databricks Platform
Real-time data processing has become an essential capability for modern digital systems, especially in industries where information must be updated and acted upon immediately. Organizations are increasingly required to combine historical datasets with continuously arriving events, often originating from legacy systems or distributed cloud services. This article explores the practical implementation of a real-time data pipeline built with Spark Structured Streaming, Delta tables, and Azure Event Hubs, examining how large stateful workloads can be managed efficiently and what levels of latency and reliability can be achieved in a realistic scenario.
Motivation and objectives
Modern enterprises often need to combine historical data extracts (for example, mainframe table dumps) with streaming updates from operational systems. The thesis studied a common pattern: initialize a stream state with a “baseline” Delta table snapshot and then apply incremental updates received via Event Hubs. The research questions were practical: how to build such a pipeline with Spark Structured Streaming, how to design stateful consolidation logic to merge partial updates into complete customer records, what latency and throughput are achievable, and whether the solution meets near-real-time requirements for a financial services scenario built in the Databricks platform.
Implementation overview
The implementation is intentionally practical and reproducible:
- Synthetic, mainframe-style test data. A Python test data generator was used to produce representative customer records. The generator writes both to Delta tables (to simulate a baseline snapshot) and to Event Hubs topics (to simulate incremental updates). All sensitive production data was avoided by design.
- Delta tables as baseline store. Baseline snapshots were loaded into Delta tables to give the streaming pipeline a consistent starting state.
- Event Hubs as message queues. Two Event Hubs were used as sources (one per upstream table), and a third topic served as the pipeline’s output for consolidated entities. Each message retained its original enqueue timestamp so end-to-end latency could be measured.
- Scala implementation utilizing RocksDB state store. After initial Python prototypes showed performance limitations on large batches, the streaming logic was implemented in Scala to get closer integration with the Spark runtime. A RocksDB state store was configured for storing efficiently nearly one million distinct customer keys. The main consolidation logic used state to merge fields from multiple upstream sources into a unified customer object.
- Instrumentation and metrics. Each message carried the original enqueue timestamp; the pipeline added a second enqueue timestamp when the consolidated entity was republished. These two timestamps were used to compute end-to-end latency.
Key results
The experiments were run after preloading the pipeline state with roughly 990,500 unique customer records to simulate a large state. The most important findings were:
Warm-up overhead: The first micro-batch after starting the streaming job consistently took much longer than later batches. This was traced to one-time initialization costs (loading RocksDB state, code generation, and connector setup). Subsequent micro-batches executed much faster once initialization was completed.
Steady-state latency: After warm-up, the pipeline processed small update batches with end-to-end latency in the range of ~1.2–1.4 seconds, measured from initial enqueue to publication of the consolidated entity. Latency variance was small, and no backlog accumulated during steady runs.
State handling: RocksDB-backed state successfully maintained and updated a large state of ~1M keys with predictable memory behaviour.
Checkpointing trade-offs: Storing checkpoints in standard blob storage increased recovery time after restarts; switching to a lower-latency tier or managed checkpointing would improve restart performance.
These results show that, when properly initialized and tuned, the architecture can satisfy near-real-time requirements for common financial services scenarios.
Practical lessons and best practices
From the implementation and experiments, several practical recommendations emerged:
1. Pre-warm the query in production. Initialization overhead can be significant when restoring a large state; schedule warm-up or prepopulating tasks before routing live traffic.
2. Use a persistent, low-latency checkpoint store. The checkpoint backend affects restart latency, so use a premium storage tier.
3. Tune RocksDB for your workload. Adjust write buffers, block cache, and compaction thresholds to find the best balance between lookup latency and memory consumption.
4. Match Event Hub partitions and Spark parallelism to ensure the system can scale horizontally.
Conclusion
The experiments demonstrate that Spark Structured Streaming, paired with Delta tables and Event Hubs and backed by a RocksDB state store, can be used to implement a robust, low-latency entity consolidation pipeline for financial services use cases. A careful focus on initialization, checkpointing, and state management is necessary to achieve stable sub-two-second processing times at near-production scale. The thesis documents both the implementation details and the measured results that support these conclusions.
Reference
A. Ahonen (2025). Real-time streaming pipelines in Databricks using Spark Structured Streaming. Master’s thesis, Turku University of Applied Sciences.