Real-Time Data Analysis Architecture Design: Evolution from Batch Processing to Stream Processing

In the digital age, business changes are accelerating and enterprise data timeliness requirements are increasing. From traditional T+1 batch processing to near-real-time, then to real-time stream processing, data analysis architecture has undergone profound evolution. This article explores real-time data analysis technical architecture, key technologies, and implementation paths.

Why Real-Time Data Analysis Is Needed

Business Value

Instant Decision-Making: In scenarios like e-commerce promotions, financial transactions, and operations monitoring, data delays of just minutes can mean missed opportunities or losses.

Anomaly Early Warning: Real-time monitoring of business metrics to promptly detect anomalies (system failures, fraud, inventory alerts) for rapid response.

Improved User Experience: Provide users with real-time data feedback like order status, inventory levels, and real-time recommendations.

Operational Efficiency: Real-time visibility into operational data enables rapid strategy adjustments and improved efficiency.

Typical Application Scenarios

E-commerce Scenarios:

•Real-time monitoring of order volume, GMV, inventory, and system performance during promotions.
•Real-time recommendations: Dynamically adjust recommendation content based on user real-time behavior.
•Real-time inventory: Ensure users see the latest inventory information, avoiding overselling.

Financial Scenarios:

•Real-time risk control: Analyze transaction behavior in real-time to identify fraud and anomalies.
•Real-time quotes: Provide real-time stock, futures, and forex quotes.
•Real-time reconciliation: Verify transaction data in real-time to ensure accounting accuracy.

Logistics Scenarios:

•Real-time tracking: Track shipment locations and delivery status.
•Real-time scheduling: Dynamically schedule capacity based on real-time orders and vehicle locations.
•Real-time alerts: Monitor delivery timeliness and proactively warn about potentially delayed orders.

Operations Monitoring Scenarios:

•Real-time dashboards: Display real-time business metrics like PV, UV, conversion rate, and GMV.
•Real-time alerts: Send alert notifications when key metrics become abnormal.
•Real-time analysis: Operations personnel query data in real-time for rapid business response.

Evolution of Data Analysis Architecture

Stage 1: T+1 Batch Processing

Architecture characteristics:

•Data extracted from business databases to data warehouse daily at midnight.
•ETL tools used for data cleaning, transformation, and loading.
•Users query previous day's data during daytime.

Technology stack:

•ETL tools: Kettle, DataX, Sqoop
•Data warehouse: MySQL, Oracle, Hive
•Scheduling tools: Cron, Airflow

Advantages:

•Simple architecture, easy to implement and maintain.
•Minimal impact on business databases — extraction happens during off-peak hours.
•Good data consistency, suitable for complex data processing logic.

Disadvantages:

•High data latency, cannot meet real-time requirements.
•Cannot support real-time monitoring or instant decision-making.

Applicable scenarios:

•Scenarios with low data timeliness requirements like monthly reports and annual analysis.
•Scenarios requiring complex data processing with heavy computation.

Stage 2: Near-Real-Time (Minute-Level)

Architecture characteristics:

•Data extracted every few minutes or tens of minutes.
•Incremental extraction — only changed data is extracted.
•User query data latency is measured in minutes.

Technology stack:

•Incremental extraction: Timestamp or auto-increment ID based incremental extraction
•Message queues: Kafka, RabbitMQ
•Stream processing: Flink, Spark Streaming
•Data storage: ClickHouse, Elasticsearch

Advantages:

•Significantly reduced data latency, meeting most business needs.
•Relatively low impact on business databases.
•Moderate implementation complexity.

Disadvantages:

•Still minute-level latency, cannot meet second-level real-time requirements.
•Incremental extraction logic is complex, requiring handling of data updates and deletes.

Applicable scenarios:

•Scenarios with moderate data timeliness requirements that can tolerate minute-level latency.
•Such as operations monitoring, business analysis, and report generation.

Stage 3: Real-Time (Second or Millisecond Level)

Architecture characteristics:

•Data changes are captured in real-time and pushed to the data analysis system.
•Stream processing technology for real-time computation and metric updates.
•User query data latency is measured in seconds or milliseconds.

Technology stack:

•CDC (Change Data Capture): Canal, Debezium, Maxwell
•Message queues: Kafka
•Stream processing: Flink, Spark Streaming
•Real-time databases: ClickHouse, Druid, Pinot
•Cache: Redis, Memcached

Advantages:

•Extremely low data latency, meeting the highest real-time requirements.
•Supports real-time monitoring, real-time alerts, and real-time decision-making.

Disadvantages:

•Complex architecture, high implementation and maintenance costs.
•High requirements for technical teams.
•High infrastructure requirements — needs high-performance computing and storage resources.

Applicable scenarios:

•Scenarios with extremely high data timeliness requirements like real-time risk control, real-time recommendations, and real-time monitoring.

Key Technologies for Real-Time Data Analysis

CDC (Change Data Capture)

CDC is the foundation of real-time data analysis, used to capture database changes (insert, update, delete) and push in real-time.

Implementation principles:

Log-based CDC: Parse binlog (MySQL), WAL (PostgreSQL) and other logs to capture data changes. This is the most common approach with minimal impact on business databases and low latency.

Trigger-based CDC: Create triggers on database tables that record changes to a change table when data changes. This approach has some impact on database performance.

Query-based CDC: Periodically query the database and compare data changes. This approach has high latency and puts some pressure on the database.

Common tools:

Canal: Alibaba's open-source MySQL binlog parsing tool, widely used within Alibaba and external enterprises.

Debezium: Kafka Connect-based CDC tool supporting MySQL, PostgreSQL, MongoDB and other databases.

Maxwell: Lightweight MySQL binlog parsing tool, easy to deploy and use.

Example architecture:

MySQL (Business Database)
  ↓ binlog
Canal / Debezium
  ↓ Change events
Kafka (Message Queue)
  ↓ Consume
Flink (Stream Processing)
  ↓ Computation results
ClickHouse (Real-time Database)
  ↓ Query
Users

Stream Processing

Stream processing is the core of real-time data analysis, used for computing and processing real-time data streams.

Core concepts:

Event time vs. Processing time: Event time is when data was generated; processing time is when data is processed. In real-time processing, event time is typically used to ensure result accuracy.

Windows: Divide infinite data streams into finite windows for computation. Common window types include tumbling windows, sliding windows, and session windows.

Watermarks: Used to handle out-of-order data, marking event time progress.

State management: Stream processing requires maintaining state (like accumulators, window data) — state management is key to stream processing.

Common frameworks:

Apache Flink: The most popular stream processing framework, supporting Exactly-Once semantics, powerful state management, and low latency.

Apache Spark Streaming: Micro-batch based stream processing with relatively higher latency (second-level) but good Spark ecosystem integration.

Apache Storm: Early stream processing framework, now less commonly used.

Example: Real-time PV/UV Statistics

// Flink code example
DataStream<Event> events = env.addSource(new KafkaSource<>(...));

// Real-time statistics for PV per minute
events
  .keyBy(event -> event.getPageId())
  .window(TumblingEventTimeWindows.of(Time.minutes(1)))
  .aggregate(new CountAggregateFunction())
  .addSink(new ClickHouseSink<>(...));

// Real-time UV statistics (deduplication) per minute
events
  .keyBy(event -> event.getPageId())
  .window(TumblingEventTimeWindows.of(Time.minutes(1)))
  .aggregate(new DistinctCountAggregateFunction())
  .addSink(new ClickHouseSink<>(...));

Real-Time Data Storage

Real-time data analysis requires high-performance data storage capable of supporting high-concurrency writes and low-latency queries.

Key requirements:

High-throughput writes: Handle tens of thousands to hundreds of thousands of data writes per second.

Low-latency queries: Query latency measured in milliseconds to seconds.

Complex query support: Support complex queries including aggregation, filtering, sorting, and JOINs.

Horizontal scaling: Expand capacity and performance by adding nodes.

Common databases:

ClickHouse: Columnar storage database with extremely high query performance, suitable for OLAP scenarios. Supports SQL, easy to use.

Apache Druid: Database designed specifically for real-time analysis, supporting high-concurrency queries, suitable for real-time dashboards and reports.

Apache Pinot: Real-time analysis database open-sourced by LinkedIn, supporting ultra-low latency queries (millisecond level).

Elasticsearch: Search engine, also commonly used for log analysis and real-time monitoring.

Comparison:

Database	Write Performance	Query Performance	Complex Queries	Ease of Use	Applicable Scenarios
ClickHouse	Extremely High	Extremely High	Strong	High (SQL)	Real-time reports, OLAP
Druid	High	High	Medium	Medium	Real-time dashboards, monitoring
Pinot	Extremely High	Extremely High	Medium	Medium	Ultra-low latency queries
Elasticsearch	High	Medium	Weak	High	Log analysis, search

Cache Optimization

For frequently queried data, using cache can significantly reduce query latency and database pressure.

Cache strategies:

Query result cache: Cache query results to Redis; return directly from cache for the same query next time.

Pre-computation cache: Pre-compute common metrics (like today's GMV, real-time online users) and cache to Redis; return directly when users query.

Hot data cache: Cache hot data (like popular products, hot users) to Redis to improve query speed.

Cache update strategies:

TTL (Time To Live): Set cache expiration time; automatically invalidates after expiration.

Active update: Actively update cache when data changes.

Delayed double delete: Delete cache first, update database, then delete cache again — avoiding cache inconsistency.

Example architecture:

User query
  ↓
Check Redis cache
  ↓ Cache hit
Return result
  ↓ Cache miss
Query ClickHouse
  ↓
Write to Redis cache
  ↓
Return result

Incremental Computing

For complex aggregation computations, full computation is costly and slow. Incremental computing only computes changed portions, significantly improving performance.

Implementation methods:

Incremental updates: When new data arrives, only update affected aggregation results.

Materialized views: Pre-compute and store aggregation results; incrementally update materialized views when base data changes.

Stream aggregation: Use stream processing frameworks (like Flink) for real-time aggregation, maintaining aggregation state.

Example: Real-time Sales Statistics

Traditional approach (full computation):
SELECT SUM(amount) FROM orders WHERE date = '2026-03-19'
-- Every query scans all orders for that day

Incremental computing approach:
1. Maintain an accumulator: today_total_amount
2. When new order arrives: today_total_amount += new_order.amount
3. Query directly returns: today_total_amount
-- Query latency reduced from seconds to milliseconds

Real-Time Data Analysis Architecture Design

Lambda Architecture

Lambda architecture is a classic big data architecture supporting both batch and stream processing.

Architecture components:

Batch Layer: Processes all historical data, generating batch views. High latency (hour or day level) but accurate results.

Speed Layer: Processes real-time data streams, generating real-time views. Low latency (second level) but may be less accurate (like out-of-order data).

Serving Layer: Merges batch and real-time views, providing query services externally.

Advantages:

•Simultaneously ensures data accuracy (batch) and real-time capability (stream).
•Good fault tolerance — even if stream processing errors, batch processing can correct.

Disadvantages:

•Complex architecture, requiring maintenance of two processing logics.
•High development and maintenance costs.

Applicable scenarios:

•Scenarios with high requirements for both data accuracy and real-time capability.
•Like financial transactions and e-commerce promotions.

Kappa Architecture

Kappa architecture is a simplified version of Lambda architecture, retaining only the stream processing layer.

Architecture components:

Stream Processing Layer: All data goes through stream processing, including historical data (via replay).

Serving Layer: Provides query services externally.

Advantages:

•Simple architecture, requiring maintenance of only one processing logic.
•Low development and maintenance costs.

Disadvantages:

•High requirements for stream processing framework — needs to support state management and Exactly-Once semantics.
•Historical data replay may take a long time.

Applicable scenarios:

•Scenarios where stream processing framework is mature (like Flink) and can guarantee accuracy.
•Scenarios with requirements for architecture simplicity.

Real-Time Data Warehouse Architecture

Real-time data warehouse is a real-time version of the traditional data warehouse using a layered architecture.

Architecture layers:

ODS (Operational Data Store): Raw data layer; synchronizes business database data in real-time via CDC.

DWD (Data Warehouse Detail): Detail data layer; cleans, transforms, and correlates ODS data.

DWS (Data Warehouse Summary): Summary data layer; aggregates DWD data to generate various metrics.

ADS (Application Data Service): Application data layer; data for specific application scenarios like real-time dashboards and reports.

Advantages:

•Clear layering, easy to understand and maintain.
•High data quality,经过多层清洗和转换。
•Supports complex business logic.

Disadvantages:

•Complex architecture, high implementation cost.
•Long data pipeline, relatively higher latency.

Applicable scenarios:

•Large enterprises with large data volumes and complex business.
•Scenarios with high requirements for data quality and governance.

Challenges and Solutions in Real-Time Data Analysis

Challenge 1: Out-of-Order Data

In distributed systems, data may arrive out of order due to network latency, system failures, etc.

Solutions:

Watermarks: Use watermarks to mark event time progress, allowing some degree of out-of-order processing.

Delayed windows: Set window allowed delay time so late-arriving data can still be processed.

Side output streams: Output severely delayed data to side output streams for separate processing.

Challenge 2: Data Duplication

In distributed systems, data may duplicate due to retries, fault recovery, etc.

Solutions:

Idempotency: Ensure processing logic is idempotent — processing the same data multiple times doesn't produce incorrect results.

Deduplication: Deduplicate data using unique IDs.

Exactly-Once semantics: Use stream processing frameworks (like Flink) that support Exactly-Once semantics.

Challenge 3: State Management

Stream processing requires maintaining state (like window data, accumulators), which is a challenge.

Solutions:

State backends: Use high-performance state backends (like RocksDB) to store state.

State snapshots: Periodically snapshot state for fault recovery.

State cleanup: Timely clean up expired state to avoid unlimited state growth.

Challenge 4: Data Skew

Some keys have particularly large data volumes, causing processing skew and affecting performance.

Solutions:

Salting: Salt hot keys to distribute data across multiple partitions.

Two-stage aggregation: First perform local aggregation, then global aggregation.

Dynamic load balancing: Dynamically adjust partitioning strategy based on data distribution.

Implementation Path for Real-Time Data Analysis

Step 1: Assess Requirements

Clarify real-time requirements: Second-level, minute-level, or hour-level?

Identify key scenarios: Which scenarios need real-time data most?

Assess technical capability: Does the team have technical capability for real-time data analysis?

Assess costs: What are the infrastructure and labor costs for real-time data analysis?

Step 2: Select Technology Stack

CDC tools: Select appropriate CDC tools based on database type (Canal, Debezium, Maxwell).

Message queues: Kafka is the most common choice.

Stream processing frameworks: Flink is currently the most popular choice; Spark Streaming is also an option.

Real-time databases: Select ClickHouse, Druid, or Pinot based on query patterns.

Step 3: Start with Simple Scenarios

Select pilot scenarios: Choose relatively simple, clear-value scenarios as pilots.

Quick verification: Quickly build prototypes to verify technical feasibility.

Iterate and optimize: Optimize architecture and implementation based on pilot experience.

Step 4: Gradually Expand

Expand scenarios: Gradually expand real-time data analysis to more scenarios.

Improve architecture: Improve real-time data analysis architecture based on business needs.

Establish standards: Establish real-time data development and operations standards.

Case Study: Real-Time Data Analysis Practice at an E-commerce Platform

Background: This is a mid-sized e-commerce platform with 100,000 daily orders. The platform wanted to monitor key metrics like order volume, GMV, and inventory in real-time during promotions, and promptly discover and resolve issues.

Challenges:

•Original T+1 batch processing architecture couldn't meet real-time requirements.
•Promotions cause data volume surges, requiring high system performance.
•Need to support multi-dimensional real-time analysis (by product, category, region, etc.).

Solutions:

CDC data collection: Use Canal to real-time capture changes from MySQL order tables, push to Kafka.

Stream processing: Use Flink for real-time processing of order streams, computing various real-time metrics:

•Real-time order volume: By minute and hour.
•Real-time GMV: By product, category, and region.
•Real-time inventory: Update inventory in real-time based on orders.
•Real-time conversion rates: Calculate conversion rates at each stage.

Real-time storage: Write computation results to ClickHouse, supporting multi-dimensional queries.

Cache optimization: Cache frequently queried metrics (like current GMV, current order volume) to Redis, reducing query latency.

Real-time dashboards: Develop real-time dashboards displaying key metrics for operations team monitoring.

Architecture diagram:

MySQL (Order Table)
  ↓ binlog
Canal
  ↓ Order change events
Kafka
  ↓ Consume
Flink
  ├─ Real-time order volume statistics
  ├─ Real-time GMV statistics
  ├─ Real-time inventory updates
  └─ Real-time conversion rate computation
  ↓ Write
ClickHouse + Redis
  ↓ Query
Real-time dashboards / API

Results:

•Data latency reduced from T+1 to second-level; operations team can grasp business conditions in real-time.
•During promotions, real-time monitoring promptly discovered inventory shortages and payment anomalies, enabling rapid response and avoiding losses.
•Real-time conversion rate analysis helped operations team quickly adjust marketing strategies, increasing GMV by 15%.

Summary

Real-time data analysis is an important direction in enterprise digital transformation, helping enterprises achieve instant decision-making, anomaly early warning, and improved user experience. From T+1 batch processing to near-real-time to real-time stream processing, data analysis architecture has undergone profound evolution.

Key technologies for real-time data analysis include CDC, stream processing, real-time data storage, cache optimization, and incremental computing. Enterprises can choose Lambda architecture, Kappa architecture, or real-time data warehouse architecture based on their needs.

Implementing real-time data analysis requires starting with simple scenarios, verifying quickly, and expanding gradually. Also pay attention to challenges like out-of-order data, data duplication, state management, and data skew — take corresponding solutions.

Ultimately, the goal of real-time data analysis is to enable enterprises to grasp business conditions in real-time, respond quickly to changes, and maintain advantage in fierce market competition.

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity

Real-Time Data Analysis Architecture Design: Evolution from Batch Processing to Stream Processing

Why Real-Time Data Analysis Is Needed

Business Value

Typical Application Scenarios

Evolution of Data Analysis Architecture

Stage 1: T+1 Batch Processing

Stage 2: Near-Real-Time (Minute-Level)

Stage 3: Real-Time (Second or Millisecond Level)

Key Technologies for Real-Time Data Analysis

CDC (Change Data Capture)

Stream Processing

Real-Time Data Storage

Cache Optimization

Incremental Computing

Real-Time Data Analysis Architecture Design

Lambda Architecture

Kappa Architecture

Real-Time Data Warehouse Architecture

Challenges and Solutions in Real-Time Data Analysis

Challenge 1: Out-of-Order Data

Challenge 2: Data Duplication

Challenge 3: State Management

Challenge 4: Data Skew

Implementation Path for Real-Time Data Analysis

Step 1: Assess Requirements

Step 2: Select Technology Stack

Step 3: Start with Simple Scenarios

Step 4: Gradually Expand

Case Study: Real-Time Data Analysis Practice at an E-commerce Platform

Summary

cta.readyToSimplify