
sidebar.wechat

sidebar.feishu
sidebar.chooseYourWayToJoin

sidebar.scanToAddConsultant
In the digital age, business changes are accelerating and enterprise data timeliness requirements are increasing. From traditional T+1 batch processing to near-real-time, then to real-time stream processing, data analysis architecture has undergone profound evolution. This article explores real-time data analysis technical architecture, key technologies, and implementation paths.
Instant Decision-Making: In scenarios like e-commerce promotions, financial transactions, and operations monitoring, data delays of just minutes can mean missed opportunities or losses.
Anomaly Early Warning: Real-time monitoring of business metrics to promptly detect anomalies (system failures, fraud, inventory alerts) for rapid response.
Improved User Experience: Provide users with real-time data feedback like order status, inventory levels, and real-time recommendations.
Operational Efficiency: Real-time visibility into operational data enables rapid strategy adjustments and improved efficiency.
E-commerce Scenarios:
Financial Scenarios:
Logistics Scenarios:
Operations Monitoring Scenarios:
Architecture characteristics:
Technology stack:
Advantages:
Disadvantages:
Applicable scenarios:
Architecture characteristics:
Technology stack:
Advantages:
Disadvantages:
Applicable scenarios:
Architecture characteristics:
Technology stack:
Advantages:
Disadvantages:
Applicable scenarios:
CDC is the foundation of real-time data analysis, used to capture database changes (insert, update, delete) and push in real-time.
Implementation principles:
Log-based CDC: Parse binlog (MySQL), WAL (PostgreSQL) and other logs to capture data changes. This is the most common approach with minimal impact on business databases and low latency.
Trigger-based CDC: Create triggers on database tables that record changes to a change table when data changes. This approach has some impact on database performance.
Query-based CDC: Periodically query the database and compare data changes. This approach has high latency and puts some pressure on the database.
Common tools:
Canal: Alibaba's open-source MySQL binlog parsing tool, widely used within Alibaba and external enterprises.
Debezium: Kafka Connect-based CDC tool supporting MySQL, PostgreSQL, MongoDB and other databases.
Maxwell: Lightweight MySQL binlog parsing tool, easy to deploy and use.
Example architecture:
MySQL (Business Database)
↓ binlog
Canal / Debezium
↓ Change events
Kafka (Message Queue)
↓ Consume
Flink (Stream Processing)
↓ Computation results
ClickHouse (Real-time Database)
↓ Query
Users
Stream processing is the core of real-time data analysis, used for computing and processing real-time data streams.
Core concepts:
Event time vs. Processing time: Event time is when data was generated; processing time is when data is processed. In real-time processing, event time is typically used to ensure result accuracy.
Windows: Divide infinite data streams into finite windows for computation. Common window types include tumbling windows, sliding windows, and session windows.
Watermarks: Used to handle out-of-order data, marking event time progress.
State management: Stream processing requires maintaining state (like accumulators, window data) — state management is key to stream processing.
Common frameworks:
Apache Flink: The most popular stream processing framework, supporting Exactly-Once semantics, powerful state management, and low latency.
Apache Spark Streaming: Micro-batch based stream processing with relatively higher latency (second-level) but good Spark ecosystem integration.
Apache Storm: Early stream processing framework, now less commonly used.
Example: Real-time PV/UV Statistics
// Flink code example
DataStream<Event> events = env.addSource(new KafkaSource<>(...));
// Real-time statistics for PV per minute
events
.keyBy(event -> event.getPageId())
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new CountAggregateFunction())
.addSink(new ClickHouseSink<>(...));
// Real-time UV statistics (deduplication) per minute
events
.keyBy(event -> event.getPageId())
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new DistinctCountAggregateFunction())
.addSink(new ClickHouseSink<>(...));
Real-time data analysis requires high-performance data storage capable of supporting high-concurrency writes and low-latency queries.
Key requirements:
High-throughput writes: Handle tens of thousands to hundreds of thousands of data writes per second.
Low-latency queries: Query latency measured in milliseconds to seconds.
Complex query support: Support complex queries including aggregation, filtering, sorting, and JOINs.
Horizontal scaling: Expand capacity and performance by adding nodes.
Common databases:
ClickHouse: Columnar storage database with extremely high query performance, suitable for OLAP scenarios. Supports SQL, easy to use.
Apache Druid: Database designed specifically for real-time analysis, supporting high-concurrency queries, suitable for real-time dashboards and reports.
Apache Pinot: Real-time analysis database open-sourced by LinkedIn, supporting ultra-low latency queries (millisecond level).
Elasticsearch: Search engine, also commonly used for log analysis and real-time monitoring.
Comparison:
| Database | Write Performance | Query Performance | Complex Queries | Ease of Use | Applicable Scenarios |
|---|---|---|---|---|---|
| ClickHouse | Extremely High | Extremely High | Strong | High (SQL) | Real-time reports, OLAP |
| Druid | High | High | Medium | Medium | Real-time dashboards, monitoring |
| Pinot | Extremely High | Extremely High | Medium | Medium | Ultra-low latency queries |
| Elasticsearch | High | Medium | Weak | High | Log analysis, search |
For frequently queried data, using cache can significantly reduce query latency and database pressure.
Cache strategies:
Query result cache: Cache query results to Redis; return directly from cache for the same query next time.
Pre-computation cache: Pre-compute common metrics (like today's GMV, real-time online users) and cache to Redis; return directly when users query.
Hot data cache: Cache hot data (like popular products, hot users) to Redis to improve query speed.
Cache update strategies:
TTL (Time To Live): Set cache expiration time; automatically invalidates after expiration.
Active update: Actively update cache when data changes.
Delayed double delete: Delete cache first, update database, then delete cache again — avoiding cache inconsistency.
Example architecture:
User query
↓
Check Redis cache
↓ Cache hit
Return result
↓ Cache miss
Query ClickHouse
↓
Write to Redis cache
↓
Return result
For complex aggregation computations, full computation is costly and slow. Incremental computing only computes changed portions, significantly improving performance.
Implementation methods:
Incremental updates: When new data arrives, only update affected aggregation results.
Materialized views: Pre-compute and store aggregation results; incrementally update materialized views when base data changes.
Stream aggregation: Use stream processing frameworks (like Flink) for real-time aggregation, maintaining aggregation state.
Example: Real-time Sales Statistics
Traditional approach (full computation):
SELECT SUM(amount) FROM orders WHERE date = '2026-03-19'
-- Every query scans all orders for that day
Incremental computing approach:
1. Maintain an accumulator: today_total_amount
2. When new order arrives: today_total_amount += new_order.amount
3. Query directly returns: today_total_amount
-- Query latency reduced from seconds to milliseconds
Lambda architecture is a classic big data architecture supporting both batch and stream processing.
Architecture components:
Batch Layer: Processes all historical data, generating batch views. High latency (hour or day level) but accurate results.
Speed Layer: Processes real-time data streams, generating real-time views. Low latency (second level) but may be less accurate (like out-of-order data).
Serving Layer: Merges batch and real-time views, providing query services externally.
Advantages:
Disadvantages:
Applicable scenarios:
Kappa architecture is a simplified version of Lambda architecture, retaining only the stream processing layer.
Architecture components:
Stream Processing Layer: All data goes through stream processing, including historical data (via replay).
Serving Layer: Provides query services externally.
Advantages:
Disadvantages:
Applicable scenarios:
Real-time data warehouse is a real-time version of the traditional data warehouse using a layered architecture.
Architecture layers:
ODS (Operational Data Store): Raw data layer; synchronizes business database data in real-time via CDC.
DWD (Data Warehouse Detail): Detail data layer; cleans, transforms, and correlates ODS data.
DWS (Data Warehouse Summary): Summary data layer; aggregates DWD data to generate various metrics.
ADS (Application Data Service): Application data layer; data for specific application scenarios like real-time dashboards and reports.
Advantages:
Disadvantages:
Applicable scenarios:
In distributed systems, data may arrive out of order due to network latency, system failures, etc.
Solutions:
Watermarks: Use watermarks to mark event time progress, allowing some degree of out-of-order processing.
Delayed windows: Set window allowed delay time so late-arriving data can still be processed.
Side output streams: Output severely delayed data to side output streams for separate processing.
In distributed systems, data may duplicate due to retries, fault recovery, etc.
Solutions:
Idempotency: Ensure processing logic is idempotent — processing the same data multiple times doesn't produce incorrect results.
Deduplication: Deduplicate data using unique IDs.
Exactly-Once semantics: Use stream processing frameworks (like Flink) that support Exactly-Once semantics.
Stream processing requires maintaining state (like window data, accumulators), which is a challenge.
Solutions:
State backends: Use high-performance state backends (like RocksDB) to store state.
State snapshots: Periodically snapshot state for fault recovery.
State cleanup: Timely clean up expired state to avoid unlimited state growth.
Some keys have particularly large data volumes, causing processing skew and affecting performance.
Solutions:
Salting: Salt hot keys to distribute data across multiple partitions.
Two-stage aggregation: First perform local aggregation, then global aggregation.
Dynamic load balancing: Dynamically adjust partitioning strategy based on data distribution.
Clarify real-time requirements: Second-level, minute-level, or hour-level?
Identify key scenarios: Which scenarios need real-time data most?
Assess technical capability: Does the team have technical capability for real-time data analysis?
Assess costs: What are the infrastructure and labor costs for real-time data analysis?
CDC tools: Select appropriate CDC tools based on database type (Canal, Debezium, Maxwell).
Message queues: Kafka is the most common choice.
Stream processing frameworks: Flink is currently the most popular choice; Spark Streaming is also an option.
Real-time databases: Select ClickHouse, Druid, or Pinot based on query patterns.
Select pilot scenarios: Choose relatively simple, clear-value scenarios as pilots.
Quick verification: Quickly build prototypes to verify technical feasibility.
Iterate and optimize: Optimize architecture and implementation based on pilot experience.
Expand scenarios: Gradually expand real-time data analysis to more scenarios.
Improve architecture: Improve real-time data analysis architecture based on business needs.
Establish standards: Establish real-time data development and operations standards.
Background: This is a mid-sized e-commerce platform with 100,000 daily orders. The platform wanted to monitor key metrics like order volume, GMV, and inventory in real-time during promotions, and promptly discover and resolve issues.
Challenges:
Solutions:
CDC data collection: Use Canal to real-time capture changes from MySQL order tables, push to Kafka.
Stream processing: Use Flink for real-time processing of order streams, computing various real-time metrics:
Real-time storage: Write computation results to ClickHouse, supporting multi-dimensional queries.
Cache optimization: Cache frequently queried metrics (like current GMV, current order volume) to Redis, reducing query latency.
Real-time dashboards: Develop real-time dashboards displaying key metrics for operations team monitoring.
Architecture diagram:
MySQL (Order Table)
↓ binlog
Canal
↓ Order change events
Kafka
↓ Consume
Flink
├─ Real-time order volume statistics
├─ Real-time GMV statistics
├─ Real-time inventory updates
└─ Real-time conversion rate computation
↓ Write
ClickHouse + Redis
↓ Query
Real-time dashboards / API
Results:
Real-time data analysis is an important direction in enterprise digital transformation, helping enterprises achieve instant decision-making, anomaly early warning, and improved user experience. From T+1 batch processing to near-real-time to real-time stream processing, data analysis architecture has undergone profound evolution.
Key technologies for real-time data analysis include CDC, stream processing, real-time data storage, cache optimization, and incremental computing. Enterprises can choose Lambda architecture, Kappa architecture, or real-time data warehouse architecture based on their needs.
Implementing real-time data analysis requires starting with simple scenarios, verifying quickly, and expanding gradually. Also pay attention to challenges like out-of-order data, data duplication, state management, and data skew — take corresponding solutions.
Ultimately, the goal of real-time data analysis is to enable enterprises to grasp business conditions in real-time, respond quickly to changes, and maintain advantage in fierce market competition.
sidebar.noProgrammingNeeded
sidebar.startFreeTrial