Modern data platforms are no longer built for batch-only workloads. Businesses today require near real-time insights for monitoring, analytics, and automation. This is where a well-designed real-time data pipeline on AWS becomes critical.
In this guide, you will learn how to design and implement a production-grade architecture using AWS DMS, Amazon Kinesis, and Delta Lake, along with the practical considerations that are often missing from generic tutorials. This is not theoretical content—this is designed so you can actually build and scale the pipeline.
End-to-End Architecture of Real-Time Data Pipeline on AWS

A robust real-time data pipeline architecture on AWS typically looks like this:
Source Database → AWS DMS → Kinesis Data Streams → S3 (Bronze Layer) → Spark (EMR/Glue) → Delta Lake (Silver Layer)
Each component plays a specific role:
1. Source Layer (PostgreSQL / OLTP Systems)
Your source system is typically a transactional database such as PostgreSQL.
Key requirement:
- Enable logical replication (for CDC)
- Ensure proper indexing for replication performance
2. AWS DMS (Change Data Capture Engine)
AWS DMS (Database Migration Service) is used to capture real-time changes (CDC) from the source database. Configuration essentials:
- Use CDC mode instead of full load for streaming use cases
- Enable transaction logs (WAL for PostgreSQL)
- Set appropriate batch apply settings
Important considerations:
- DMS guarantees at-least-once delivery, not exactly-once
- High throughput workloads require tuning:
ParallelLoadThreadsMaxFullLoadSubTasks- Buffer sizes
3. Amazon Kinesis Data Streams (Streaming Backbone)
Amazon Kinesis Data Streams acts as the ingestion buffer and streaming layer.
Key concepts:
- Shards define throughput (1 shard ≈ 1MB/sec write, 2MB/sec read)
- Partition key determines distribution
Best practices:
- Avoid hot shards by using a well-distributed partition key
- Use composite keys (e.g.,
tag_id + timestamp) - Monitor shard-level metrics (IteratorAge, IncomingBytes)
Why Kinesis:
- Durable, scalable, and integrates natively with AWS ecosystem
- Enables near real-time processing
4. S3 Bronze Layer (Streaming Landing Zone)
Data from Kinesis is delivered to S3 using:
- Kinesis Data Firehose (simpler)
- Or custom consumer (Lambda / Spark Streaming)
Best practices:
- Store data in partitioned format (e.g.,
year/month/day/hour) - Use compressed formats (Parquet preferred over JSON)
- Maintain schema consistency
This becomes your raw data lake layer
5. Processing Layer (Spark – EMR / AWS Glue)
Use Spark Structured Streaming or micro-batch processing to transform raw data.
Responsibilities:
- Deduplication (critical due to DMS behavior)
- Schema normalization
- Data enrichment
- Aggregations
Important:
- Handle late-arriving data
- Implement checkpointing (S3-based for fault tolerance)
6. Delta Lake (S3 Silver Layer)
Delta Lake provides:
- ACID transactions
- Schema evolution
- Time travel
Why use Delta Lake:
- Eliminates small file problem
- Enables reliable upserts (MERGE INTO)
- Ensures data consistency for analytics
Design:
- Partition by business keys (e.g., plant, region)
- Optimize with compaction (OPTIMIZE, ZORDER)
Key Design Decisions That Make or Break Your Pipeline
Partition Key Strategy (Kinesis)
Incorrect partition key design leads to:
- Hot shards
- Throughput bottlenecks
Correct approach:
- Use high-cardinality keys
- Avoid sequential keys
Handling Duplicate Records
Since AWS DMS is at-least-once:
- You MUST implement deduplication logic in Spark
Typical approach:
- Use primary key + timestamp
- Apply window-based deduplication
Schema Evolution Strategy
Real-world systems change:
- Columns get added/modified
Solution:
- Use Delta Lake schema evolution (
mergeSchema) - Maintain schema registry if needed
Fault Tolerance & Checkpointing
Critical for production:
- Use Spark checkpointing in S3
- Ensure idempotent processing
Cost Optimization Strategy (Very Important)
A real-time data pipeline on AWS can become expensive if not designed carefully.
Major cost drivers:
- Kinesis shards
- DMS replication instances
- Spark compute (EMR/Glue)
Optimization tips:
- Right-size shard count based on throughput
- Use auto-scaling for EMR
- Batch micro-batches instead of ultra-low latency
When Should You Use This Architecture?
This architecture is ideal for:
- SCADA / IoT streaming systems
- Financial transaction monitoring
- Real-time analytics dashboards
- Event-driven applications
Avoid this if:
- You only need daily batch processing
- Data volume is extremely low
Common Mistakes to Avoid
- Using default partition keys → leads to hot shards
- Ignoring deduplication → results in incorrect analytics
- Over-provisioning Kinesis → unnecessary cost
- Not planning CIDR/IP/networking for EMR/Glue connectivity
- Skipping monitoring (CloudWatch metrics are essential)
Step-by-Step Implementation Summary
- Enable CDC on source database
- Configure AWS DMS with Kinesis as target
- Create Kinesis stream with proper shard count
- Deliver data to S3 (Firehose or consumer)
- Process data using Spark (EMR/Glue)
- Write curated data into Delta Lake
- Optimize and monitor
If you follow these steps correctly, you will have a scalable and production-ready real-time data pipeline on AWS.
Why Most Teams Struggle with This Setup
Even though the architecture looks straightforward, real-world challenges include:
- DMS tuning issues
- Kinesis shard imbalance
- Data duplication
- Schema drift
- Cost overruns
These are not covered in standard documentation, which is why many teams fail during scaling.
Need Help Building This in Production?
If you’re planning to implement a real-time data pipeline on AWS using DMS + Kinesis + Delta Lake, getting the architecture right from day one is critical.
TheCodeWizard team specializes in:
- Production-grade data pipeline design
- Cost optimization strategies
- High-throughput streaming architectures
- SCADA and IoT-scale ingestion systems
If you want to avoid costly mistakes and accelerate your implementation, you can connect with us by filling-up the form on our Services Page. We help teams move from idea to fully functional, scalable data platforms.
