Data Ingestion Patterns: Streaming, Zero-ETL, CDC, and Best Practices on AWS

Published: Sunday, Jan 20, 2019 | 11 minute read | Using 2334 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide
Authors: Sakti Mishra, Dylan Qu, Anusha Challa
Publisher: O’Reilly Media
ISBN: 978-1-098-17007-3

Chapter 4 covers data ingestion and transformation. This is Part 1, focused on ingestion. Getting data into AWS is the first step of any analytics pipeline. Sounds simple, but the number of services and patterns you need to know is big.

Data Ingestion Overview

Data ingestion is the process of importing data from various sources into AWS storage and processing systems. The book breaks it into three patterns:

Real-time streaming - high-velocity data that needs immediate ingestion. IoT sensor readings, clickstream data, social media feeds. Continuous flow of small records.
Batch - periodic data loads at scheduled intervals. Daily database dumps, weekly report files. Focus is on throughput, not latency.
Near real time - semi-frequent updates with slight delay tolerance. Inventory updates, price changes. Typically uses CDC (change data capture) mechanisms.

Most organizations deal with all three. The trick is picking the right AWS service for each pattern.

Real-Time Streaming Data Ingestion

A streaming data pipeline has five components: stream sources, stream ingestion, stream storage, stream consumption/processing, and destination data store.

Stream Ingestion Layer

This layer collects data from sources and pushes it into stream storage. Several options:

Kinesis Agent - a standalone Java app that monitors files (like logs) and continuously ingests new data into Kinesis Data Streams. Installed on Linux servers. Each newline is a record by default, but you can configure multiline.
DynamoDB Streams - captures item-level changes from DynamoDB tables and can stream them through Kinesis Data Streams.
AWS IoT Core - connects IoT devices via MQTT and routes data into KDS or MSK.
AWS DMS - connects to transactional databases, does CDC, and loads into KDS or MSK.
MSK Connect - continuously ingests data from files, does CDC from databases, loads into Amazon MSK.
Custom producers - using AWS SDK or Kinesis Producers Library (KPL). High implementation complexity and operational overhead. Think carefully before going this route.

Many AWS services like CloudWatch, EventBridge, CloudFront can also stream directly into KDS.

Stream Storage

Two options:

Amazon Kinesis Data Streams (KDS)
Amazon Managed Streaming for Apache Kafka (MSK)

Both collect data from producers, store it for a set duration, and distribute to consumers. They decouple producers from consumers.

Stream Consumers

Consumer services read from stream storage and load into target data stores. Options include Amazon Data Firehose, Redshift streaming ingestion, Managed Flink, Glue streaming jobs, Spark/Flink on EMR, Lambda, or custom KCL applications.

Destination Data Stores

Common targets for streaming data analytics:

Data lakes - Amazon S3
Data warehouses - Amazon Redshift
Search and log analytics - Amazon OpenSearch

Kinesis Data Streams vs Amazon MSK

This comparison comes up on the exam. Here’s the summary:

Attribute	Kinesis Data Streams	Amazon MSK
Management overhead	Low	Low (Serverless) to medium (Provisioned)
Scalability	Seconds, one click	Minutes, one click
Throughput	On-demand with limits	Highest of the two
Open source	No	Yes
Data retention	Up to 365 days	Configurable, longer. Tiered storage to S3
Latency	70ms with enhanced fan-out, 200-500ms without	Lowest

Practical rule: check if your sources and targets have direct integration with the service. KDS has deep integration with AWS services. MSK has a richer open source connector ecosystem. If you use DuckDB or ClickHouse, MSK will be easier to work with.

Most AWS-native architectures lean toward KDS because of the integrations. If you have existing Kafka expertise or need open source compatibility, MSK is the way to go.

Sample Streaming Ingestion Use Cases

IoT Devices to Data Lake

Architecture: IoT devices publish MQTT messages to AWS IoT Core. IoT Core routes to Amazon MSK (chosen for lowest latency). Amazon Data Firehose reads from MSK and delivers to S3.

Firehose can convert JSON to Parquet or ORC on the fly. For non-JSON input (CSV, XML), it invokes a Lambda function to convert to JSON first, then to Parquet. It can also compress with GZIP, ZIP, or SNAPPY before delivery.

Clickstreams to Redshift for Real-Time Reporting

Redshift has a streaming ingestion feature that can ingest directly from KDS, MSK, or self-managed Kafka. Data lands in a materialized view (MV) within Redshift.

The MV refresh is incremental. Only new records since last refresh get loaded. You can set refresh to manual or automatic. Automatic refresh runs as soon as new data arrives, but Redshift prioritizes your workloads over auto-refresh. If you need deterministic refresh timing, use manual.

Semi-structured data (JSON) can be loaded as-is into Redshift’s SUPER data type, or shredded into individual columns. You can define transformations in the MV definition.

DynamoDB CDC to Data Lake

DynamoDB Streams captures changes in real time. These flow to KDS through direct integration, then to S3 via Data Firehose. Straightforward pattern for getting DynamoDB data into your data lake.

AWS Logs to OpenSearch

CloudWatch log groups have subscription filters that send data directly to Amazon Data Firehose. Firehose delivers to OpenSearch or third-party solutions like Splunk.

ADF delivers in batches, typically with latencies from a few minutes to an hour. It has native integration with S3 (Iceberg or Parquet), Redshift, OpenSearch, and third-party solutions like Splunk and Dynatrace.

ADF also supports optional backup of raw incoming data to S3. Records that fail transformation get routed to a separate S3 location. Useful for debugging.

Zero-ETL Integrations

Zero-ETL is a set of integrations that removes the need to build ETL pipelines for point-to-point data movement. When available, zero-ETL is almost always the best choice. Lowest operational complexity, most cost-effective, reliable.

Important: zero-ETL provides near-real-time latency. If you need sub-second latency, go with streaming ingestion instead.

Here’s what’s available:

Source	Amazon S3 (Data Lake)	Amazon Redshift (DW)	Amazon OpenSearch
DynamoDB	Iceberg tables in S3	Analytics on DynamoDB data	Full-text and vector search
DocumentDB	-	-	Fuzzy search, cross-collection search
Aurora MySQL	-	Transactional data for analytics	-
Aurora Postgres	-	Transactional data for analytics	-
RDS MySQL	-	Transactional data for analytics	-
Amazon S3	-	Auto-copy from S3	Query operational logs
SaaS apps	Available	Facebook Ads, Salesforce, SAP, ServiceNow, Zendesk, etc.	-
CloudWatch Logs	-	-	Direct querying and visualization
Security Lake	-	-	Direct search and analysis

Not every source-target combination has a zero-ETL integration. When one exists, use it.

CDC with AWS Database Migration Service

When zero-ETL isn’t available, AWS DMS handles CDC-based ingestion. DMS uses a replication instance to connect to the source database, captures changes using transaction logs, and streams them to the target in near real time.

DMS can also do one-time migration on top of ongoing CDC. It supports both homogeneous and heterogeneous database migrations, with multi-AZ deployment for high availability.

Supported Sources

On-premises databases: Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, MongoDB, SAP ASE, IBM Db2
Third-party cloud databases: Azure SQL, Azure PostgreSQL, Azure MySQL, Google Cloud MySQL/PostgreSQL, OCI MySQL Heatwave
Amazon RDS instances: all major engines including Aurora
Amazon S3 data lakes
Amazon DocumentDB

Supported Targets

Amazon RDS instances (all major engines)
Amazon Redshift
Amazon S3
DynamoDB
OpenSearch
Apache Kafka clusters
Kinesis Data Streams
DocumentDB
Neptune
Databases on EC2 or on premises

DMS Use Cases

Ingesting data into S3 data lake - DMS can replicate from any supported source to S3 in near real time. Important settings: cdcMaxBatchInterval (max time in seconds to accumulate CDC data before writing) and cdcMinFileSize (minimum file size in KB before writing). Tune these to avoid files that are too small or too large (over 1 GB). Set target format to Parquet instead of default CSV for better query performance.

Ingesting data into Redshift - for sources without zero-ETL integration. Common pattern is: use AWS Schema Conversion Tool (SCT) for one-time schema conversion, then DMS for initial data migration and ongoing CDC. DMS can create schemas too, but SCT does it better.

DMS Schema Conversion - converts schemas between different database types. Creates an assessment report on migration complexity first. Converts tables, views, stored procedures, functions, data types. Objects that can’t convert automatically get marked for manual conversion.

Ingesting on-premises files - AWS DataSync moves file data between on-premises storage and AWS. Uses parallel transfers, minimizes network overhead, handles encryption and data integrity. Copies only changed files after initial seeding. Pay-as-you-go pricing based on data volume.

Third-party datasets - AWS Data Exchange is a marketplace for third-party data. Providers publish datasets, consumers subscribe and integrate. Supports S&P Global, weather data, FINRA data, and many more. Integrates with existing analytics and ML pipelines.

Best Practices for Streaming Ingestion

Kinesis Data Streams Architecture

A KDS data stream is a set of shards. Each shard has a sequence of data records (up to 1 MB each). Producers push data in, consumers process it in real time.

Two capacity modes:

Provisioned mode - each shard gets 1,000 records/sec write (1 MB/sec max) and 2 MB/sec read. You manage shard count. Good for predictable workloads.

On-demand mode - AWS manages shards automatically. Base write throughput of 4 MB/sec, scales up to 10 GB/sec. Base read of 8 MB/sec, scales to 20 GB/sec. Good for unpredictable traffic.

Best Practices for Sharding

When using provisioned mode:

Determine optimal shard count - analyze expected throughput, start with minimum shards, scale up as needed.
Distribute data evenly - use a well-distributed partition key. Avoid sequential or predictable keys. Hash or random prefix helps.
Handle splits and merges - monitor utilization. Split shards for more capacity, merge to reduce costs. Use UpdateShardCount API.
Maintain ordering within shards - process records by sequence number if order matters.
Implement shard-level checkpointing - maintain offsets for each shard for reliable reprocessing after failures.

Consuming Data from KDS

Two consumer modes:

Shared throughput - default mode. 2 MB/sec per shard shared across all consumers. More consumers = less throughput each.

Enhanced fan-out - each consumer gets dedicated 2 MB/sec per shard. Consumers scale linearly without contention. Data latency as low as 70ms.

If you need performance, use enhanced fan-out.

Best Practices for MSK

Provisioned vs Serverless clusters:

Provisioned clusters have brokers across 3 AZs by default. Two broker types:

Standard brokers - flexible sizing, fixed storage per broker. AWS handles hardware maintenance.
Express brokers - simpler to manage, elastic storage with pay-as-you-go pricing. Up to 3x throughput per broker, 20x faster scaling, 90% quicker recovery. Come preconfigured with best practices.

Serverless clusters auto-scale without manual sizing. Max 200 MiB/sec write, 400 MiB/sec read, 2400 partitions, unlimited storage.

Starting point: provisioned cluster with express brokers.

General MSK practices:

Choose partition keys that distribute data evenly. Avoid keys where one value dominates 90% of traffic.
Rightsize your cluster following AWS best practices for your broker type.
Use MSK Connect for integration with external sources and targets. Source connectors pull data into MSK, sink connectors push data out. Each connector runs workers (JVM processes) that parallelize the work.
Use MSK Replicator for cross-region replication. Set network bandwidth quotas on the Replicator’s execution role to prevent throttling other consumers.
For large instance types, tune num.io.threads and num.network.threads.
Use 3 AZs, replication factor of at least 3, proper min.insync.replicas.
Keep CPU under 60%, disk under 85%, heap memory after GC under 60%.
Don’t add non-MSK brokers. Enable in-transit encryption.
Rebalance after scaling with the partition reassignment tool.
Set log retention to 7 days for Replicator source and target clusters.

Best Practices for Data Firehose

ADF delivers data to S3, OpenSearch, and Redshift. It can convert JSON to Parquet, compress/decompress, add delimiters, batch data, and create dynamic partitions for S3.

Key practices:

Buffer tuning - set 0 second buffering interval for low-latency use cases. Increase buffer size and interval for high-throughput destinations like S3.
Dynamic partitioning - use when you expect spikes in ingestion rates. Firehose scales shards up during peaks and back down after. Also useful for sustained high data rates.
Pick direct integrations - if Firehose has native support for your target, it gives you the least operational overhead.
Configure delivery settings - buffer size, interval, compression, and file type all affect cost and performance.
Use batching and partitioning - group records into single deliveries, partition by time or other attributes.

Best Practices for DMS

Replication instance sizing:

Use C4 or R4 for CPU-intensive heterogeneous migrations. R4 for memory-heavy workloads.
Increase storage for logs and cached data.
Use multi-AZ for high availability.

Loading performance:

Load multiple tables in parallel (default is 8, adjust based on instance size).
Use parallel full load for partitioned tables.
Drop secondary indexes, constraints, and triggers before full load. Add indexes before CDC phase. Enable triggers before cutover.
Turn off backups and logging on target during migration.

LOB settings:

Use Limited LOB mode (32 KB default).
Increase limit if needed. Use per-table settings.
Use Inline LOB mode for mixed sizes.

DMS with Redshift target specifically:

Enable BatchApplyEnabled since Redshift isn’t designed for transactional one-by-one changes.
Set BatchSplitSize=0 and increase BatchApplyMemoryLimit (e.g., 1024 MB).
Increase maxFileSize (e.g., 250,000 KB) and fileTransferUploadStreams (e.g., 20) for large migrations.
Use primary keys on source and target tables.
Avoid VARCHAR larger than 64 KB.
Use ParallelApplyThreads and ParallelApplyBufferSize for multithreaded CDC.
Use ParallelLoadThreads with MaxFullLoadSubTasks for parallel full loads.
Create Redshift tables with proper distribution and sort keys.
Monitor WLM queue disk spill. Use AutoWLM or customize queues.
Check for locks and blocking sessions in Redshift.

Wrapping Up

Data ingestion is about matching the right pattern to the right service. Streaming goes through KDS or MSK. Near-real-time goes through zero-ETL when available, or DMS CDC when it’s not. Batch files move with DataSync or direct S3 uploads. Third-party data comes from AWS Data Exchange.

The decision tree is simple: if zero-ETL exists for your source-target pair, use it. If you need sub-second latency, use streaming. If you need CDC from databases, use DMS. Everything else, evaluate based on operational overhead and integration support.

Start simple. Zero-ETL or Data Firehose will cover 80% of use cases. Only reach for custom KCL consumers or complex Kafka setups when the simpler options genuinely can’t handle your requirements.

Next part covers data transformation with Glue, EMR, and Redshift.

Next: Data Transformation with Glue, EMR, and Redshift

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-data-ingestion/

denis256 notes and projects