Hands-On: Building Batch and Streaming Data Pipelines on AWS
Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3
This is the chapter I was waiting for. After seven chapters of theory, services, security, and governance, Chapter 8 finally says: “OK, build something.” Two complete pipelines, end to end, with real code and real AWS services.
If you learn by doing, this chapter alone is worth the price of the book.
What Is a Data Processing Pipeline
Quick refresher before the hands-on work. A data processing pipeline is a sequence of steps that takes raw data and transforms it into something useful for analytics.
Typical reasons:
- Cleansing data and improving quality
- Aggregating with business rules
- Formatting for time series or ML
- Building data models for BI reporting
- Preparing data for downstream systems
High-level architecture always looks the same: data sources, ingestion layer, processing layer, consumption layer. The details change depending on batch or real-time.
Chapter 8 builds both.
Batch Processing Pipeline
Batch processing means you collect a bunch of records and process them all at once. Schedule or on demand.
The Use Case
A sales team uploads CSV files to S3 at the end of every month. Over 20,000 sales opportunity records with fields like date, salesperson name, market segment, and forecasted monthly revenue. Goal: transform this data for BI reporting.
Simple, realistic, common. The kind of pipeline most data engineers build first.
The Architecture
- Input CSV files land in an S3 raw input bucket
- AWS Glue PySpark ETL job reads from S3, transforms the data
- Transformed data gets loaded into Amazon Redshift Serverless
- Amazon QuickSight connects to Redshift and creates visualizations
EventBridge scheduler triggers it. Every month (or whatever schedule you set), the pipeline runs automatically.
Clean, straightforward architecture. S3 as landing zone, Glue for transformation, Redshift for warehousing, QuickSight for dashboards. All managed services. No servers to maintain.
Solid starter architecture. For a small to medium dataset, works perfectly. For production I’d add error handling, maybe Step Functions for orchestration, and definitely monitoring. For learning the concepts though, exactly right.
Step 1: Create S3 Buckets
Create an S3 bucket for input data. The authors name it ch8-ex1-input-data and remind you S3 bucket names must be globally unique. Add random characters at the end.
Upload the sales CSV. Done. Simple, should be.
Step 2: Create Redshift Serverless Cluster
Set up the destination. Redshift Serverless is the right call for a learning exercise. No cluster sizing decisions, no node type selection.
Key settings:
- Workgroup name:
ch8-ex1-redshift-serverless - Base capacity: 8 RPUs (the minimum)
- Namespace:
salesdata - Custom admin credentials with a password
Important detail: create an IAM role for the Redshift namespace with S3 access. The authors select “Any S3 bucket” for simplicity but note that production should restrict to specific buckets. Least privilege.
Configure the security group too. The Redshift workgroup needs inbound access on port 8182, and a self-referencing rule so the Glue job (using the same security group) can connect.
The security group setup is where people get stuck most. Glue job can’t connect to Redshift? Check the security groups first. Nine times out of ten, that’s the problem.
Step 3: Create Glue Data Connection
Before writing a Glue job that talks to Redshift, you need a connection object. Telling Glue how to reach your Redshift cluster.
You need:
- The JDBC URL from the Redshift workgroup details page
- VPC, subnet, and security group info from the Data access tab
- Username and password for authentication
Create a JDBC connection in the Glue console, fill in the details, name it ch8-ex1-redshift-connection.
I appreciate the authors walking through exactly where to find each piece of information in the console. When you’re new to AWS, hunting for the JDBC URL across different console pages is frustrating. They save you that pain.
Step 4: Create the Glue PySpark ETL Job
The core of the pipeline. The authors use Glue Studio’s visual editor. Nice touch for a certification book.
Three nodes in the visual ETL pipeline:
- Amazon S3 Source: points to the input bucket, format CSV
- Change Schema Transform: renames columns to replace spaces with underscores (Redshift doesn’t like spaces in column names)
- Redshift Target: writes to the
public.salestable using the connection from the previous step
The column renaming is a real-world detail. CSV files from business teams always have messy column names. “opportunity stage” becomes “opportunity_stage”, “weighted revenue” becomes “weighted_revenue”. Small thing, breaks everything if you skip it.
Configure the IAM role in Job details, hit Run. Job goes through Waiting, Running, Succeeded.
Verify in Redshift Query Editor v2 with a SELECT on the sales table.
The visual editor is good for simple ETL like this. For anything complex with branching logic, error handling, or custom transformations, write PySpark code directly. The visual editor starts fighting you past a certain complexity level.
Step 5: Set Up QuickSight
Most involved part of this pipeline, surprisingly. Not hard, just a lot of steps.
Create an IAM execution role with VPC access permissions, Redshift access, and Redshift Serverless access. Sign up for QuickSight, configure VPC connection, attach the IAM role.
The VPC connection is key. QuickSight needs to be in the same VPC as Redshift. Specify VPC ID, execution role, subnet IDs, security group. Wait for AVAILABLE status.
I’ve seen teams skip the VPC setup and wonder why QuickSight can’t see their Redshift data. It’s always the network. Always.
Step 6: Create the Visualization
Create a dataset in QuickSight pointing to your Redshift table. “Redshift - Manual connect” option, validate, select public.sales. Import into SPICE for faster analytics. Create an area line chart showing forecasted monthly revenue by region and segment.
Publish as “sales-data-forecast” dashboard. Working end-to-end batch pipeline. CSV files in S3, transformed by Glue, warehoused in Redshift, visualized in QuickSight.
Best Practices
Solid optimization tips from the authors:
Security:
- Secrets Manager for Redshift credentials instead of hardcoding
- Restrict security groups to specific clusters
- Least privilege for IAM policies
Performance:
- Partition S3 input data
- SPICE in QuickSight for caching
Cost:
- Autoscaling in Glue jobs
- Base and max RPUs for Redshift Serverless
All things I’d tell a junior engineer on day one.
Real-Time Streaming Pipeline
The fun part. More complex, more services, closer to real production.
The Use Case
Product inventory data streams continuously. Process in real time, land in an Iceberg data lake on S3, queryable via Athena.
The Architecture
- Kinesis Data Generator (KDG) generates sample streaming data
- Kinesis Data Streams buffers the records
- EMR Serverless with Spark Structured Streaming consumes and processes
- Amazon S3 stores output in Apache Iceberg format
- Amazon Athena queries the Iceberg tables
Modern data lakehouse architecture. Kinesis for ingestion, Spark for processing, Iceberg for the table format, S3 for storage.
I like this architecture choice. Iceberg gives you ACID transactions, schema evolution, and time travel on top of S3. Big deal compared to just dumping Parquet files into a bucket and hoping for the best.
Step 1: Create Kinesis Data Stream
Straightforward. ch8-kinesis-stream with default settings. On-demand capacity mode, no shard count worries.
Wait for Active status. Done.
Step 2: Set Up Kinesis Data Generator
KDG is an open source tool for generating test data. Uses Cognito for authentication, deploy a CloudFormation stack for the Cognito user.
Log in, configure a template for product inventory data:
{
"ProductID": "{{random.number({\"min\": 1, \"max\": 10000})}}",
"ProductName": "{{random.weightedArrayElement(...)}}",
"ApplicableCountry": "...",
"ListPrice": "...",
"DiscountedPrice": "...",
"MarketingCampaign": "...",
"ShippingType": "...",
"ShippingMode": "...",
"ShippingCarrier": "...",
"LastUpdatedDate": "{{date.now(\"DD/MMM/YYYY\")}}",
"LastUpdatedTime": "{{date.now(\"HH:mm:ss\")}}"
}
Hit “Send data,” check Kinesis monitoring tab. KDG is a great tool for testing. Much faster than writing a Python script that pushes records in a loop.
Step 3: Create S3 Buckets
Three this time:
ch8-ex2-streaming-checkpoint: checkpoint info so the streaming job knows where it left off on restartch8-ex2-iceberg-data-lake: actual data lake storagech8-ex2-scripts: holds the PySpark streaming script
The checkpoint bucket matters. Without it, a failed streaming job either reprocesses everything from scratch or skips data. Checkpointing makes streaming reliable.
Step 4: Create EMR Studio and EMR Serverless Application
EMR Serverless is the compute layer. Create an EMR Studio first (management layer), then a Serverless application.
Key settings:
- Name:
ch8-ex2-spark-streaming-app - Type: Spark
- Release: emr-7.1.0
- Architecture: x86_64
Critical detail: deploy inside a VPC. The authors note the VPC field is marked “optional” in the console, but for streaming applications it’s actually mandatory. Skip this and your streaming job fails to connect to Kinesis.
One of those AWS gotchas that costs hours of debugging. Good that the book calls it out.
Step 5: Create VPC Endpoints
EMR Serverless inside a VPC needs VPC endpoints to reach other services. Three:
- EMR Serverless endpoint:
com.amazonaws.us-east-1.emr-serverless - Kinesis Data Streams endpoint:
com.amazonaws.us-east-1.kinesis-streams - S3 endpoint:
com.amazonaws.us-east-1.s3(interface type)
Each in the same VPC with correct subnets and security groups.
Without these endpoints, the EMR Serverless application sits in a VPC with no way to reach Kinesis or S3. Like putting a server in a room with no doors.
Step 6: Submit the Spark Streaming Job
Create the target Iceberg table using Athena:
CREATE DATABASE icebergdb;
CREATE TABLE icebergdb.productcatalog (
ProductID STRING,
ProductName STRING,
ApplicableCountry STRING,
ListPrice DOUBLE,
DiscountedPrice DOUBLE,
MarketingCampaign STRING,
ShippingType STRING,
ShippingMode STRING,
ShippingCarrier STRING,
LastUpdatedDate TIMESTAMP,
LastUpdatedTime STRING
)
LOCATION 's3://ch8-ex2-iceberg-data-lake/icebergdb/productcatalog/'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
);
The PySpark streaming script initializes a Spark session with Iceberg catalog configuration, defines the schema, reads from Kinesis, parses JSON, writes to the Iceberg table with checkpointing.
Key Spark session configurations:
- Iceberg Spark extensions enabled
- Glue Catalog as Iceberg catalog implementation
- S3 as warehouse location
- Snappy compression for Parquet
Submit via CLI:
aws emr-serverless start-job-run \
--application-id <emr-serverless-application-id> \
--execution-role-arn <iam-role-arn> \
--mode STREAMING \
--retry-policy '{"maxFailedAttemptsPerHour": 5}' \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://ch8-ex2-scripts/stream-processor.py",
"sparkSubmitParameters": "--conf spark.executor.cores=4 ..."
}
}'
--mode STREAMING tells EMR Serverless this is long-running, not one-shot. Retry policy adds resilience.
Check Spark UI for executors working. Navigate to S3 for the Iceberg table structure: data folder with Parquet files, metadata folder with Iceberg JSON.
Query with Athena to confirm. Product inventory data, streaming from KDG through Kinesis, processed by Spark on EMR Serverless, landed in an Iceberg table on S3. Complete real-time data lakehouse pipeline.
My Take on This Chapter
Best chapter in the book. Hands-on beats theory every time.
Things I’d do differently in production:
Batch pipeline:
- Step Functions for orchestration instead of just EventBridge. Retry logic, error handling, notifications.
- Secrets Manager for Redshift credentials from the start. Wouldn’t even show the manual password approach.
- Data quality checks between Glue and Redshift. Simple row count validation saves you from silent data loss.
Streaming pipeline:
- Dead letter queue for records that fail parsing. Current script drops bad records silently.
- CloudWatch alarms on Kinesis iterator age. Consumer falls behind? Know immediately.
- MSK instead of Kinesis if you have Kafka expertise. More control over partitioning and consumer groups.
Both:
- Infrastructure as Code. All of this should be in Terraform or CloudFormation. Console steps are fine for learning. Never build production infrastructure by clicking through consoles.
The architectures are sound though. Service choices are practical. Step-by-step format means you end up with working pipelines.
If you’re preparing for the DEA-C01, build both in your own AWS account. Reading about services is one thing. Watching a Glue job go from Waiting to Running to Succeeded, or seeing data appear in your Iceberg table in real time – that’s what makes the concepts stick.
Just remember to clean up. Redshift Serverless and EMR Serverless both charge for compute time. Don’t leave them running overnight.