Hands-On: Building Batch and Streaming Data Pipelines on AWS

Published: Tuesday, Jan 29, 2019 | 10 minute read | Using 1918 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

This is the chapter I was waiting for. After seven chapters of theory, services, security, and governance, Chapter 8 finally says: “OK, build something.” Two complete pipelines, end to end, with real code and real AWS services.

If you learn by doing, this chapter alone is worth the price of the book.

What Is a Data Processing Pipeline

Quick refresher before the hands-on work. A data processing pipeline is a sequence of steps that takes raw data and transforms it into something useful for analytics.

Typical reasons:

Cleansing data and improving quality
Aggregating with business rules
Formatting for time series or ML
Building data models for BI reporting
Preparing data for downstream systems

High-level architecture always looks the same: data sources, ingestion layer, processing layer, consumption layer. The details change depending on batch or real-time.

Chapter 8 builds both.

Batch Processing Pipeline

Batch processing means you collect a bunch of records and process them all at once. Schedule or on demand.

The Use Case

A sales team uploads CSV files to S3 at the end of every month. Over 20,000 sales opportunity records with fields like date, salesperson name, market segment, and forecasted monthly revenue. Goal: transform this data for BI reporting.

Simple, realistic, common. The kind of pipeline most data engineers build first.

The Architecture

Input CSV files land in an S3 raw input bucket
AWS Glue PySpark ETL job reads from S3, transforms the data
Transformed data gets loaded into Amazon Redshift Serverless
Amazon QuickSight connects to Redshift and creates visualizations

EventBridge scheduler triggers it. Every month (or whatever schedule you set), the pipeline runs automatically.

Clean, straightforward architecture. S3 as landing zone, Glue for transformation, Redshift for warehousing, QuickSight for dashboards. All managed services. No servers to maintain.

Solid starter architecture. For a small to medium dataset, works perfectly. For production I’d add error handling, maybe Step Functions for orchestration, and definitely monitoring. For learning the concepts though, exactly right.

Step 1: Create S3 Buckets

Create an S3 bucket for input data. The authors name it ch8-ex1-input-data and remind you S3 bucket names must be globally unique. Add random characters at the end.

Upload the sales CSV. Done. Simple, should be.

Step 2: Create Redshift Serverless Cluster

Set up the destination. Redshift Serverless is the right call for a learning exercise. No cluster sizing decisions, no node type selection.

Key settings:

Workgroup name: ch8-ex1-redshift-serverless
Base capacity: 8 RPUs (the minimum)
Namespace: salesdata
Custom admin credentials with a password

Important detail: create an IAM role for the Redshift namespace with S3 access. The authors select “Any S3 bucket” for simplicity but note that production should restrict to specific buckets. Least privilege.

Configure the security group too. The Redshift workgroup needs inbound access on port 8182, and a self-referencing rule so the Glue job (using the same security group) can connect.

The security group setup is where people get stuck most. Glue job can’t connect to Redshift? Check the security groups first. Nine times out of ten, that’s the problem.

Step 3: Create Glue Data Connection

Before writing a Glue job that talks to Redshift, you need a connection object. Telling Glue how to reach your Redshift cluster.

You need:

The JDBC URL from the Redshift workgroup details page
VPC, subnet, and security group info from the Data access tab
Username and password for authentication

Create a JDBC connection in the Glue console, fill in the details, name it ch8-ex1-redshift-connection.

I appreciate the authors walking through exactly where to find each piece of information in the console. When you’re new to AWS, hunting for the JDBC URL across different console pages is frustrating. They save you that pain.

Step 4: Create the Glue PySpark ETL Job

The core of the pipeline. The authors use Glue Studio’s visual editor. Nice touch for a certification book.

Three nodes in the visual ETL pipeline:

Amazon S3 Source: points to the input bucket, format CSV
Change Schema Transform: renames columns to replace spaces with underscores (Redshift doesn’t like spaces in column names)
Redshift Target: writes to the public.sales table using the connection from the previous step

The column renaming is a real-world detail. CSV files from business teams always have messy column names. “opportunity stage” becomes “opportunity_stage”, “weighted revenue” becomes “weighted_revenue”. Small thing, breaks everything if you skip it.

Configure the IAM role in Job details, hit Run. Job goes through Waiting, Running, Succeeded.

Verify in Redshift Query Editor v2 with a SELECT on the sales table.

The visual editor is good for simple ETL like this. For anything complex with branching logic, error handling, or custom transformations, write PySpark code directly. The visual editor starts fighting you past a certain complexity level.

Step 5: Set Up QuickSight

Most involved part of this pipeline, surprisingly. Not hard, just a lot of steps.

Create an IAM execution role with VPC access permissions, Redshift access, and Redshift Serverless access. Sign up for QuickSight, configure VPC connection, attach the IAM role.

The VPC connection is key. QuickSight needs to be in the same VPC as Redshift. Specify VPC ID, execution role, subnet IDs, security group. Wait for AVAILABLE status.

I’ve seen teams skip the VPC setup and wonder why QuickSight can’t see their Redshift data. It’s always the network. Always.

Step 6: Create the Visualization

Create a dataset in QuickSight pointing to your Redshift table. “Redshift - Manual connect” option, validate, select public.sales. Import into SPICE for faster analytics. Create an area line chart showing forecasted monthly revenue by region and segment.

Publish as “sales-data-forecast” dashboard. Working end-to-end batch pipeline. CSV files in S3, transformed by Glue, warehoused in Redshift, visualized in QuickSight.

Best Practices

Solid optimization tips from the authors:

Security:

Secrets Manager for Redshift credentials instead of hardcoding
Restrict security groups to specific clusters
Least privilege for IAM policies

Performance:

Partition S3 input data
SPICE in QuickSight for caching

Cost:

Autoscaling in Glue jobs
Base and max RPUs for Redshift Serverless

All things I’d tell a junior engineer on day one.

Real-Time Streaming Pipeline

The fun part. More complex, more services, closer to real production.

The Use Case

Product inventory data streams continuously. Process in real time, land in an Iceberg data lake on S3, queryable via Athena.

The Architecture

Kinesis Data Generator (KDG) generates sample streaming data
Kinesis Data Streams buffers the records
EMR Serverless with Spark Structured Streaming consumes and processes
Amazon S3 stores output in Apache Iceberg format
Amazon Athena queries the Iceberg tables

Modern data lakehouse architecture. Kinesis for ingestion, Spark for processing, Iceberg for the table format, S3 for storage.

I like this architecture choice. Iceberg gives you ACID transactions, schema evolution, and time travel on top of S3. Big deal compared to just dumping Parquet files into a bucket and hoping for the best.

Step 1: Create Kinesis Data Stream

Straightforward. ch8-kinesis-stream with default settings. On-demand capacity mode, no shard count worries.

Wait for Active status. Done.

Step 2: Set Up Kinesis Data Generator

KDG is an open source tool for generating test data. Uses Cognito for authentication, deploy a CloudFormation stack for the Cognito user.

{
  "ProductID": "{{random.number({\"min\": 1, \"max\": 10000})}}",
  "ProductName": "{{random.weightedArrayElement(...)}}",
  "ApplicableCountry": "...",
  "ListPrice": "...",
  "DiscountedPrice": "...",
  "MarketingCampaign": "...",
  "ShippingType": "...",
  "ShippingMode": "...",
  "ShippingCarrier": "...",
  "LastUpdatedDate": "{{date.now(\"DD/MMM/YYYY\")}}",
  "LastUpdatedTime": "{{date.now(\"HH:mm:ss\")}}"
}

Hit “Send data,” check Kinesis monitoring tab. KDG is a great tool for testing. Much faster than writing a Python script that pushes records in a loop.

Step 3: Create S3 Buckets

Three this time:

ch8-ex2-streaming-checkpoint: checkpoint info so the streaming job knows where it left off on restart
ch8-ex2-iceberg-data-lake: actual data lake storage
ch8-ex2-scripts: holds the PySpark streaming script

The checkpoint bucket matters. Without it, a failed streaming job either reprocesses everything from scratch or skips data. Checkpointing makes streaming reliable.

Step 4: Create EMR Studio and EMR Serverless Application

EMR Serverless is the compute layer. Create an EMR Studio first (management layer), then a Serverless application.

Key settings:

Name: ch8-ex2-spark-streaming-app
Type: Spark
Release: emr-7.1.0
Architecture: x86_64

Critical detail: deploy inside a VPC. The authors note the VPC field is marked “optional” in the console, but for streaming applications it’s actually mandatory. Skip this and your streaming job fails to connect to Kinesis.

One of those AWS gotchas that costs hours of debugging. Good that the book calls it out.

Step 5: Create VPC Endpoints

EMR Serverless inside a VPC needs VPC endpoints to reach other services. Three:

EMR Serverless endpoint: com.amazonaws.us-east-1.emr-serverless
Kinesis Data Streams endpoint: com.amazonaws.us-east-1.kinesis-streams
S3 endpoint: com.amazonaws.us-east-1.s3 (interface type)

Each in the same VPC with correct subnets and security groups.

Without these endpoints, the EMR Serverless application sits in a VPC with no way to reach Kinesis or S3. Like putting a server in a room with no doors.

Step 6: Submit the Spark Streaming Job

Create the target Iceberg table using Athena:

CREATE DATABASE icebergdb;
CREATE TABLE icebergdb.productcatalog (
    ProductID STRING,
    ProductName STRING,
    ApplicableCountry STRING,
    ListPrice DOUBLE,
    DiscountedPrice DOUBLE,
    MarketingCampaign STRING,
    ShippingType STRING,
    ShippingMode STRING,
    ShippingCarrier STRING,
    LastUpdatedDate TIMESTAMP,
    LastUpdatedTime STRING
)
LOCATION 's3://ch8-ex2-iceberg-data-lake/icebergdb/productcatalog/'
TBLPROPERTIES (
    'table_type' = 'ICEBERG',
    'format' = 'parquet',
    'write_compression' = 'snappy'
);

The PySpark streaming script initializes a Spark session with Iceberg catalog configuration, defines the schema, reads from Kinesis, parses JSON, writes to the Iceberg table with checkpointing.

Key Spark session configurations:

Iceberg Spark extensions enabled
Glue Catalog as Iceberg catalog implementation
S3 as warehouse location
Snappy compression for Parquet

Submit via CLI:

aws emr-serverless start-job-run \
  --application-id <emr-serverless-application-id> \
  --execution-role-arn <iam-role-arn> \
  --mode STREAMING \
  --retry-policy '{"maxFailedAttemptsPerHour": 5}' \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://ch8-ex2-scripts/stream-processor.py",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 ..."
    }
  }'

--mode STREAMING tells EMR Serverless this is long-running, not one-shot. Retry policy adds resilience.

Check Spark UI for executors working. Navigate to S3 for the Iceberg table structure: data folder with Parquet files, metadata folder with Iceberg JSON.

Query with Athena to confirm. Product inventory data, streaming from KDG through Kinesis, processed by Spark on EMR Serverless, landed in an Iceberg table on S3. Complete real-time data lakehouse pipeline.

My Take on This Chapter

Best chapter in the book. Hands-on beats theory every time.

Things I’d do differently in production:

Batch pipeline:

Step Functions for orchestration instead of just EventBridge. Retry logic, error handling, notifications.
Secrets Manager for Redshift credentials from the start. Wouldn’t even show the manual password approach.
Data quality checks between Glue and Redshift. Simple row count validation saves you from silent data loss.

Streaming pipeline:

Dead letter queue for records that fail parsing. Current script drops bad records silently.
CloudWatch alarms on Kinesis iterator age. Consumer falls behind? Know immediately.
MSK instead of Kinesis if you have Kafka expertise. More control over partitioning and consumer groups.

Both:

Infrastructure as Code. All of this should be in Terraform or CloudFormation. Console steps are fine for learning. Never build production infrastructure by clicking through consoles.

The architectures are sound though. Service choices are practical. Step-by-step format means you end up with working pipelines.

If you’re preparing for the DEA-C01, build both in your own AWS account. Reading about services is one thing. Watching a Glue job go from Waiting to Running to Succeeded, or seeing data appear in your Iceberg table in real time – that’s what makes the concepts stick.

Just remember to clean up. Redshift Serverless and EMR Serverless both charge for compute time. Don’t leave them running overnight.

Next: Practice Exam Tips and Review

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-batch-streaming-pipelines/

denis256 notes and projects