Data Transformation on AWS: Glue, EMR, Redshift, Flink, and Lambda Compared
Previous: Data Ingestion Patterns
Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3
Second part of Chapter 4, covering data transformation. The first part was about ingestion. Now we look at what happens after the data lands. You need to clean it, reshape it, enrich it, and get it into a format that analysts and applications can actually use.
AWS gives you at least six services that can transform data. The trick is knowing which one to pick.
Batch vs Streaming Transformation
Two modes of transformation in AWS:
Batch transformation processes data in large chunks on a schedule. Hourly, daily, weekly. Collect data, then process it all at once. Apache Spark jobs or SQL queries handle this. On AWS, the main services for batch transformation are AWS Glue, Amazon EMR, and Amazon Redshift.
Streaming transformation processes data continuously as it arrives. Two main frameworks:
- Spark Structured Streaming treats the stream as an unbounded table that keeps growing. You write SQL-like operations against it. Familiar if you already know Spark batch processing. Available through AWS Glue and Amazon EMR.
- Apache Flink processes each event individually. Handles both stateless operations (filtering, routing, enrichment) and stateful operations (aggregation, windowing, pattern detection). Flink is better for low-latency use cases like fraud detection. Available through Amazon Managed Service for Apache Flink (MSF) and Amazon EMR.
The key distinction: Spark Streaming works in microbatches. Flink processes event by event. If you need the absolute lowest latency, Flink wins.
Data Transformation Using AWS Glue
AWS Glue is a serverless data transformation service. Runs both batch and streaming workloads. Under the hood, it uses Spark for batch and Spark Structured Streaming for streams. For lightweight work, you can also run plain Python shell jobs.
You create Glue jobs, which are scripts that read from sources, transform data, and write results to targets. Simple concept.
Glue Connectors
Connectors are prebuilt components that let Glue read from various sources: Amazon S3, RDS, Redshift, MongoDB, Snowflake, Salesforce, ServiceNow, and many others. Also connectors for applications like LinkedIn, Facebook Ads, and SAP.
The point: you don’t write custom code to connect to each source. Pick a connector, configure it, move on.
Glue Bookmarks
Bookmarks enable incremental processing. They track what data was already processed in previous runs. For S3 sources, bookmarks remember which files were processed. For JDBC sources, they track primary key ranges.
Your Glue job picks up where it left off instead of reprocessing everything from scratch. Essential for CDC workflows. Without bookmarks, you waste compute reprocessing data you already handled.
Data Processing Units (DPUs)
DPUs are how Glue measures and bills compute. One DPU equals 4 vCPUs and 16 GB of memory. You pay per second based on how many DPUs your job uses.
Important for cost planning. More DPUs means faster jobs but higher bills. Find the right balance.
Worker Types
Glue workers come in different sizes:
- G.025X for low-volume streaming jobs (Glue version 3.0+ only)
- G.1X and G.2X for standard workloads, lightweight transforms, joins, and queries
- G.4X and G.8X for heavy transforms, complex aggregations, and demanding queries
AWS Glue autoscales workers up and down based on job needs. No manual resizing needed.
Glue Job Types
Three types:
Spark jobs run in a fully managed Apache Spark environment. Minimum 2 DPUs. Best for batch processing of large datasets.
Streaming ETL jobs use Spark Structured Streaming. They process data in configurable time windows (default 100 seconds). Support compression formats like GZIP, Snappy, and Bzip2 automatically. Sources include Kinesis Data Streams and Apache Kafka. Also minimum 2 DPUs.
Python shell jobs run plain Python scripts. Minimum DPU is just 1/16. Very cheap. Use these for lightweight ETL that doesn’t need Spark’s distributed processing.
Job Authoring Options
Three ways to write Glue jobs:
Glue Studio is a visual drag-and-drop interface. Build ETL workflows graphically, it generates Spark code for you. Good for quick pipelines and people who prefer visual tools.
Glue Studio Notebooks give you a Jupyter notebook experience. Write and test PySpark code interactively, then convert the notebook to a Glue job with one click.
Interactive Sessions let you test code against live data in real time. Good for debugging complex transformations.
Best Practices for AWS Glue
The book lists several practical tips:
- Pick the right worker type. G.4X or G.8X for compute-heavy jobs, G.2X for standard workloads.
- Partition your data. Partitioning reduces how much data Glue needs to scan. Partition on columns you frequently filter by.
- Use columnar formats. Parquet and ORC are much more efficient than CSV or JSON for analytical workloads.
- Use Data Catalog partitions. Proper partitioning in the catalog improves query performance through partition pruning.
- Enable bookmarks for incremental processing. Don’t reprocess what you already processed.
- Monitor your jobs. Use Glue job metrics and the Spark UI to find bottlenecks and data skew.
- Use autoscaling. Let Glue scale workers based on actual needs.
- Avoid tiny files. Too many small files hurt performance. Also avoid files larger than 1 GB.
- Use Flex execution class for non-urgent jobs to save money. Flex uses spare capacity at a discount.
The most common mistake I see with Glue is people using it with CSV files and no partitioning. They get slow jobs and high bills, then blame Glue. Convert to Parquet, partition properly, enable bookmarks. Most Glue performance problems disappear after that.
Data Transformation Using Amazon EMR
Amazon EMR (Elastic MapReduce) is for when Glue isn’t enough. Maybe you need Hadoop, HBase, Hive, Presto, Trino, or Flink alongside Spark. Maybe you need deep customization of the runtime environment. EMR gives you a full big data platform.
With EMR you can do batch processing (Spark, Hive, Presto, Trino), stream processing (Spark Structured Streaming, Flink), or interactive analytics.
Storage Options
Two choices for persistent storage on EMR:
HDFS is the traditional Hadoop distributed filesystem. High throughput, fault tolerant. Adds operational complexity and cost though. Data disappears when the cluster terminates unless you persist it elsewhere.
Amazon S3 is the recommended option. Cheaper, more durable, data persists after the cluster shuts down. You can spin up new clusters and point them at the same S3 data.
The book recommends S3 over HDFS. I agree. Unless you have a very specific HDFS requirement, use S3.
Deployment Options
EMR gives you three deployment models:
EMR on EC2 is the classic approach. Maximum control. Pick EC2 instance types, run bootstrap scripts, use multiple frameworks in one cluster. Reserved Instances or Spot Instances for cost savings. You manage the cluster yourself though.
EMR Serverless is the hands-off option. No cluster management. Currently supports Spark and Hive. Pick x86 or Graviton instances, submit jobs, pay only for runtime. Best for intermittent or unpredictable workloads.
EMR on EKS runs EMR on Kubernetes. Good if your team already uses EKS. Multi-AZ resiliency, can run multiple Spark versions on the same cluster. Middle ground between full control and full managed.
Instance Types
- x86-based (M5, R5, C5) are the standard general-purpose, memory-optimized, and compute-optimized families.
- Graviton instances use ARM processors and can give up to 30% better price-performance for Spark workloads compared to x86.
- Spot Instances offer up to 90% discount but can be interrupted. Use them for task nodes, not core nodes.
Best Practices for Amazon EMR
- Use S3 for storage, not HDFS.
- Compress and convert data to Parquet or ORC.
- Partition and bucket your S3 data.
- Choose the right instance types for your workload (compute-optimized for CPU-heavy, memory-optimized for memory-heavy).
- Use Spot Instances for task nodes, on-demand for core nodes.
- Enable managed scaling.
- Rightsize your containers and resources.
- Use the latest EMR version for performance improvements.
- Use EMR Serverless for sporadic workloads.
AWS Glue vs Amazon EMR
This question comes up often. Here’s the comparison:
| Criteria | AWS Glue | EMR Serverless | EMR on EC2 | EMR on EKS |
|---|---|---|---|---|
| Serverless | Yes | Yes | No | No (unless Fargate) |
| Frameworks | Spark, Python | Spark, Hive | Spark, Hive, Trino, HBase, Flink, and more | Spark |
| Job startup time | ~10 seconds | ~2 minutes (seconds if pre-initialized) | ~5 minutes for cluster creation | ~10 seconds if instances available |
| Scaling | Fully managed | Fully managed | Autoscaling with custom policies | EKS autoscaler or Karpenter |
| Interactive analytics | Yes (Studio, notebooks, interactive sessions) | Yes (EMR Studio) | Yes (Studio, JupyterHub, Zeppelin, Hue) | Yes (EMR Studio) |
| Multi-AZ | No | Yes | No | Yes |
| Cost optimization | Flex execution | N/A | Spot, Reserved, Graviton, managed scaling | Spot, Reserved, Graviton |
| Management overhead | Low | Low | Medium | Medium (needs Kubernetes expertise) |
Start with Glue. It’s simpler, starts faster, and covers most ETL use cases. Move to EMR when you need frameworks beyond Spark and Python, need deep customization of the runtime, or need to run Flink. EMR on EC2 gives maximum control but also maximum overhead. EMR Serverless is a good middle ground if you just need Spark or Hive without cluster management.
The 10-second startup time of Glue is a real advantage for pipelines with many short jobs. EMR on EC2 takes minutes to spin up a cluster unless you keep one running all the time.
SQL-Based Transformation Using Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse. If your team’s strength is SQL and your data fits a relational model, Redshift is a strong choice for transformation.
Compute
Two modes:
Provisioned uses RA3 nodes in four sizes (large, xlarge, 4xlarge, 16xlarge). Pay per second of usage. Pause when not in use. Reserved instances give 30%-60% savings for 1 or 3 year commitments.
Serverless measures capacity in Redshift Processing Units (RPUs). One RPU provides 16 GB of memory. Only pay when queries or loads are running. No charges for idle time. Easier to manage than provisioned.
Storage
Redshift uses Redshift Managed Storage (RMS), which combines local SSDs for hot data caching with S3 for persistence. The compute nodes can also query data directly from S3 in open formats like Iceberg, Hudi, Delta Lake, Parquet, CSV, and JSON.
Multi-Cluster Architectures
Instead of one giant Redshift cluster, you can build distributed architectures:
Hub and spoke puts each workload (ETL, reporting, data science) on its own cluster. Isolates them so an ETL job doesn’t slow down dashboards.
Data mesh gives each business unit their own cluster. Finance, HR, and operations each control their own data assets and decide what to share.
Both architectures use data sharing, a Redshift feature that lets clusters read each other’s data in place without copying it. Transactional consistency and can even write to shared tables from multiple endpoints.
Materialized Views
Materialized views store precomputed query results. If your dashboard query joins five tables with aggregations, create a materialized view. The dashboard reads from the precomputed result instead of running the full query every time.
Key features:
- Automatic query rewriting. Create the materialized view and Redshift automatically uses it for matching queries. No need to rewrite your existing SQL.
- Incremental refresh. Redshift updates only the changed data, not the whole view.
- Automatic refresh. Redshift detects changes in base tables and refreshes the view during low-load periods.
Example:
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
date_trunc('day', sale_timestamp) as sale_date,
product_category,
region,
COUNT(*) as transaction_count,
SUM(amount) as total_sales,
AVG(amount) as avg_sale_amount
FROM sales_transactions
GROUP BY 1, 2, 3;
Stored Procedures
Stored procedures in Redshift use PL/pgSQL (PostgreSQL procedural language). They encapsulate multi-step transformation logic: load staging data, merge into fact tables, clean up.
A typical pattern:
- COPY data from S3 into a staging table.
- MERGE staging data into dimension or fact tables.
- Drop the staging table.
Stored procedures also support delegated access control. You can let users run a procedure without giving them direct access to the underlying tables.
Amazon Managed Service for Apache Flink
Amazon MSF is a fully managed Flink service for real-time stream processing. It reads from Kinesis Data Streams or Amazon MSK and performs transformations, aggregations, windowing, and stateful computations.
Key facts:
- Lowest latency and highest throughput for streaming transformations among managed AWS services.
- Can use enhanced fan-out with Kinesis for dedicated read throughput.
- Output goes to S3, Kinesis Data Streams, or MSK for delivery to other targets like Redshift.
Best Practices for MSF
- Costs are based on Kinesis Processing Units (KPUs). Monitor for overprovisioning.
- Start with 1 KPU per 1 MB/s throughput and adjust from there.
- Enable autoscaling.
- For I/O-bound workloads, increase parallelism per KPU to run more tasks per unit.
- Use higher-level APIs, eliminate data skew, and use async I/O.
- Ask yourself if you actually need Flink. For stateless, high-latency workloads, Lambda might be enough.
That last point matters. Flink is powerful but has a learning curve. Don’t use it just because it sounds impressive. If Lambda or Firehose can handle your use case, use those instead.
Amazon Data Firehose for Transformation
Firehose is primarily a delivery service. Moves streaming data to S3, OpenSearch, and Redshift. It can do lightweight transformations along the way:
- Convert JSON to Parquet format
- Handle compression and decompression
- Add delimiters
- Batch records for optimal delivery
- Create dynamic partitions in S3
For slightly more complex transformations, Firehose can invoke Lambda functions. For example: convert CSV to JSON, then to Parquet.
Firehose isn’t a full transformation engine. More like a delivery truck that can do some light processing during transit.
AWS Lambda for Transformation
Lambda handles simple, event-driven transformations that run for less than 15 minutes and don’t need state between invocations. Use cases:
- Data format conversions
- Basic filtering
- Small-scale aggregations
Serverless, scales automatically, pay per invocation. Cheapest option for lightweight, stateless transformations. Limits: 15-minute timeout, no state management, limited windowing.
Choosing the Right Streaming Transformation Service
Here’s how the book breaks down the streaming options:
| Criteria | Firehose + Lambda | Spark Streaming (Glue / EMR) | Flink (MSF / EMR) |
|---|---|---|---|
| Transformation type | Simple stateless, limited windowing (15 min) | Stateless and stateful | Rich stateless and stateful |
| Schema evolution | Limited | Yes | No |
| Schema registry | No | Yes | Yes |
| Low-latency needs | Flink is faster | Flink is faster | Lowest latency |
| High-throughput needs | Limited | Flink often better | Optimized for high throughput |
| Exactly-once processing | No | With extra configuration | Native support |
| Out-of-order events | Limited | Limited | Handles efficiently |
| Ease of use for Spark users | Familiar Python/Java | Familiar Spark APIs | Steeper learning curve |
The practical decision tree:
- Simple filtering, format conversion, delivery to S3: Firehose + Lambda.
- Need Spark ecosystem, schema evolution, microbatch is OK: Glue Streaming or EMR with Spark Structured Streaming.
- Need lowest latency, exactly-once guarantees, complex stateful processing: Amazon MSF (Flink).
- Need full control over Flink runtime: Flink on EMR.
AWS Glue and Amazon MSF are fully managed. They handle infrastructure, scaling, and maintenance. EMR gives more control but requires more operational expertise.
Choosing the Right Batch Transformation Service
Four options for batch:
| Criteria | AWS Glue | Amazon EMR | AWS Lambda | Amazon Redshift |
|---|---|---|---|---|
| Best for | Spark-based ETL, format conversions, serverless simplicity | Complex large-scale processing with multiple frameworks | Lightweight event-driven transforms under 15 min | SQL-based analytics, data warehouse workloads |
| Complexity | Large-scale Spark or lightweight Python shell | Complex, large-scale with full framework capabilities | Lightweight, short-running functions | Complex SQL on massive structured datasets |
| Customization | Managed, limited customization | High, custom libraries, integrations | Limited beyond function code | Provisioned mode offers more control |
| Infrastructure | Fully managed, serverless | Managed service on EC2, or EMR Serverless | Fully managed, serverless | Fully managed, serverless or provisioned |
| Expertise needed | General ETL experience | Big data framework expertise | Serverless function experience | SQL skills |
My decision framework:
- SQL team, structured data, dashboard queries? Redshift.
- Spark ETL, want simplicity? Glue.
- Need Hadoop, Hive, Presto, Flink, or custom libraries? EMR.
- Small, event-driven, under 15 minutes? Lambda.
Most teams should start with Glue or Redshift depending on whether the work is Spark or SQL. Move to EMR only when you hit the limits of what Glue can do. Lambda is for the small stuff that doesn’t justify a full ETL service.
Summary
Data transformation on AWS comes down to knowing the tradeoffs:
- AWS Glue is the default for Spark-based ETL. Serverless, fast startup, low overhead. Start here.
- Amazon EMR is for when you need more frameworks, more control, or more customization than Glue provides.
- Amazon Redshift is the SQL transformation engine. Materialized views, stored procedures, petabyte-scale analytics.
- Amazon MSF (Flink) is for lowest-latency streaming with exactly-once guarantees and complex stateful processing.
- Amazon Data Firehose handles delivery with lightweight transformation. Not a full processing engine.
- AWS Lambda is for small, stateless, event-driven transformations under 15 minutes.
For the exam, match requirements to services. Pay attention to keywords like “serverless,” “lowest latency,” “exactly-once,” “SQL-based,” and “multiple frameworks.” Those keywords map directly to specific services.