Data Transformation on AWS: Glue, EMR, Redshift, Flink, and Lambda Compared

   |   13 minute read   |   Using 2716 words

Previous: Data Ingestion Patterns

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Second part of Chapter 4, covering data transformation. The first part was about ingestion. Now we look at what happens after the data lands. You need to clean it, reshape it, enrich it, and get it into a format that analysts and applications can actually use.

AWS gives you at least six services that can transform data. The trick is knowing which one to pick.

Batch vs Streaming Transformation

Two modes of transformation in AWS:

Batch transformation processes data in large chunks on a schedule. Hourly, daily, weekly. Collect data, then process it all at once. Apache Spark jobs or SQL queries handle this. On AWS, the main services for batch transformation are AWS Glue, Amazon EMR, and Amazon Redshift.

Streaming transformation processes data continuously as it arrives. Two main frameworks:

  • Spark Structured Streaming treats the stream as an unbounded table that keeps growing. You write SQL-like operations against it. Familiar if you already know Spark batch processing. Available through AWS Glue and Amazon EMR.
  • Apache Flink processes each event individually. Handles both stateless operations (filtering, routing, enrichment) and stateful operations (aggregation, windowing, pattern detection). Flink is better for low-latency use cases like fraud detection. Available through Amazon Managed Service for Apache Flink (MSF) and Amazon EMR.

The key distinction: Spark Streaming works in microbatches. Flink processes event by event. If you need the absolute lowest latency, Flink wins.

Data Transformation Using AWS Glue

AWS Glue is a serverless data transformation service. Runs both batch and streaming workloads. Under the hood, it uses Spark for batch and Spark Structured Streaming for streams. For lightweight work, you can also run plain Python shell jobs.

You create Glue jobs, which are scripts that read from sources, transform data, and write results to targets. Simple concept.

Glue Connectors

Connectors are prebuilt components that let Glue read from various sources: Amazon S3, RDS, Redshift, MongoDB, Snowflake, Salesforce, ServiceNow, and many others. Also connectors for applications like LinkedIn, Facebook Ads, and SAP.

The point: you don’t write custom code to connect to each source. Pick a connector, configure it, move on.

Glue Bookmarks

Bookmarks enable incremental processing. They track what data was already processed in previous runs. For S3 sources, bookmarks remember which files were processed. For JDBC sources, they track primary key ranges.

Your Glue job picks up where it left off instead of reprocessing everything from scratch. Essential for CDC workflows. Without bookmarks, you waste compute reprocessing data you already handled.

Data Processing Units (DPUs)

DPUs are how Glue measures and bills compute. One DPU equals 4 vCPUs and 16 GB of memory. You pay per second based on how many DPUs your job uses.

Important for cost planning. More DPUs means faster jobs but higher bills. Find the right balance.

Worker Types

Glue workers come in different sizes:

  • G.025X for low-volume streaming jobs (Glue version 3.0+ only)
  • G.1X and G.2X for standard workloads, lightweight transforms, joins, and queries
  • G.4X and G.8X for heavy transforms, complex aggregations, and demanding queries

AWS Glue autoscales workers up and down based on job needs. No manual resizing needed.

Glue Job Types

Three types:

Spark jobs run in a fully managed Apache Spark environment. Minimum 2 DPUs. Best for batch processing of large datasets.

Streaming ETL jobs use Spark Structured Streaming. They process data in configurable time windows (default 100 seconds). Support compression formats like GZIP, Snappy, and Bzip2 automatically. Sources include Kinesis Data Streams and Apache Kafka. Also minimum 2 DPUs.

Python shell jobs run plain Python scripts. Minimum DPU is just 1/16. Very cheap. Use these for lightweight ETL that doesn’t need Spark’s distributed processing.

Job Authoring Options

Three ways to write Glue jobs:

Glue Studio is a visual drag-and-drop interface. Build ETL workflows graphically, it generates Spark code for you. Good for quick pipelines and people who prefer visual tools.

Glue Studio Notebooks give you a Jupyter notebook experience. Write and test PySpark code interactively, then convert the notebook to a Glue job with one click.

Interactive Sessions let you test code against live data in real time. Good for debugging complex transformations.

Best Practices for AWS Glue

The book lists several practical tips:

  • Pick the right worker type. G.4X or G.8X for compute-heavy jobs, G.2X for standard workloads.
  • Partition your data. Partitioning reduces how much data Glue needs to scan. Partition on columns you frequently filter by.
  • Use columnar formats. Parquet and ORC are much more efficient than CSV or JSON for analytical workloads.
  • Use Data Catalog partitions. Proper partitioning in the catalog improves query performance through partition pruning.
  • Enable bookmarks for incremental processing. Don’t reprocess what you already processed.
  • Monitor your jobs. Use Glue job metrics and the Spark UI to find bottlenecks and data skew.
  • Use autoscaling. Let Glue scale workers based on actual needs.
  • Avoid tiny files. Too many small files hurt performance. Also avoid files larger than 1 GB.
  • Use Flex execution class for non-urgent jobs to save money. Flex uses spare capacity at a discount.

The most common mistake I see with Glue is people using it with CSV files and no partitioning. They get slow jobs and high bills, then blame Glue. Convert to Parquet, partition properly, enable bookmarks. Most Glue performance problems disappear after that.

Data Transformation Using Amazon EMR

Amazon EMR (Elastic MapReduce) is for when Glue isn’t enough. Maybe you need Hadoop, HBase, Hive, Presto, Trino, or Flink alongside Spark. Maybe you need deep customization of the runtime environment. EMR gives you a full big data platform.

With EMR you can do batch processing (Spark, Hive, Presto, Trino), stream processing (Spark Structured Streaming, Flink), or interactive analytics.

Storage Options

Two choices for persistent storage on EMR:

HDFS is the traditional Hadoop distributed filesystem. High throughput, fault tolerant. Adds operational complexity and cost though. Data disappears when the cluster terminates unless you persist it elsewhere.

Amazon S3 is the recommended option. Cheaper, more durable, data persists after the cluster shuts down. You can spin up new clusters and point them at the same S3 data.

The book recommends S3 over HDFS. I agree. Unless you have a very specific HDFS requirement, use S3.

Deployment Options

EMR gives you three deployment models:

EMR on EC2 is the classic approach. Maximum control. Pick EC2 instance types, run bootstrap scripts, use multiple frameworks in one cluster. Reserved Instances or Spot Instances for cost savings. You manage the cluster yourself though.

EMR Serverless is the hands-off option. No cluster management. Currently supports Spark and Hive. Pick x86 or Graviton instances, submit jobs, pay only for runtime. Best for intermittent or unpredictable workloads.

EMR on EKS runs EMR on Kubernetes. Good if your team already uses EKS. Multi-AZ resiliency, can run multiple Spark versions on the same cluster. Middle ground between full control and full managed.

Instance Types

  • x86-based (M5, R5, C5) are the standard general-purpose, memory-optimized, and compute-optimized families.
  • Graviton instances use ARM processors and can give up to 30% better price-performance for Spark workloads compared to x86.
  • Spot Instances offer up to 90% discount but can be interrupted. Use them for task nodes, not core nodes.

Best Practices for Amazon EMR

  • Use S3 for storage, not HDFS.
  • Compress and convert data to Parquet or ORC.
  • Partition and bucket your S3 data.
  • Choose the right instance types for your workload (compute-optimized for CPU-heavy, memory-optimized for memory-heavy).
  • Use Spot Instances for task nodes, on-demand for core nodes.
  • Enable managed scaling.
  • Rightsize your containers and resources.
  • Use the latest EMR version for performance improvements.
  • Use EMR Serverless for sporadic workloads.

AWS Glue vs Amazon EMR

This question comes up often. Here’s the comparison:

CriteriaAWS GlueEMR ServerlessEMR on EC2EMR on EKS
ServerlessYesYesNoNo (unless Fargate)
FrameworksSpark, PythonSpark, HiveSpark, Hive, Trino, HBase, Flink, and moreSpark
Job startup time~10 seconds~2 minutes (seconds if pre-initialized)~5 minutes for cluster creation~10 seconds if instances available
ScalingFully managedFully managedAutoscaling with custom policiesEKS autoscaler or Karpenter
Interactive analyticsYes (Studio, notebooks, interactive sessions)Yes (EMR Studio)Yes (Studio, JupyterHub, Zeppelin, Hue)Yes (EMR Studio)
Multi-AZNoYesNoYes
Cost optimizationFlex executionN/ASpot, Reserved, Graviton, managed scalingSpot, Reserved, Graviton
Management overheadLowLowMediumMedium (needs Kubernetes expertise)

Start with Glue. It’s simpler, starts faster, and covers most ETL use cases. Move to EMR when you need frameworks beyond Spark and Python, need deep customization of the runtime, or need to run Flink. EMR on EC2 gives maximum control but also maximum overhead. EMR Serverless is a good middle ground if you just need Spark or Hive without cluster management.

The 10-second startup time of Glue is a real advantage for pipelines with many short jobs. EMR on EC2 takes minutes to spin up a cluster unless you keep one running all the time.

SQL-Based Transformation Using Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse. If your team’s strength is SQL and your data fits a relational model, Redshift is a strong choice for transformation.

Compute

Two modes:

Provisioned uses RA3 nodes in four sizes (large, xlarge, 4xlarge, 16xlarge). Pay per second of usage. Pause when not in use. Reserved instances give 30%-60% savings for 1 or 3 year commitments.

Serverless measures capacity in Redshift Processing Units (RPUs). One RPU provides 16 GB of memory. Only pay when queries or loads are running. No charges for idle time. Easier to manage than provisioned.

Storage

Redshift uses Redshift Managed Storage (RMS), which combines local SSDs for hot data caching with S3 for persistence. The compute nodes can also query data directly from S3 in open formats like Iceberg, Hudi, Delta Lake, Parquet, CSV, and JSON.

Multi-Cluster Architectures

Instead of one giant Redshift cluster, you can build distributed architectures:

Hub and spoke puts each workload (ETL, reporting, data science) on its own cluster. Isolates them so an ETL job doesn’t slow down dashboards.

Data mesh gives each business unit their own cluster. Finance, HR, and operations each control their own data assets and decide what to share.

Both architectures use data sharing, a Redshift feature that lets clusters read each other’s data in place without copying it. Transactional consistency and can even write to shared tables from multiple endpoints.

Materialized Views

Materialized views store precomputed query results. If your dashboard query joins five tables with aggregations, create a materialized view. The dashboard reads from the precomputed result instead of running the full query every time.

Key features:

  • Automatic query rewriting. Create the materialized view and Redshift automatically uses it for matching queries. No need to rewrite your existing SQL.
  • Incremental refresh. Redshift updates only the changed data, not the whole view.
  • Automatic refresh. Redshift detects changes in base tables and refreshes the view during low-load periods.

Example:

CREATE MATERIALIZED VIEW daily_sales_summary AS
    SELECT
        date_trunc('day', sale_timestamp) as sale_date,
        product_category,
        region,
        COUNT(*) as transaction_count,
        SUM(amount) as total_sales,
        AVG(amount) as avg_sale_amount
    FROM sales_transactions
    GROUP BY 1, 2, 3;

Stored Procedures

Stored procedures in Redshift use PL/pgSQL (PostgreSQL procedural language). They encapsulate multi-step transformation logic: load staging data, merge into fact tables, clean up.

A typical pattern:

  1. COPY data from S3 into a staging table.
  2. MERGE staging data into dimension or fact tables.
  3. Drop the staging table.

Stored procedures also support delegated access control. You can let users run a procedure without giving them direct access to the underlying tables.

Amazon MSF is a fully managed Flink service for real-time stream processing. It reads from Kinesis Data Streams or Amazon MSK and performs transformations, aggregations, windowing, and stateful computations.

Key facts:

  • Lowest latency and highest throughput for streaming transformations among managed AWS services.
  • Can use enhanced fan-out with Kinesis for dedicated read throughput.
  • Output goes to S3, Kinesis Data Streams, or MSK for delivery to other targets like Redshift.

Best Practices for MSF

  • Costs are based on Kinesis Processing Units (KPUs). Monitor for overprovisioning.
  • Start with 1 KPU per 1 MB/s throughput and adjust from there.
  • Enable autoscaling.
  • For I/O-bound workloads, increase parallelism per KPU to run more tasks per unit.
  • Use higher-level APIs, eliminate data skew, and use async I/O.
  • Ask yourself if you actually need Flink. For stateless, high-latency workloads, Lambda might be enough.

That last point matters. Flink is powerful but has a learning curve. Don’t use it just because it sounds impressive. If Lambda or Firehose can handle your use case, use those instead.

Amazon Data Firehose for Transformation

Firehose is primarily a delivery service. Moves streaming data to S3, OpenSearch, and Redshift. It can do lightweight transformations along the way:

  • Convert JSON to Parquet format
  • Handle compression and decompression
  • Add delimiters
  • Batch records for optimal delivery
  • Create dynamic partitions in S3

For slightly more complex transformations, Firehose can invoke Lambda functions. For example: convert CSV to JSON, then to Parquet.

Firehose isn’t a full transformation engine. More like a delivery truck that can do some light processing during transit.

AWS Lambda for Transformation

Lambda handles simple, event-driven transformations that run for less than 15 minutes and don’t need state between invocations. Use cases:

  • Data format conversions
  • Basic filtering
  • Small-scale aggregations

Serverless, scales automatically, pay per invocation. Cheapest option for lightweight, stateless transformations. Limits: 15-minute timeout, no state management, limited windowing.

Choosing the Right Streaming Transformation Service

Here’s how the book breaks down the streaming options:

CriteriaFirehose + LambdaSpark Streaming (Glue / EMR)Flink (MSF / EMR)
Transformation typeSimple stateless, limited windowing (15 min)Stateless and statefulRich stateless and stateful
Schema evolutionLimitedYesNo
Schema registryNoYesYes
Low-latency needsFlink is fasterFlink is fasterLowest latency
High-throughput needsLimitedFlink often betterOptimized for high throughput
Exactly-once processingNoWith extra configurationNative support
Out-of-order eventsLimitedLimitedHandles efficiently
Ease of use for Spark usersFamiliar Python/JavaFamiliar Spark APIsSteeper learning curve

The practical decision tree:

  • Simple filtering, format conversion, delivery to S3: Firehose + Lambda.
  • Need Spark ecosystem, schema evolution, microbatch is OK: Glue Streaming or EMR with Spark Structured Streaming.
  • Need lowest latency, exactly-once guarantees, complex stateful processing: Amazon MSF (Flink).
  • Need full control over Flink runtime: Flink on EMR.

AWS Glue and Amazon MSF are fully managed. They handle infrastructure, scaling, and maintenance. EMR gives more control but requires more operational expertise.

Choosing the Right Batch Transformation Service

Four options for batch:

CriteriaAWS GlueAmazon EMRAWS LambdaAmazon Redshift
Best forSpark-based ETL, format conversions, serverless simplicityComplex large-scale processing with multiple frameworksLightweight event-driven transforms under 15 minSQL-based analytics, data warehouse workloads
ComplexityLarge-scale Spark or lightweight Python shellComplex, large-scale with full framework capabilitiesLightweight, short-running functionsComplex SQL on massive structured datasets
CustomizationManaged, limited customizationHigh, custom libraries, integrationsLimited beyond function codeProvisioned mode offers more control
InfrastructureFully managed, serverlessManaged service on EC2, or EMR ServerlessFully managed, serverlessFully managed, serverless or provisioned
Expertise neededGeneral ETL experienceBig data framework expertiseServerless function experienceSQL skills

My decision framework:

  1. SQL team, structured data, dashboard queries? Redshift.
  2. Spark ETL, want simplicity? Glue.
  3. Need Hadoop, Hive, Presto, Flink, or custom libraries? EMR.
  4. Small, event-driven, under 15 minutes? Lambda.

Most teams should start with Glue or Redshift depending on whether the work is Spark or SQL. Move to EMR only when you hit the limits of what Glue can do. Lambda is for the small stuff that doesn’t justify a full ETL service.

Summary

Data transformation on AWS comes down to knowing the tradeoffs:

  • AWS Glue is the default for Spark-based ETL. Serverless, fast startup, low overhead. Start here.
  • Amazon EMR is for when you need more frameworks, more control, or more customization than Glue provides.
  • Amazon Redshift is the SQL transformation engine. Materialized views, stored procedures, petabyte-scale analytics.
  • Amazon MSF (Flink) is for lowest-latency streaming with exactly-once guarantees and complex stateful processing.
  • Amazon Data Firehose handles delivery with lightweight transformation. Not a full processing engine.
  • AWS Lambda is for small, stateless, event-driven transformations under 15 minutes.

For the exam, match requirements to services. Pay attention to keywords like “serverless,” “lowest latency,” “exactly-once,” “SQL-based,” and “multiple frameworks.” Those keywords map directly to specific services.

Next: Data Preparation and Pipeline Orchestration



denis256 at denis256.dev