Data Preparation and Pipeline Orchestration: Step Functions, Airflow, and Glue Workflows

Published: Tuesday, Jan 22, 2019 | 9 minute read | Using 1835 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Final part of Chapter 4. We move from raw transformation into two topics: data preparation for people who don’t write code, and orchestrating the whole pipeline end to end. Both matter for the exam. Both matter in real life.

Data Preparation for Nontechnical Personas

AWS Glue DataBrew is a low-code, visual tool for data cleaning and preparation. It targets data analysts, data scientists, and business users who need to work with data but don’t want to write PySpark or SQL.

DataBrew has over 250 built-in transformation functions. You work with it through a visual interface, building “recipes” of steps that clean and transform your data. The exam may ask about specific function names, so knowing what DataBrew can do matters.

Low-code data tools have their place. If a business analyst needs to clean a CSV before loading it into a dashboard, DataBrew is fine. If you’re building production ETL pipelines that run daily on terabytes of data, you probably want Glue jobs with proper code. Know when to use each.

Fill Missing Values

Missing data is everywhere. DataBrew gives you several ways to handle it:

Fill missing values with a constant, the mean/median/mode of the column, or the previous/next nonmissing value
Drop rows that contain missing values entirely
Impute missing values using advanced methods like k-nearest neighbors (KNN) or decision tree imputation

For numerical columns, you can fill with average, mode, sum, or median. For text columns, you can use empty string, null, last valid value, most frequent value, or a custom value.

All through a visual menu. No code needed. For the exam, remember the function names and what each option does.

Identify Duplicate Records

DataBrew has the FLAG_DUPLICATE_ROWS function. It adds a new column to your dataset that marks whether each row is an exact match of an earlier row. The first occurrence stays clean. Only the subsequent matches get flagged.

Simple but effective. In production, duplicate detection is usually more complex (fuzzy matching, record linkage), but for basic deduplication this works.

Formatting Functions

DataBrew provides text manipulation functions in three categories:

Formatting: Case modifications like CAPITAL_CASE, LOWER_CASE, UPPER_CASE, SENTENCE_CASE. Plus quote additions and date formatting.

Extraction: Pull text between delimiters, from specific positions, using regex patterns, or from nested structures.

Replacement: REMOVE_COMBINED for removing character combinations, REPLACE_BETWEEN_DELIMITERS and REPLACE_BETWEEN_POSITIONS for targeted replacements, and REPLACE_TEXT for simple substitutions.

These are the kind of functions you use for standardizing messy data. Different date formats, inconsistent capitalization, extra whitespace. DataBrew handles all of it visually.

Integrating Data from Multiple Sources

DataBrew supports two ways to combine data from different sources:

Union to stack datasets on top of each other (same columns, more rows)
Join to combine datasets side by side based on matching keys

Same concepts as SQL UNION and JOIN, just in a visual interface.

Nesting and Unnesting Data Structures

Real-world data is messy. Nested JSON, arrays inside columns, maps with key-value pairs. DataBrew has functions to handle all of this.

Nesting functions consolidate columns into compact structures:

Function	What It Does
`NEST_TO_ARRAY`	Combines multiple columns into a single array column
`NEST_TO_MAP`	Combines multiple columns into a JSON map column
`NEST_TO_STRUCT`	Combines multiple columns into a structured column

Unnesting functions flatten complex structures back into tabular format:

Function	What It Does
`UNNEST_ARRAY`	Extracts array values into individual rows
`UNNEST_MAP`	Extracts key-value pairs from a map into rows
`UNNEST_STRUCT`	Extracts struct fields into individual columns
`UNNEST_STRUCT_N`	Extracts a specific field from a struct

The exam loves these. Know the difference between nesting (many columns into one) and unnesting (one complex column into many). Know which function works with arrays vs maps vs structs.

Protecting Sensitive Data

When your data has PII (personally identifiable information), DataBrew provides functions to protect it. Important for compliance with privacy regulations.

The main categories:

Hashing: CRYPTOGRAPHIC_HASH applies SHA-256 or MD5 to create a one-way transformation. You can’t reverse it. Good for pseudonymization.

Encryption/Decryption: ENCRYPT, DECRYPT, DETERMINISTIC_ENCRYPT, DETERMINISTIC_DECRYPT. Deterministic encryption always produces the same ciphertext for the same plaintext and key. Regular encryption doesn’t. Know the difference for the exam.

Masking: MASK_CUSTOM for custom patterns, MASK_DATE for dates, MASK_DELIMITER for characters between delimiters, MASK_RANGE for numeric values.

Randomization: REPLACE_WITH_RANDOM_BETWEEN and REPLACE_WITH_RANDOM_DATE_BETWEEN swap values with random ones in a range. SHUFFLE_ROWS randomizes row order to break correlations.

For the exam: hashing is one-way, encryption is two-way, masking replaces characters visually, and randomization swaps values entirely.

Orchestrating Data Pipelines

Now to the part I actually enjoy. Orchestration is what holds your data pipeline together. Makes sure things run in the right order, handles failures, gives you visibility into what’s happening.

AWS gives you five orchestration options. Each has its place.

AWS Step Functions

Step Functions is a serverless orchestration service. You define workflows as state machines using Amazon States Language (ASL). Each step is a “state” that can call AWS services like Lambda, Glue, ECS, or any AWS API.

Key use cases: automating ETL processes, orchestrating parallel workloads, coordinating microservices, building approval workflows.

The visual console is nice. You can see your workflow as a flowchart with each step and its status. For debugging, much better than reading log files.

Step Functions is great when your workflow lives entirely within AWS. Serverless, pay-per-use, and tightly integrated with AWS services. It uses ASL though, which is JSON-based and can feel clunky if you’re used to writing Python.

Managed Workflows for Apache Airflow (MWAA)

MWAA is AWS’s managed version of Apache Airflow. You define workflows as Python code using DAGs (directed acyclic graphs).

The key components:

DAGs define the workflow. Python scripts that specify tasks, dependencies, and schedules. Run them hourly, daily, or based on conditions.

Tasks are individual units of work inside a DAG. Each task uses an operator to do something.

Operators define what each task actually does. Built-in ones include PythonOperator, BashOperator, S3FileTransferOperator, RedshiftDataOperator, and many more.

Dependencies set the execution order. In Airflow, you use the >> operator: task_a >> task_b means task_b runs only after task_a succeeds.

The book gives a good example: a movie data pipeline where files arrive in S3, get loaded into Athena tables, joined together, cleaned up, and transferred to Redshift. The DAG defines all these steps and their dependencies in Python.

Airflow is the most flexible orchestration tool on this list. Handles AWS services, non-AWS services, custom scripts, external APIs. The Python-based DAG definition is powerful. The UI for monitoring and debugging is better than anything else here. Downside: MWAA is expensive. High base cost compared to Step Functions. If all your work is within AWS, Step Functions might be simpler.

AWS Glue Workflows

Glue Workflows coordinate the interactions between Glue crawlers and Glue jobs. Build them using prebuilt blueprints, a visual UI, or the Glue API with JSON.

Workflows use triggers to chain components together. A trigger fires when a previous component finishes, starting the next one. The start trigger can run on a schedule, on-demand, or in response to EventBridge events.

One useful feature: run properties. Name/value pairs that all jobs in the workflow can read and write. Lets jobs share state with each other during a workflow run.

If your pipeline is entirely Glue crawlers and Glue jobs, use Glue Workflows. No extra cost. The workflow orchestration comes built-in with Glue. Don’t try to force non-Glue components into a Glue workflow though. That’s where Step Functions or MWAA makes more sense.

Amazon Redshift Scheduler

Simplest option on this list. The Redshift Query Editor v2 has a built-in scheduler that runs SQL statements at specified times. No dependencies, no complex workflows. Just “run this query at 8 PM every day.”

Good for: refreshing materialized views, running recurring analytics queries, performing table maintenance.

Not good for: anything with dependencies or multiple steps. If your needs are simple, keep the solution simple.

Amazon EventBridge

EventBridge is a serverless event bus. Routes events from sources to targets based on rules. Used to be called CloudWatch Events and still uses the same API under the hood.

Core concept: event buses receive events, rules match events to patterns, and targets receive the matched events. One event can match multiple rules. Each rule can have up to five targets.

Two types of rules:

Event-driven rules respond to changes (like a file arriving in S3)
Schedule-based rules fire at specified times (like 8 PM on weekdays)

EventBridge can trigger Lambda functions, Step Functions, Glue workflows, Redshift queries via the Data API, SageMaker pipelines, and more. Also sends events to Kinesis, SNS, SQS, and Firehose.

The book gives a practical example: Parquet files land in an S3 raw bucket. S3 sends an event to EventBridge. A rule matches the event and triggers a Lambda function that validates the schema and moves the file to the processed bucket. All serverless, all event-driven, near real-time.

EventBridge isn’t an orchestration tool in the traditional sense. It’s an event router. Almost every modern AWS data pipeline uses EventBridge somewhere though. Learn it well.

Choosing the Right Orchestration Service

Scenario	Best Choice
Only Glue crawlers and jobs	AWS Glue Workflows (no extra cost)
Simple Redshift SQL scheduling, no dependencies	Redshift Scheduler (no extra cost)
AWS-only services, serverless	AWS Step Functions
External dependencies, open-source preference	MWAA (Apache Airflow)
Event-based triggering	Amazon EventBridge

Other factors:

Coding language: Step Functions uses ASL (JSON). MWAA uses Python. Glue Workflows use visual UI or JSON. Redshift uses cron/at syntax. EventBridge uses cron or rate-based schedules.

Monitoring: MWAA has the best monitoring and debugging through Airflow’s UI. The others have monitoring too, but not as detailed.

Community: Airflow has a large open-source community with tons of plugins and resources. The AWS-native services rely on AWS documentation and support.

Cost: Glue Workflows and Redshift Scheduler add no extra cost beyond the components they run. Step Functions charges per state transition but is cheap for simple workflows. MWAA has a higher base cost because it runs a managed Airflow cluster, but can be more cost-effective for complex workflows with many dependencies.

For the exam, match the use case to the service. If the question mentions “only Glue components,” the answer is Glue Workflows. “External dependencies” means MWAA. “Serverless” and “AWS services only” means Step Functions. “Event-driven” means EventBridge.

Final Thoughts

This chapter covered a lot of ground across three parts: data ingestion, transformation, and now preparation plus orchestration. The orchestration section is especially important because it ties everything together.

In real-world projects, the orchestration layer is where things break most often. Choosing the right tool upfront saves you headaches later. My rule: start with the simplest tool that fits your requirements. Don’t reach for MWAA when a Glue Workflow will do. Don’t build a Step Functions state machine when a Redshift Scheduler query is enough.

For the exam, know DataBrew function names (especially the PII handling ones and the nest/unnest functions), and know the decision criteria for choosing between orchestration services. High-probability topics.

Next: Choosing Data Stores, Formats, and Lifecycle Management

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-data-preparation-orchestration/

denis256 notes and projects