Pipeline Resiliency, Monitoring, DR, and Cost Optimization for AWS Data Engineering

Published: Saturday, Jan 26, 2019 | 13 minute read | Using 2613 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Second half of Chapter 6. It covers the stuff that separates a working pipeline from a production-grade pipeline: monitoring, alerting, data quality checks, disaster recovery, Infrastructure as Code, CI/CD, and cost optimization. If Part 1 was about running analytics, Part 2 is about keeping them running and not going broke doing it.

Having been on the ops side of data pipelines, I can tell you: this section is where the real work lives. Building the pipeline is the easy part. Keeping it alive at 3 AM when something breaks? That’s the hard part.

Data Pipeline Resiliency

Resiliency means your pipeline keeps working even when things go wrong. Error handling, monitoring, alerting, data validation, backups, disaster recovery. The book covers all of these.

Monitoring

Can’t fix what you can’t see. Monitoring is the foundation.

CloudWatch metrics. Every AWS service pushes metrics to CloudWatch. CPU usage, memory, network, query times. You can also push custom metrics using the put-metric-data API. Useful when you need to track something specific to your application, like records processed per minute or data quality scores.

CloudWatch dashboards. Put multiple metrics on a single screen. Build a story about your pipeline health. I always create at least one dashboard per pipeline. Saves you from clicking through ten different service consoles when something looks wrong.

CloudTrail. Your audit trail. Records API calls: who created what, who deleted what, who modified what. Tracks actions on Glue jobs, Step Functions, Redshift clusters, and more. By default, 90 days of management events. For a full record, configure a CloudTrail trail. Always enable CloudTrail trails for production accounts. When someone asks “who changed that Glue job last Tuesday?”, you want the answer ready.

Application logs and traces. CloudWatch Logs centralizes logs from Glue ETL jobs, EMR Spark jobs, Lambda functions. Query these with Athena or analyze with OpenSearch. Where you find the actual error messages when a job fails.

Redshift system tables. Deep visibility into your warehouse:

Table	What It Tracks
`STL_QUERY_METRICS`	Query execution metrics: rows, CPU, disk I/O
`STL_ALERT_EVENT_LOG`	Warnings during query execution
`STL_LOAD_ERRORS`	Details about failed COPY commands
`STL_LOAD_INFO`	Statistics about data load operations
`SYS_QUERY_HISTORY`	All submitted queries with metadata
`SYS_QUERY_DETAIL`	Detailed metrics for query troubleshooting
`STL_PLAN_INFO`	Query execution plan details
`STL_USAGE_CONTROL`	Resource usage and limits

If you run Redshift in production, learn these tables. They’ll save you hours of guessing when queries slow down.

Alerting

Monitoring without alerting is just watching things break in slow motion.

CloudWatch Alarms come in two flavors:

Metric alarms watch a single metric. Set a threshold. Metric crosses it, alarm fires. Simple and effective.
Composite alarms combine multiple alarms. Fire only when specific conditions are met together. “Alert me only if CPU is high AND memory is high.” Reduces noise.

Configure metric alarms with either a static threshold (fixed number, like CPU > 80%) or anomaly detection (CloudWatch uses ML to learn normal patterns, alerts when things look unusual). Anomaly detection is great for metrics with daily or weekly patterns.

Alarm states:

OK means fine.
ALARM means threshold breached.
INSUFFICIENT_DATA means not enough data yet.

Notifications go through SNS. Email, SMS, or trigger a Lambda function. The Lambda option is powerful – auto-remediate issues (like resizing a Redshift cluster) or forward alerts to Slack or Teams. The Lambda-to-Slack pattern is one of the most useful things you can set up for any pipeline.

Event-Driven Pipeline Maintenance with EventBridge

EventBridge is a serverless event bus. Create rules that react to events across AWS services. For data pipelines, this is gold.

Examples:

Glue job fails? EventBridge triggers a Lambda to restart it.
New file in S3? EventBridge triggers a data quality check.
Lambda function errors? EventBridge sends a notification.

Build self-healing pipelines this way. Instead of manually watching for failures and restarting things, EventBridge handles it. Doesn’t get enough attention but solves a lot of operational pain.

Data Quality with Deequ and DQDL

Bad data in a pipeline is worse than no data. At least with no data, people know something is wrong. Bad data leads to wrong decisions made with full confidence.

Deequ is an open source library built by Amazon on Apache Spark. Treats data quality checks like unit tests for code. Define assertions about your data, Deequ validates them.

DQDL (Data Quality Definition Language) is a declarative language for defining quality rules. Write rules as configurations instead of code. Even non-developers can manage them.

AWS Glue Data Quality is the managed service built on top of Deequ. Two entry points:

Through the Data Catalog. Glue auto-recommends rules based on your data. Edit them or write custom rules using DQDL. Get a data quality score showing how many rules passed.
Through ETL jobs. Embed data quality checks directly in your Glue jobs. Bad data gets filtered out before reaching your data lake or warehouse.

DQDL syntax:

Rules = [
   IsComplete "order-id",
   IsUnique "order-id"
]

Rule types that matter most for the exam:

IsComplete and IsUnique for basic integrity checks
ColumnDataType and ColumnExists for schema validation
RowCount and RowCountMatch for volume checks
DataFreshness for SLA monitoring
ReferentialIntegrity for cross-dataset consistency
CustomSQL for anything the built-in rules can’t handle

Composite rules combine checks with and / or:

(IsComplete "id") and (IsUnique "id")

The book shows a great example using the New York taxi dataset. The DQDL ruleset validates passenger counts, trip distances, fare totals, and column counts, comparing against historical runs using the last() function:

CustomSql "select vendorid from primary where passenger_count > 0"
    with threshold > 0.9,
Mean "trip_distance" < max(last(3)) * 1.50,
Sum "total_amount" between min(last(3)) * 0.8 and max(last(3)) * 1.2,
RowCount between min(last(3)) * 0.9 and max(last(3)) * 1.2,
ColumnCount = max(last(2))

Comparing current runs against recent historical runs is exactly what you want in production. Static thresholds break when data volumes grow. Historical comparisons adapt.

Using Deequ with EMR gives you more control. Run Deequ directly on an EMR cluster with Spark. Access AnalysisRunner for computing metrics and VerificationSuite for defining checks. Full power of the Deequ library, not just the subset through Glue Data Quality.

Automated Data Quality Checks and Error Handling

Beyond Deequ, Glue DataBrew has data validation rules for detecting missing values, handling sensitive data, deduplicating records. These checks integrate into your ETL workflows.

For error handling, Glue and Step Functions can automatically retry failed tasks, route data to dead-letter queues, or trigger custom remediation. Step Functions is especially good here – define retry logic and error handling as part of your state machine definition.

Troubleshooting and Performance Tuning

When things break in production, identify the error type fast. Use CloudWatch Logs for the actual error message, then match it to these common categories:

Connection timed out. Almost always a network issue. Check in order:

Are the services in the same VPC? If not, VPC peering or VPC endpoint?
Does the security group on the target allow traffic from the source? Glue job connecting to Redshift? Redshift security group must allow Glue.

90% of “connection timed out” errors in AWS are security group misconfigurations. Always check security groups first.

Access denied. IAM permissions are wrong. Check the IAM role on your analytics service, resource policies (S3 bucket policies, Redshift grants), and KMS key permissions if encryption is involved. Use CloudTrail to find the exact denied permission. IAM Policy Simulator helps for testing policies.

Throttling errors. Too many API requests too fast. Implement exponential backoff: wait 1 second, then 2, then 4. Use rate limiting. For S3, avoid millions of small files – use Glue ETL to compact them. For Athena, use workgroups for query concurrency.

Resource constraints. Service ran out of memory or CPU. For Lambda, increase memory (CPU scales proportionally). For Glue, increase DPUs. For EMR, scale the cluster.

CI/CD Pipelines

Data pipelines need CI/CD just like application code.

Continuous Integration (CI) means developers merge code frequently and automated tests run on every merge. AWS CodeBuild handles this. Compiles code, runs tests, produces artifacts. No servers to manage. Pay per build.

Continuous Deployment (CD) means every change that passes tests gets deployed automatically. AWS CodePipeline handles this. Models your release process as a pipeline: source, build, test, deploy.

For the exam, know that CodeBuild is CI and CodePipeline is CD.

Version Control and Collaboration

AWS CodeCommit is a managed Git service. Store ETL scripts, transformation code, IaC templates. Supports code reviews, branch permissions, standard Git workflows. If you already use GitHub or GitLab, same concepts.

Infrastructure as Code

IaC is defining infrastructure in code files instead of clicking through the AWS Console. Strong opinion here: if your infrastructure isn’t in code, it doesn’t exist. Anything created manually will eventually be misconfigured, forgotten, or impossible to recreate.

AWS CloudFormation

YAML or JSON templates to define AWS resources. Three main sections:

Parameters: Input values that make templates reusable across environments.
Resources: The actual AWS resources to create (the core section).
Outputs: Values exposed after creation, like endpoint URLs or IP addresses.

Simple example creating a Glue database:

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-mysampledatabse
Resources:
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: Database to hold tables for flights data

Write once, deploy everywhere. Same template works in dev, staging, production. Need changes? Modify the template and CloudFormation updates only what changed.

AWS SAM (Serverless Application Model)

CloudFormation extended for serverless. Simplifies syntax for Lambda, API Gateway, DynamoDB. If your pipeline is mostly Lambda functions and event-driven, SAM saves a lot of YAML.

SAM also supports local development and testing. Test Lambda functions locally before deploying. Big deal for developer productivity.

AWS CDK (Cloud Development Kit)

Define infrastructure using programming languages: TypeScript, Python, Java, Go. Instead of YAML, actual code. Under the hood, CDK generates CloudFormation templates.

Building blocks are constructs. Each construct represents an AWS resource or group of resources. Compose using code – loops, conditionals, all the benefits of a real programming language.

CDK is the best option for teams with programming experience. YAML templates get painful past 200 lines. CDK also has extensions for Kubernetes and Terraform.

Choosing the Right IaC

Aspect	SAM	CDK	CloudFormation
Best for	Simple serverless apps	Complex infrastructure	Broad AWS resource management
Format	YAML/JSON	Programming languages	YAML/JSON
Learning curve	Low	Higher (needs coding)	Medium
Scope	Serverless only	All AWS services	All AWS services
Testing	Basic, local Lambda	Comprehensive unit/integration	Basic validation
Reusability	Limited	High (custom constructs)	Medium (nested stacks)

SAM for simple serverless. CDK for complex apps when you have developers. CloudFormation for everything else.

Disaster Recovery and High Availability

Two key metrics:

RPO (Recovery Point Objective): How much data loss is acceptable. RPO of 1 hour means you might lose up to 1 hour of data.
RTO (Recovery Time Objective): How fast you need to be back online. RTO of 4 hours means your system must recover within 4 hours.

Define these with your business stakeholders first. Not every report is business critical. Your DR plan should match actual severity.

Three resilience architectures:

Active-active. Both environments run simultaneously. Traffic shifts instantly if one fails. Lowest RTO, highest cost.
Active-passive. Primary handles all work. Standby is ready for failover. Good balance of cost and resilience.
Backup-restore. Regular backups, restore when needed. Cheapest but slowest recovery. Works for non-critical systems.

Serverless services (Athena, Glue, Lambda) have built-in HA. No configuration needed. For provisioned services, you configure HA yourself.

EMR High Availability

Launch your EMR cluster with 3 primary nodes instead of 1. If one fails, the other two keep the cluster running. Use EC2 placement groups to spread primary nodes across different hardware.

Redshift High Availability

Automatic fault detection. Replaces failed nodes, restores frequently accessed data from S3 first.

AZ failure recovery has two options:

Active-passive (relocation): Single-AZ clusters, Redshift relocates to another AZ. Recovery takes 10 to 60 minutes.
Active-active (multi-AZ): Multiple AZs with failover under 60 seconds. RPO of 0. Provisioned clusters only. SLA goes from 99.9% to 99.99%.

Backups for provisioned clusters:

Automated snapshots: Every 8 hours or after 5 GB of changes per node. Retained 1 to 30 days. Deleted when cluster is deleted.
Manual snapshots: Retained indefinitely. Can be shared with other accounts. Storage charges apply.

For Redshift Serverless:

Recovery points: Automated every 30 minutes, retained 24 hours. Can be converted to manual snapshots.
Manual snapshots: Same as provisioned. Can restore to either serverless or provisioned.

Cross-region recovery: Enable cross-region snapshots to automatically copy snapshots to a backup region. Primary region goes down, restore from backup.

MSK High Availability

Distributed across multiple AZs by default. Single-AZ not even allowed. For maximum resilience, use 3 AZs. Tiered storage separates compute and storage. MSK Replicator copies data between clusters in different regions for cross-region DR.

OpenSearch High Availability

Deploy data nodes across multiple AZs. Use dedicated cluster manager (CM) nodes. Primary and replica shards – when a primary shard fails, replica gets promoted automatically. Shards placed across different nodes and AZs.

Cost Optimization

Building pipelines that work is one thing. Building ones that don’t drain your AWS bill is another.

Serverless Services

Use Athena, Glue, and Lambda when possible. Scale automatically, pay only for what you use. No idle resources eating money. For compute-heavy batch jobs, EC2 Spot Instances cost significantly less than on-demand but can be interrupted. Fine for retryable ETL jobs. Not for real-time processing.

Autoscaling

EMR has Managed Scaling. Glue autoscales ETL and streaming jobs. Redshift has AI-driven autoscaling. Application Auto Scaling works for EMR, MSK, and EC2. Don’t overprovision. Let services scale based on actual demand.

Tiered Storage

Move data to cheaper storage when accessed less frequently:

OpenSearch: Hot, UltraWarm, and Cold tiers.
MSK: Tiered storage for brokers.
S3: Intelligent-Tiering, Glacier, Glacier Deep Archive.

One of the easiest wins for cost savings. Old data nobody queries shouldn’t sit on expensive storage.

Columnar Formats

Parquet or ORC instead of CSV or JSON. Compress your data files. Partition by columns used in WHERE clauses. The difference between querying 1 TB of CSV and 100 GB of compressed Parquet is real money when you pay per scan.

Data Transfer Costs

AWS charges for data moving between services and regions. These costs sneak up on you:

Minimize cross-region transfers.
Use Direct Connect or VPC Peering for VPC-to-VPC traffic.
Compress data before transferring.
Use Cost Explorer and Budgets to track and alert on spending.

Data transfer costs are the silent killer of AWS bills. Most teams focus on compute and storage costs but ignore transfer. Set up budget alerts early.

General Best Practices

Spot Instances for retryable EMR jobs
Flex execution class in Glue
Reserved Instances for Redshift provisioned clusters
Columnar formats for Athena query performance
Athena capacity reservations for predictable compute costs

Know these for the exam. AWS loves asking about cost optimization.

Key Takeaways

Monitoring is not optional. CloudWatch, CloudTrail, application logs, and system tables together give you full visibility.
Alerting needs to be smart. Composite alarms and anomaly detection reduce noise.
EventBridge enables self-healing pipelines. React to events automatically instead of manually.
Data quality is testable. Deequ and DQDL treat data validation like unit tests. Historical comparisons beat static thresholds.
IaC is non-negotiable. CloudFormation for general use, SAM for serverless, CDK for complex apps.
DR planning starts with RPO and RTO. Match your architecture to actual business requirements.
Cost optimization is continuous. Serverless, autoscaling, tiered storage, columnar formats, transfer cost management all add up.

This chapter covers a lot of ground but the pattern is clear: build resilient pipelines by monitoring everything, automating responses, validating data quality, and keeping costs under control. These are the operational skills that turn a data engineer into a reliable one.

Next: Network Security and Data Protection on AWS

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-pipeline-resiliency-cost/

denis256 notes and projects