AWS Auxiliary Services for Data Engineering: Compute, Storage, ML, and More

   |   14 minute read   |   Using 2779 words

Previous: AWS Analytics Services


Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3


Chapter 3 Part 1 covered the core analytics services: Kinesis, Glue, Redshift, Athena, and friends. Those services don’t exist in a vacuum though. They need compute to run on, databases to pull data from, storage to land results, networking to keep things secure, and monitoring to know when something breaks.

This is Chapter 3 Part 2. It covers the auxiliary services that support analytics workloads. The supporting cast. Not the headline act, but without them nothing works.

I’ll add my own take on which services actually matter for the DEA-C01 exam and real-world data engineering.

Application Integration

These services connect things together. In data engineering, they’re the glue (not AWS Glue, just regular glue) between your pipeline components.

Amazon EventBridge is a serverless event bus. When new data lands in S3, EventBridge can trigger a Lambda function or Step Functions workflow. It connects your own applications, SaaS products, and AWS services through events. For data pipelines, this is how you build event-driven architectures without polling.

Amazon SQS is a message queue. It decouples producers from consumers. If your data pipeline produces records faster than downstream can process them, SQS buffers the messages. Simple, reliable, and one of the oldest AWS services.

Amazon SNS is pub/sub messaging. One producer, many subscribers. Good for fan-out patterns where one event needs to trigger multiple downstream actions. Pair SNS with SQS for robust notification patterns.

Amazon MWAA is managed Apache Airflow. If you need to orchestrate complex data pipelines with DAGs, MWAA gives you Airflow without the pain of managing the infrastructure. Airflow is very popular in data engineering teams, so this one shows up on the exam.

AWS Step Functions is a serverless workflow orchestrator. Integrates with over 220 AWS services. For simpler ETL pipelines where Airflow is overkill, Step Functions works well. Visual workflow editor, error handling, retries built in.

Amazon AppFlow transfers data between SaaS applications (Salesforce, Slack, SAP) and AWS services. No code needed. If the exam asks about moving data from SaaS to S3, AppFlow is the answer.

EventBridge, SQS, and Step Functions are the most important here. You’ll see them constantly in real data engineering work. MWAA matters if your team already uses Airflow.

Compute and Containers

Where your code actually runs.

Amazon EC2 gives you virtual machines. Full control, any configuration. For data engineering, you might run self-managed Spark or Hadoop clusters on EC2. Most teams are moving away from this approach, but it still exists.

AWS Lambda is serverless compute. Write a function, trigger it with an event, pay only for execution time. For data engineering, Lambda handles lightweight transformations, file processing triggers, and event-level processing. There’s a 15-minute timeout and memory limit, so it’s not for heavy ETL jobs.

Amazon ECS runs Docker containers with tight AWS integration. If your data processing code is containerized, ECS manages the orchestration.

Amazon EKS is managed Kubernetes. Same idea as ECS but with Kubernetes. If your organization already invested in Kubernetes, EKS makes sense. Otherwise, ECS is simpler.

Amazon ECR stores your container images. Simple container registry. Build your image, push to ECR, and ECS or EKS pulls from there.

AWS SAM (Serverless Application Model) helps you define and deploy Lambda-based serverless applications. Framework on top of CloudFormation. Useful for development, not something you interact with at runtime.

AWS Batch dynamically provisions compute resources for batch jobs. It runs workloads on ECS, EKS, or EC2. Good for large-scale data preprocessing or model training where you need lots of compute for a few hours.

Lambda is the star here for data engineering. Know its limits (15 min timeout, 10 GB memory). For heavier workloads, know when to use Batch vs. ECS vs. EKS. The exam tests whether you can pick the right compute for the job.

Databases

AWS has a database for every data model. In analytics, these are your data sources. You pull data from these into your analytics layer.

Amazon RDS is managed relational databases. Supports MySQL, PostgreSQL, Oracle, SQL Server. Handles backups, patching, and scaling. Bread and butter for transactional workloads.

Amazon Aurora is AWS’s own relational database, compatible with MySQL and PostgreSQL. Separates compute and storage, auto-scales, and replicates across Availability Zones. Faster and more available than standard RDS. If the exam mentions high-performance relational database, think Aurora.

Amazon DynamoDB is a key-value and document database. Single-digit millisecond latency at any scale. Perfect for high-velocity data ingestion from web apps, gaming, ad tech, and IoT. DynamoDB Streams can feed data into analytics pipelines. Important for the exam.

Amazon DocumentDB is MongoDB-compatible. If your application uses MongoDB and you want a managed version on AWS, this is it.

Amazon Keyspaces is managed Apache Cassandra. Wide-column store for large volumes of data across many servers with no single point of failure. Niche but good to know.

Amazon MemoryDB is a durable in-memory database compatible with Redis. Ultra-fast performance for applications that need sub-millisecond reads.

Amazon Neptune is a graph database. For highly connected datasets like social networks, recommendation engines, or fraud detection. Not common in typical data engineering, but the exam includes it.

Focus on DynamoDB, Aurora, and RDS. These three cover 90% of exam questions about databases as data sources. Know when to use each: DynamoDB for high-throughput key-value access, Aurora for high-performance relational needs, RDS for standard relational workloads.

Storage

Storage is foundational. Every analytics pipeline starts and ends with storage.

Amazon S3 is object storage and the backbone of data lakes on AWS. Store anything, any size, from anywhere. Multiple storage classes (Standard, Infrequent Access, Glacier) let you balance cost and access speed. S3 integrates with basically every AWS analytics service. If you learn one storage service, learn S3.

Amazon S3 Glacier is the cold storage tier within S3. Cheap storage for data you rarely access. Retrieval times range from milliseconds to hours depending on the tier. Good for compliance archives and historical data.

Amazon EBS (Elastic Block Store) provides block storage for EC2 instances. Virtual hard drive. Consistent low-latency performance. In analytics, EBS is the underlying storage for self-managed Hadoop or database workloads running on EC2.

Amazon EFS (Elastic File System) is shared file storage using NFS protocol. Multiple EC2 instances can access the same filesystem concurrently. Useful when multiple compute nodes need access to shared data.

AWS Backup centralizes backup management across EBS, EFS, S3, and other services. Not glamorous, but important for data protection.

S3 is king. Know its storage classes, lifecycle policies, and how it integrates with Athena, Glue, and Redshift Spectrum. EBS and EFS come up less frequently, but understand when block storage vs. file storage vs. object storage is appropriate.

Machine Learning

ML services are growing in importance for data engineers. You may not build the models, but you need to understand the infrastructure.

Amazon SageMaker is the full ML platform. Build, train, and deploy models in production. It integrates with data lakes and warehouses for training data, stores model artifacts in S3, uses IAM for security, and provides monitoring. Data engineers typically set up the data pipelines that feed SageMaker.

Amazon Bedrock gives you access to foundation models (large language models) from providers like Anthropic, Meta, Cohere, and Amazon’s own Titan models. Managed way to build generative AI applications. Bedrock also includes knowledge bases for RAG (retrieval-augmented generation), agents for complex tasks, and Guardrails for content filtering.

Amazon Q is a generative AI assistant built on Bedrock. Integrates into multiple AWS services: QuickSight for natural language visualizations, Glue for code generation, Redshift for SQL generation. Amazon Q Developer helps with code transformation and generation.

SageMaker and Bedrock are the two to know. The exam will test whether you understand how data flows into ML training and how models get deployed. Bedrock is newer and increasingly relevant with the generative AI wave.

Migration and Transfer

Moving data to AWS is a common problem. These services solve different variations of it.

AWS DMS (Database Migration Service) moves data between databases. Works between AWS services, from on-premises to AWS, or from other clouds to AWS. Supports one-time full loads and continuous incremental replication. Probably the most important migration service for the exam.

AWS SCT (Schema Conversion Tool) is part of DMS. Converts database schemas between different engines and provides compatibility reports. When migrating from Oracle to PostgreSQL, for example, SCT helps with the schema translation.

AWS DataSync transfers files and objects between storage systems. NFS shares or HDFS clusters and need to move data to S3? DataSync handles it. Fast, secure, automated.

AWS Data Exchange is a data marketplace. Buy or sell datasets through subscriptions. Access data via files, APIs, or Redshift queries.

AWS Snow Family handles physical data transfer at scale. Snowball moves petabytes of data using physical devices shipped to your location. Snowcone is a smaller, portable version for edge locations. When your data is too large for network transfer, Snow Family is the answer.

AWS Transfer Family enables file transfers over SFTP, FTPS, FTP, and AS2 protocols. Moves data to and from S3 or EFS. Good for legacy systems that only speak FTP.

DMS is the big one here. Know the difference between full load and CDC (change data capture). DataSync vs. Snow Family is a common exam question: DataSync for network transfers, Snow Family for when the network is too slow or non-existent.

Networking and Content Delivery

Networking is where security starts. If your analytics services can’t reach each other securely, nothing else matters.

Amazon VPC (Virtual Private Cloud) provides network isolation. You create subnets (public for internet-facing, private for internal-only) and control traffic flow. Every serious AWS deployment uses VPC. Your analytics services like Redshift, Glue, and RDS typically sit in private subnets.

AWS PrivateLink lets services in your VPC access other AWS services without going through the internet. If a Lambda function in your VPC needs to read from S3, PrivateLink routes that traffic through AWS’s internal network. Faster and more secure.

Amazon Route 53 is DNS. Routes domain names to IP addresses. Supports latency-based routing, geo-based routing, and health checks. Important for any web-facing application but less directly relevant to analytics pipelines.

Amazon CloudFront is a CDN (content delivery network). Serves content from edge locations close to users. Integrates with S3 for static content delivery. Includes Lambda@Edge for compute at the edge.

VPC and PrivateLink are critical for the exam. Many questions involve network architecture for analytics workloads. Know the difference between public and private subnets, and understand VPC endpoints (which PrivateLink enables). CloudFront and Route 53 are less likely to appear in data engineering questions.

Security, Identity, and Compliance

AWS calls security its top priority. For the exam, you need to know how to secure data at every layer.

AWS IAM (Identity and Access Management) controls who can do what. Users, roles, groups, policies. Fine-grained access control on every AWS action. You’ll use IAM in every single AWS project. For data engineering, IAM roles define what your Glue jobs, Lambda functions, and Redshift clusters can access.

AWS KMS (Key Management Service) manages encryption keys. Encrypts data at rest in S3, databases, EBS volumes. Supports multi-region keys and external key stores. Know the difference between AWS-managed keys and customer-managed keys.

Amazon Macie uses ML to detect sensitive data (names, addresses, credit card numbers) in S3. Flags problems and notifies stakeholders. Good for compliance and data governance.

AWS Secrets Manager stores database credentials, API keys, and other secrets. Applications read secrets at runtime instead of hardcoding them. Integrates with IAM for access control and supports automatic rotation.

AWS Shield protects against DDoS attacks. Standard tier is free and automatic. Advanced tier adds detection, mitigation, and response capabilities.

AWS WAF (Web Application Firewall) protects HTTP/HTTPS applications. Controls access to web content. Integrates with CloudFront, API Gateway, and Application Load Balancers.

IAM and KMS are essential. Every exam question about security involves one or both. Secrets Manager comes up in questions about connecting to databases securely. Macie is the answer when the question mentions detecting PII or sensitive data in S3.

Management and Governance

Operations. Monitoring. Auditing. Infrastructure as code. The stuff that keeps production running.

AWS CloudFormation is infrastructure as code. Define your AWS resources in YAML or JSON templates, and CloudFormation creates and manages them. Handles rollbacks on failure. AWS CDK generates CloudFormation templates from programming languages, so they’re related.

AWS CloudTrail tracks every API call made in your AWS account. Who did what, when, from where. Active by default. Essential for auditing and compliance. If the exam asks “how do you track who deleted a resource,” the answer is CloudTrail.

Amazon CloudWatch handles logging and monitoring. Collect metrics, set alarms, build dashboards, analyze logs. Every AWS service publishes metrics to CloudWatch. For data pipelines, CloudWatch tells you when something fails or performs poorly.

AWS Config tracks configuration changes to AWS resources over time. Shows relationships between resources. Useful for compliance: “show me every change made to this security group in the last 30 days.”

Amazon Managed Service for Prometheus is managed Prometheus for metrics monitoring. Native integration with containers. Good for teams already using Prometheus.

Amazon Managed Grafana is managed Grafana for visualization. Sits on top of Prometheus, Elastic, and other data sources. Great for observability dashboards.

AWS Systems Manager gives you visibility and control over your AWS infrastructure. Automate patching, manage configurations, run commands across multiple instances. More relevant for ops than pure data engineering, but good to know.

CloudWatch and CloudTrail show up constantly on the exam. CloudWatch for monitoring pipelines, CloudTrail for auditing access. CloudFormation matters if the question is about automating infrastructure deployment. The managed Prometheus and Grafana offerings are nice but unlikely to dominate exam questions.

Developer Tools

CI/CD for your data engineering code.

AWS CLI lets you manage AWS from the terminal. Every data engineer uses it daily. Scripts, automation, quick lookups.

AWS CloudShell is a browser-based terminal in the AWS console. Pre-authenticated, no setup needed. Handy for quick tasks.

AWS CDK (Cloud Development Kit) lets you define infrastructure using TypeScript, Python, Java, Go, or .NET. Generates CloudFormation templates under the hood. More developer-friendly than writing raw CloudFormation YAML.

AWS Code Services is the CI/CD suite: CodeCommit for git repositories, CodeBuild for building and testing, CodeDeploy for deployment automation, and CodePipeline for orchestrating the whole release process. Together they form a complete CI/CD pipeline on AWS.

For the exam, know that CodePipeline orchestrates CI/CD and that CodeCommit can store ETL scripts. In practice, most teams use GitHub or GitLab instead of CodeCommit, but the exam tests AWS services. CDK is worth knowing as an alternative to raw CloudFormation.

Cloud Financial Management

Money. The part everyone forgets until the bill arrives.

AWS Cost Explorer visualizes your AWS spending over time. Default reports plus custom reports. Forecasts future spending and identifies trends. After you build a data lake, Cost Explorer tells you how much it actually costs to run.

AWS Budgets lets you set spending limits and get alerts. Configure email or SNS notifications when costs exceed thresholds. Proactive cost management instead of reactive surprise.

Cost Explorer and Budgets are simple but important. The exam may ask about monitoring costs for analytics workloads. Set up Budgets alerts on every account. Good practice regardless of certification.

AWS Well-Architected Tool

The Well-Architected Tool helps you review your workloads against AWS best practices. Built around the Well-Architected Framework, which has six pillars: Security, Reliability, Sustainability, Operational Excellence, Performance Efficiency, and Cost Optimization.

Well-Architected Lenses extend these best practices to specific domains like machine learning, IoT, and financial services.

It integrates with AWS Trusted Advisor and Service Catalog AppRegistry to help answer review questions.

Know the six pillars. The exam tests whether you can identify which pillar a given recommendation falls under. The tool itself is less important than understanding the framework’s principles.

Wrapping Up

That’s a lot of services. The key insight from this chapter: data engineering on AWS is not just about Glue, Redshift, and Kinesis. It’s about the whole ecosystem working together. Lambda triggers a Glue job when data lands in S3. IAM controls who can access what. CloudWatch tells you when something breaks. VPC keeps your data isolated. KMS encrypts everything at rest.

For the exam, focus on understanding when to use which service. The questions aren’t about memorizing features. They present scenarios and ask you to pick the right combination of services.

The services that come up most in data engineering contexts: Lambda, DynamoDB, S3, IAM, KMS, VPC, CloudWatch, CloudTrail, DMS, and SageMaker. Know these well, and you’ll handle most auxiliary service questions on the DEA-C01.


Next: Data Ingestion Patterns and Best Practices



denis256 at denis256.dev