Choosing Data Stores, Storage Formats, and Lifecycle Management on AWS
Previous: Data Preparation and Orchestration
Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3
Chapter 5 is where the book gets into data store management. Domain 2 territory on the exam, and a big one. How do you pick the right storage? What file format should you use? How do you keep your S3 bill from growing out of control?
I split Chapter 5 into two parts because there’s a lot of material. This first part covers choosing data stores, storage formats, data cataloging, and lifecycle management. Second part handles data modeling and schema evolution.
Choosing a Data Store
The book starts with a breakdown of AWS storage options into two buckets: core storage services and managed databases.
Core Storage Services
Three types of storage on AWS:
Block storage (Amazon EBS) is like attaching a hard drive to your EC2 instance. Ultra-low latency, which is what databases and ERP systems need. EBS volumes are replicated within an Availability Zone, and you can resize them without detaching.
File storage (Amazon EFS, Amazon FSx) is shared file access over a network. EFS uses NFS and works across multiple AZs. FSx has specialized flavors: Lustre for HPC and ML, Windows File Server for Microsoft shops, NetApp ONTAP for hybrid setups. If your app needs a shared filesystem, this is the category.
Object storage (Amazon S3) is the big one for data engineers. Cheapest per GB, scales to basically unlimited size, handles any data type. S3 is the default storage layer for data lakes on AWS. Every analytics service integrates with it. The tradeoff: S3 is built for throughput, not low-latency random access. If you need high IOPS and small file updates, EBS or FSx is better.
In practice, as a data engineer, you’ll mostly work with S3. EBS and EFS are usually abstracted behind managed services like RDS or handled by a separate infrastructure team.
AWS Cloud Databases
Six database types you should know for the exam:
| Database Type | Data Type | AWS Service | Common Use Cases |
|---|---|---|---|
| Relational | Structured with schemas | Aurora, RDS, Redshift | ERP, CRM, BI |
| Key-value | Key-value pairs | DynamoDB | Gaming, IoT, session management |
| Document | Semi-structured (JSON/BSON) | DocumentDB | Content management, user profiles |
| In-memory | Key-value, semi-structured | ElastiCache, MemoryDB | Caching, leaderboards, real-time analytics |
| Graph | Nodes, edges, properties | Neptune | Social networks, fraud detection |
| Search engine | Semi-structured, free text | OpenSearch | Log analytics, ecommerce search |
The key decision point: OLTP vs OLAP. This trips people up. Aurora and RDS are OLTP – high-concurrency transactions with normalized schemas. Redshift is OLAP – complex analytical queries over large datasets with denormalized schemas. All relational, completely different purposes. On the exam, if the question mentions “analytical queries” or “business intelligence,” the answer is Redshift. “Transactional workload” or “high concurrency” means Aurora or RDS.
Data Storage Formats for Data Lakes
Choosing the right file format for your data lake matters more than most people think. Wrong format can make your queries 10x slower and your storage costs 3x higher.
Row-Based Formats
Row-based formats store all fields of a record together. Good for reading entire rows, bad for analytical queries that only need a few columns.
- CSV is simple, human-readable, widely supported. Good for initial data ingestion and small datasets. Bad for anything at scale – no schema, no compression, no type safety.
- JSON is flexible, handles nested data, used everywhere in APIs. Same problems as CSV at scale: verbose, no compression by default, slow to parse.
- Avro is binary, supports schema evolution, popular in streaming pipelines (especially Kafka). Row-based but much better than CSV/JSON for serialization.
Column-Based Formats
Column-based formats store values of each column together. This is what you want for analytics.
- Parquet is the default choice for most data lake workloads. Handles nested data well, compresses aggressively, every analytics engine supports it. If you’re not sure what format to use, use Parquet. You won’t regret it.
- ORC (Optimized Row Columnar) was built for the Hadoop ecosystem. Performs well, but Parquet has won the popularity contest. ORC is still common in Hive-heavy environments. Starting fresh? Go Parquet.
I’ve seen teams waste weeks arguing about ORC vs Parquet. In 2026, Parquet is the standard. Every major tool supports it natively. ORC is fine if you already have it, but there’s no reason to start new projects with it.
Table Formats
Table formats are the layer on top of file formats that brings database features to data lakes.
- Apache Iceberg brings ACID transactions, schema evolution, time travel, and partition evolution to your data lake. Works with Spark, Trino, Flink, and most AWS services. Iceberg is getting the most momentum right now, and AWS has been investing heavily in it.
- Apache Hudi focuses on incremental processing and upserts. Good for CDC use cases where you need to merge updates into existing data. Hudi was born at Uber for exactly this kind of workload.
- Delta Lake was built on top of Spark by Databricks. Provides ACID transactions and scalable metadata. If you’re in the Databricks ecosystem, Delta Lake is the natural choice.
The Iceberg hype is real, and it’s also justified. Iceberg has the broadest engine support and the most active community. For new data lake projects on AWS, Iceberg is the safe bet. Hudi still has advantages for streaming upsert workloads. Delta Lake makes sense if you’re all-in on Databricks. The exam will test you on knowing what each one does, not on picking favorites.
Building a Data Strategy with Multiple Data Stores
Two important architectural patterns:
Lakehouse Architecture
A lakehouse combines the scalability of a data lake (S3) with the query performance of a data warehouse (Redshift). Store everything in S3, use table formats for structure, connect Redshift for high-performance queries. Redshift Spectrum lets you query S3 data directly from Redshift using standard SQL.
The book breaks the lakehouse into five logical layers: ingestion, storage, cataloging, processing, and consumption. On AWS, S3 is the storage layer and Redshift is the warehouse layer. Data moves between them using COPY (load into Redshift) and UNLOAD (export to S3) commands.
Federated Queries
Federated queries let you query data where it lives without moving it. Amazon Athena Federated Query can run SQL across S3, RDS, Redshift, DynamoDB, and third-party sources in a single query. No ETL needed.
The tradeoff: federated queries put compute load on the source systems. Great for ad-hoc analysis and real-time queries, but don’t scale as well as centralized lakehouses for heavy analytical workloads. Use federated queries when you need fresh data from multiple sources. Use a lakehouse when you need repeatable, high-performance analytics.
Data Cataloging
A data catalog is a registry of what data you have, where it lives, and what it looks like. Without it, your data lake becomes a data swamp.
Technical vs Business Metadata
Two types:
Technical metadata includes schemas, column types, partition layouts, data lineage, and source information. What engineers need.
Business metadata includes data ownership, business definitions, usage policies, and quality scores. What business users need.
On AWS, the Glue Data Catalog handles technical metadata. Amazon DataZone handles business metadata cataloging (though it’s not on the exam yet since it launched recently).
Populating the Glue Data Catalog
Four ways to get metadata in:
Glue Crawlers are the most common approach. A crawler connects to a data source (S3, DynamoDB, JDBC databases, MongoDB), reads the data, infers the schema, and writes table definitions into the catalog. Schedule crawlers to run periodically or trigger them via API.
Manual definition gives you full control. Define every table and column yourself. Good for proof-of-concept work or unsupported formats. Bad for anything at scale.
Integration with other AWS services. Create tables using Athena DDL statements (
CREATE TABLE) and the metadata goes straight to the Glue Data Catalog. TheMSCK REPAIR TABLEcommand in Athena loads Hive-style partitions.Migration from Hive Metastore. Already have a Hive metastore? Migrate it to the Glue Data Catalog using Glue ETL jobs.
Data Catalog Best Practices
The ones that matter most:
- Consistent naming conventions. Use prefixes for environments (dev, prod) and data stages (raw, processed). Sounds obvious but many teams skip it and then spend months cleaning up.
- Security. Use IAM policies for fine-grained access control. Encrypt metadata at rest and in transit. Audit access logs.
- Schema change management. Schedule regular crawler runs or use trigger-based crawls. Glue supports schema versioning, so you can track changes over time and roll back if needed. Use incremental crawls for frequently changing sources to save time and money.
- Column statistics. Glue can compute column-level stats (min, max, null count, distinct count). Athena and Redshift use these stats to build better query plans. Quick win for performance.
- Partition indexes. If your tables have many partitions, create partition indexes in the Glue Data Catalog. Speeds up partition lookups during query planning.
Data Classification
Data classification is about tagging and categorizing your data for better discovery and access control. Three approaches:
- By ownership: which team, business unit, or project owns the data.
- By sensitivity: public, internal, confidential, restricted. Drives access control decisions.
- By stage: raw, cleansed, processed, sandbox, production. Useful for tracking data through your pipeline.
AWS Lake Formation lets you tag data at the database, table, and column level. These tags drive fine-grained access control. Powerful feature that shows up on the exam.
Managing the Lifecycle of Data
Data doesn’t stay hot forever. What gets queried 100 times a day today will be touched once a month next year. Managing this lifecycle is how you keep storage costs under control.
Hot vs Cold Data
Hot data is accessed frequently and needs fast retrieval. Transactional records, streaming data, real-time analytics. Store it on in-memory (ElastiCache), block storage (EBS), or S3 Standard.
Cold data is accessed rarely. Historical logs, compliance records, archived transactions. Store it on S3 Glacier or S3 Glacier Deep Archive.
Key factors for choosing storage:
- What analytics engine are you using?
- What latency can you tolerate?
- How often is the data accessed?
- What’s your budget per GB?
- Can the data be re-created if lost?
The book includes a good example: a petabyte-scale log analytics system on OpenSearch with four storage tiers. Hot storage for the last 2-7 days (fast indexing and queries). UltraWarm for 1 week to 2 months (read-only, cheaper). Cold storage for 2 months to 1 year (very cheap, slower). Direct query to S3 for anything older than 1 year (archive cost, on-demand access).
Retention Policies and Archiving
Best practices:
- Start from business and compliance requirements. Legal requirements might say “keep financial records for 7 years.” That dictates your archiving strategy.
- Automate everything. Use S3 lifecycle policies, DynamoDB TTL, and other automation tools. Manual data management doesn’t scale.
- Review regularly. Access patterns change. Regulations change. Review your retention policies periodically.
COPY and UNLOAD: Moving Data Between S3 and Redshift
In a lakehouse architecture, data moves between S3 and Redshift frequently:
- COPY command loads data from S3 into Redshift. Supports multiple formats and can apply transformations during load.
- UNLOAD command exports data from Redshift to S3. Runs in parallel, supports compression, lets you pick output formats.
Three common patterns:
- ETL with Glue/EMR to prepare data in S3, then COPY into Redshift.
- Load raw data directly into Redshift and transform there (ELT pattern).
- UNLOAD old, rarely queried data from Redshift back to S3 for cheaper storage. Still queryable via Athena or Redshift Spectrum.
Optimizing with Amazon S3
Dense section with S3 storage class details. If you’re preparing for the exam, memorize this table.
S3 Storage Classes
| Storage Class | Access Pattern | Retrieval Speed | Availability | Notes |
|---|---|---|---|---|
| S3 Standard | Frequent | Milliseconds | 99.99% | Default class |
| S3 Express One Zone | Ultra-frequent, latency-sensitive | Milliseconds (10x faster) | 99.5% | Single AZ, higher storage cost, 80% lower request cost |
| S3 Standard-IA | Monthly access | Milliseconds | 99.9% | Lower storage cost, retrieval fee |
| S3 One Zone-IA | Monthly access, re-creatable data | Milliseconds | 99.5% | Single AZ, cheapest IA option |
| S3 Glacier Instant Retrieval | Rare, needs fast access | Milliseconds | 99.99% | Archival with instant access |
| S3 Glacier Flexible Retrieval | Rare, tolerates delay | Minutes to hours | 99.99% | Cheaper archival |
| S3 Glacier Deep Archive | Very rare | Hours (up to 12) | 99.99% | Cheapest option |
| S3 Intelligent-Tiering | Unknown/changing patterns | Milliseconds to minutes | 99.9% | Automatic tier movement |
All S3 storage classes have 99.999999999% (11 nines) durability. The differences are in availability, retrieval speed, and cost.
Two key factors:
- Retrieval speed. Milliseconds, minutes, or hours?
- Re-creation ability. Can you regenerate this data if lost? One Zone classes save money if yes.
S3 Intelligent-Tiering
If you don’t know your access patterns (and honestly, you often don’t), Intelligent-Tiering is the pragmatic choice. It monitors access patterns and moves objects automatically:
- Frequent Access tier: default starting point.
- Infrequent Access tier: after 30 days without access.
- Archive Instant Access tier: after 90 days without access.
- Archive Access tier (optional): minimum 90 days, configurable up to 730 days.
- Deep Archive Access tier (optional): minimum 180 days, configurable up to 730 days.
You enable the archive tiers with a configuration on the bucket or prefix:
{
"Id": "ExampleConfig",
"Status": "Enabled",
"Filter": {
"Prefix": "images"
},
"Tierings": [
{
"Days": 90,
"AccessTier": "ARCHIVE_ACCESS"
},
{
"Days": 180,
"AccessTier": "DEEP_ARCHIVE_ACCESS"
}
]
}
For data lake prefixes where access patterns vary, Intelligent-Tiering is almost always the right call. The monitoring cost per object is tiny. The savings from automatic tiering can be significant, especially on large datasets where nobody remembers to set lifecycle rules.
S3 Lifecycle Policies
When you know your access patterns, lifecycle policies give you explicit control. Define rules that transition objects between storage classes and delete them on schedule.
Use lifecycle policies when you have well-defined retention periods. Use Intelligent-Tiering when access patterns are unpredictable.
Example lifecycle configuration:
<LifecycleConfiguration>
<Rule>
<ID>example-id</ID>
<Filter>
<Prefix>logs/</Prefix>
</Filter>
<Status>Enabled</Status>
<Transition>
<Days>30</Days>
<StorageClass>STANDARD_IA</StorageClass>
</Transition>
<Transition>
<Days>90</Days>
<StorageClass>GLACIER</StorageClass>
</Transition>
<Expiration>
<Days>365</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
Moves objects to Standard-IA after 30 days, Glacier after 90 days, deletes after 365 days. Simple and effective.
Snapshot Expiration for Table Formats
One thing people miss: if you use Iceberg, Hudi, or Delta Lake on S3, regular lifecycle rules aren’t enough. Table formats create snapshots that consist of interdependent files. You can’t just delete individual files without breaking things.
Each format has its own cleanup mechanism:
- Iceberg:
expire_snapshotsprocedure - Hudi: cleaning service
- Delta Lake:
VACUUMcommand
Run these regularly or your snapshot storage will grow without limit. Standard S3 lifecycle rules don’t understand snapshot dependencies and will either miss files or break your tables.
Monitoring S3 Costs
Three tools:
S3 Storage Lens gives you a dashboard with 60+ metrics across all your buckets. Find your largest buckets, identify cold buckets that should be moved to cheaper storage, spot buckets without lifecycle rules. Free at the basic level, publishes metrics to CloudWatch.
Storage Class Analysis looks at access patterns and recommends which objects should move to a cheaper storage class. Run it periodically to catch data that’s gone cold.
AWS Cost Explorer breaks down your S3 bill by storage, requests, and data transfer. Use cost allocation tags to track spending by project or team. Set up budgets and forecasts to avoid surprises.
Set up Storage Lens on day one. It’s free and it’ll show you exactly where your money is going. Most teams are surprised by what they find.
DynamoDB to S3 Archiving
Common pattern for time-series data: clickstream events or transactions in DynamoDB, need to archive old items to S3 for compliance.
The architecture uses four components:
- DynamoDB TTL automatically deletes items after a timestamp you set. Pick an attribute that holds the expiration time, DynamoDB handles the rest.
- DynamoDB Streams captures the deletion events and sends them to Kinesis Data Streams.
- Lambda function on Amazon Data Firehose filters for TTL-deleted items specifically (they have
userIdentity.principalId: "dynamodb.amazonaws.com"). - Amazon Data Firehose delivers the filtered items to S3 in JSON format.
Items expire from DynamoDB automatically, but a copy ends up in S3 for long-term retention. Cost savings from removing old DynamoDB items plus compliance benefit of keeping the data in cheap S3 storage.
S3 Versioning for Data Resiliency
Versioning keeps every version of every object in your bucket. When you overwrite or delete an object, S3 creates a new version instead of removing the old one. Protects against accidental deletions and overwrites.
Key points:
- Enable versioning with a single command:
aws s3api put-bucket-versioning --bucket your-bucket-name --versioning-configuration Status=Enabled - Once enabled, versioning can’t be disabled, only suspended. Suspending stops new versions from being created but keeps existing ones.
- Every PUT, DELETE, or POST creates a new version.
- MFA Delete adds an extra authentication layer for version deletions and versioning state changes.
- Versioning is important for GDPR and HIPAA compliance, where you need to prove data integrity and maintain audit trails.
Versioning + Lifecycle Policies
Versioning without lifecycle management will blow up your storage costs. Old versions accumulate fast. Combine versioning with lifecycle policies:
<LifecycleConfiguration>
<Rule>
<ID>TransitionNonCurrentVersions</ID>
<Status>Enabled</Status>
<NoncurrentVersionTransition>
<NoncurrentDays>30</NoncurrentDays>
<StorageClass>GLACIER</StorageClass>
</NoncurrentVersionTransition>
<NoncurrentVersionExpiration>
<NoncurrentDays>365</NoncurrentDays>
</NoncurrentVersionExpiration>
</Rule>
</LifecycleConfiguration>
Moves old versions to Glacier after 30 days, deletes after 365 days. You keep the safety net of versioning without paying S3 Standard prices for data you’ll probably never access.
Key Takeaways
- S3 is the center of the data lake. Know the storage classes and when to use each one.
- Parquet is the default file format for analytics. Use it unless you have a specific reason not to.
- Table formats (Iceberg, Hudi, Delta Lake) bring database features to data lakes. Know what each one does. Iceberg is the strongest choice for new AWS projects.
- The Glue Data Catalog is your metadata registry. Crawlers are the primary way to populate it. Use incremental crawls for frequently changing data.
- Lifecycle management is about matching storage cost to access patterns. Hot data goes in fast, expensive storage. Cold data goes in cheap, slow storage. Automate the transitions.
- Versioning protects your data. Always combine it with lifecycle policies for noncurrent versions.
The exam will test your ability to pick the right storage class, the right file format, and the right lifecycle policy for a given scenario. Understand the tradeoffs, not just the feature lists.