Choosing Data Stores, Storage Formats, and Lifecycle Management on AWS

   |   15 minute read   |   Using 2993 words

Previous: Data Preparation and Orchestration

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Chapter 5 is where the book gets into data store management. Domain 2 territory on the exam, and a big one. How do you pick the right storage? What file format should you use? How do you keep your S3 bill from growing out of control?

I split Chapter 5 into two parts because there’s a lot of material. This first part covers choosing data stores, storage formats, data cataloging, and lifecycle management. Second part handles data modeling and schema evolution.

Choosing a Data Store

The book starts with a breakdown of AWS storage options into two buckets: core storage services and managed databases.

Core Storage Services

Three types of storage on AWS:

Block storage (Amazon EBS) is like attaching a hard drive to your EC2 instance. Ultra-low latency, which is what databases and ERP systems need. EBS volumes are replicated within an Availability Zone, and you can resize them without detaching.

File storage (Amazon EFS, Amazon FSx) is shared file access over a network. EFS uses NFS and works across multiple AZs. FSx has specialized flavors: Lustre for HPC and ML, Windows File Server for Microsoft shops, NetApp ONTAP for hybrid setups. If your app needs a shared filesystem, this is the category.

Object storage (Amazon S3) is the big one for data engineers. Cheapest per GB, scales to basically unlimited size, handles any data type. S3 is the default storage layer for data lakes on AWS. Every analytics service integrates with it. The tradeoff: S3 is built for throughput, not low-latency random access. If you need high IOPS and small file updates, EBS or FSx is better.

In practice, as a data engineer, you’ll mostly work with S3. EBS and EFS are usually abstracted behind managed services like RDS or handled by a separate infrastructure team.

AWS Cloud Databases

Six database types you should know for the exam:

Database TypeData TypeAWS ServiceCommon Use Cases
RelationalStructured with schemasAurora, RDS, RedshiftERP, CRM, BI
Key-valueKey-value pairsDynamoDBGaming, IoT, session management
DocumentSemi-structured (JSON/BSON)DocumentDBContent management, user profiles
In-memoryKey-value, semi-structuredElastiCache, MemoryDBCaching, leaderboards, real-time analytics
GraphNodes, edges, propertiesNeptuneSocial networks, fraud detection
Search engineSemi-structured, free textOpenSearchLog analytics, ecommerce search

The key decision point: OLTP vs OLAP. This trips people up. Aurora and RDS are OLTP – high-concurrency transactions with normalized schemas. Redshift is OLAP – complex analytical queries over large datasets with denormalized schemas. All relational, completely different purposes. On the exam, if the question mentions “analytical queries” or “business intelligence,” the answer is Redshift. “Transactional workload” or “high concurrency” means Aurora or RDS.

Data Storage Formats for Data Lakes

Choosing the right file format for your data lake matters more than most people think. Wrong format can make your queries 10x slower and your storage costs 3x higher.

Row-Based Formats

Row-based formats store all fields of a record together. Good for reading entire rows, bad for analytical queries that only need a few columns.

  • CSV is simple, human-readable, widely supported. Good for initial data ingestion and small datasets. Bad for anything at scale – no schema, no compression, no type safety.
  • JSON is flexible, handles nested data, used everywhere in APIs. Same problems as CSV at scale: verbose, no compression by default, slow to parse.
  • Avro is binary, supports schema evolution, popular in streaming pipelines (especially Kafka). Row-based but much better than CSV/JSON for serialization.

Column-Based Formats

Column-based formats store values of each column together. This is what you want for analytics.

  • Parquet is the default choice for most data lake workloads. Handles nested data well, compresses aggressively, every analytics engine supports it. If you’re not sure what format to use, use Parquet. You won’t regret it.
  • ORC (Optimized Row Columnar) was built for the Hadoop ecosystem. Performs well, but Parquet has won the popularity contest. ORC is still common in Hive-heavy environments. Starting fresh? Go Parquet.

I’ve seen teams waste weeks arguing about ORC vs Parquet. In 2026, Parquet is the standard. Every major tool supports it natively. ORC is fine if you already have it, but there’s no reason to start new projects with it.

Table Formats

Table formats are the layer on top of file formats that brings database features to data lakes.

  • Apache Iceberg brings ACID transactions, schema evolution, time travel, and partition evolution to your data lake. Works with Spark, Trino, Flink, and most AWS services. Iceberg is getting the most momentum right now, and AWS has been investing heavily in it.
  • Apache Hudi focuses on incremental processing and upserts. Good for CDC use cases where you need to merge updates into existing data. Hudi was born at Uber for exactly this kind of workload.
  • Delta Lake was built on top of Spark by Databricks. Provides ACID transactions and scalable metadata. If you’re in the Databricks ecosystem, Delta Lake is the natural choice.

The Iceberg hype is real, and it’s also justified. Iceberg has the broadest engine support and the most active community. For new data lake projects on AWS, Iceberg is the safe bet. Hudi still has advantages for streaming upsert workloads. Delta Lake makes sense if you’re all-in on Databricks. The exam will test you on knowing what each one does, not on picking favorites.

Building a Data Strategy with Multiple Data Stores

Two important architectural patterns:

Lakehouse Architecture

A lakehouse combines the scalability of a data lake (S3) with the query performance of a data warehouse (Redshift). Store everything in S3, use table formats for structure, connect Redshift for high-performance queries. Redshift Spectrum lets you query S3 data directly from Redshift using standard SQL.

The book breaks the lakehouse into five logical layers: ingestion, storage, cataloging, processing, and consumption. On AWS, S3 is the storage layer and Redshift is the warehouse layer. Data moves between them using COPY (load into Redshift) and UNLOAD (export to S3) commands.

Federated Queries

Federated queries let you query data where it lives without moving it. Amazon Athena Federated Query can run SQL across S3, RDS, Redshift, DynamoDB, and third-party sources in a single query. No ETL needed.

The tradeoff: federated queries put compute load on the source systems. Great for ad-hoc analysis and real-time queries, but don’t scale as well as centralized lakehouses for heavy analytical workloads. Use federated queries when you need fresh data from multiple sources. Use a lakehouse when you need repeatable, high-performance analytics.

Data Cataloging

A data catalog is a registry of what data you have, where it lives, and what it looks like. Without it, your data lake becomes a data swamp.

Technical vs Business Metadata

Two types:

Technical metadata includes schemas, column types, partition layouts, data lineage, and source information. What engineers need.

Business metadata includes data ownership, business definitions, usage policies, and quality scores. What business users need.

On AWS, the Glue Data Catalog handles technical metadata. Amazon DataZone handles business metadata cataloging (though it’s not on the exam yet since it launched recently).

Populating the Glue Data Catalog

Four ways to get metadata in:

  1. Glue Crawlers are the most common approach. A crawler connects to a data source (S3, DynamoDB, JDBC databases, MongoDB), reads the data, infers the schema, and writes table definitions into the catalog. Schedule crawlers to run periodically or trigger them via API.

  2. Manual definition gives you full control. Define every table and column yourself. Good for proof-of-concept work or unsupported formats. Bad for anything at scale.

  3. Integration with other AWS services. Create tables using Athena DDL statements (CREATE TABLE) and the metadata goes straight to the Glue Data Catalog. The MSCK REPAIR TABLE command in Athena loads Hive-style partitions.

  4. Migration from Hive Metastore. Already have a Hive metastore? Migrate it to the Glue Data Catalog using Glue ETL jobs.

Data Catalog Best Practices

The ones that matter most:

  • Consistent naming conventions. Use prefixes for environments (dev, prod) and data stages (raw, processed). Sounds obvious but many teams skip it and then spend months cleaning up.
  • Security. Use IAM policies for fine-grained access control. Encrypt metadata at rest and in transit. Audit access logs.
  • Schema change management. Schedule regular crawler runs or use trigger-based crawls. Glue supports schema versioning, so you can track changes over time and roll back if needed. Use incremental crawls for frequently changing sources to save time and money.
  • Column statistics. Glue can compute column-level stats (min, max, null count, distinct count). Athena and Redshift use these stats to build better query plans. Quick win for performance.
  • Partition indexes. If your tables have many partitions, create partition indexes in the Glue Data Catalog. Speeds up partition lookups during query planning.

Data Classification

Data classification is about tagging and categorizing your data for better discovery and access control. Three approaches:

  • By ownership: which team, business unit, or project owns the data.
  • By sensitivity: public, internal, confidential, restricted. Drives access control decisions.
  • By stage: raw, cleansed, processed, sandbox, production. Useful for tracking data through your pipeline.

AWS Lake Formation lets you tag data at the database, table, and column level. These tags drive fine-grained access control. Powerful feature that shows up on the exam.

Managing the Lifecycle of Data

Data doesn’t stay hot forever. What gets queried 100 times a day today will be touched once a month next year. Managing this lifecycle is how you keep storage costs under control.

Hot vs Cold Data

Hot data is accessed frequently and needs fast retrieval. Transactional records, streaming data, real-time analytics. Store it on in-memory (ElastiCache), block storage (EBS), or S3 Standard.

Cold data is accessed rarely. Historical logs, compliance records, archived transactions. Store it on S3 Glacier or S3 Glacier Deep Archive.

Key factors for choosing storage:

  • What analytics engine are you using?
  • What latency can you tolerate?
  • How often is the data accessed?
  • What’s your budget per GB?
  • Can the data be re-created if lost?

The book includes a good example: a petabyte-scale log analytics system on OpenSearch with four storage tiers. Hot storage for the last 2-7 days (fast indexing and queries). UltraWarm for 1 week to 2 months (read-only, cheaper). Cold storage for 2 months to 1 year (very cheap, slower). Direct query to S3 for anything older than 1 year (archive cost, on-demand access).

Retention Policies and Archiving

Best practices:

  1. Start from business and compliance requirements. Legal requirements might say “keep financial records for 7 years.” That dictates your archiving strategy.
  2. Automate everything. Use S3 lifecycle policies, DynamoDB TTL, and other automation tools. Manual data management doesn’t scale.
  3. Review regularly. Access patterns change. Regulations change. Review your retention policies periodically.

COPY and UNLOAD: Moving Data Between S3 and Redshift

In a lakehouse architecture, data moves between S3 and Redshift frequently:

  • COPY command loads data from S3 into Redshift. Supports multiple formats and can apply transformations during load.
  • UNLOAD command exports data from Redshift to S3. Runs in parallel, supports compression, lets you pick output formats.

Three common patterns:

  1. ETL with Glue/EMR to prepare data in S3, then COPY into Redshift.
  2. Load raw data directly into Redshift and transform there (ELT pattern).
  3. UNLOAD old, rarely queried data from Redshift back to S3 for cheaper storage. Still queryable via Athena or Redshift Spectrum.

Optimizing with Amazon S3

Dense section with S3 storage class details. If you’re preparing for the exam, memorize this table.

S3 Storage Classes

Storage ClassAccess PatternRetrieval SpeedAvailabilityNotes
S3 StandardFrequentMilliseconds99.99%Default class
S3 Express One ZoneUltra-frequent, latency-sensitiveMilliseconds (10x faster)99.5%Single AZ, higher storage cost, 80% lower request cost
S3 Standard-IAMonthly accessMilliseconds99.9%Lower storage cost, retrieval fee
S3 One Zone-IAMonthly access, re-creatable dataMilliseconds99.5%Single AZ, cheapest IA option
S3 Glacier Instant RetrievalRare, needs fast accessMilliseconds99.99%Archival with instant access
S3 Glacier Flexible RetrievalRare, tolerates delayMinutes to hours99.99%Cheaper archival
S3 Glacier Deep ArchiveVery rareHours (up to 12)99.99%Cheapest option
S3 Intelligent-TieringUnknown/changing patternsMilliseconds to minutes99.9%Automatic tier movement

All S3 storage classes have 99.999999999% (11 nines) durability. The differences are in availability, retrieval speed, and cost.

Two key factors:

  1. Retrieval speed. Milliseconds, minutes, or hours?
  2. Re-creation ability. Can you regenerate this data if lost? One Zone classes save money if yes.

S3 Intelligent-Tiering

If you don’t know your access patterns (and honestly, you often don’t), Intelligent-Tiering is the pragmatic choice. It monitors access patterns and moves objects automatically:

  • Frequent Access tier: default starting point.
  • Infrequent Access tier: after 30 days without access.
  • Archive Instant Access tier: after 90 days without access.
  • Archive Access tier (optional): minimum 90 days, configurable up to 730 days.
  • Deep Archive Access tier (optional): minimum 180 days, configurable up to 730 days.

You enable the archive tiers with a configuration on the bucket or prefix:

{
    "Id": "ExampleConfig",
    "Status": "Enabled",
    "Filter": {
        "Prefix": "images"
    },
    "Tierings": [
        {
            "Days": 90,
            "AccessTier": "ARCHIVE_ACCESS"
        },
        {
            "Days": 180,
            "AccessTier": "DEEP_ARCHIVE_ACCESS"
        }
    ]
}

For data lake prefixes where access patterns vary, Intelligent-Tiering is almost always the right call. The monitoring cost per object is tiny. The savings from automatic tiering can be significant, especially on large datasets where nobody remembers to set lifecycle rules.

S3 Lifecycle Policies

When you know your access patterns, lifecycle policies give you explicit control. Define rules that transition objects between storage classes and delete them on schedule.

Use lifecycle policies when you have well-defined retention periods. Use Intelligent-Tiering when access patterns are unpredictable.

Example lifecycle configuration:

<LifecycleConfiguration>
  <Rule>
    <ID>example-id</ID>
    <Filter>
       <Prefix>logs/</Prefix>
    </Filter>
    <Status>Enabled</Status>
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>365</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Moves objects to Standard-IA after 30 days, Glacier after 90 days, deletes after 365 days. Simple and effective.

Snapshot Expiration for Table Formats

One thing people miss: if you use Iceberg, Hudi, or Delta Lake on S3, regular lifecycle rules aren’t enough. Table formats create snapshots that consist of interdependent files. You can’t just delete individual files without breaking things.

Each format has its own cleanup mechanism:

  • Iceberg: expire_snapshots procedure
  • Hudi: cleaning service
  • Delta Lake: VACUUM command

Run these regularly or your snapshot storage will grow without limit. Standard S3 lifecycle rules don’t understand snapshot dependencies and will either miss files or break your tables.

Monitoring S3 Costs

Three tools:

S3 Storage Lens gives you a dashboard with 60+ metrics across all your buckets. Find your largest buckets, identify cold buckets that should be moved to cheaper storage, spot buckets without lifecycle rules. Free at the basic level, publishes metrics to CloudWatch.

Storage Class Analysis looks at access patterns and recommends which objects should move to a cheaper storage class. Run it periodically to catch data that’s gone cold.

AWS Cost Explorer breaks down your S3 bill by storage, requests, and data transfer. Use cost allocation tags to track spending by project or team. Set up budgets and forecasts to avoid surprises.

Set up Storage Lens on day one. It’s free and it’ll show you exactly where your money is going. Most teams are surprised by what they find.

DynamoDB to S3 Archiving

Common pattern for time-series data: clickstream events or transactions in DynamoDB, need to archive old items to S3 for compliance.

The architecture uses four components:

  1. DynamoDB TTL automatically deletes items after a timestamp you set. Pick an attribute that holds the expiration time, DynamoDB handles the rest.
  2. DynamoDB Streams captures the deletion events and sends them to Kinesis Data Streams.
  3. Lambda function on Amazon Data Firehose filters for TTL-deleted items specifically (they have userIdentity.principalId: "dynamodb.amazonaws.com").
  4. Amazon Data Firehose delivers the filtered items to S3 in JSON format.

Items expire from DynamoDB automatically, but a copy ends up in S3 for long-term retention. Cost savings from removing old DynamoDB items plus compliance benefit of keeping the data in cheap S3 storage.

S3 Versioning for Data Resiliency

Versioning keeps every version of every object in your bucket. When you overwrite or delete an object, S3 creates a new version instead of removing the old one. Protects against accidental deletions and overwrites.

Key points:

  • Enable versioning with a single command: aws s3api put-bucket-versioning --bucket your-bucket-name --versioning-configuration Status=Enabled
  • Once enabled, versioning can’t be disabled, only suspended. Suspending stops new versions from being created but keeps existing ones.
  • Every PUT, DELETE, or POST creates a new version.
  • MFA Delete adds an extra authentication layer for version deletions and versioning state changes.
  • Versioning is important for GDPR and HIPAA compliance, where you need to prove data integrity and maintain audit trails.

Versioning + Lifecycle Policies

Versioning without lifecycle management will blow up your storage costs. Old versions accumulate fast. Combine versioning with lifecycle policies:

<LifecycleConfiguration>
  <Rule>
    <ID>TransitionNonCurrentVersions</ID>
    <Status>Enabled</Status>
    <NoncurrentVersionTransition>
      <NoncurrentDays>30</NoncurrentDays>
      <StorageClass>GLACIER</StorageClass>
    </NoncurrentVersionTransition>
    <NoncurrentVersionExpiration>
      <NoncurrentDays>365</NoncurrentDays>
    </NoncurrentVersionExpiration>
  </Rule>
</LifecycleConfiguration>

Moves old versions to Glacier after 30 days, deletes after 365 days. You keep the safety net of versioning without paying S3 Standard prices for data you’ll probably never access.

Key Takeaways

  1. S3 is the center of the data lake. Know the storage classes and when to use each one.
  2. Parquet is the default file format for analytics. Use it unless you have a specific reason not to.
  3. Table formats (Iceberg, Hudi, Delta Lake) bring database features to data lakes. Know what each one does. Iceberg is the strongest choice for new AWS projects.
  4. The Glue Data Catalog is your metadata registry. Crawlers are the primary way to populate it. Use incremental crawls for frequently changing data.
  5. Lifecycle management is about matching storage cost to access patterns. Hot data goes in fast, expensive storage. Cold data goes in cheap, slow storage. Automate the transitions.
  6. Versioning protects your data. Always combine it with lifecycle policies for noncurrent versions.

The exam will test your ability to pick the right storage class, the right file format, and the right lifecycle policy for a given scenario. Understand the tradeoffs, not just the feature lists.

Next: Data Modeling for Redshift, DynamoDB, and Data Lakes



denis256 at denis256.dev