Data Modeling for Redshift, DynamoDB, and Data Lakes on AWS

Published: Thursday, Jan 24, 2019 | 12 minute read | Using 2443 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Data modeling is one of those topics that sounds academic until you get it wrong in production. Then it becomes very real, very fast. This section of Chapter 5 covers data modeling strategies for three different AWS services: Amazon Redshift, Amazon DynamoDB, and S3 data lakes. Each has its own rules, trade-offs, and gotchas.

Introduction to Data Modeling

The book starts with three levels of data models. Nothing new if you’ve done any database work, but worth a quick recap:

Conceptual models are the high-level view. Entities and relationships. No technical details. What you show to business stakeholders so everyone agrees on what “customer” or “order” means.

Logical models add structure. Attributes, primary keys, foreign keys, relationships between entities. The blueprint that developers and data engineers work with. Still no physical storage decisions.

Physical models translate logical models into actual implementation. Distribution styles, sort keys, indexes, compression, partitioning. Performance lives or dies here.

Key takeaway: always work backward from data access patterns. Figure out how the data will be queried first, then design the model to support those queries. Sounds obvious, but I’ve seen teams design beautiful normalized schemas and then wonder why their analytics queries take 20 minutes.

Data Modeling for Amazon Redshift

Redshift is a columnar data warehouse. The modeling decisions you make here directly affect query speed and cost.

Schema Design Patterns

Three common patterns in relational systems:

Star schema is the go-to for data warehouses. One central fact table (sales, transactions, events) surrounded by dimension tables (dates, products, customers). Minimal joins. Fast reads. Redshift is built for this.

Snowflake schema takes the star schema and normalizes the dimension tables further. Your “product” dimension splits into product, category, and subcategory tables. Saves some storage, adds more joins. In Redshift, those extra joins usually cost more than the storage you save.

Third Normal Form (3NF) is great for transactional databases where you write a lot. Eliminates redundancy. The book explicitly calls 3NF an anti-pattern for Redshift though. Too many joins, too slow for analytical queries.

For Redshift, denormalization wins. The columnar storage format means Redshift only reads the columns you actually query. Combined with compression, a denormalized star schema performs extremely well. Less data read, fewer joins, faster queries.

Logical Modeling: The Kimball Approach

Four steps:

Identify the business process. What activity are you tracking? Sales? Claims? Shipments? Each process gets its own fact table.
Declare the grain. What does one row represent? Individual transactions? Daily summaries? Get this wrong and everything downstream is confused.
Identify dimensions. Time, product, customer, location. The “by what” of your analysis.
Identify facts. The numeric measurements. Sales amount, unit price, quantity, commission.

Not new stuff. Kimball has been around for decades. Works though. And for the exam, you need to know these steps.

Physical Modeling: Distribution Styles

Where Redshift gets interesting. When you create a table, you choose how data gets distributed across compute nodes. Pick wrong and your queries shuffle data between nodes constantly, killing performance.

Four distribution styles:

AUTO lets Redshift decide. The book recommends this as the default. Only override it if you understand your query patterns well. For most teams starting out, AUTO is the right call. Redshift will start with ALL for small tables and switch to EVEN or KEY as data grows.

EVEN distributes rows round-robin across all nodes. Good when a table doesn’t participate in joins. Data is spread evenly, no skew, but joins require data movement between nodes.

KEY distributes rows based on one column’s values. Rows with the same key land on the same node. Powerful when you join two large tables on the same column. If both tables use the same distribution key, the join happens locally on each node with zero data movement. Catch: pick a column with low cardinality and you get data skew. One node gets overloaded while others sit idle.

ALL copies the entire table to every node. Perfect for small dimension tables that rarely change. Joins with these tables never need data movement because the data is already everywhere. Don’t use ALL for large tables though. You’re multiplying storage by the number of nodes.

Use KEY distribution for your large fact tables, distribute on the column you join on most often. Use ALL for small dimension tables under a few million rows. Use AUTO if you’re not sure. Check SVV_TABLE_INFO to see what Redshift actually chose.

Sort Keys

Sort keys control the physical order of data on disk. A good sort key lets Redshift skip entire data blocks during queries.

Best practices:

Let Redshift choose with SORTKEY AUTO if you’re unsure. It picks based on your actual access patterns.
Analyze your queries. If you always filter by order_date, that should be your sort key.
Focus on large tables first. Sort keys on a 100-row lookup table don’t matter. Sort keys on a billion-row fact table matter a lot.
Use compound sort keys when queries filter on a consistent set of columns. The sort follows the column order you define. Works great when the leading column is always in the WHERE clause. Falls apart when queries only filter on the second or third column.
Sort on join columns. If you set the join column as both the sort key and the distribution key, Redshift can use a merge join instead of a hash join. Merge joins are faster.

Additional Best Practices

Optimize column size. Use VARCHAR(50) instead of VARCHAR(255) when you know the data fits. Smaller columns mean less data read from disk. Adds up on billion-row tables.

Use compression. Set ENCODE AUTO and let Redshift handle it. Manual compression rarely beats automatic. One exception: don’t compress sort key columns. The sort metadata works better without compression.

Data Modeling for Amazon DynamoDB

DynamoDB modeling is a completely different mindset from relational databases. If you approach DynamoDB the same way you approach Redshift or PostgreSQL, you’re going to have a bad time. Learned this the hard way.

NoSQL vs Relational: The Mindset Shift

In relational databases, you normalize first and worry about queries later. You can always add an index or write a complex join. The schema is flexible.

In DynamoDB, you must know your access patterns before you design anything. No joins. No flexible query language. If your schema doesn’t support a specific access pattern, adding it later is painful and expensive. Sometimes it means creating a whole new table or rebuilding your Global Secondary Indexes.

The book makes this clear: in NoSQL, design starts with access patterns, not entities. List every query your application will make. Then design the schema to serve those queries efficiently.

Another big difference: in relational databases, you might have dozens of tables. In DynamoDB, fewer tables is better. Single-table design, where multiple entity types live in one table, is common. Improves data locality and simplifies permission management.

E-commerce Use Case

The book walks through an e-commerce example with four entities: Customers, Products, Shopping Carts, and Orders. The access patterns:

Customer looks up product inventory
Customer adds products to cart
Customer checks out
Customer views recent orders
Company views unfulfilled orders

Good example because it shows how you work backward from these patterns to choose partition keys, sort keys, and indexes.

Core DynamoDB Concepts

Quick refresher:

Tables, items, and attributes. A table holds items (like rows). Each item has attributes (like columns). DynamoDB is schemaless though. Different items in the same table can have completely different attributes.

Primary keys. Every table needs one. Two flavors:

Simple primary key: just a partition key. Each item must have a unique partition key value.
Composite primary key: partition key plus sort key. Items can share a partition key, but the combination must be unique.

Partition key determines which physical partition stores the item. Items with the same partition key are stored together.

Sort key orders items within a partition. Enables range queries: give me all orders for this customer between January and March.

Secondary indexes add query flexibility:

Global Secondary Index (GSI): completely new partition key and optional sort key. Can be added anytime. Uses its own read/write capacity. Eventually consistent reads only.
Local Secondary Index (LSI): same partition key as the base table, different sort key. Must be created at table creation time. Can’t be added later. Shares capacity with the base table. Supports strongly consistent reads.

Query vs Scan. Query uses the primary key or indexes to find specific items. Fast and cheap. Scan reads every item in the table and filters afterward. Slow and expensive. Your goal: make sure every critical access pattern can be served by a Query, not a Scan.

Partition Key Selection

For the e-commerce example:

Entity	Partition Key
Customer	Customer UUID
Product	Product UUID
Shopping Cart	Customer UUID
Order	Customer UUID

Cart and Order use the Customer UUID, not their own IDs. The “owner entity” pattern. Since most queries start from “show me this customer’s carts/orders,” bucketing by customer makes the queries efficient.

Key rules:

High cardinality. UUID is better than name or email because every value is unique. Low cardinality leads to hot partitions.
Avoid hot partitions. If one partition key gets way more traffic than others, that partition gets throttled. For write-heavy keys, add a random suffix (0-9) to spread load across partitions.

Sort Key Selection

Sort keys organize data within a partition:

Group related data. ORDER#OrderID sorts orders by ID within a customer’s partition.
Hierarchical relationships. CART#STATUS#ProductUUID lets you query active vs saved cart items.
Time-ordered data. Timestamps as sort keys to retrieve events in chronological order.
Version control. Include a version number to store multiple versions of an item.

Full mapping for the e-commerce example:

Entity	PK	SK
Customer	Customer UUID	VersionNumber
Product	Product UUID	VersionNumber
Shopping Cart	Customer UUID	CART#STATUS#ProductUUID
Order	Customer UUID	ORDER#OrderID

Global vs Local Secondary Indexes

Feature	GSI	LSI
Partition key	Any attribute	Same as base table
Creation	Anytime	Table creation only
Capacity	Separate	Shared with base table
Consistency	Eventually consistent	Strongly consistent
Storage	Sparse (only indexed items)	All items in partition

Use GSI when you need to query across different partitions or when requirements change after launch. Trade-off is extra cost and eventual consistency.

Use LSI when you need an alternative sort order within the same partition. Cheaper since it shares capacity, but you can’t add it after the table exists.

GSIs are used far more often than LSIs in practice. The flexibility of adding them later is worth the extra cost. I’ve seen multiple projects where the team didn’t create LSIs at table creation time, realized they needed them months later, and had to recreate the entire table.

Data Modeling for Data Lakes

Data lakes on S3 follow a layered architecture. The book describes three layers, sometimes called the medallion architecture (Bronze, Silver, Gold).

Raw Data Layer (Bronze)

Where raw data lands from all sources: databases, APIs, files, streams. No transformations. Keep the data exactly as it arrived.

Always keep this layer. Store it in a cheap storage tier. If your processing pipeline has a bug, you can reprocess from raw data instead of re-ingesting from the source. Been saved by this more than once.

Stage Data Layer (Silver)

Cleaned and conformed data. Convert CSV to Parquet, run data quality checks, merge CDC logs into current-state tables, standardize column names and types.

This layer enables self-service analytics. Data scientists and analysts can query it without worrying about data quality issues.

Analytics Data Layer (Gold)

Final, consumption-ready layer. Business rules applied, aggregations computed, data denormalized for fast reads. Powers dashboards, BI tools, and ML models.

The data warehouse layer of your data lake. Read-optimized, low latency, ready for business users.

S3 Data Lake Best Practices

Six best practices critical for both the exam and real-world performance:

Partition Your Data

Partitioning divides data into folders based on column values like date, region, or country. When a query filters on a partition column, the query engine skips all non-matching partitions. Less data scanned = faster queries and lower costs.

Choose partition keys that appear in your WHERE clauses. Date is the most common. Be careful though: partition keys should have low to medium cardinality. Partitioning by user_id when you have millions of users creates millions of tiny partitions, which is worse than no partitioning at all.

Bucket Your Data

Bucketing splits data within partitions into a fixed number of files based on a column’s hash. Good for high-cardinality columns like user_id or device_id. Helps with join performance by reducing data shuffling between nodes.

Use Compression

Compression reduces storage costs and speeds up queries by reading less data from disk. Two formats worth knowing:

Snappy: fast compression and decompression, moderate compression ratio. Good default for analytics.
Zstandard (zstd): better compression ratio than Snappy, still good performance. Newer option gaining popularity.

Optimize File Size

Many small files are worse than fewer large files. Each file has metadata overhead, and too many files slow down listing and query planning. Aim for files around 128 MB or larger.

Common problem with streaming pipelines that write tiny files every few seconds. You need a compaction process to merge small files into larger ones.

Use Columnar File Formats

Apache Parquet and Apache ORC store data by column, not by row. Benefits:

Better compression. Similar values in a column compress well together.
Column pruning. Only reads the columns your query needs.
Block-level metadata. Min/max values per column per block let the engine skip irrelevant blocks.
Splittable. Large files can be processed in parallel by multiple nodes.

For analytics workloads, Parquet and ORC are the standard. CSV and JSON should stay in the raw layer only.

Use Open Table Formats

Apache Iceberg, Apache Hudi, and Delta Lake add database-like features to your data lake:

ACID transactions
Schema evolution
Time-travel queries
Partition evolution without rewriting data

These formats solve real problems. Without them, concurrent writes can corrupt your data, schema changes require full rewrites, and there’s no way to query historical snapshots. If you’re building a new data lake today, pick one of these formats from the start.

Key Takeaways

Redshift: denormalize, use star schema, pick distribution styles based on join patterns, sort on filter columns.
DynamoDB: know your access patterns first, choose high-cardinality partition keys, use sort keys for grouping and ordering, add GSIs for cross-partition queries.
Data Lakes: layer your data (raw/stage/analytics), partition on query filters, use Parquet/ORC, compress everything, keep files above 128 MB, consider open table formats.

The common thread across all three: design for how you query, not for how the data looks. Access patterns drive everything.

Next: Analytics with QuickSight, Athena, and Redshift

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-data-modeling/

denis256 notes and projects