Data Engineering Fundamentals: Databases, Spark, Data Lakes, and AWS Basics

Published: Thursday, Jan 17, 2019 | 12 minute read | Using 2539 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Chapter 2 is called “Prerequisite Knowledge for Aspiring Data Engineers.” It covers the foundation you need before touching any AWS service. Databases, big data concepts, processing frameworks, data lakes, warehouses, ETL patterns, CI/CD, and AWS basics.

If you already work in tech, a lot of this will feel like a refresher. It’s nice to have it all in one place though. Some of these topics honestly deserve more attention than people give them during cert prep.

Databases: The Starting Point

The chapter starts with databases. No surprises there. Every application needs to store data somewhere.

A database is a collection of data stored electronically for easy access. A DBMS (database management system) is the software that handles create, read, update, delete operations plus admin stuff like caching, indexing, and performance tuning. Examples: MySQL, PostgreSQL, Oracle, MSSQL Server, Amazon Aurora.

Three Types of Databases

Hierarchical databases were popular around 1970. They store data in tree structures with parent-child relationships. One parent, many children, but a child can only have one parent. Like a file system. These mostly died out because they don’t scale well and get complex fast.

Relational databases took over in the 1980s and never left. Tables, rows, columns. Simple to understand, simple to query with SQL. They’re still the most popular database type for good reason. If your data has a fixed schema and you need ACID transactions, relational databases are your default choice.

SQL commands break down into categories:

DDL (Data Definition Language): CREATE, ALTER, DROP tables and objects
DML (Data Manipulation Language): INSERT, UPDATE, DELETE data
DQL (Data Query Language): SELECT statements to read data
DCL (Data Control Language): GRANT/REVOKE permissions
TCL (Transaction Control Language): COMMIT, ROLLBACK transactions

NoSQL databases fill the gap where relational databases struggle. When your schema is flexible, when each record might have different attributes, when you need specific query patterns that relational databases handle poorly. Types include key-value stores, document stores, graph databases, in-memory databases, and search databases.

The trade-off with NoSQL: no referential integrity between tables, no JOINs. You design your data model around your access patterns, not around normalization rules.

Most real-world systems use both. Relational for transactional data, NoSQL for specific use cases like caching (Redis), search (Elasticsearch), or flexible document storage (MongoDB/DynamoDB).

OLTP vs OLAP

Two different worlds:

OLTP (Online Transaction Processing) handles record-level operations. Your web app inserting orders, updating user profiles, processing payments. Fast response times, high concurrency, row-based operations.

OLAP (Online Analytical Processing) handles analytical queries. Monthly sales reports, year-over-year comparisons, trend analysis. Column-based operations, scanning large volumes of data, lower concurrency.

In practice, you run your application on OLTP databases and feed data into OLAP systems for analytics. Trying to do both on one system is a recipe for trouble. Your analytics queries will kill your application performance.

Big Data: The 5 Vs

Traditional databases can’t handle petabyte-scale data. That’s where big data and distributed processing come in.

Big data originally had 3 Vs, now it has 5:

Volume: How much data. Terabytes to petabytes.
Velocity: How fast data arrives. Daily batch files or real-time streams every few seconds.
Variety: What forms. Structured (CSV, relational), semi-structured (JSON, XML), unstructured (video, audio, PDFs).
Veracity: How reliable the data is. Quality and accuracy matter for correct insights.
Value: What the data is worth to the business. All that processing should produce something useful.

The core problem: single-server processing can’t scale to handle this volume. You need distributed processing frameworks that split work across a cluster of machines.

Distributed Processing Frameworks

This is where it gets interesting. The chapter covers the major players in the Hadoop ecosystem and beyond.

MapReduce

The original distributed processing framework. Splits data into 64-128 MB chunks and processes them in parallel across the cluster. Three stages: Map (transform key-value pairs), Reduce (aggregate results), and an optional Combiner (mini-reducer on each node before shuffling data).

MapReduce was revolutionary when it came out, but writing Java MapReduce jobs is painful. Nobody does this voluntarily anymore. Understanding the concept matters though because everything that came after builds on these ideas.

Apache Spark

Spark is the big one. If you learn one distributed processing framework, make it Spark.

It does in-memory processing, which makes it way faster than MapReduce for most workloads. Supports batch processing, real-time streaming, machine learning (MLlib), and graph processing. You can write Spark jobs in Java, Scala, Python, or R. Python (PySpark) is the most popular choice today.

Spark works with three data abstractions:

RDDs (Resilient Distributed Datasets): the original low-level API
DataFrames: structured data with named columns, like a table
Datasets: type-safe version of DataFrames (Java/Scala only)

Two types of operations:

Transformations: lazy operations that define what to do (map, filter, join). Nothing actually executes until you call an action.
Actions: trigger execution and produce results (write to disk, collect to driver, count).

Spark runs in standalone mode or cluster mode with a cluster manager (YARN, Mesos, Kubernetes). In cluster mode, the driver program submits jobs to the cluster, which splits work into tasks executed by worker nodes. Each worker has executors with allocated memory and CPU.

Spark shows up everywhere in my daily work. AWS EMR, Glue, Databricks. If you understand Spark, you understand the core of modern data engineering.

Apache Flink

Flink is designed primarily for real-time stream processing. While Spark added streaming capabilities later (Structured Streaming), Flink was built for streaming from day one.

Key features: in-memory processing, stateful processing, out-of-order event handling, exactly-once semantics, backpressure control.

Architecture has two main components:

JobManager: schedules tasks, coordinates checkpoints, handles failures. You can have multiple for high availability.
TaskManager: executes the actual work. Scales horizontally with more task slots.

Flink is gaining popularity, especially for use cases where low-latency stream processing matters. On AWS, you’ll see it as Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics).

Hive

Apache Hive was created at Facebook around 2008. It lets you query data on Hadoop using SQL (HiveQL) instead of writing Java MapReduce code. Huge deal for data analysts who knew SQL but didn’t want to learn Java.

Hive acts as a data warehouse on top of Hadoop. You can configure it to use MapReduce, Tez, or Spark as the processing engine. The Hive Metastore (HMS) stores schema metadata in a relational database. This lets you define virtual table schemas on top of files in HDFS or S3 and query them with SQL.

The Hive Metastore concept is important. It shows up in AWS Glue Data Catalog, which is essentially a managed HMS.

Presto and Trino

Presto (PrestoDB) is a distributed SQL query engine open sourced by Facebook around 2013. Faster than Hive because it processes queries in-memory. Good for complex queries, aggregations, joins, window functions. It can query data from many sources: HDFS, S3, relational databases, Cassandra, MongoDB.

Trino is a fork of Presto. In 2019, some original Presto engineers left Meta and forked the project. Originally called PrestoSQL, rebranded to Trino in 2020. Today, Trino is often the preferred choice in the open source community.

For the exam, know that Spark is for batch and streaming, Flink is for real-time streaming, and Hive/Presto/Trino are SQL query engines for analysts. In practice, Spark dominates the data engineering landscape.

Data Lake vs Data Warehouse

Data Lake

A centralized repository that stores structured, semi-structured, and unstructured data at any scale. Store raw data as-is, process it later with distributed frameworks.

Data lakes started with HDFS but moved to cloud object stores like S3. S3 is cheaper, more reliable, scales better, and you don’t have to manage cluster storage.

The danger: without proper governance, your data lake becomes a data swamp. Unmanageable, no structure, no metadata, nobody knows what’s in there. I’ve seen this happen. Not fun to clean up.

Data Warehouse

Also a centralized repository for analytics, but built for structured and semi-structured data with low-latency queries. Data warehouses store data in proprietary formats optimized for analytical queries.

Key features: tiered storage (hot data in memory, cold data on SSD), data distribution strategies for efficient joins, SQL-native interfaces.

Data marts are smaller subsets of a data warehouse focused on a specific department like finance or marketing. Aggregated data for specific use cases.

Key Differences

Schema approach: Data warehouses use “schema on write” (define schema first, then load data). Data lakes use “schema on read” (store data first, apply schema when querying).
Data types: Warehouses handle structured data. Lakes handle everything.
Interface: Warehouses are SQL-focused. Lakes work with programming frameworks like Spark, Flink, Python.
History: Lakes are good for storing all historical data. Warehouses typically hold the last few years needed for active analytics.

Most organizations need both. Use the data lake as your landing zone and long-term storage. Use the data warehouse for fast analytical queries on curated data.

ETL vs ELT

Both are patterns for moving data from source to target. The difference is when transformation happens.

ETL (Extract, Transform, Load): Extract data from source, transform it, then load to target. Use when data needs cleaning or format conversion before it’s usable.

ELT (Extract, Load, Transform): Extract data, load it raw into the target, then transform it there. Use when your target system (like a data warehouse) has enough compute power for transformations and analysts want access to raw data quickly.

The three steps:

Extract: Fetch data from databases, APIs, FTP, object stores. Full or incremental.
Transform: Modify data for consumption. SQL, Spark, Python, whatever fits.
Load: Save to the target storage layer (database, data lake, data warehouse).

No clear winner here. ELT is popular with modern cloud data warehouses (Redshift, Snowflake, BigQuery) because they have the compute power to transform data after loading. ETL is better when raw data needs significant cleaning before it’s useful.

In practice, most pipelines are a mix. You do some light transformation during extraction, load into a staging area, then do heavy transformations in the warehouse.

Data Processing Patterns

Three main ways to process data:

Batch Processing

Process accumulated data at regular intervals. Examples: nightly processing of files received during the day, hourly log processing, daily full database extracts.

Batch jobs handle higher volumes and take more time. Bread and butter of most data pipelines.

Real-Time Stream Processing

Process unbounded data continuously as it arrives. Examples: website click tracking, IoT device events, CDC (Change Data Capture) from databases, SaaS data feeds.

Stream processing handles smaller volumes per event but adds complexity: out-of-order records, late arrivals, checkpointing, retries.

Event-Driven Processing

Similar to batch but triggered by events instead of schedules. Process data when it arrives, but don’t keep streaming infrastructure running all the time. Good for irregular data arrivals where you want to save cost.

On AWS, this usually means S3 events triggering Lambda functions or Step Functions. Cost-effective for sporadic workloads.

Data Pipeline Architecture

The chapter presents a high-level architecture that applies regardless of specific technologies:

Data Sources: databases, APIs, files, streams
Ingestion: batch jobs, Kafka, message queues
Distributed Storage: data lake, data warehouse, databases
Processing: validation, enrichment, transformation
Catalog: metadata for data discovery
Consumption: analytics, BI, ML, downstream systems
Security & Governance: access control, encryption, masking, lineage
Orchestration: workflow management, monitoring, retry logic
Monitoring: infrastructure health, logs, alerts

This pattern repeats everywhere. Every data platform I’ve worked with follows some version of this architecture. The specific tools change, but the layers stay the same.

Code Repositories and CI/CD

Quick summary:

Code repositories (GitHub, GitLab, Bitbucket) provide version control, collaboration, code review, and release management. Standard commands: git clone, git fetch, git merge, git pull, git push. Handle merge conflicts by editing the conflicting files and committing the resolution.

CI/CD has three parts:

Continuous Integration (CI): Automatically build and test code when developers merge changes
Continuous Delivery (CD): Deploy to pre-production for QA validation
Continuous Deployment (CD): Automatically deploy to production

Tools: Jenkins, GitHub Actions, AWS CodePipeline.

If you’re reading this book, you probably already know Git. If not, stop and learn it. Not optional for any engineering role.

Cloud Computing and AWS

The chapter explains why cloud computing exists: traditional on-premises infrastructure has slow procurement, high upfront cost, limited scalability, and operational complexity.

Cloud benefits:

Agility: 200+ AWS services, experiment and deploy quickly
Elasticity: Scale up and down based on demand, serverless options
Cost savings: Pay-as-you-go, no upfront hardware investment
Global deployment: Spin up resources in any AWS region in minutes

Cloud service models: IaaS (Infrastructure as a Service), PaaS (Platform as a Service), SaaS (Software as a Service).

AWS Infrastructure

AWS has Regions (geographic areas) and Availability Zones (isolated data centers within a region, connected by low-latency network). Most AWS services are regional. Some, like IAM, are global.

Important cost note: data transfer between AZs within a managed service is free, but cross-AZ traffic for your own applications costs money. Cross-region transfer also costs money.

Getting Started with AWS

Setting up an AWS account:

Create account with root email, account name, address, phone
Configure IAM access
Choose primary region
Provision resources

Best practices:

Use corporate email for business accounts, not personal
Enable MFA on root account immediately
Use AWS Organizations for multi-account setups
Use AWS Control Tower for multi-account governance
Understand the Shared Responsibility Model
Set up AWS Budget and Cost Explorer from day one
Tag everything for cost reporting
Set up CloudWatch logging and CloudTrail auditing

AWS IAM

IAM is the global service for managing access. Users, groups, policies, and roles.

IAM Policies combine permissions into a JSON document. Three types:

AWS-managed policies (pre-built, read-only)
Customer-managed policies (you create)
Inline policies (attached directly to a user/role, not reusable)

IAM Roles let you attach policies and have users or services assume them. Benefits: less overhead than per-user permissions, temporary credentials instead of long-term keys, delegation across accounts.

IAM best practices:

Enable MFA for all IAM users
Use roles instead of attaching policies directly to users
Follow least-privilege: grant only what’s needed
Use federation with identity providers for human users
Use IAM Identity Center for multi-account management

From years of working with AWS, IAM is where most security problems start. Too many teams use overly broad policies because “it works.” Take the time to do IAM right. Least privilege isn’t just a best practice – it’s the difference between a minor incident and a catastrophic breach.

My Take on This Chapter

This chapter covers a lot of ground. For experienced engineers, fast read. For people new to data engineering, solid foundation.

The most important things to really understand for the exam and for real work:

Spark: Know it well. Default tool for data engineering on AWS.
Data lake vs data warehouse: Know when to use each. On AWS, this means S3 + Glue vs Redshift.
ETL vs ELT: Understand the trade-offs. Modern stacks lean toward ELT.
IAM: This shows up in every single AWS exam. Roles, policies, least privilege.
Batch vs streaming vs event-driven: Know the patterns and when each applies.

The Hadoop/MapReduce content is mostly historical context at this point. Nobody writes MapReduce jobs in 2026. The concepts of distributed processing, data partitioning, and parallel execution are still fundamental though.

Next chapter gets into actual AWS analytics services. That’s where things get practical.

Next: AWS Analytics Services Overview

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-prerequisite-knowledge/

denis256 notes and projects