AWS Analytics Services: Kinesis, Glue, Athena, Redshift, and More
Previous: Prerequisite Knowledge
Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3
Chapter 3 is where the real AWS content starts. This is the overview of analytics services you need to know for the DEA-C01 exam. Even if you’re not taking the exam, it’s a solid map of what AWS offers for data work.
There are a lot of services here. Some overlap. Some feel redundant. That’s just how AWS works.
Amazon Kinesis Data Streams
Kinesis Data Streams is the real-time streaming service. You push data in, consumers read it out, and everything happens continuously.
The core concept is shards. Each shard gives you 1 MB/s write and 2 MB/s read throughput. Shards are basically parallel lanes on a highway. More shards, more throughput.
Two capacity modes:
- Provisioned mode: you pick the number of shards. You pay for what you provision.
- On-demand mode: AWS scales shards automatically. No capacity planning needed. Good for unpredictable workloads, but costs more per GB.
Data retention is flexible. Default is 24 hours, extendable up to 365 days. That long retention is useful for replaying events when something goes wrong or when you need to retrain ML models.
The enhanced fan-out feature is worth knowing. Without it, all consumers share the 2 MB/s per shard. With enhanced fan-out, each consumer gets its own dedicated 2 MB/s. Important for scenarios with multiple consuming applications.
Kinesis works well for AWS-native pipelines from what I’ve seen. If your whole stack is AWS, it’s the path of least resistance. If you have teams that already know Kafka, MSK might be a better fit.
Amazon Data Firehose
Firehose used to be called Kinesis Data Firehose. AWS renamed it. It’s a delivery service for streaming data. The key difference from Kinesis Data Streams: Firehose is about getting data from point A to point B with minimal setup. You don’t write consumer applications.
How it works:
- Data comes in from producers (Kinesis Data Streams, MSK, direct API calls)
- Firehose buffers the data (you configure buffer size and time interval)
- Optionally transforms data via Lambda functions
- Delivers to a destination: S3, Redshift, OpenSearch, or third-party tools like Splunk, Datadog, Snowflake
Firehose is serverless. No capacity planning, no shard management. Scales automatically.
The buffering matters. Firehose collects data and writes it in batches, not record by record. Near-real-time, not truly real-time. Typical latency is 60 seconds or more depending on your buffer settings.
I use Firehose a lot for log delivery to S3. Set it up once, forget about it. For anything that needs sub-second processing, use Kinesis Data Streams or MSK instead.
Amazon Managed Service for Apache Flink
Previously called Amazon Kinesis Data Analytics. AWS loves renaming things.
This is managed Apache Flink. If you need to process streaming data with complex logic, stateful computations, windowing, or exactly-once semantics, Flink is the tool.
Key things about Flink:
- Stateful processing: Flink can maintain state across events. Session tracking, running aggregations, anomaly detection.
- Exactly-once processing: each event is processed exactly once, even during failures. Critical for financial data.
- Multiple languages: Java, Scala, Python, and SQL.
- Studio notebooks: interactive development powered by Apache Zeppelin. Good for prototyping.
The managed service handles infrastructure, scaling, and checkpointing. You focus on writing the Flink application.
Common use cases: real-time analytics, fraud detection, event-driven apps. If you’re doing simple ETL on streams, Lambda or Firehose might be enough. Flink is for when your processing logic is complex.
Amazon MSK (Managed Streaming for Apache Kafka)
Amazon MSK is managed Kafka. If your team already runs Kafka on-premises or on EC2, MSK removes the operational burden of managing brokers, ZooKeeper, and storage.
Key features:
- Fully compatible with Apache Kafka: no code changes needed when migrating. All Kafka tools and libraries work.
- MSK Serverless: auto-scales based on demand. No cluster sizing. Good for variable workloads.
- Tiered storage: older data moves to cheaper storage automatically. Useful for compliance and long retention.
- MSK Connect: managed Kafka Connect. Deploy connectors without managing infrastructure. Pull data from databases, push to S3, etc.
- MSK Replicator: cross-region replication for disaster recovery.
When to pick MSK over Kinesis? If your team knows Kafka. If you need the Kafka ecosystem (Connect, Streams, KSQL). If you want portability across clouds. Kinesis is simpler for purely AWS workloads, but MSK gives you the full Kafka experience.
Reference Architecture: Streaming Analytics with Flink and MSK
The book shows a fraud detection pipeline:
- Events come in through API Gateway
- Stream into Amazon MSK
- Apache Flink applications process events in real-time, doing anomaly detection and data enrichment (reference data from S3)
- Three outputs:
- Lambda + SNS for fraud notifications
- OpenSearch for real-time search and reporting
- S3 for long-term storage
Solid pattern. MSK as the central event bus, Flink for processing, and multiple consumers for different purposes. I’ve seen similar setups in production. The main challenge is usually getting the Flink application right, not the infrastructure.
AWS Glue
Glue is the Swiss Army knife of AWS data services. It does ETL, data cataloging, data quality, and data integration. All serverless.
The main components:
Glue Data Catalog
A centralized metadata repository. It’s basically a technical catalog that stores table definitions, schemas, and partition information. Other services like Athena, Redshift Spectrum, and EMR use the Glue Data Catalog to understand what data exists and how it’s structured.
Glue Crawlers
Automated programs that scan your data sources (S3, databases, etc.), detect schemas, and populate the Data Catalog. Point a crawler at an S3 bucket, it figures out the file format, column types, and creates table definitions.
Glue ETL Jobs
The actual data transformation engine. Three flavors:
- Spark-based jobs: for large-scale batch processing. Python or Scala.
- Streaming ETL: Apache Spark Structured Streaming for continuous data flows.
- Python shell jobs: for simple, lightweight transformations.
Glue Studio
A visual interface for building ETL workflows. Drag and drop. Good for people who prefer visual tools over writing code.
Glue Data Quality
Define rules, run them against your data, get reports. “Column X should never be null.” “Values in column Y should be between 0 and 100.” That kind of thing.
Glue is the service I see most often in AWS data pipelines. Not the fastest or cheapest for every scenario, but it connects to everything and requires zero infrastructure management. For the exam, know Glue well.
AWS Glue DataBrew
DataBrew is the no-code version of Glue’s data transformation capabilities. Visual interface, 250+ prebuilt transformations, data profiling.
Target audience: analysts and data scientists who don’t want to write Spark code. They can clean, normalize, and transform data through a point-and-click interface.
Features worth noting:
- Data profiling: automatically generates statistics, distributions, and data quality reports.
- Recipes: reusable sequences of transformation steps. Apply the same recipe to new datasets.
- No-code transformations: filtering, grouping, joining, pivoting, all without code.
DataBrew is nice but niche. If you’re a developer, you’ll probably just write Glue jobs directly. For mixed teams where analysts need self-service data prep, it fills a gap.
Amazon Athena
Athena is serverless SQL on S3. Point it at your data, write a SQL query, get results. Pay only for the data scanned.
Built on Trino and Presto. Supports CSV, JSON, ORC, Avro, Parquet. The format matters a lot for cost. Parquet and ORC are columnar, so Athena scans less data and you pay less.
Key features:
- Pay-per-query: charged per TB of data scanned. Use columnar formats and partitioning to keep costs down.
- Apache Spark support: run Spark workloads alongside SQL. Notebooks in the Athena console.
- Apache Iceberg support: table format with schema evolution, time travel, and hidden partitioning. Big deal for data lakes.
- Federated queries: query DynamoDB, Redshift, RDS, and other sources directly from Athena without moving data.
Athena is one of my favorite AWS services. For ad-hoc analysis of data in S3, nothing beats it. No cluster to manage, no infrastructure to worry about. Write SQL, get answers.
The main limitation: not great for frequent, high-concurrency workloads. For that, Redshift is better. Athena shines for exploration and occasional queries.
Amazon EMR
EMR is Elastic MapReduce. Managed clusters running open source big data frameworks: Hadoop, Spark, Presto, HBase, Flink.
Deployment options:
- EMR on EC2: traditional cluster. Full control over instances.
- EMR on EKS: run on Kubernetes. Good if you already have EKS.
- EMR Serverless: no cluster management at all. Submit jobs, EMR handles the rest.
EMR supports Spot Instances for significant cost savings on fault-tolerant workloads. Real advantage for batch processing.
The book mentions support for open table formats: Iceberg, Hudi, Delta Lake. If you’re building a data lakehouse, EMR with Spark and Iceberg is a common pattern.
When to use EMR vs. Glue? EMR gives you more control. You can tune Spark configurations, use any Hadoop ecosystem tool, and run long-lived clusters. Glue is simpler but less flexible. For large-scale, complex Spark workloads, EMR is often the better choice. For straightforward ETL, Glue is easier.
Amazon Redshift
Redshift is the AWS data warehouse. Columnar storage, MPP (Massively Parallel Processing), petabyte-scale.
Important concepts:
- RA3 instances: decouple storage and compute. Scale them independently. Big improvement over the older node types.
- Redshift Serverless: auto-scaling, no cluster management. Good for variable workloads.
- Redshift Spectrum: query data in S3 directly from Redshift without loading it. Combines warehouse and data lake.
- Zero-ETL ingestion: pull data from Aurora and DynamoDB without building ETL pipelines. Still relatively new but promising.
- Data sharing: share live data across Redshift clusters, accounts, and regions without copying.
- Redshift ML: create ML models using SQL. Uses SageMaker under the hood.
Redshift is the right choice when you need consistent, fast query performance on structured data. It’s the backbone of many analytics setups. Athena is great for ad-hoc queries, but Redshift handles concurrent users and repeated workloads much better.
For the exam, know the difference between Redshift provisioned, Serverless, and Spectrum. Also understand data sharing and how Redshift fits into a lakehouse architecture.
Amazon QuickSight
QuickSight is the BI and visualization service. Serverless, scalable, and supports embedding dashboards into applications.
Key features:
- SPICE: Super-fast, Parallel, In-memory Calculation Engine. Stores data in-memory for fast dashboards. Basically a cache layer.
- Pay-per-session pricing: you pay only when users actually view dashboards. Cost-effective for organizations with many occasional users.
- QuickSight Q: natural language queries. Ask questions in English, get visualizations. The generative BI feature.
- Embedded analytics: put dashboards inside your own apps.
- Pixel-perfect reports: for compliance and executive reporting.
QuickSight connects to Athena, Redshift, S3, RDS, and many other sources. It’s the last mile of the analytics pipeline. Where the business users actually see the data.
QuickSight has gotten better over the years, but it still competes with Tableau, Looker, and Power BI. If your org is all-in on AWS, QuickSight makes sense. Otherwise, pick whatever BI tool your team already knows.
Reference Architecture: Lakehouse with Glue, Redshift, and Athena
The book describes a lakehouse pattern:
- Raw data lands in S3 (the data lake)
- Lambda triggers Glue jobs to transform raw data into structured formats
- Processed data goes back to S3
- Redshift consumes curated data for SQL analytics
- Athena queries raw and semi-structured data directly in S3
- QuickSight provides dashboards, with Lambda triggering SPICE refreshes
Classic AWS lakehouse. S3 as the foundation, Glue for ETL, Redshift for the warehouse layer, Athena for ad-hoc exploration, QuickSight for visualization. Most AWS data platforms I’ve seen follow some variation of this pattern.
Amazon OpenSearch Service
OpenSearch is managed Elasticsearch (or rather, the open source fork). Search, log analytics, and observability.
Two deployment models:
- Provisioned clusters: you pick instance types and count. Predictable workloads.
- OpenSearch Serverless: auto-scales. No cluster management.
Key features:
- OpenSearch Dashboards: visualization tool, like Kibana.
- OpenSearch Ingestion: managed Data Prepper for data ingestion and transformation.
- Tiered storage: hot/warm/cold tiers for cost management.
- Vector search: for semantic search and RAG (Retrieval-Augmented Generation) workflows.
Common use cases: log analytics, full-text search, observability, and now RAG for AI applications.
When to use OpenSearch vs. Athena for log analysis? If you need keyword search, fuzzy matching, and real-time dashboards, OpenSearch. If you want to run SQL queries on log files in S3, Athena. Different tools for different patterns.
Amazon DataZone
DataZone is the data governance and cataloging service. It sits on top of your data lake and warehouse, providing a business-friendly view of your data assets.
Key features:
- Business data catalog: enriched with business context, not just technical metadata. Uses LLMs to generate descriptions.
- Data portal: web-based interface for discovering and requesting access to data. No AWS Console needed.
- Governed data sharing: approval workflows, fine-grained access control.
- Data quality and lineage: automated quality checks, end-to-end view of data movement.
DataZone is for large organizations that need formal data governance. Small team where everyone knows where the data is? You probably don’t need it. At scale with hundreds of datasets across multiple teams, something like DataZone becomes necessary.
AWS Lake Formation
Lake Formation manages permissions for your data lake. It integrates with the Glue Data Catalog and provides fine-grained access control.
Key features:
- Centralized permission management: define who can access what data, down to the column level.
- Fine-Grained Access Control (FGAC): database, table, and column-level permissions.
- Tag-Based Access Control (TBAC): label resources with tags, manage access based on tags. Scales better than per-resource permissions.
- External data sharing: share data via AWS Data Exchange without copying.
Lake Formation works with Athena, Redshift, EMR, and third-party tools like Starburst and Dremio. It’s the security layer for your data lake.
The relationship between Lake Formation and DataZone can be confusing. Lake Formation handles technical permissions (who can query what table). DataZone handles business governance (who can discover, request, and use data assets). They complement each other.
Summary
Chapter 3 Part 1 covers a lot of ground. Here’s the quick mental model:
Streaming: Kinesis Data Streams (real-time), Firehose (delivery), Flink (complex processing), MSK (managed Kafka)
ETL and Cataloging: Glue (ETL + catalog), DataBrew (visual data prep)
Querying: Athena (serverless SQL on S3), EMR (managed Spark/Hadoop clusters)
Warehousing: Redshift (MPP data warehouse)
Visualization: QuickSight (BI dashboards)
Search: OpenSearch (log analytics, full-text search)
Governance: DataZone (business catalog), Lake Formation (permissions)
The exam tests whether you can pick the right service for a given scenario. Most questions come down to understanding the trade-offs between these services. Serverless vs. managed clusters. Real-time vs. near-real-time. Ad-hoc queries vs. high-concurrency workloads.
Next chapter section covers the auxiliary services that support these analytics workloads.