Analytics with QuickSight, Athena, Redshift SQL, and Notebooks on AWS

Published: Friday, Jan 25, 2019 | 14 minute read | Using 2889 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Chapter 6 is titled “Data Operations and Support” and it’s a big one. This post covers the first part: analytics operations. QuickSight for BI dashboards, Athena for SQL on S3, Redshift for heavy SQL analytics, and notebooks for interactive data exploration. These are the tools you use after you’ve built your pipelines and stored your data. Now you actually look at it.

Amazon QuickSight

QuickSight is AWS’s serverless BI service. Connect data sources, build visualizations, group them into analyses, publish dashboards. It competes with Tableau, Power BI, and Looker. The serverless part is its biggest selling point. No servers to manage, no licenses to count per seat in the traditional sense.

Data Sources

QuickSight connects to a wide range of sources:

AWS databases: Aurora, RDS (MySQL, PostgreSQL, SQL Server, Oracle)
Big data services: Redshift, S3 (CSV, TSV, JSON, XLSX), OpenSearch, Athena, IoT Analytics
Third-party databases: Snowflake, Starburst, Trino, Teradata, Microsoft SQL Server, MySQL, MariaDB, Oracle, PostgreSQL
SaaS apps: Salesforce, ServiceNow, Jira, GitHub, Adobe Analytics
On-premises: databases accessible via VPC or Direct Connect

This flexibility is genuinely useful. Most BI tools struggle with connecting to everything. QuickSight handles it reasonably well out of the box.

Datasets and SPICE

Once connected to a data source, you create a dataset. Two options:

Direct query - queries the live data source every time
SPICE import - loads data into an in-memory cache called SPICE (Super-fast, Parallel, In-memory Calculation Engine)

SPICE is the better option in most cases. It preprocesses and caches data so your dashboard queries are fast. No waiting for slow database queries. Since SPICE caches results, you also avoid repeated query costs on pay-per-query services like Athena. That last point matters. I’ve seen teams rack up serious Athena bills because their QuickSight dashboards were doing direct queries and every dashboard refresh was a full table scan.

The tradeoff: SPICE data isn’t real-time. You need to set up refreshes.

Refreshing SPICE Datasets

Four ways to keep SPICE data fresh:

Manual refresh. Click a button in the console. Good for dev and test. Not for production.

Scheduled refresh. Daily, weekly, or monthly. Full refresh by default, fetching the entire dataset each time.

Incremental refresh. Enterprise edition only. Pick a date column and a lookback window. QuickSight only refreshes data from that window. Can run every 15 minutes, 30 minutes, hourly, daily, weekly, or monthly. What you want for large datasets where full refreshes take too long.

API and automation. Trigger-based. When a new file lands in S3, call the QuickSight API to refresh the dataset. Best for event-driven pipelines where data arrives at irregular intervals.

Visualizations

QuickSight supports the standard chart types:

Bar charts (vertical, horizontal, stacked) for comparing categories
Line charts for trends over time
Pie and donut charts for proportions
Scatter plots for identifying correlations and outliers
Histograms for distribution analysis
Box plots for comparing distributions between groups
Treemaps for hierarchical data
Pivot tables for multidimensional slicing and dicing
KPIs for showing key metrics with trend indicators

Also geospatial maps, funnel charts, gauges, and calendar heatmaps. QuickSight has an AutoGraph feature that suggests the best chart type based on the data fields you select. Surprisingly decent for quick exploration.

Presentation Formats

Visualizations live inside these containers:

Analyses - where you build and organize multiple visualizations across sheets
Sheets - individual pages within an analysis, each showing a specific set of visuals
Dashboards - the published, shareable version of an analysis with interactive filters
Data stories - narrative-driven presentations with descriptive text and annotations guiding viewers through the data

QuickSight Q (GenBI)

QuickSight Q is the GenAI layer on top of QuickSight. Ask questions in plain English, get charts and answers back. Three main capabilities:

Generate stories. Give it a prompt and select some visuals. It drafts a data story you can customize. Not a replacement for actual analysis, but a decent starting point.

Executive summaries. Auto-generated highlights from your dashboard. Enable it when publishing by checking “Allow executive summary.” Useful for stakeholders who’ll never click through individual charts.

Dashboard Q&A. Users type questions about the data shown in the dashboard. Enable by checking “Allow data Q&A” when publishing. Select which datasets to include. Lets people explore data without knowing how to build charts.

Quick take on QuickSight: good enough for many use cases, especially if you’re already deep in the AWS ecosystem. SPICE engine is fast. Pricing model scales well. The visualization builder feels clunky compared to Tableau though. Power users will hit limits quickly. For the exam, know the data flow: data source to dataset (direct or SPICE) to visualization to analysis to dashboard.

SQL Analytics with Amazon Athena

Athena is the serverless SQL query engine for S3. No infrastructure to manage. Pay per query based on data scanned. Write SQL, point it at your data lake, get results. Go-to service for ad-hoc data exploration on AWS.

Choice of Querying Engine

Two engines:

Trino SQL. The default. Write SQL in Athena’s query editor or call the StartQueryExecution API. Reads from the Glue Data Catalog and queries data in S3.

Spark SQL/PySpark. Uses the Athena notebook editor. Spark DataFrames, Spark SQL, Spark’s advanced analytics functions. Good for data engineers who need more than SQL can offer.

Workgroups

Workgroups separate and control different query workloads. Each workgroup can use either the Athena SQL engine or PySpark engine. Use them to:

Separate automated workloads (report generation) from ad-hoc analyst queries
Control access per team via IAM policies
Set workgroup-wide settings like query result location and encryption
Publish query metrics to CloudWatch and set cost allocation tags

Workgroups are essential in any serious Athena deployment. Without them, one team’s expensive query can eat all the resources. No way to track who’s spending what.

Capacity Reservations

By default, Athena uses on-demand pricing. The problem: no guarantees during peak loads. Capacity reservations let you preallocate compute power measured in DPUs (1 DPU = 4 vCPUs + 16 GB RAM). Assign workgroups to reserved capacity. Queries in those workgroups run on reserved resources. Everyone else stays on on-demand.

Useful for production workloads that must finish on time. If your nightly report queries sometimes get slow because other tenants are hammering Athena, reserved capacity fixes that.

Federated SQL

Athena Federated Query extends querying beyond S3. Uses Lambda-based connectors to query external data sources. The connector handles connecting, authenticating, translating the query, and converting results back to a format Athena understands.

Supported connectors include:

Non-AWS: Snowflake, Google BigQuery, Azure Data Lake Storage, Azure Synapse, Cloudera Hive/Impala
AWS: DocumentDB, DynamoDB, OpenSearch, CloudWatch, Redshift, MSK (Kafka)

You can join S3 data with Snowflake data in a single SQL statement. Common use cases: multicloud analytics, one-time cross-source analysis, ad-hoc data exploration without moving data first.

Use Cases

Athena handles more than basic SELECT queries:

Query S3 data: service logs, data lake tables in open formats
Query federated sources via connectors
Query Apache Iceberg tables with time-travel, and Hudi datasets
Run geospatial analytics with built-in geospatial functions
Run ML inference from SageMaker models using the USING EXTERNAL FUNCTION clause
Extend SQL with user-defined functions (UDFs) powered by Lambda

The SageMaker integration is worth noting. Define a function in Athena that references a SageMaker endpoint, call it in SQL. The model processes your data and returns results inline. Anomaly detection model flags outliers directly in your query results, for example.

Lambda UDFs work similarly. Define the function, point at a Lambda ARN, call in SQL. Useful for custom transformations like polygon-to-cell conversion, text translation, or sentiment analysis.

DDL Capabilities

Athena supports a subset of DDL statements:

CREATE DATABASE, TABLE, VIEW
DROP DATABASE, TABLE, VIEW
ALTER TABLE to add/drop/rename partitions and replace columns
DESCRIBE for table metadata
SHOW statements to list databases, tables, views, columns
MSCK REPAIR TABLE to sync partition metadata with Glue Data Catalog

That last one is important. When you add new partitions to S3, Athena doesn’t know about them until the metadata is updated. MSCK REPAIR TABLE scans S3 for new Hive-compatible partitions and adds them to the catalog. Use it when creating a partitioned table from existing data. For frequent updates like daily partitions, prefer ALTER TABLE ADD PARTITION to avoid timeout issues from excessive scanning.

Best Practices

Athena charges by data scanned. Everything here is about scanning less data:

Partition your data. Pick partition keys that match your query patterns. If you always filter by date, partition by year/month/day.

Use partition projection. For highly partitioned tables, partition projection avoids metadata lookups in the Glue Catalog. Calculates partitions on the fly based on configured patterns (dates, integers, enumerations). Faster queries, automatic partition management.

Enable query result reuse. Opt-in feature. Athena caches results and returns them for identical queries within a configurable time window (up to 7 days). Great for recurring queries with stable data. Doesn’t work with DDL, CTAS, INSERT INTO, MERGE, or UNLOAD.

Use columnar formats like Parquet. Columnar storage means Athena reads only the columns you need. If your table has 100 columns and you need 3, Parquet reads just those 3. CSV reads everything.

Select specific columns. Use SELECT column1, column2 instead of SELECT *. Directly reduces the amount of data scanned and your bill.

The single biggest cost and performance improvement with Athena is switching from CSV/JSON to Parquet with proper partitioning. I’ve seen query costs drop by 90% just from that change. If you remember one thing about Athena best practices, that’s the one.

SQL Analytics with Amazon Redshift

Redshift is the heavy hitter. Fully managed, petabyte-scale data warehouse using massively parallel processing (MPP). While Athena is best for ad-hoc queries on S3, Redshift handles complex analytical workloads at scale with consistent performance.

Query Redshift through the Query Editor v2 in the console, any SQL IDE via JDBC/ODBC (DBeaver, SQLWorkbench), BI tools (QuickSight, Tableau, Power BI), or the Redshift Data API for programmatic access.

SQL Functions

Redshift is built on PostgreSQL but heavily modified for analytics:

Standard aggregates: AVG, COUNT, MAX, MIN, SUM
Advanced aggregates: APPROXIMATE PERCENTILE_DISC, LISTAGG, MEDIAN
Statistical functions: STDDEV_SAMP, STDDEV_POP, VAR_SAMP, VAR_POP
Specialized: ANY_VALUE, PERCENTILE_CONT, bit-wise aggregates, HyperLogLog, array aggregation
Date/time, text manipulation, JSON, ML, and geospatial functions

Extensive function library. For most analytical SQL work, Redshift has you covered without needing UDFs.

Semi-Structured Data with SUPER Type

Redshift handles semi-structured data (JSON, Avro, Ion) natively using the SUPER data type. Store entire JSON documents in a SUPER column without defining the schema upfront. Data can have nested objects and arrays.

To query SUPER data, Redshift uses PartiQL, an extension of SQL that understands nested structures.

Navigating JSON objects/arrays:

SELECT c_orders[0].o_orderkey FROM customer_orders_lineitem;

Array elements accessed by index, attributes by dot notation. Simple and intuitive.

Unnesting arrays:

SELECT c.c_name, o.o_orderkey, o.o_orderdate
FROM customer_orders_lineitem c, c.c_orders o;

Flattens the orders array. Each order becomes its own row. Customer#1 with two orders produces two rows.

Unpivoting objects:

SELECT attr, val
FROM customer_orders_lineitem c, UNPIVOT c.c_orders[0] AS val AT attr
WHERE c.c_custkey = 100;

Turns object properties into rows with attribute name and value columns.

PartiQL is powerful for semi-structured data. Avoids the pain of shredding JSON into separate columns, which is what you had to do before SUPER type existed. If your data has evolving schemas, SUPER is the right approach.

Geospatial Data Analysis

Redshift supports spatial data types and functions:

Store geographic data using the GEOMETRY type (points, polygons, lines)
Use spatial functions: ST_Distance, ST_Within, ST_Intersects, ST_DistanceSphere
Support for WKT (Well-Known Text) and GeoJSON formats
Spatial joins using the && operator

Example: count Airbnb listings within 500 meters of the Brandenburg Gate:

SELECT count(*) 
FROM accommodations 
WHERE ST_DistanceSphere(
    shape, 
    ST_GeomFromText('POINT(13.377704 52.516431)', 4326)
) < 500;

Or find which ZIP code polygon contains a specific property:

SELECT a.price, a.name, z.spatial_name
FROM accommodations a, zipcode z
WHERE price = 9000 AND ST_Within(a.shape, z.wkb_geometry);

Geospatial in Redshift works well for moderate spatial workloads. For heavy GIS work you’d want PostGIS or a dedicated spatial database, but for analytics use cases it’s solid.

Querying Data Lake via Redshift Spectrum

Redshift Spectrum lets you query data in S3 directly from Redshift without loading it first. Supports open data formats (Parquet, ORC, Avro, CSV, JSON) and open table formats (Iceberg, Delta Lake, Hudi).

The workflow:

Define external schemas and tables using CREATE EXTERNAL TABLE/CREATE EXTERNAL SCHEMA, or use the Glue Data Catalog
Partition external tables for query pruning
Join S3 external tables with Redshift internal tables in the same query
Updated S3 files are immediately available for queries

One of Redshift’s strongest features. Keep hot data in Redshift for fast queries. Cold and historical data stays in S3. Spectrum bridges both. Your analysts write one SQL query that joins across both storage tiers. They don’t need to know or care where the data physically lives.

Analyzing Operational Data Stores

Redshift federated queries let you directly query Amazon RDS and Aurora databases from Redshift. No ETL needed. Redshift connects to the remote database, runs the query, retrieves matching rows, processes them locally.

Useful for joining operational data with analytical data. Combining real-time customer records from Aurora with historical order data in Redshift, for example. Compute is split between Redshift and the remote database.

Redshift ML and Generative AI

Redshift ML integrates machine learning directly into SQL. Connects to SageMaker, Forecast, and SageMaker JumpStart. Provide training data, Redshift ML picks the best model via SageMaker Autopilot, make predictions using SQL.

Supported algorithms: XGBoost, neural networks, k-means clustering, linear regression, forecasting, and generative AI model invocation.

Use cases:

Prediction: churn prediction, fraud detection, text classification (binary, multiclass, regression)
Clustering: grouping customers by behavior, grouping content by attributes
Generative AI: sentiment analysis, text summarization, translation

The appeal: SQL users can do ML without leaving Redshift. No Python, no separate ML pipeline, no data movement. For simple prediction and classification tasks, practical shortcut.

User-Defined Functions (UDFs)

Three types of scalar UDFs:

SQL UDFs. Created with a SQL SELECT clause. Simple, reusable transformations and calculations.

Python UDFs. Created with Python code. Can import custom modules. Good for complex transformations easier in Python than SQL.

Lambda UDFs. Call Lambda functions from SQL queries. Written in any Lambda-supported language (Java, Go, Node.js, Python, Ruby, C#). Most flexible option. Use for integrating with external APIs, third-party services, or data sources Redshift doesn’t natively connect to. Tokenization services are a common use case.

Practical note on Athena vs Redshift: use Athena for ad-hoc exploration and infrequent queries on S3 data. Use Redshift when you need consistent sub-second performance on complex analytical queries, when you join structured warehouse data with semi-structured data, or when your query patterns are predictable and frequent. The cost models differ. Athena charges per scan. Redshift charges per cluster hour (or per query for Serverless). For heavy workloads, Redshift is usually cheaper per query. For light, occasional queries, Athena wins.

Analyzing Data Using Notebooks

Notebooks provide interactive environments for data exploration using Apache Spark. Two options on AWS: Glue Interactive Sessions and EMR Notebooks.

AWS Glue Interactive Sessions

Jupyter-compatible notebooks backed by on-demand Apache Spark. No infrastructure to manage.

Key features:

Jupyter notebooks for writing code, visualizing data, and documenting work
On-demand Spark as the compute backend for processing large datasets
Built-in visualization with Matplotlib and Seaborn support
Custom libraries by uploading to S3 and referencing via the additional_python_modules magic command

Simple example from the book loads the Iris dataset and creates a pair plot:

import seaborn as sns 
import matplotlib.pyplot as plt 

iris = sns.load_dataset("iris") 
sns.pairplot(iris, hue="species") 
%matplot plt

The %matplot plt magic command renders the visualization in the notebook. Straightforward for exploratory analysis.

Amazon EMR Notebooks

Managed Jupyter environment designed for Apache Spark on EMR clusters. Supports PySpark, Spark SQL, Spark R, and Scala through Spark Magic kernels.

The workflow:

Create an EMR cluster with required components (Spark, Hive, Presto)
Create an EMR Notebook and connect it to the cluster
Read data from S3, Glue Data Catalog, or other sources
Perform exploratory analysis and visualization
Share notebooks with team members
Productionize workflows as scheduled jobs

Attach multiple notebooks to one cluster, detach and reattach to different clusters. The notebook environment is managed, no software installation needed.

The difference between the two comes down to infrastructure control. Glue is fully serverless. You don’t see or manage any clusters. EMR gives you more control over cluster configuration, instance types, and installed libraries. For quick data exploration, Glue is simpler. For heavy processing or specific Spark configurations, EMR is the better choice.

Exam Tips

For the certification exam, focus on:

Know when to use QuickSight SPICE vs direct query (SPICE for performance and cost savings, direct for real-time)
Understand SPICE refresh options and when each is appropriate
Know Athena workgroups for access control, cost tracking, and resource isolation
Understand Athena capacity reservations for guaranteed compute
Know federated query patterns for both Athena (Lambda connectors) and Redshift (RDS/Aurora)
Understand the SUPER data type and PartiQL for semi-structured data in Redshift
Know Redshift Spectrum for querying S3 from Redshift
Understand the three types of Redshift UDFs (SQL, Python, Lambda) and when to use each
Know the difference between Glue Interactive Sessions (serverless) and EMR Notebooks (cluster-based)

The exam likes to test your ability to pick the right tool. Athena for ad-hoc S3 queries. Redshift for complex warehouse analytics. QuickSight for BI dashboards. Notebooks for interactive exploration. Each has its lane.

Next: Pipeline Resiliency and Cost Optimization

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-analytics-operations/

denis256 notes and projects