Analytics with QuickSight, Athena, Redshift SQL, and Notebooks on AWS
Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3
Chapter 6 is titled “Data Operations and Support” and it’s a big one. This post covers the first part: analytics operations. QuickSight for BI dashboards, Athena for SQL on S3, Redshift for heavy SQL analytics, and notebooks for interactive data exploration. These are the tools you use after you’ve built your pipelines and stored your data. Now you actually look at it.
Amazon QuickSight
QuickSight is AWS’s serverless BI service. Connect data sources, build visualizations, group them into analyses, publish dashboards. It competes with Tableau, Power BI, and Looker. The serverless part is its biggest selling point. No servers to manage, no licenses to count per seat in the traditional sense.
Data Sources
QuickSight connects to a wide range of sources:
- AWS databases: Aurora, RDS (MySQL, PostgreSQL, SQL Server, Oracle)
- Big data services: Redshift, S3 (CSV, TSV, JSON, XLSX), OpenSearch, Athena, IoT Analytics
- Third-party databases: Snowflake, Starburst, Trino, Teradata, Microsoft SQL Server, MySQL, MariaDB, Oracle, PostgreSQL
- SaaS apps: Salesforce, ServiceNow, Jira, GitHub, Adobe Analytics
- On-premises: databases accessible via VPC or Direct Connect
This flexibility is genuinely useful. Most BI tools struggle with connecting to everything. QuickSight handles it reasonably well out of the box.
Datasets and SPICE
Once connected to a data source, you create a dataset. Two options:
- Direct query - queries the live data source every time
- SPICE import - loads data into an in-memory cache called SPICE (Super-fast, Parallel, In-memory Calculation Engine)
SPICE is the better option in most cases. It preprocesses and caches data so your dashboard queries are fast. No waiting for slow database queries. Since SPICE caches results, you also avoid repeated query costs on pay-per-query services like Athena. That last point matters. I’ve seen teams rack up serious Athena bills because their QuickSight dashboards were doing direct queries and every dashboard refresh was a full table scan.
The tradeoff: SPICE data isn’t real-time. You need to set up refreshes.
Refreshing SPICE Datasets
Four ways to keep SPICE data fresh:
Manual refresh. Click a button in the console. Good for dev and test. Not for production.
Scheduled refresh. Daily, weekly, or monthly. Full refresh by default, fetching the entire dataset each time.
Incremental refresh. Enterprise edition only. Pick a date column and a lookback window. QuickSight only refreshes data from that window. Can run every 15 minutes, 30 minutes, hourly, daily, weekly, or monthly. What you want for large datasets where full refreshes take too long.
API and automation. Trigger-based. When a new file lands in S3, call the QuickSight API to refresh the dataset. Best for event-driven pipelines where data arrives at irregular intervals.
Visualizations
QuickSight supports the standard chart types:
- Bar charts (vertical, horizontal, stacked) for comparing categories
- Line charts for trends over time
- Pie and donut charts for proportions
- Scatter plots for identifying correlations and outliers
- Histograms for distribution analysis
- Box plots for comparing distributions between groups
- Treemaps for hierarchical data
- Pivot tables for multidimensional slicing and dicing
- KPIs for showing key metrics with trend indicators
Also geospatial maps, funnel charts, gauges, and calendar heatmaps. QuickSight has an AutoGraph feature that suggests the best chart type based on the data fields you select. Surprisingly decent for quick exploration.
Presentation Formats
Visualizations live inside these containers:
- Analyses - where you build and organize multiple visualizations across sheets
- Sheets - individual pages within an analysis, each showing a specific set of visuals
- Dashboards - the published, shareable version of an analysis with interactive filters
- Data stories - narrative-driven presentations with descriptive text and annotations guiding viewers through the data
QuickSight Q (GenBI)
QuickSight Q is the GenAI layer on top of QuickSight. Ask questions in plain English, get charts and answers back. Three main capabilities:
Generate stories. Give it a prompt and select some visuals. It drafts a data story you can customize. Not a replacement for actual analysis, but a decent starting point.
Executive summaries. Auto-generated highlights from your dashboard. Enable it when publishing by checking “Allow executive summary.” Useful for stakeholders who’ll never click through individual charts.
Dashboard Q&A. Users type questions about the data shown in the dashboard. Enable by checking “Allow data Q&A” when publishing. Select which datasets to include. Lets people explore data without knowing how to build charts.
Quick take on QuickSight: good enough for many use cases, especially if you’re already deep in the AWS ecosystem. SPICE engine is fast. Pricing model scales well. The visualization builder feels clunky compared to Tableau though. Power users will hit limits quickly. For the exam, know the data flow: data source to dataset (direct or SPICE) to visualization to analysis to dashboard.
SQL Analytics with Amazon Athena
Athena is the serverless SQL query engine for S3. No infrastructure to manage. Pay per query based on data scanned. Write SQL, point it at your data lake, get results. Go-to service for ad-hoc data exploration on AWS.
Choice of Querying Engine
Two engines:
Trino SQL. The default. Write SQL in Athena’s query editor or call the StartQueryExecution API. Reads from the Glue Data Catalog and queries data in S3.
Spark SQL/PySpark. Uses the Athena notebook editor. Spark DataFrames, Spark SQL, Spark’s advanced analytics functions. Good for data engineers who need more than SQL can offer.
Workgroups
Workgroups separate and control different query workloads. Each workgroup can use either the Athena SQL engine or PySpark engine. Use them to:
- Separate automated workloads (report generation) from ad-hoc analyst queries
- Control access per team via IAM policies
- Set workgroup-wide settings like query result location and encryption
- Publish query metrics to CloudWatch and set cost allocation tags
Workgroups are essential in any serious Athena deployment. Without them, one team’s expensive query can eat all the resources. No way to track who’s spending what.
Capacity Reservations
By default, Athena uses on-demand pricing. The problem: no guarantees during peak loads. Capacity reservations let you preallocate compute power measured in DPUs (1 DPU = 4 vCPUs + 16 GB RAM). Assign workgroups to reserved capacity. Queries in those workgroups run on reserved resources. Everyone else stays on on-demand.
Useful for production workloads that must finish on time. If your nightly report queries sometimes get slow because other tenants are hammering Athena, reserved capacity fixes that.
Federated SQL
Athena Federated Query extends querying beyond S3. Uses Lambda-based connectors to query external data sources. The connector handles connecting, authenticating, translating the query, and converting results back to a format Athena understands.
Supported connectors include:
- Non-AWS: Snowflake, Google BigQuery, Azure Data Lake Storage, Azure Synapse, Cloudera Hive/Impala
- AWS: DocumentDB, DynamoDB, OpenSearch, CloudWatch, Redshift, MSK (Kafka)
You can join S3 data with Snowflake data in a single SQL statement. Common use cases: multicloud analytics, one-time cross-source analysis, ad-hoc data exploration without moving data first.
Use Cases
Athena handles more than basic SELECT queries:
- Query S3 data: service logs, data lake tables in open formats
- Query federated sources via connectors
- Query Apache Iceberg tables with time-travel, and Hudi datasets
- Run geospatial analytics with built-in geospatial functions
- Run ML inference from SageMaker models using the
USING EXTERNAL FUNCTIONclause - Extend SQL with user-defined functions (UDFs) powered by Lambda
The SageMaker integration is worth noting. Define a function in Athena that references a SageMaker endpoint, call it in SQL. The model processes your data and returns results inline. Anomaly detection model flags outliers directly in your query results, for example.
Lambda UDFs work similarly. Define the function, point at a Lambda ARN, call in SQL. Useful for custom transformations like polygon-to-cell conversion, text translation, or sentiment analysis.
DDL Capabilities
Athena supports a subset of DDL statements:
CREATE DATABASE,TABLE,VIEWDROP DATABASE,TABLE,VIEWALTER TABLEto add/drop/rename partitions and replace columnsDESCRIBEfor table metadataSHOWstatements to list databases, tables, views, columnsMSCK REPAIR TABLEto sync partition metadata with Glue Data Catalog
That last one is important. When you add new partitions to S3, Athena doesn’t know about them until the metadata is updated. MSCK REPAIR TABLE scans S3 for new Hive-compatible partitions and adds them to the catalog. Use it when creating a partitioned table from existing data. For frequent updates like daily partitions, prefer ALTER TABLE ADD PARTITION to avoid timeout issues from excessive scanning.
Best Practices
Athena charges by data scanned. Everything here is about scanning less data:
Partition your data. Pick partition keys that match your query patterns. If you always filter by date, partition by year/month/day.
Use partition projection. For highly partitioned tables, partition projection avoids metadata lookups in the Glue Catalog. Calculates partitions on the fly based on configured patterns (dates, integers, enumerations). Faster queries, automatic partition management.
Enable query result reuse. Opt-in feature. Athena caches results and returns them for identical queries within a configurable time window (up to 7 days). Great for recurring queries with stable data. Doesn’t work with DDL, CTAS, INSERT INTO, MERGE, or UNLOAD.
Use columnar formats like Parquet. Columnar storage means Athena reads only the columns you need. If your table has 100 columns and you need 3, Parquet reads just those 3. CSV reads everything.
Select specific columns. Use SELECT column1, column2 instead of SELECT *. Directly reduces the amount of data scanned and your bill.
The single biggest cost and performance improvement with Athena is switching from CSV/JSON to Parquet with proper partitioning. I’ve seen query costs drop by 90% just from that change. If you remember one thing about Athena best practices, that’s the one.
SQL Analytics with Amazon Redshift
Redshift is the heavy hitter. Fully managed, petabyte-scale data warehouse using massively parallel processing (MPP). While Athena is best for ad-hoc queries on S3, Redshift handles complex analytical workloads at scale with consistent performance.
Query Redshift through the Query Editor v2 in the console, any SQL IDE via JDBC/ODBC (DBeaver, SQLWorkbench), BI tools (QuickSight, Tableau, Power BI), or the Redshift Data API for programmatic access.
SQL Functions
Redshift is built on PostgreSQL but heavily modified for analytics:
- Standard aggregates:
AVG,COUNT,MAX,MIN,SUM - Advanced aggregates:
APPROXIMATE PERCENTILE_DISC,LISTAGG,MEDIAN - Statistical functions:
STDDEV_SAMP,STDDEV_POP,VAR_SAMP,VAR_POP - Specialized:
ANY_VALUE,PERCENTILE_CONT, bit-wise aggregates, HyperLogLog, array aggregation - Date/time, text manipulation, JSON, ML, and geospatial functions
Extensive function library. For most analytical SQL work, Redshift has you covered without needing UDFs.
Semi-Structured Data with SUPER Type
Redshift handles semi-structured data (JSON, Avro, Ion) natively using the SUPER data type. Store entire JSON documents in a SUPER column without defining the schema upfront. Data can have nested objects and arrays.
To query SUPER data, Redshift uses PartiQL, an extension of SQL that understands nested structures.
Navigating JSON objects/arrays:
SELECT c_orders[0].o_orderkey FROM customer_orders_lineitem;
Array elements accessed by index, attributes by dot notation. Simple and intuitive.
Unnesting arrays:
SELECT c.c_name, o.o_orderkey, o.o_orderdate
FROM customer_orders_lineitem c, c.c_orders o;
Flattens the orders array. Each order becomes its own row. Customer#1 with two orders produces two rows.
Unpivoting objects:
SELECT attr, val
FROM customer_orders_lineitem c, UNPIVOT c.c_orders[0] AS val AT attr
WHERE c.c_custkey = 100;
Turns object properties into rows with attribute name and value columns.
PartiQL is powerful for semi-structured data. Avoids the pain of shredding JSON into separate columns, which is what you had to do before SUPER type existed. If your data has evolving schemas, SUPER is the right approach.
Geospatial Data Analysis
Redshift supports spatial data types and functions:
- Store geographic data using the
GEOMETRYtype (points, polygons, lines) - Use spatial functions:
ST_Distance,ST_Within,ST_Intersects,ST_DistanceSphere - Support for WKT (Well-Known Text) and GeoJSON formats
- Spatial joins using the
&&operator
Example: count Airbnb listings within 500 meters of the Brandenburg Gate:
SELECT count(*)
FROM accommodations
WHERE ST_DistanceSphere(
shape,
ST_GeomFromText('POINT(13.377704 52.516431)', 4326)
) < 500;
Or find which ZIP code polygon contains a specific property:
SELECT a.price, a.name, z.spatial_name
FROM accommodations a, zipcode z
WHERE price = 9000 AND ST_Within(a.shape, z.wkb_geometry);
Geospatial in Redshift works well for moderate spatial workloads. For heavy GIS work you’d want PostGIS or a dedicated spatial database, but for analytics use cases it’s solid.
Querying Data Lake via Redshift Spectrum
Redshift Spectrum lets you query data in S3 directly from Redshift without loading it first. Supports open data formats (Parquet, ORC, Avro, CSV, JSON) and open table formats (Iceberg, Delta Lake, Hudi).
The workflow:
- Define external schemas and tables using
CREATE EXTERNAL TABLE/CREATE EXTERNAL SCHEMA, or use the Glue Data Catalog - Partition external tables for query pruning
- Join S3 external tables with Redshift internal tables in the same query
- Updated S3 files are immediately available for queries
One of Redshift’s strongest features. Keep hot data in Redshift for fast queries. Cold and historical data stays in S3. Spectrum bridges both. Your analysts write one SQL query that joins across both storage tiers. They don’t need to know or care where the data physically lives.
Analyzing Operational Data Stores
Redshift federated queries let you directly query Amazon RDS and Aurora databases from Redshift. No ETL needed. Redshift connects to the remote database, runs the query, retrieves matching rows, processes them locally.
Useful for joining operational data with analytical data. Combining real-time customer records from Aurora with historical order data in Redshift, for example. Compute is split between Redshift and the remote database.
Redshift ML and Generative AI
Redshift ML integrates machine learning directly into SQL. Connects to SageMaker, Forecast, and SageMaker JumpStart. Provide training data, Redshift ML picks the best model via SageMaker Autopilot, make predictions using SQL.
Supported algorithms: XGBoost, neural networks, k-means clustering, linear regression, forecasting, and generative AI model invocation.
Use cases:
- Prediction: churn prediction, fraud detection, text classification (binary, multiclass, regression)
- Clustering: grouping customers by behavior, grouping content by attributes
- Generative AI: sentiment analysis, text summarization, translation
The appeal: SQL users can do ML without leaving Redshift. No Python, no separate ML pipeline, no data movement. For simple prediction and classification tasks, practical shortcut.
User-Defined Functions (UDFs)
Three types of scalar UDFs:
SQL UDFs. Created with a SQL SELECT clause. Simple, reusable transformations and calculations.
Python UDFs. Created with Python code. Can import custom modules. Good for complex transformations easier in Python than SQL.
Lambda UDFs. Call Lambda functions from SQL queries. Written in any Lambda-supported language (Java, Go, Node.js, Python, Ruby, C#). Most flexible option. Use for integrating with external APIs, third-party services, or data sources Redshift doesn’t natively connect to. Tokenization services are a common use case.
Practical note on Athena vs Redshift: use Athena for ad-hoc exploration and infrequent queries on S3 data. Use Redshift when you need consistent sub-second performance on complex analytical queries, when you join structured warehouse data with semi-structured data, or when your query patterns are predictable and frequent. The cost models differ. Athena charges per scan. Redshift charges per cluster hour (or per query for Serverless). For heavy workloads, Redshift is usually cheaper per query. For light, occasional queries, Athena wins.
Analyzing Data Using Notebooks
Notebooks provide interactive environments for data exploration using Apache Spark. Two options on AWS: Glue Interactive Sessions and EMR Notebooks.
AWS Glue Interactive Sessions
Jupyter-compatible notebooks backed by on-demand Apache Spark. No infrastructure to manage.
Key features:
- Jupyter notebooks for writing code, visualizing data, and documenting work
- On-demand Spark as the compute backend for processing large datasets
- Built-in visualization with Matplotlib and Seaborn support
- Custom libraries by uploading to S3 and referencing via the
additional_python_modulesmagic command
Simple example from the book loads the Iris dataset and creates a pair plot:
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
%matplot plt
The %matplot plt magic command renders the visualization in the notebook. Straightforward for exploratory analysis.
Amazon EMR Notebooks
Managed Jupyter environment designed for Apache Spark on EMR clusters. Supports PySpark, Spark SQL, Spark R, and Scala through Spark Magic kernels.
The workflow:
- Create an EMR cluster with required components (Spark, Hive, Presto)
- Create an EMR Notebook and connect it to the cluster
- Read data from S3, Glue Data Catalog, or other sources
- Perform exploratory analysis and visualization
- Share notebooks with team members
- Productionize workflows as scheduled jobs
Attach multiple notebooks to one cluster, detach and reattach to different clusters. The notebook environment is managed, no software installation needed.
The difference between the two comes down to infrastructure control. Glue is fully serverless. You don’t see or manage any clusters. EMR gives you more control over cluster configuration, instance types, and installed libraries. For quick data exploration, Glue is simpler. For heavy processing or specific Spark configurations, EMR is the better choice.
Exam Tips
For the certification exam, focus on:
- Know when to use QuickSight SPICE vs direct query (SPICE for performance and cost savings, direct for real-time)
- Understand SPICE refresh options and when each is appropriate
- Know Athena workgroups for access control, cost tracking, and resource isolation
- Understand Athena capacity reservations for guaranteed compute
- Know federated query patterns for both Athena (Lambda connectors) and Redshift (RDS/Aurora)
- Understand the SUPER data type and PartiQL for semi-structured data in Redshift
- Know Redshift Spectrum for querying S3 from Redshift
- Understand the three types of Redshift UDFs (SQL, Python, Lambda) and when to use each
- Know the difference between Glue Interactive Sessions (serverless) and EMR Notebooks (cluster-based)
The exam likes to test your ability to pick the right tool. Athena for ad-hoc S3 queries. Redshift for complex warehouse analytics. QuickSight for BI dashboards. Notebooks for interactive exploration. Each has its lane.