What's New in AWS for Data Engineers: SageMaker Lakehouse, S3 Tables, and GenAI

Published: Thursday, Jan 31, 2019 | 8 minute read | Using 1520 words

Tags:

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Chapter 10 is the forward-looking chapter. Everything in Chapters 1 through 9 was about established AWS services. This one covers what AWS announced at re:Invent 2024 and what’s coming next.

Some services are still in preview. Some just reached general availability. They may or may not show up on your DEA-C01 exam today. If you’re building data pipelines on AWS in 2025 and beyond though, you need to know where the platform is heading.

The big theme: unification. AWS is trying to bring everything under one roof. One IDE, one catalog, one lakehouse. GenAI is getting baked into everything.

Amazon SageMaker Unified Studio

AWS has a lot of analytics services. EMR, Glue, Athena, Redshift, more. They all work, but each has its own interface, its own way of doing things. Connecting them takes effort. Learning each console takes time.

SageMaker Unified Studio is the answer to that fragmentation. Single IDE for all your data and tools – analytics and AI. Data discovery, big data processing, SQL analytics, ML model development, GenAI app development. One place.

Integrates with SageMaker Lakehouse for unified data access and SageMaker Catalog for governance.

At the time the book was written, streaming services (MSK, Kinesis), BI (QuickSight), and search (OpenSearch) weren’t yet in Unified Studio. Planned for future updates.

This is AWS admitting their service sprawl is a problem. Having 15 different consoles for related services was never great UX. Whether Unified Studio delivers on the “single pane of glass” promise remains to be seen. AWS has tried unifying things before with mixed results. Direction is right though.

Amazon SageMaker Catalog

Unified data catalog. Discover, govern, and collaborate on data across your organization.

What makes it interesting:

GenAI-powered data discovery. Automatically adds business context to table attributes. Search by business glossary terms, not just technical metadata. Genuinely useful. Finding the right table in a data lake with thousands of tables is a real pain point.
Centralized and decentralized governance. Publish/subscribe workflows for sharing between teams.
Lake Formation and DataZone integration. Fine-grained access controls.
Lineage tracking. Data flows, transformations, origin tracking.
Automated data quality reporting. Less manual work.

Each catalog maps to a storage type. Managed catalog for Redshift Managed Storage, or federate existing data from Redshift, S3 table buckets, or external sources like Snowflake and MySQL.

The metadata management layer tying everything together. Producers publish, consumers subscribe, governance sits in between.

Amazon SageMaker Lakehouse

Where the lakehouse architecture goes from concept to managed AWS service.

The problem is real. Organizations have S3 data lakes, Redshift warehouses, operational databases. Getting a unified view requires a lot of plumbing. Different access patterns, permissions, tools.

SageMaker Lakehouse combines data lake and warehouse into a single interface. One copy of data. Structured, semi-structured, unstructured. All accessible through a unified catalog.

Key components:

Flexible storage for diverse workloads
Unified technical catalog managing all data
Integrated permission management for securing and sharing
Apache Iceberg APIs for accessing data from AWS services and open source engines

That last point is important. Iceberg-compatible API means you can use any Apache Iceberg compatible tool. Spark, SQL tools, BI tools, ML frameworks. Data sits in S3 or Redshift, but you access it through standard Iceberg interface.

Zero-ETL for operational databases in near real time. Glue connectors for other sources. Federated queries for third-party data.

Permissions defined once, enforced everywhere. No more setting up access controls in five different places.

The lakehouse pattern is where the industry has been heading for years. Databricks popularized it. Now AWS is making it first-class managed. The Apache Iceberg bet is smart – Iceberg has become the dominant open table format, and building around it gives you portability.

Amazon SageMaker AI

Quick naming note. The original SageMaker for ML model training and deployment is now called SageMaker AI. “SageMaker” became the umbrella brand for the whole platform (Unified Studio, Lakehouse, Catalog).

SageMaker AI still does what SageMaker always did:

Model training and deployment with managed infrastructure
SageMaker JumpStart for pre-built models. Hundreds of foundation models ready to use.
MLOps for repeatable ML workflows at scale
Governance like Model Cards, Model Dashboard, Clarify for bias detection
Ground Truth for labeling and model customization

Custom ML models or fine-tuning existing ones? SageMaker AI is still the place. The rebranding is organizational.

Amazon S3 Tables

Practical and immediately useful. Iceberg adoption has been growing fast, but Iceberg tables need maintenance. Expire old snapshots, compact files, clean up unreferenced data. That’s operational overhead.

S3 Tables is a new S3 bucket type purpose-built for tabular data. Managed Iceberg table service basically.

What you get:

Up to 3x faster query throughput compared to self-managed Iceberg
10x higher transactions per second
Automatic table maintenance. Compaction, snapshot management, file cleanup. All handled.
Full Iceberg capabilities. Row-level transactions (UPSERT/MERGE), schema evolution, queryable snapshots.
Seamless integration with query engines

Anyone who’s managed Iceberg tables manually knows the pain of compaction jobs, snapshot expiry schedules, orphan file cleanups. Having AWS handle that automatically is a significant operational win.

S3 Tables is the foundation other new features build on. SageMaker Lakehouse uses it. S3 Metadata stores data in it. Becoming a core primitive.

Amazon S3 Metadata

Nice quality-of-life feature. Upload objects to S3, metadata gets automatically captured and made queryable in near real time (within minutes).

Two types:

System-defined: size, object source, content type, standard attributes
Custom: tags you define like SKU, transaction ID, content rating

Metadata stored in S3 Tables (Iceberg format), queryable with standard SQL tools.

Solves a common pattern. Before this, wanting to know “show me all objects uploaded in the last hour larger than 100 MB with tag X” meant building your own metadata tracking. Lambda triggers, DynamoDB tables, custom code. Now it’s built in.

Simple. Useful. Should have existed years ago.

Improving Developer Experience with GenAI

Three GenAI integrations in AWS data services.

Amazon Q Developer for Code Generation

AI coding assistant integrated into Glue ETL jobs, SageMaker Unified Studio notebooks, and Redshift Query Editor v2.

In Glue, supports Python and Scala for Spark ETL scripts. Code generation currently works only with PySpark kernel.

Context-awareness is limited. Carries context from the previous query only within the same conversation. Doesn’t remember three prompts ago. DataFrame support works in Q Developer Chat and SageMaker Unified Studio notebooks but not yet Glue Studio notebooks.

AI code generation for data pipelines is useful but not magical. Good for boilerplate Spark code. Good for “how do I read a Parquet file from S3 in PySpark” questions. For complex business logic and performance tuning, you still need an engineer who understands the data and the system. Productivity tool, not a replacement for thinking.

Automated Script Upgrade in AWS Glue

More immediately practical than general code generation. Glue 2.0 jobs (Spark 2.4.3, Python 3.7) need upgrading to Glue 4.0 (Spark 3.3.0, Python 3.10)? This feature automates the migration.

Upgrading Spark versions is painful. API changes, configuration defaults, deprecated functions, Python syntax differences. The tool analyzes four areas:

Spark SQL API methods and functions
Spark DataFrame API methods and operations
Python language updates (deprecations, syntax)
Spark SQL and Core configuration settings

Generates upgrade plan, runs automated validation jobs.

Limitations: PySpark jobs without additional dependencies only. Maximum 10 concurrent jobs per account.

Genuinely useful. Spark version upgrades are tedious and error-prone. Any automation helps. Review the output carefully before production though.

GenAI-Powered Troubleshooting for Spark in AWS Glue

Debugging failed Spark jobs is painful. Distributed processing, lazy evaluation, cryptic error messages. Hours to find root cause.

The new feature analyzes job metadata, metrics, and logs for automated root cause analysis and actionable recommendations. Access from Glue console job list, job details, or monitoring page.

This kind of GenAI application makes sense. Analyzing logs and correlating with known error patterns is what LLMs are good at. Pattern matching at scale. More reliable than asking AI to write your business logic.

Still, review suggested changes before implementing in production. Good advice for any AI-generated recommendation.

Where AWS Data Engineering Is Heading

Stepping back from individual services, the trends are clear:

Unification is the priority. AWS built a lot of point solutions. Now they’re connecting them under one platform. SageMaker is becoming the umbrella for data + analytics + ML + AI.

Apache Iceberg is the standard. S3 Tables, SageMaker Lakehouse, S3 Metadata. Everything built on Iceberg. If you’re not learning Iceberg yet, start now.

GenAI is a feature, not a product. AWS isn’t selling GenAI as standalone for data engineers. They’re embedding it into existing tools. Code generation in Glue, troubleshooting in Spark, data discovery in Catalog. Right approach. GenAI works best augmenting existing workflows, not replacing them.

The lakehouse pattern won. Separate data lakes and warehouses are legacy thinking. Industry is converging on unified access with open table formats. AWS is catching up to where Databricks has been for a couple years.

For the exam, know these services at a high level. For your career, pay attention to Iceberg and the lakehouse pattern. That’s where everything is going.

Next: Final Thoughts and Key Takeaways

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-whats-new/

denis256 notes and projects