Data Governance: Metadata, Data Sharing, Lineage, and Auditing on AWS

Previous: Security and Authentication

Book: AWS Certified Data Engineer Associate Study Guide Authors: Sakti Mishra, Dylan Qu, Anusha Challa Publisher: O’Reilly Media ISBN: 978-1-098-17007-3

Data governance is that thing every team says they care about but nobody wants to own. Security gets attention because breaches make headlines. Governance? It just slowly rots your data platform from the inside when you ignore it. This part of Chapter 7 covers the governance pillars beyond security and privacy.

Seven pillars of data governance:

  • Data Catalog for metadata management
  • Data sharing
  • Data quality
  • Data profiling
  • Data lifecycle management
  • Data lineage
  • Logging and auditing

Each maps to specific AWS services.

Metadata Management and Technical Catalog

Centralized metadata management is the foundation. You’ve got databases, data lakes, data warehouses spread across your organization. Without a common metadata layer, you can’t build access controls, auditing, or reporting on top of them.

AWS Glue Data Catalog

The centralized technical catalog in AWS. Define virtual table schemas on top of datasets in S3, relational databases, Amazon Redshift, DynamoDB, and other data stores through Glue connections.

Key features:

  • Highly available and scalable
  • Hive Metastore compatible
  • Serverless
  • Integrates with IAM for security
  • Logs to CloudWatch and CloudTrail

Other AWS services connect to it: Athena, Redshift Spectrum, EMR, Glue ETL jobs, Lake Formation, DataZone, QuickSight. They all fetch metadata from Glue Data Catalog to query data from source tables.

Glue Data Catalog is one of those services that quietly becomes the center of your entire analytics stack. Set it up for one use case, suddenly everything depends on it.

AWS Glue Crawler

Crawlers auto-detect schemas by scanning a subset of data from your data lake and creating metadata tables in Glue Data Catalog. New datasets arrive? Trigger crawlers to create metadata tables automatically, then kick off ETL jobs. Built-in or custom classifiers parse the schema.

Amazon DataZone Business Glossary

The problem: Glue Data Catalog is technical. Business users look at column names like txn_amt_usd and have no idea what it means. They need business terms, not technical jargon.

Amazon DataZone solves this with a business glossary, terms, and metadata forms. Business users map business terminology to technical attributes and search the catalog using words they actually understand. DataZone also supports publisher-subscriber models for data sharing.

Data Sharing

Sounds simple until you realize there are at least five different patterns depending on who you’re sharing with.

Share Within a Single AWS Account

Simplest case. One account, one data lake or warehouse. Producers and consumers in the same account. Use AWS Lake Formation fine-grained access control for individual IAM users, roles, or groups. Done.

Multiaccount Hub-and-Spoke Model

Multiple business units, each with its own AWS account. Centralized data lake or warehouse. Other accounts are consumers. Use Lake Formation cross-account data sharing to control access to databases or tables from the central account.

Amazon Redshift data sharing lets you share live data across Redshift clusters, workgroups, other AWS accounts, or even other Regions. No copying. Producers create read-only outbound shares, consumers receive inbound shares. Integrates with AWS Data Exchange and Lake Formation.

Important detail: in hub-and-spoke, no centralized governance. Each producer controls access to its own consumers. Fine for small setups, messy at scale.

Data Mesh with Centralized Governance

Pattern for larger organizations. Multiple data lakes or warehouses managed by different business units or accounts. Centralized governance across all of them.

Each account integrates AWS Lake Formation for fine-grained access control. Amazon DataZone manages the publisher-subscriber workflow. Maps to the four core data mesh principles:

  1. Domain-driven ownership. Each team owns their data.
  2. Data as a product. Quality, documentation, SLAs.
  3. Federated governance. Centralized rules, decentralized execution.
  4. Self-serve data platform. Teams discover and access data without going through a central team.

Data mesh is a great concept on paper. In practice, it requires serious organizational maturity. If your teams can’t even agree on naming conventions, don’t jump straight to data mesh. Start with hub-and-spoke and grow into it.

Cross-Organization B2B Data Sharing

Share data with external organizations. Business partners, collaborators, joint ventures. AWS Clean Rooms lets you share a subset of your data with partners to aggregate datasets and derive cross-organization insights without exposing raw data.

Data Marketplace

Produce valuable data? Sell it. AWS Data Exchange enables building a data marketplace. Publish datasets or subscribe to third-party datasets. Integrates with Redshift, S3, and Lake Formation.

Data Quality

Low-quality data leads to wrong business decisions. Simple as that. Validate incoming data before making it available.

For structured tabular data, standard validation rules:

  • Expected schema?
  • Minimum row count present?
  • Columns populated with correct values?
  • Values within defined range?

AWS Glue Data Quality handles this. Serverless, cost-effective, scales to petabytes. Integrate with data at rest (Glue Data Catalog table on top of existing data) or data in transit (rules inside Glue ETL jobs).

The workflow: Glue Data Quality recommends rules after analyzing your data. Refine as a data steward, create final ruleset. Glue evaluates rules against the dataset, produces quality check results.

For unstructured data or media files, use AI services like Amazon Rekognition, Comprehend, Textract, or custom models in Bedrock or SageMaker to extract metadata. Apply quality rules on the extracted metadata.

Data Profiling

Checking that data from source systems or processed through your pipeline meets expectations. Common checks:

  • Row counts received or processed
  • Expected number of columns
  • Data types in specific fields (number, string, date)
  • Value ranges (valid country codes, month values 1-12)
  • Numeric values following expected mean or median

AWS tools for profiling:

Glue Data Quality rules. DQDL with custom logic.

AWS Glue Deequ framework. The open source framework behind Glue Data Quality. More flexible, more customizable.

AWS Glue DataBrew recipes. Predefined functions to scan column values or count rows without custom code.

AWS Samples Data Profiler utility. Profiles tables in Glue Data Catalog using Glue or EMR.

Third-party tools. Plenty of non-AWS options with periodic reports and anomaly alerts.

Data Lifecycle Management

Every dataset has a lifespan. Some data needs to be available for years. Some is accessed rarely. Some should be archived or deleted.

Core idea: move data to cheaper storage tiers as it ages, export old data from databases to object storage, use snapshot expiration in open table formats like Iceberg.

You can write batch scripts to export older database data to S3 and delete from the database. Not glamorous, works, saves money.

Data Lineage

Data lineage tracks the journey of data from source through transformations to final destination. A map of data flows, transformations, and storage locations. For governance and compliance, not optional.

Lineage requires four types of metadata:

  • Technical metadata. Data sources, schemas, transformations, consumers.
  • Business metadata. Data ownership, business definitions, classifications.
  • Operational metadata. Transformation schedules, execution timestamps, data flow info.
  • Quality metrics. Accuracy, completeness, job status.

Different people benefit differently:

  • Data engineers pinpoint where errors occur, understand dependencies before making changes.
  • Platform administrators handle data integration management, regulatory compliance (GDPR, CCPA audit trails).
  • Data analysts and scientists understand dataset meaning and context.
  • Business consumers rely on it for trustworthy decision-making data.

Amazon DataZone for Lineage

API-driven lineage compatible with OpenLineage. Visualize data provenance, trace changes, root cause analysis. Captures transformations at both asset and column level. Graphical interface for navigating data relationships.

AWS Glue + Amazon Neptune + Spline

Custom lineage solution. Glue runs ETL. Spline agent captures runtime lineage from Spark jobs in Glue. Neptune (graph database) stores and models lineage data. Neptune notebooks visualize results. More work to set up, full control.

Amazon SageMaker ML Lineage Tracking

For ML workflows specifically. Tracks every step from data preparation to model deployment. Integrates with SageMaker Pipelines. Detailed lineage for reproducibility and model governance. Essential for auditing model performance and understanding how data changes affect model outcomes.

Open source alternatives exist too: DataHub, Collibra, Amundsen.

Logging and Auditing

Keeps everything accountable. Store logs, analyze them, audit user actions for security and compliance.

Amazon CloudWatch

Default logging service. Stores, analyzes logs, triggers alarms, creates visualizations. Natively integrates with all AWS services, logs flow in without custom coding.

Amazon OpenSearch Service

Distributed search and analytics engine on Apache Lucene. Log analytics, security intelligence, operational analytics, full-text search.

Ingest through Logstash, APIs, or bulk loaders. OpenSearch Dashboards for visualizations. Fixed-node provisioned clusters or OpenSearch Serverless. Choose OpenSearch when you want custom indexes and managed log schema for faster search.

Amazon S3

Cheapest option. Store log files, query with Athena or process with EMR. Good when you don’t need real-time log analysis and want low costs.

Redshift Audit Logging

Integrates natively with CloudWatch and CloudTrail. Publishes metrics for cluster health, memory, CPU, IOPS, and user activity.

Three log types: connection logs, user logs, user activity logs. Log groups follow /aws/redshift/cluster/<cluster_name>/<log_type>.

Publishes audit logs to CloudTrail for tracking who made what request, from which IP, when.

Important: audit logging for Redshift is not enabled by default. You must explicitly enable it by specifying a log export to CloudWatch or to an S3 prefix. Don’t assume it’s on.

Amazon Managed Service for Prometheus and Grafana

Application-level metrics monitoring and visualization. Prometheus stores metrics, Grafana builds dashboards. Industry-standard open source tools. AWS managed versions are convenient, you pay a premium though. If you already run them on Kubernetes, managed services might not add much.

AWS CloudTrail

Tracks user activities and AWS API actions. Hybrid and multicloud environments. Logs stored immutably for compliance audits. Your audit trail for “who did what and when” across your entire AWS account.

CloudTrail Lake

Easier analysis of CloudTrail logs. Managed data lake for user activity data. SQL queries, visualizations through CloudTrail Lake Dashboards, natural language prompts for SQL generation. Connect Athena, QuickSight, or Grafana to it.

Analyzing Logs Using AWS Services

Storing logs is half the job. You need to analyze them.

Amazon Athena

Query logs from many AWS services: CloudTrail, VPC flow logs, ALB logs, NLB logs, Route53 logs, S3 access logs, web server logs. Define an external table pointing to an S3 prefix, specify input format with the right SerDe (serializer/deserializer), query with SQL.

SerDe is worth understanding: serialization converts readable data to compressed/encrypted format (text to binary). Deserialization converts back (binary to text). Shows up on the exam.

CloudWatch Log Insights

Interactive querying of CloudWatch logs. Natural language query generation, auto-detects fields, visualizes results as graphs. Save queries and results.

AWS CloudTrail Insights

Continuously analyzes management events, baselines API call volumes and error rates. Generates insights on anomalies – like API call volumes or error rates exceeding baseline. Only analyzes management events in a single Region.

Amazon OpenSearch Dashboards

Visualization tool for OpenSearch clusters. Installed with every OpenSearch domain. Works only with hot data on the cluster.

Processing Logs with EMR or Glue

Custom or poorly structured log formats? You need a processing framework with custom logic. Glue or EMR with Spark for terabyte-scale log processing. Transform, write to S3, query with Athena.

AWS Config

Managed service continuously assessing AWS service configurations and auditing changes over time. Change management. Keeps history, monitoring dashboard, delivers change history to S3. Also works for third-party resources, on-premises servers, SaaS tools.

My Take

Data governance isn’t exciting. Nobody puts “set up data lineage tracking” on their conference talk proposal. It’s the difference between a data platform that scales and one that becomes a liability though.

Key takeaways:

  1. Glue Data Catalog is the center of everything. Every analytics service connects to it. Invest time setting it up properly.
  2. Pick the right data sharing pattern. Don’t over-engineer. Single account? Lake Formation permissions. Multi-account? Hub-and-spoke. At scale? Data mesh. External partners? Clean Rooms.
  3. Data quality is not optional. Glue Data Quality is surprisingly capable for a serverless offering. Use it.
  4. Lineage is governance insurance. When something breaks (and it will), lineage tells you what’s affected. DataZone for managed, Glue + Neptune + Spline for custom.
  5. CloudTrail is your audit backbone. Enable it, keep the logs, use CloudTrail Lake for analysis. Turn on Redshift audit logging explicitly.
  6. AWS Config tracks infrastructure changes. Pair with CloudTrail for complete picture of who changed what and when.

For the exam, know which service fits which use case. AWS loves questions where two options seem correct but one is more operationally efficient. In governance, “operationally efficient” usually means “managed and serverless.”

Next: Hands-On Batch and Streaming Pipelines



denis256 at denis256.dev