Network Security, Authentication, and Data Protection on AWS

Published: Sunday, Jan 27, 2019 | 12 minute read | Using 2385 words

Tags:

Previous: Pipeline Resiliency and Cost Optimization

Book Info

AWS Certified Data Engineer Associate Study Guide

Authors: Sakti Mishra, Dylan Qu, Anusha Challa

Publisher: O'Reilly Media

ISBN: 978-1-098-17007-3

Chapter 7 covers data security and governance. You can build the most elegant data pipeline in the world, but if security is an afterthought, you’re one misconfigured S3 bucket away from a headline nobody wants.

Splitting this chapter into two posts. This first part covers network security, authentication, encryption, and access control. Second part covers data governance.

Network Security

VPC Basics

Amazon VPC (Virtual Private Cloud) is a logical network boundary. Your own private data center inside AWS. Every AWS account comes with a default VPC per region, but for production workloads you should create your own.

A VPC contains subnets, one per Availability Zone. Subnets can be public or private, controlled by internet gateway routing. Classic example: web servers in a public subnet, database in a private subnet. Users hit the web servers, web servers talk to the database. Database never touches the internet. Simple and correct.

Security Groups

Security groups are virtual firewalls at the instance level. They control inbound and outbound traffic to your resources. In the web/database example, configure the database security group to accept connections only from the web server security group. Nothing else gets in.

Best practices worth remembering:

Never use 0.0.0.0/0 for inbound rules. Opens your resource to the entire internet. Always specify exact IPs or reference another security group. For outbound, 0.0.0.0/0 is sometimes acceptable, like when your instance needs to pull code from GitHub or call third-party APIs.

Group related security groups. Ten Lambda functions all accessing one RDS database? Don’t create ten separate security groups. Group them. One or two security groups for related functions. Less to manage, fewer mistakes.

Don’t use default VPC or default security groups for production. Default configurations exist for convenience, not security. Too permissive. Create new VPCs and security groups with exact permissions your workload needs.

EMR Cluster VPC Configuration

Amazon EMR on EC2 can be deployed in public or private subnets. For production, private subnet. The cluster connects to S3 through VPC endpoints, staying off the public internet.

Keep in mind:

Once deployed in a private subnet, you can’t move it to a public subnet (or vice versa).
Not all AWS services have VPC endpoints. For those that don’t, you need a NAT gateway or internet gateway.
EMRFS uses DynamoDB under the hood. Private subnet EMR? Make sure routing to DynamoDB is configured.

Managed vs Unmanaged Services

AWS uses a shared responsibility model. AWS manages infrastructure. You manage your applications, network config, and security settings. Managed services mean AWS takes on more security burden. Less room for human error, better scalability. Also less flexibility and higher cost.

For the exam, understand the line between what AWS handles and what you handle. It shifts depending on the service.

VPC Endpoints

Scenario: EC2 instance needs to upload images to S3. Without a VPC endpoint, traffic goes over the public internet. Slow and insecure.

VPC endpoints create a private connection between your VPC and supported AWS services. Traffic stays within the AWS network. Two types: interface endpoints and gateway endpoints.

Redshift-managed VPC endpoints connect to a Redshift cluster in a different VPC (even a different account) through a private connection. Requirements: RA3 node type with a subnet group, cluster relocation or multi-AZ enabled. Default port is 5439, allow port ranges 5431-5455 and 8191-8215 in security groups. Not internet-accessible, which is the point.

OpenSearch Service-managed VPC endpoints work similarly through AWS PrivateLink. Private connection within the AWS network. Rules: only works with VPC-launched domains (not public access ones), same region only, HTTPS required (no HTTP), can’t create through CloudFormation – console or API only.

User Authentication and Authorization

IAM Credentials

Simplest way to authenticate: create an IAM user with access key and secret key. Attach policies for permissions. Group users into IAM groups.

Embedding IAM credentials directly in application code is a bad idea though. Credentials leak, rotate poorly, create operational headaches. Only use this for external non-AWS tools that must call AWS APIs.

IAM Role-Based Authentication

The recommended approach. Create an IAM role with specific permissions. Users or services assume the role to perform actions. Follow least-privilege: grant only the exact actions needed, restricted to specific resource ARNs.

From real production experience: overly permissive roles are one of the most common security issues. “Just give it admin access, we’ll tighten it later.” Later never comes.

Service-Linked Roles

A service-linked role is owned by an AWS service. Contains all permissions that service needs to call other services on your behalf. Can’t modify or attach managed policies to it.

Important distinction: a service role is an IAM role that a service assumes (you create and manage it). A service-linked role is created and managed by the service itself.

Managed vs Custom Policies

Three types:

Managed policies: AWS-provided, pre-packaged permissions. Can’t edit. Good for quick setup. AWS keeps them updated.
Inline policies: Embedded directly in a role. Not reusable. Avoid unless truly specific to one role.
Custom policies: You define exact actions and resource ARNs. Reusable across multiple roles. Recommended for production because it follows least-privilege.

SSO with IAM Identity Center

IAM Identity Center sits on top of IAM. Centralizes access across multiple AWS accounts and SAML-enabled apps (Salesforce, Microsoft 365, etc.). Integrates with Active Directory. One login, access to everything the user is authorized for.

Lake Formation integration: connect Identity Center to Lake Formation. SSO-authenticated users get fine-grained data lake permissions managed by Lake Formation. For auditing, CloudTrail logs the IAM role by default. To track individual SSO users, opt in and enable S3-level CloudTrail event logging.

DataZone integration: SSO users can log into the DataZone data portal. Two assignment modes:

Implicit: all Identity Center users can access the DataZone domain.
Explicit: only selected users or groups get access.

Important: once you set the assignment mode on a DataZone domain, you can’t change it later. Choose carefully.

Data Security and Privacy

Securing S3

Control access through IAM roles, groups, and users. Use bucket policies for cross-account access or external customers. Resource-based policies define who can do what on specific resources.

For every S3 bucket: disable public access unless specifically serving public website content. Default rule in every organization.

Database Credential Management

Never hardcode database credentials in your application. Never pass them as environment variables either. Use AWS Secrets Manager.

Secrets Manager stores credentials and API keys with encryption. Supports auto-rotation and retrieval through API calls. Application code references a Secrets Manager key, not a password string. Integrates with CloudWatch for monitoring, CloudTrail for auditing.

The number of production systems I’ve seen with database passwords in environment variables or config files is disturbing. Secrets Manager fixes this cleanly.

Encryption and Decryption

Even with proper authentication and authorization, data must be encrypted. Two dimensions: at rest and in transit.

Encryption at rest has two approaches:

Server-side: encryption happens on the server infrastructure.
Client-side: you encrypt data before sending it to the server.

Encryption in transit: use SSL/TLS certificates. AWS services that move data (DMS, DataSync, Backup, VPN) support encryption in transit by default.

AWS KMS Key Management

AWS Key Management Service (KMS) is where you create and manage cryptographic keys. Integrates natively with many AWS services.

S3 encryption options:

SSE-S3: default encryption. Key managed by S3.
SSE-KMS: you manage keys through KMS. Create, rotate, disable, delete.
DSSE-KMS: dual-layer server-side encryption. For compliance standards requiring multi-layer encryption.
SSE-C: you provide a custom key. S3 uses it for encryption.
Client-side encryption: objects encrypted with AES-256 before upload. You manage the key entirely.

Important rule: data and the KMS key must be in the same region.

KMS best practices:

Share KMS keys cross-account instead of creating separate keys in each account.
Enable MFA for sensitive KMS actions like PutKeyPolicy and ScheduleKeyDeletion.
Use key aliases instead of key ARNs or IDs. Abstracts the key identity and works across regions.
Enable key rotation. AWS supports automatic rotation. Configure frequency from 90 days to 7 years.

Enabling Encryption in Analytics Services

AWS Glue: encrypt the Glue Data Catalog (metadata and connection credentials) and Glue ETL jobs separately. For ETL jobs you can encrypt S3 data with SSE-S3 or SSE-KMS, CloudWatch logs, and Job Bookmarks metadata.

Amazon EMR: supports KMS for at-rest and in-transit encryption. Also SSE-S3, SSE-KMS for S3, encryption for HDFS (AES-256), NVMe encryption for instance stores, EBS volume encryption. Transit encryption depends on the open source application running on EMR.

Amazon Redshift: at-rest and in-transit encryption. At rest: KMS or a hardware security module (HSM). Enabling encryption on an existing cluster? Redshift migrates data automatically to a new encrypted cluster. In transit: HTTPS endpoint with ACM-issued SSL certificates for S3 and DynamoDB load/unload operations.

Sensitive Data Detection and Redaction

When data enters your lake or warehouse, privacy regulations may require you to detect PII and redact it. Names, addresses, credit card numbers. Handle both data at rest and in transit.

Amazon Macie

Uses machine learning and pattern matching to scan S3 for sensitive data. Detects names, addresses, phone numbers, credit cards, more. Pipe Macie events to EventBridge, set up SNS notifications to alert stakeholders.

Your automated PII scanner for historical data already sitting in S3.

Glue Sensitive Data Detection

Define rules to detect sensitive data and apply redaction: remove a column, mask values, or store masked data in a new column. Scan full dataset or just a sample.

Supported categories include universal PII (email, credit card), HIPAA fields (driver’s license, HCPCS codes), networking elements (IP addresses, MAC addresses), and country-specific PII. Custom detection rules via regex too.

Works for both data at rest (Glue Data Catalog tables) and data in transit (inside Glue ETL jobs). Very practical for building PII-compliant pipelines.

Fine-Grained Access Control with Lake Formation

Lake Formation is where AWS gets serious about data access control. Integrates with Glue, EMR, Athena, QuickSight, SageMaker, Redshift, and third-party tools like Collibra and Privacera.

Data Lake Registration

First step: register your S3 prefix as a data lake location in the Lake Formation console. Same account or different.

Permissions and Access Control

Multiple levels:

Name-based access control: select a database, then specific tables or all tables, define permissions through console, APIs, or CloudFormation.

Tag-based access control (LF-TBAC): assign LF-Tags to Glue Data Catalog resources (databases, tables, columns). When an IAM principal’s tag values match the resource tag values, access is granted. Scales much better than name-based control with hundreds of tables across multiple domains.

Row and Column Filtering

Column-level security: hide specific columns from certain users. Table has 10 columns, 3 contain sensitive data. Define which columns each principal can see.

Row-level security: filter rows based on conditions. Common: table with data from multiple business units. Each BU sees only their rows using a filter like business_unit=BU1. Uses PartiQL filter expressions.

Cell-level security: combine row and column filters. BU1 users see only BU1 rows and only non-PII columns. Requires additional IAM permissions.

Lake Formation Best Practices

Don’t use bucket policies on S3 locations registered with Lake Formation. Lake Formation manages access. Adding bucket policies creates conflicts.

Don’t use the root AWS user as data lake admin. Create a separate IAM user. Least-privilege, always.

Don’t use the service-linked role in production. Too permissive. EMR on EC2 doesn’t support SLR-registered locations for data access. Encrypted catalogs don’t support SLR for cross-account sharing. Create a dedicated IAM role for registering data locations.

Lake Formation uses AWS Resource Access Manager for cross-account grants:

Use AWS Organizations to structure accounts. Makes granting permissions easier.
Instead of per-table permissions, combine tables into a database and use All Tables permission. One grant instead of many.
Create a placeholder database and grant CREATE_TABLE to ALLIAMPrincipal. All IAM principals in the recipient account can create resource links and query shared tables.

Tag-Based Access Control Best Practices

Define tags before assigning them. Designate a team responsible for tag management.
Tags are stored in lowercase. Plan accordingly.
Wildcards not supported. To tag all tables in a database, tag the database. Tables inherit.
Glue ETL jobs need full table access. Without it, jobs fail.
Keep tagging simple. Too many LF-Tags become unmanageable. Document your tagging ontology.

Database Security in Redshift

GRANT and REVOKE

Standard SQL permission model. Create users or groups, then GRANT or REVOKE permissions: SELECT, INSERT, UPDATE, DELETE, REFERENCES, CREATE, TEMPORARY, USAGE. Object owners get implicit GRANT, REVOKE, and DROP that can’t be revoked.

Role-Based Access Control (RBAC)

Create roles, assign users to roles, define permissions at the role level. Permission changes propagate to all users in the role. Supports nested roles: role 1 assigned to role 2 means users in role 2 get permissions from both.

Need CREATE ROLE permission, or a superuser grants it.

Row-Level Security (RLS)

Define which rows each user or role can access. Combine with column-based filters for fine-grained control. Users querying a table with RLS get results automatically filtered.

Best practices: keep RLS policies simple. Avoid complex statements and excessive table joins in policy definitions.

Dynamic Data Masking

Mask sensitive column data at query time. Define masking policies with custom obfuscation rules for specific users or roles. Conditional dynamic data masking goes further: apply masking at the cell level based on column values in the row.

QuickSight Access Control

Two approaches:

IAM policies: control which users can create dashboards, datasets, which visualizations they access. Also need to grant access to underlying data sources like S3 prefixes.

Lake Formation integration: QuickSight dataset built through Athena on S3 data managed by Lake Formation? Lake Formation’s column, row, and tag-based permissions apply automatically. The query respects whatever access control Lake Formation defines for that IAM user.

Key Takeaways

Security in AWS is layered. Network security with VPCs and security groups is the outer wall. IAM authentication and authorization is the gate. Encryption with KMS protects the data itself. Lake Formation and Redshift security features provide fine-grained control over who sees what.

Recurring theme: least-privilege. Give only what’s needed, nothing more. Custom policies over managed ones. Dedicated roles over service-linked roles. Private subnets and VPC endpoints over public internet.

From real-world experience, security problems almost never come from sophisticated attacks. They come from lazy defaults, overly permissive roles, and credentials stored where they shouldn’t be. Get the basics right and you prevent most issues.

Next: Data Governance on AWS

Page link: /posts/aws-certified-data-engineer-associate-study-guide-sakti-mishra/aws-dea-security-authentication/

denis256 notes and projects

Network Security, Authentication, and Data Protection on AWS

Network Security

VPC Basics

Security Groups

EMR Cluster VPC Configuration

Managed vs Unmanaged Services

VPC Endpoints

User Authentication and Authorization

IAM Credentials

IAM Role-Based Authentication

Service-Linked Roles

Managed vs Custom Policies

SSO with IAM Identity Center

Data Security and Privacy

Securing S3

Database Credential Management

Encryption and Decryption

AWS KMS Key Management

Enabling Encryption in Analytics Services

Sensitive Data Detection and Redaction

Amazon Macie

Glue Sensitive Data Detection

Fine-Grained Access Control with Lake Formation

Data Lake Registration

Permissions and Access Control

Row and Column Filtering

Lake Formation Best Practices

Tag-Based Access Control Best Practices

Database Security in Redshift

GRANT and REVOKE

Role-Based Access Control (RBAC)

Row-Level Security (RLS)

Dynamic Data Masking

QuickSight Access Control

Key Takeaways

Choose Accent Color

denis256 notes and projects

Network Security, Authentication, and Data Protection on AWS

Network Security

VPC Basics

Security Groups

EMR Cluster VPC Configuration

Managed vs Unmanaged Services

VPC Endpoints

User Authentication and Authorization

IAM Credentials

IAM Role-Based Authentication

Service-Linked Roles

Managed vs Custom Policies

SSO with IAM Identity Center

Data Security and Privacy

Securing S3

Database Credential Management

Encryption and Decryption

AWS KMS Key Management

Enabling Encryption in Analytics Services

Sensitive Data Detection and Redaction

Amazon Macie

Glue Sensitive Data Detection

Fine-Grained Access Control with Lake Formation

Data Lake Registration

Permissions and Access Control

Row and Column Filtering

Lake Formation Best Practices

Cross-Account Sharing Best Practices

Tag-Based Access Control Best Practices

Database Security in Redshift

GRANT and REVOKE

Role-Based Access Control (RBAC)

Row-Level Security (RLS)

Dynamic Data Masking

QuickSight Access Control

Key Takeaways