Network Security, Authentication, and Data Protection on AWS
Previous: Pipeline Resiliency and Cost Optimization
Chapter 7 covers data security and governance. You can build the most elegant data pipeline in the world, but if security is an afterthought, you’re one misconfigured S3 bucket away from a headline nobody wants.
Splitting this chapter into two posts. This first part covers network security, authentication, encryption, and access control. Second part covers data governance.
Network Security
VPC Basics
Amazon VPC (Virtual Private Cloud) is a logical network boundary. Your own private data center inside AWS. Every AWS account comes with a default VPC per region, but for production workloads you should create your own.
A VPC contains subnets, one per Availability Zone. Subnets can be public or private, controlled by internet gateway routing. Classic example: web servers in a public subnet, database in a private subnet. Users hit the web servers, web servers talk to the database. Database never touches the internet. Simple and correct.
Security Groups
Security groups are virtual firewalls at the instance level. They control inbound and outbound traffic to your resources. In the web/database example, configure the database security group to accept connections only from the web server security group. Nothing else gets in.
Best practices worth remembering:
Never use 0.0.0.0/0 for inbound rules. Opens your resource to the entire internet. Always specify exact IPs or reference another security group. For outbound, 0.0.0.0/0 is sometimes acceptable, like when your instance needs to pull code from GitHub or call third-party APIs.
Group related security groups. Ten Lambda functions all accessing one RDS database? Don’t create ten separate security groups. Group them. One or two security groups for related functions. Less to manage, fewer mistakes.
Don’t use default VPC or default security groups for production. Default configurations exist for convenience, not security. Too permissive. Create new VPCs and security groups with exact permissions your workload needs.
EMR Cluster VPC Configuration
Amazon EMR on EC2 can be deployed in public or private subnets. For production, private subnet. The cluster connects to S3 through VPC endpoints, staying off the public internet.
Keep in mind:
- Once deployed in a private subnet, you can’t move it to a public subnet (or vice versa).
- Not all AWS services have VPC endpoints. For those that don’t, you need a NAT gateway or internet gateway.
- EMRFS uses DynamoDB under the hood. Private subnet EMR? Make sure routing to DynamoDB is configured.
Managed vs Unmanaged Services
AWS uses a shared responsibility model. AWS manages infrastructure. You manage your applications, network config, and security settings. Managed services mean AWS takes on more security burden. Less room for human error, better scalability. Also less flexibility and higher cost.
For the exam, understand the line between what AWS handles and what you handle. It shifts depending on the service.
VPC Endpoints
Scenario: EC2 instance needs to upload images to S3. Without a VPC endpoint, traffic goes over the public internet. Slow and insecure.
VPC endpoints create a private connection between your VPC and supported AWS services. Traffic stays within the AWS network. Two types: interface endpoints and gateway endpoints.
Redshift-managed VPC endpoints connect to a Redshift cluster in a different VPC (even a different account) through a private connection. Requirements: RA3 node type with a subnet group, cluster relocation or multi-AZ enabled. Default port is 5439, allow port ranges 5431-5455 and 8191-8215 in security groups. Not internet-accessible, which is the point.
OpenSearch Service-managed VPC endpoints work similarly through AWS PrivateLink. Private connection within the AWS network. Rules: only works with VPC-launched domains (not public access ones), same region only, HTTPS required (no HTTP), can’t create through CloudFormation – console or API only.
User Authentication and Authorization
IAM Credentials
Simplest way to authenticate: create an IAM user with access key and secret key. Attach policies for permissions. Group users into IAM groups.
Embedding IAM credentials directly in application code is a bad idea though. Credentials leak, rotate poorly, create operational headaches. Only use this for external non-AWS tools that must call AWS APIs.
IAM Role-Based Authentication
The recommended approach. Create an IAM role with specific permissions. Users or services assume the role to perform actions. Follow least-privilege: grant only the exact actions needed, restricted to specific resource ARNs.
From real production experience: overly permissive roles are one of the most common security issues. “Just give it admin access, we’ll tighten it later.” Later never comes.
Service-Linked Roles
A service-linked role is owned by an AWS service. Contains all permissions that service needs to call other services on your behalf. Can’t modify or attach managed policies to it.
Important distinction: a service role is an IAM role that a service assumes (you create and manage it). A service-linked role is created and managed by the service itself.
Managed vs Custom Policies
Three types:
- Managed policies: AWS-provided, pre-packaged permissions. Can’t edit. Good for quick setup. AWS keeps them updated.
- Inline policies: Embedded directly in a role. Not reusable. Avoid unless truly specific to one role.
- Custom policies: You define exact actions and resource ARNs. Reusable across multiple roles. Recommended for production because it follows least-privilege.
SSO with IAM Identity Center
IAM Identity Center sits on top of IAM. Centralizes access across multiple AWS accounts and SAML-enabled apps (Salesforce, Microsoft 365, etc.). Integrates with Active Directory. One login, access to everything the user is authorized for.
Lake Formation integration: connect Identity Center to Lake Formation. SSO-authenticated users get fine-grained data lake permissions managed by Lake Formation. For auditing, CloudTrail logs the IAM role by default. To track individual SSO users, opt in and enable S3-level CloudTrail event logging.
DataZone integration: SSO users can log into the DataZone data portal. Two assignment modes:
- Implicit: all Identity Center users can access the DataZone domain.
- Explicit: only selected users or groups get access.
Important: once you set the assignment mode on a DataZone domain, you can’t change it later. Choose carefully.
Data Security and Privacy
Securing S3
Control access through IAM roles, groups, and users. Use bucket policies for cross-account access or external customers. Resource-based policies define who can do what on specific resources.
For every S3 bucket: disable public access unless specifically serving public website content. Default rule in every organization.
Database Credential Management
Never hardcode database credentials in your application. Never pass them as environment variables either. Use AWS Secrets Manager.
Secrets Manager stores credentials and API keys with encryption. Supports auto-rotation and retrieval through API calls. Application code references a Secrets Manager key, not a password string. Integrates with CloudWatch for monitoring, CloudTrail for auditing.
The number of production systems I’ve seen with database passwords in environment variables or config files is disturbing. Secrets Manager fixes this cleanly.
Encryption and Decryption
Even with proper authentication and authorization, data must be encrypted. Two dimensions: at rest and in transit.
Encryption at rest has two approaches:
- Server-side: encryption happens on the server infrastructure.
- Client-side: you encrypt data before sending it to the server.
Encryption in transit: use SSL/TLS certificates. AWS services that move data (DMS, DataSync, Backup, VPN) support encryption in transit by default.
AWS KMS Key Management
AWS Key Management Service (KMS) is where you create and manage cryptographic keys. Integrates natively with many AWS services.
S3 encryption options:
- SSE-S3: default encryption. Key managed by S3.
- SSE-KMS: you manage keys through KMS. Create, rotate, disable, delete.
- DSSE-KMS: dual-layer server-side encryption. For compliance standards requiring multi-layer encryption.
- SSE-C: you provide a custom key. S3 uses it for encryption.
- Client-side encryption: objects encrypted with AES-256 before upload. You manage the key entirely.
Important rule: data and the KMS key must be in the same region.
KMS best practices:
- Share KMS keys cross-account instead of creating separate keys in each account.
- Enable MFA for sensitive KMS actions like
PutKeyPolicyandScheduleKeyDeletion. - Use key aliases instead of key ARNs or IDs. Abstracts the key identity and works across regions.
- Enable key rotation. AWS supports automatic rotation. Configure frequency from 90 days to 7 years.
Enabling Encryption in Analytics Services
AWS Glue: encrypt the Glue Data Catalog (metadata and connection credentials) and Glue ETL jobs separately. For ETL jobs you can encrypt S3 data with SSE-S3 or SSE-KMS, CloudWatch logs, and Job Bookmarks metadata.
Amazon EMR: supports KMS for at-rest and in-transit encryption. Also SSE-S3, SSE-KMS for S3, encryption for HDFS (AES-256), NVMe encryption for instance stores, EBS volume encryption. Transit encryption depends on the open source application running on EMR.
Amazon Redshift: at-rest and in-transit encryption. At rest: KMS or a hardware security module (HSM). Enabling encryption on an existing cluster? Redshift migrates data automatically to a new encrypted cluster. In transit: HTTPS endpoint with ACM-issued SSL certificates for S3 and DynamoDB load/unload operations.
Sensitive Data Detection and Redaction
When data enters your lake or warehouse, privacy regulations may require you to detect PII and redact it. Names, addresses, credit card numbers. Handle both data at rest and in transit.
Amazon Macie
Uses machine learning and pattern matching to scan S3 for sensitive data. Detects names, addresses, phone numbers, credit cards, more. Pipe Macie events to EventBridge, set up SNS notifications to alert stakeholders.
Your automated PII scanner for historical data already sitting in S3.
Glue Sensitive Data Detection
Define rules to detect sensitive data and apply redaction: remove a column, mask values, or store masked data in a new column. Scan full dataset or just a sample.
Supported categories include universal PII (email, credit card), HIPAA fields (driver’s license, HCPCS codes), networking elements (IP addresses, MAC addresses), and country-specific PII. Custom detection rules via regex too.
Works for both data at rest (Glue Data Catalog tables) and data in transit (inside Glue ETL jobs). Very practical for building PII-compliant pipelines.
Fine-Grained Access Control with Lake Formation
Lake Formation is where AWS gets serious about data access control. Integrates with Glue, EMR, Athena, QuickSight, SageMaker, Redshift, and third-party tools like Collibra and Privacera.
Data Lake Registration
First step: register your S3 prefix as a data lake location in the Lake Formation console. Same account or different.
Permissions and Access Control
Multiple levels:
Name-based access control: select a database, then specific tables or all tables, define permissions through console, APIs, or CloudFormation.
Tag-based access control (LF-TBAC): assign LF-Tags to Glue Data Catalog resources (databases, tables, columns). When an IAM principal’s tag values match the resource tag values, access is granted. Scales much better than name-based control with hundreds of tables across multiple domains.
Row and Column Filtering
Column-level security: hide specific columns from certain users. Table has 10 columns, 3 contain sensitive data. Define which columns each principal can see.
Row-level security: filter rows based on conditions. Common: table with data from multiple business units. Each BU sees only their rows using a filter like business_unit=BU1. Uses PartiQL filter expressions.
Cell-level security: combine row and column filters. BU1 users see only BU1 rows and only non-PII columns. Requires additional IAM permissions.
Lake Formation Best Practices
Don’t use bucket policies on S3 locations registered with Lake Formation. Lake Formation manages access. Adding bucket policies creates conflicts.
Don’t use the root AWS user as data lake admin. Create a separate IAM user. Least-privilege, always.
Don’t use the service-linked role in production. Too permissive. EMR on EC2 doesn’t support SLR-registered locations for data access. Encrypted catalogs don’t support SLR for cross-account sharing. Create a dedicated IAM role for registering data locations.
Cross-Account Sharing Best Practices
Lake Formation uses AWS Resource Access Manager for cross-account grants:
- Use AWS Organizations to structure accounts. Makes granting permissions easier.
- Instead of per-table permissions, combine tables into a database and use
All Tablespermission. One grant instead of many. - Create a placeholder database and grant
CREATE_TABLEtoALLIAMPrincipal. All IAM principals in the recipient account can create resource links and query shared tables.
Tag-Based Access Control Best Practices
- Define tags before assigning them. Designate a team responsible for tag management.
- Tags are stored in lowercase. Plan accordingly.
- Wildcards not supported. To tag all tables in a database, tag the database. Tables inherit.
- Glue ETL jobs need full table access. Without it, jobs fail.
- Keep tagging simple. Too many LF-Tags become unmanageable. Document your tagging ontology.
Database Security in Redshift
GRANT and REVOKE
Standard SQL permission model. Create users or groups, then GRANT or REVOKE permissions: SELECT, INSERT, UPDATE, DELETE, REFERENCES, CREATE, TEMPORARY, USAGE. Object owners get implicit GRANT, REVOKE, and DROP that can’t be revoked.
Role-Based Access Control (RBAC)
Create roles, assign users to roles, define permissions at the role level. Permission changes propagate to all users in the role. Supports nested roles: role 1 assigned to role 2 means users in role 2 get permissions from both.
Need CREATE ROLE permission, or a superuser grants it.
Row-Level Security (RLS)
Define which rows each user or role can access. Combine with column-based filters for fine-grained control. Users querying a table with RLS get results automatically filtered.
Best practices: keep RLS policies simple. Avoid complex statements and excessive table joins in policy definitions.
Dynamic Data Masking
Mask sensitive column data at query time. Define masking policies with custom obfuscation rules for specific users or roles. Conditional dynamic data masking goes further: apply masking at the cell level based on column values in the row.
QuickSight Access Control
Two approaches:
IAM policies: control which users can create dashboards, datasets, which visualizations they access. Also need to grant access to underlying data sources like S3 prefixes.
Lake Formation integration: QuickSight dataset built through Athena on S3 data managed by Lake Formation? Lake Formation’s column, row, and tag-based permissions apply automatically. The query respects whatever access control Lake Formation defines for that IAM user.
Key Takeaways
Security in AWS is layered. Network security with VPCs and security groups is the outer wall. IAM authentication and authorization is the gate. Encryption with KMS protects the data itself. Lake Formation and Redshift security features provide fine-grained control over who sees what.
Recurring theme: least-privilege. Give only what’s needed, nothing more. Custom policies over managed ones. Dedicated roles over service-linked roles. Private subnets and VPC endpoints over public internet.
From real-world experience, security problems almost never come from sophisticated attacks. They come from lazy defaults, overly permissive roles, and credentials stored where they shouldn’t be. Get the basics right and you prevent most issues.