AWS DevOps Interview Questions 2026: The Complete Guide

AWS interviews in 2026 have evolved. Companies aren't just testing tool knowledge — they're testing how you think under production pressure. Here are the questions that matter most, with the frameworks that score 8+/10 consistently.

Section 1: Core AWS Services

1. What is the difference between a Security Group and a NACL?

Why they ask this: This is a filtering question — they want to see if you understand stateful vs stateless, and whether you know when to use each.

Ideal answer framework:

Security Groups are stateful — if you allow inbound traffic on port 80, the return traffic is automatically allowed. They operate at the instance level.

NACLs (Network Access Control Lists) are stateless — you must explicitly allow both inbound AND outbound traffic. They operate at the subnet level and process rules in number order (lowest first).

When to use each:

Security Groups: instance-level control, application-layer rules
NACLs: subnet-level control, blocking specific IPs, compliance requirements

Follow-up they often ask: "If you wanted to block a specific IP from reaching your entire VPC, which would you use?" → NACL, because it operates at subnet level before traffic reaches instances.

2. Explain VPC peering vs Transit Gateway

Why they ask this: Architects need to understand network topology at scale. This tests whether you've worked on multi-account or multi-region setups.

Ideal answer:

VPC Peering is a direct 1:1 connection between two VPCs. It's non-transitive — if VPC A peers with VPC B, and VPC B peers with VPC C, VPC A cannot talk to VPC C through B. Works across accounts and regions.

Transit Gateway is a hub-and-spoke model. All VPCs connect to the Transit Gateway, which routes traffic between them. Supports thousands of VPCs, on-premises connections, and is transitive by design.

Decision framework:

2-5 VPCs with simple connectivity → VPC Peering (cheaper, simpler)
5+ VPCs, multi-account, on-premises hybrid → Transit Gateway

3. Your EC2 instance is showing high CPU but your application load is normal. What do you investigate?

Why they ask this: This is a debugging scenario. They want to see your systematic troubleshooting approach.

Step-by-step investigation:

1. Check the processes: SSH in, run top or htop. Identify which process is consuming CPU.

2. Check for noisy neighbors: On a shared instance, other tenants can cause steal time. Check %st in top.

3. Check CloudWatch metrics: CPU credit balance (for T-series), network I/O, disk I/O — sometimes CPU spikes are I/O wait masquerading.

4. Check for crypto mining: An unusual process with a generic name consuming high CPU is a red flag.

5. Check auto scaling triggers: Was a scaling event supposed to happen but didn't?

Common gotcha: T3/T2 instances have CPU credits. When credits run out, they throttle. Check CPUCreditBalance metric.

4. What is S3 Transfer Acceleration and when would you use it?

Ideal answer: Transfer Acceleration uses CloudFront's edge network to speed up uploads to S3. Instead of uploading directly to your S3 bucket, you upload to the nearest CloudFront edge, which then transfers via AWS's optimized backbone network.

Use when: You have users globally uploading large files (videos, backups) to a centralized S3 bucket. Typical improvement: 50-500% for large files over long distances.

Don't use when: Users are in the same region as the S3 bucket — the overhead of routing through edge negates any benefit.

Section 2: CI/CD & Automation

5. Walk me through your CI/CD pipeline for a containerized application

Strong answer structure:

Developer pushes code →
GitHub Actions/Jenkins triggers →
  1. Unit tests + linting
  2. Docker image build
  3. Image scan (Trivy/Snyk)
  4. Push to ECR
  5. Update Helm chart / Kubernetes manifest
  6. Deploy to staging (ArgoCD/Flux)
  7. Integration tests
  8. Manual approval gate
  9. Deploy to production
  10. Smoke tests + rollback trigger

What makes this answer strong: mentioning image scanning, approval gates, and automatic rollback. Most candidates skip these.

6. How do you handle secrets in a CI/CD pipeline?

What NOT to say: Hardcode in environment variables, store in code, put in .env files committed to git.

Correct approaches:

AWS Secrets Manager with IAM role-based access — application retrieves secrets at runtime
Parameter Store (SSM) for non-sensitive config, SecureString for secrets
HashiCorp Vault for multi-cloud or on-premises
Kubernetes Secrets with encryption at rest + external-secrets-operator

Rotation: Mention that Secrets Manager supports automatic rotation with Lambda functions.

Section 3: Kubernetes on AWS (EKS)

7. What is the difference between a Deployment and a StatefulSet?

Deployment manages stateless pods. Pods are interchangeable — each gets a random name suffix.

StatefulSet manages stateful pods. Each pod has a stable, predictable identity (mysql-0, mysql-1). Pods are created and deleted in order. Each pod can have its own persistent volume.

Use Deployments for: Web servers, API services, workers.

Use StatefulSets for: Databases, Kafka, Elasticsearch — anything needing stable identity or persistent storage.

8. Your pods are in CrashLoopBackOff. Walk me through your debugging process.

Step 1 — Describe the pod:

kubectl describe pod <pod-name> -n <namespace>

Step 2 — Check logs:

kubectl logs <pod-name> --previous
kubectl logs <pod-name>

Step 3 — Common causes:

Root Cause	Signal	Fix
App crash	Exit code 1, stack trace in logs	Fix application bug
OOMKilled	Exit code 137	Increase memory limits
Config error	"Cannot find config file"	Check ConfigMap/Secret mounts
Liveness probe failing	Probe failures in events	Fix probe path or increase threshold

Section 4: Cost & Architecture

9. Your S3 costs tripled last month. How do you investigate?

1. AWS Cost Explorer: Filter by S3, break down by usage type

2. S3 Storage Lens: Check which buckets grew

3. Access logs: Look for unexpected GET patterns

4. Lifecycle policies: Did a policy expire that was moving data to Glacier?

5. Data transfer costs: S3 → internet is expensive. Check cross-region transfers.

10. What is the difference between Spot, On-Demand, and Reserved instances?

Type	Use Case	Savings	Risk
On-Demand	Unpredictable workloads	Baseline	None
Reserved (1-3yr)	Predictable baseline load	40-60%	Locked in
Spot	Batch jobs, CI/CD, stateless workers	70-90%	2-min termination notice
Savings Plans	Flexible commitment	40-60%	Usage commitment

Interview tip: Always mention Spot interruption handling — use interruption notices, design for graceful shutdown, use mixed instance types in Auto Scaling Groups.

How to Practice These Questions

Reading is not the same as answering under pressure.

InterviewDrill.io lets you practice exactly this. Paste a real AWS job description, and Joshua — your AI interviewer — asks these exact questions, references your resume, and scores every answer live.

First session is completely free → interviewdrill.io