System Design Interview Guide for Cloud Engineers (2026)
System design interviews are where cloud and DevOps engineers either shine or struggle. There's no single correct answer — but there are clear patterns that distinguish a 4/10 from a 9/10.
The Framework: SCALE
S — Scope the requirements (5 minutes)
- How many users?
- Read-heavy or write-heavy?
- Consistency requirement? (strong vs eventual)
- Latency SLA?
- Geographic distribution?
C — Calculate capacity (3 minutes)
- QPS (queries per second)
- Storage requirements
- Bandwidth
- Number of servers
A — API design (3 minutes)
Define key endpoints before drawing boxes.
L — Layout the high-level design (10 minutes)
Draw major components. Get the full picture first.
E — Evolve and deep dive (15 minutes)
Add caching, CDN, database sharding, queues — pick the hardest part and go deep.
Common Cloud System Design Questions
1. Design a scalable CI/CD pipeline
High-level architecture:
GitHub/GitLab
↓ webhook
Build Server (GitHub Actions / Jenkins)
↓
Artifact Store (ECR / S3)
↓
Deployment Engine (ArgoCD / Flux)
↓
Kubernetes Cluster (EKS)
↓
Monitoring (Prometheus + Grafana)Deep dive points that impress:
- Build caching: Docker layer caching reduces build time from 10 min to 2 min
- Parallel testing: Split test suite across multiple runners
- Progressive delivery: Canary deployments — 5% traffic to new version, auto-rollback on SLO breach
- Security gates: SAST, dependency scanning, container scanning before any deployment
- Feature flags: Separate deployment from release
2. Design a multi-region active-active setup on AWS
Database replication:
- Aurora Global Database: primary in us-east-1, replicas in eu-west-1 and ap-south-1
- Replication lag: typically less than 1 second
- Failover: promote replica to primary in ~1 minute
Traffic routing:
- Route 53 latency-based routing: user goes to nearest healthy region
- Health checks: automatic failover if region becomes unhealthy
- GeoDNS for compliance (GDPR requires EU user data stays in EU)
State management:
- DynamoDB Global Tables (multi-master, eventually consistent)
- ElastiCache Global Datastore for sessions
- S3 with Cross-Region Replication
The hard problem — split brain:
When writes go to multiple regions simultaneously, you risk conflicts. Most production systems are active-passive for writes, active-active for reads. Be honest about this tradeoff.
3. Design a log aggregation system for 1000 microservices
Volume calculation:
1000 services × 1000 req/s × 1KB per log = 1GB/s of logs
Architecture:
Services → Fluent Bit (sidecar/DaemonSet)
→ Kafka (buffer + replay)
→ Logstash/Flink (processing + enrichment)
→ OpenSearch (hot storage, 30 days)
→ S3 + Athena (cold storage, 1 year)Why Kafka in the middle:
- Decouples producers from consumers
- Handles traffic spikes
- Allows replay if downstream fails
- Multiple consumers (alerting, storage, analytics)
Cost optimization:
- Compress at the agent level (Fluent Bit supports zstd)
- Index only fields you query in OpenSearch
- Move to S3 after 30 days, query with Athena
4. How would you reduce AWS costs by 40%?
1. Rightsizing: Use AWS Compute Optimizer. Typical saving: 10-15%
2. Savings Plans: Commit to baseline usage. Typical saving: 40-60%
3. Spot instances: Move stateless workloads to Spot. Typical saving: 70-90%
4. S3 Intelligent Tiering: Automatic storage class management
5. Data transfer: Check if NAT Gateway is routing traffic that could go directly
6. Idle resources: Unattached EBS volumes, unused Elastic IPs, idle load balancers
7. RDS: Aurora Serverless v2 for variable workloads. Remove Multi-AZ from dev/staging.
Prioritize by impact: Savings Plans first (biggest bang), then Spot, then micro-optimizations.
Infrastructure as Code Design
How would you structure Terraform for a large organisation?
The three-layer model:
Layer 1: Foundation
- VPC, subnets, Transit Gateway
- IAM roles, SCPs, security baselines
Layer 2: Platform (shared services)
- EKS clusters, RDS, shared load balancers
Layer 3: Application (per-team)
- Application-specific resources
- Depends on Layer 1 & 2 via remote stateKey principles:
- Remote state in S3 with DynamoDB locking
- Separate state files per environment
- Module versioning — teams pin to specific module versions
- Policy as code — Sentinel or OPA prevents non-compliant resources
Practice System Design Out Loud
The difference between "use a message queue" and explaining SQS dead letter queues with visibility timeout set to 6x Lambda timeout — that's the difference between pass and hire.
InterviewDrill.io has a dedicated System Design track. First session is free → interviewdrill.io