System Design Interview Guide for Cloud Engineers (2026)

System design interviews are where cloud and DevOps engineers either shine or struggle. There's no single correct answer — but there are clear patterns that distinguish a 4/10 from a 9/10.

The Framework: SCALE

S — Scope the requirements (5 minutes)

How many users?
Read-heavy or write-heavy?
Consistency requirement? (strong vs eventual)
Latency SLA?
Geographic distribution?

C — Calculate capacity (3 minutes)

QPS (queries per second)
Storage requirements
Bandwidth
Number of servers

A — API design (3 minutes)

Define key endpoints before drawing boxes.

L — Layout the high-level design (10 minutes)

Draw major components. Get the full picture first.

E — Evolve and deep dive (15 minutes)

Add caching, CDN, database sharding, queues — pick the hardest part and go deep.

Common Cloud System Design Questions

1. Design a scalable CI/CD pipeline

High-level architecture:

GitHub/GitLab
    ↓ webhook
Build Server (GitHub Actions / Jenkins)
    ↓
Artifact Store (ECR / S3)
    ↓
Deployment Engine (ArgoCD / Flux)
    ↓
Kubernetes Cluster (EKS)
    ↓
Monitoring (Prometheus + Grafana)

Deep dive points that impress:

Build caching: Docker layer caching reduces build time from 10 min to 2 min
Parallel testing: Split test suite across multiple runners
Progressive delivery: Canary deployments — 5% traffic to new version, auto-rollback on SLO breach
Security gates: SAST, dependency scanning, container scanning before any deployment
Feature flags: Separate deployment from release

2. Design a multi-region active-active setup on AWS

Database replication:

Aurora Global Database: primary in us-east-1, replicas in eu-west-1 and ap-south-1
Replication lag: typically less than 1 second
Failover: promote replica to primary in ~1 minute

Traffic routing:

Route 53 latency-based routing: user goes to nearest healthy region
Health checks: automatic failover if region becomes unhealthy
GeoDNS for compliance (GDPR requires EU user data stays in EU)

State management:

DynamoDB Global Tables (multi-master, eventually consistent)
ElastiCache Global Datastore for sessions
S3 with Cross-Region Replication

The hard problem — split brain:

When writes go to multiple regions simultaneously, you risk conflicts. Most production systems are active-passive for writes, active-active for reads. Be honest about this tradeoff.

3. Design a log aggregation system for 1000 microservices

Volume calculation:

1000 services × 1000 req/s × 1KB per log = 1GB/s of logs

Architecture:

Services → Fluent Bit (sidecar/DaemonSet)
         → Kafka (buffer + replay)
         → Logstash/Flink (processing + enrichment)
         → OpenSearch (hot storage, 30 days)
         → S3 + Athena (cold storage, 1 year)

Why Kafka in the middle:

Decouples producers from consumers
Handles traffic spikes
Allows replay if downstream fails
Multiple consumers (alerting, storage, analytics)

Cost optimization:

Compress at the agent level (Fluent Bit supports zstd)
Index only fields you query in OpenSearch
Move to S3 after 30 days, query with Athena

4. How would you reduce AWS costs by 40%?

1. Rightsizing: Use AWS Compute Optimizer. Typical saving: 10-15%

2. Savings Plans: Commit to baseline usage. Typical saving: 40-60%

3. Spot instances: Move stateless workloads to Spot. Typical saving: 70-90%

4. S3 Intelligent Tiering: Automatic storage class management

5. Data transfer: Check if NAT Gateway is routing traffic that could go directly

6. Idle resources: Unattached EBS volumes, unused Elastic IPs, idle load balancers

7. RDS: Aurora Serverless v2 for variable workloads. Remove Multi-AZ from dev/staging.

Prioritize by impact: Savings Plans first (biggest bang), then Spot, then micro-optimizations.

Infrastructure as Code Design

How would you structure Terraform for a large organisation?

The three-layer model:

Layer 1: Foundation
  - VPC, subnets, Transit Gateway
  - IAM roles, SCPs, security baselines

Layer 2: Platform (shared services)
  - EKS clusters, RDS, shared load balancers

Layer 3: Application (per-team)
  - Application-specific resources
  - Depends on Layer 1 & 2 via remote state

Key principles:

Remote state in S3 with DynamoDB locking
Separate state files per environment
Module versioning — teams pin to specific module versions
Policy as code — Sentinel or OPA prevents non-compliant resources

Practice System Design Out Loud

The difference between "use a message queue" and explaining SQS dead letter queues with visibility timeout set to 6x Lambda timeout — that's the difference between pass and hire.

InterviewDrill.io has a dedicated System Design track. First session is free → interviewdrill.io