AWS Solutions Architect Interview Questions 2026
Solutions Architect interviews test your ability to design systems that are reliable, scalable, and cost-effective — and to defend your decisions under questioning. Here's what you need to know.
What Interviewers Actually Test
Three things separate great Architect interview answers:
1. Tradeoff reasoning — Never just name a service. Explain why that service and what you gave up to choose it.
2. Failure mode thinking — Every design question has a "what if X fails?" follow-up. Address it proactively.
3. Cost awareness — Senior architects think about money. Mention cost implications of your choices.
High Availability & Fault Tolerance
1. What is the difference between High Availability and Fault Tolerance?
High Availability (HA): The system minimises downtime. It may briefly be unavailable during failover but recovers quickly. Target: 99.9% to 99.99% uptime.
Fault Tolerance: The system continues operating without any interruption even when components fail. No downtime at all. More expensive.
Examples:
- Multi-AZ RDS: HA — primary fails, standby promotes in ~60 seconds (brief downtime)
- Active-active multi-region: Fault tolerant — one region fails, traffic shifts instantly
When to use which: Most applications need HA. True fault tolerance is reserved for payment processing, life-critical systems, and applications where even 60 seconds of downtime is unacceptable.
2. Design a highly available three-tier web application on AWS
Presentation tier (web/CDN):
- CloudFront in front of everything — caches static assets globally, absorbs DDoS
- ALB across two AZs minimum, targets in multiple AZs
- EC2 Auto Scaling Group or ECS/EKS for the web tier
Application tier:
- Internal ALB + Auto Scaling Group
- Multi-AZ deployment (minimum 2 AZs, ideally 3)
- EC2 instances or containers in private subnets
Data tier:
- Aurora MySQL/PostgreSQL with Multi-AZ — automatic failover
- ElastiCache (Redis) for session storage and caching — also Multi-AZ
- S3 for static files — inherently 11 nines durability, 3 AZ replication
Key architectural decisions to mention:
- Private subnets for app and data tier — only ALB is in public subnets
- NAT Gateway per AZ — single NAT Gateway is a single point of failure
- Read replicas for Aurora to offload read traffic
- Separate security groups per tier with minimal access
3. What is the difference between RTO and RPO?
RTO (Recovery Time Objective): How long can you be down? The maximum acceptable time to restore service after a failure.
RPO (Recovery Point Objective): How much data can you lose? The maximum acceptable age of files/data that must be recovered.
Examples:
- RTO = 4 hours, RPO = 24 hours: A daily backup restored to a new server is acceptable
- RTO = 5 minutes, RPO = 0: Need real-time replication and automatic failover
AWS services by RTO:
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | Lowest |
| Pilot Light | 10-30 min | Minutes | Low |
| Warm Standby | 2-10 min | Seconds | Medium |
| Multi-Site Active-Active | Seconds | Zero | Highest |
Storage & Database Design
4. When would you use each AWS storage service?
| Service | Use Case | Key Property |
|---|---|---|
| S3 | Objects, backups, static assets, data lake | Unlimited scale, 11 nines durability |
| EBS | EC2 instance storage, databases | Low latency, attached to one instance |
| EFS | Shared filesystem for multiple EC2 | NFS, scales automatically |
| S3 Glacier | Long-term archival | Very cheap, hours to retrieve |
| FSx for Lustre | HPC, ML training | Very high throughput |
Common architecture question: "Your application needs shared storage accessible by 50 EC2 instances simultaneously." → EFS, not EBS (EBS attaches to one instance) or S3 (latency too high for frequent small file access).
5. When would you use DynamoDB vs RDS?
DynamoDB:
- Single-digit millisecond at any scale
- Serverless, auto-scales with no configuration
- Limited query patterns — must design around access patterns upfront
- No joins
- Use for: user sessions, leaderboards, IoT telemetry, shopping carts, real-time gaming
RDS (Aurora/PostgreSQL/MySQL):
- Rich query language, complex joins, transactions
- Familiar SQL
- Scale by instance size + read replicas
- Use for: financial transactions, CRM data, anything needing complex queries or reporting
The trick question: "Can DynamoDB replace a relational database?" → Sometimes. If your access patterns are known and simple, DynamoDB is superior. If you need ad-hoc querying or complex relationships, RDS is better. Many modern architectures use both.
Networking & Security
6. A web application needs to be accessible from the internet but the database must not be
Standard VPC design:
Internet Gateway
↓
Public Subnet (10.0.1.0/24)
- ALB
- NAT Gateway
↓
Private Subnet - App (10.0.2.0/24)
- EC2 / ECS tasks
↓
Private Subnet - Data (10.0.3.0/24)
- RDS
- ElastiCacheSecurity Groups:
- ALB SG: Allow 80/443 from 0.0.0.0/0
- App SG: Allow 8080 from ALB SG only
- DB SG: Allow 5432 from App SG only
Key point: Security Groups reference each other by SG ID, not IP. This is more secure and doesn't break when instances scale.
7. What is AWS WAF and when do you use it?
AWS WAF (Web Application Firewall) filters HTTP/HTTPS traffic based on rules you define — or managed rule sets from AWS and AWS Marketplace.
Use when:
- SQL injection and XSS protection (OWASP rules)
- Rate limiting specific paths (login endpoints)
- Geo-blocking — block all traffic from specific countries
- Bot management
- Virtual patching while a code fix is being developed
Integration: Attach to ALB, CloudFront, or API Gateway.
Cost consideration: WAF charges per rule and per million requests. For high-traffic applications, evaluate whether Cloudflare or a dedicated WAF solution might be more cost-effective.
Cost Optimisation Architecture
8. Design a cost-optimised batch processing system on AWS
Requirement: Process 10,000 files nightly. Each file takes 2 minutes. Files arrive by midnight.
Naive approach: 24/7 EC2 fleet → pays for idle time 23 hours/day.
Optimised approach:
S3 (file uploads)
→ EventBridge scheduled rule at 00:05
→ Lambda (trigger)
→ SQS queue (10,000 messages)
→ ECS Fargate tasks (auto-scaled by SQS depth)
→ Process files → write results to S3/RDS
→ Tasks terminate when queue emptyWhy this is optimal:
- Fargate: Pay only for task execution time (~20,000 minutes/night = ~$3-5)
- SQS: Decouples arrival from processing, handles failures with DLQ
- Auto-scaling: Spins up enough tasks to process 10K files in parallel, done in 20 minutes
Alternative: AWS Batch with Spot instances — 70-90% cheaper than Fargate for large compute jobs.
Practice Architecture Questions Live
The only way to get good at design questions is to practice designing systems out loud, with someone who pushes back on your choices.
InterviewDrill.io has an AWS Solutions Architect track with real HLD scenarios. First session free → interviewdrill.io
