Site Reliability Engineering (SRE) Interview Questions 2026
SRE is now a distinct career track at top tech companies. Google, Netflix, Spotify, and hundreds of product companies hire explicitly for SRE roles. Whether you're interviewing for an SRE title or a senior DevOps role with reliability responsibilities, these are the 20 questions that matter most.
Section 1: SRE Principles & Philosophy
1. What is the difference between SRE and DevOps?
Why they ask this: This is the most fundamental SRE interview question. A vague answer signals you've read a blog post; a clear answer signals you understand the practice.
Ideal answer:
DevOps is a cultural and organizational philosophy — breaking down silos between development and operations, improving collaboration, and automating the software delivery process. It's a broad movement more than a prescriptive set of practices.
SRE is Google's specific, prescriptive implementation of DevOps principles. Ben Treynor (Google, 2003) defined it: "What happens when you ask a software engineer to design an operations function."
Key SRE-specific constructs that DevOps doesn't prescribe:
- SLIs, SLOs, error budgets — formal reliability measurement
- Toil elimination — explicit target to reduce operational work
- Postmortem culture — blameless, structured incident review
- Production readiness reviews — formal criteria before launch
- Engineering headcount cap on toil (~50% max)
Practical difference: A DevOps engineer might focus heavily on CI/CD pipelines and cloud automation. An SRE focuses on the production reliability of services at scale — measuring it, setting targets, and systematically eliminating the causes of unreliability.
2. Define SLI, SLO, and SLA. How are they different?
Why they ask this: This is the most tested SRE concept. Every SRE interview will touch this.
Ideal answer:
SLI (Service Level Indicator): A quantitative measure of service behavior. A specific metric that matters to users.
- Request success rate:
successful_requests / total_requests - Latency: percentage of requests served under 200ms
- Availability: fraction of time the service is up
SLO (Service Level Objective): A target value or range for an SLI. An internal commitment to reliability.
- "99.9% of requests return a successful response"
- "95% of requests are served under 200ms"
- SLOs are set internally by the product and SRE teams
SLA (Service Level Agreement): A contractual commitment to external customers, typically with financial penalties for violation.
- "We guarantee 99.9% uptime. If we miss it, customers get service credits."
- SLAs are weaker than SLOs — you want to miss the SLA before you miss the SLO
Why SLOs > SLAs for operations: SLOs give you an internal warning signal before you violate the customer-facing SLA. The gap between your SLO (99.95%) and SLA (99.9%) is your buffer.
3. What is an error budget and how do you use it?
Why they ask this: Error budgets are the most operationally powerful SRE concept. It's what separates SRE from traditional ops.
Ideal answer:
An error budget is the acceptable amount of unreliability derived from your SLO.
If your SLO is 99.9% availability:
- Error budget = 0.1% = 43.8 minutes per month of allowed downtime
How error budgets change behavior:
Without an error budget, developers and operations teams are in constant conflict: developers want to ship faster, ops wants stability.
With an error budget, both teams share the same goal:
When you have budget remaining: Deploy freely, run experiments, take risks with new features. You have room to fail.
When budget is exhausted (or nearly): Freeze feature releases. Focus engineering effort on reliability improvements. No new deployments until budget recovers.
This changes the conversation: Instead of "ops is blocking my release," it becomes "we burned our budget, which means we need to fix reliability before we can ship new features."
Burn rate alerting: Alert when you're consuming the error budget significantly faster than sustainable. If your budget would run out in 2 days at the current failure rate, that's a critical alert — even if you're technically within SLO so far.
4. What is toil and how do SREs eliminate it?
Why they ask this: Toil is a core SRE concept. Knowing how to identify and measure it is essential.
Ideal answer:
Toil (per Google SRE Book) is operational work that is:
- Manual
- Repetitive
- Automatable
- Tactical (reactive, not strategic)
- Grows linearly with service growth
Examples of toil:
- Manually restarting crashed processes
- Manually scaling servers during traffic spikes
- Handling the same class of tickets repeatedly
- Manually rotating certificates
- Copy-pasting commands from runbooks
Why it's harmful: Toil doesn't make the service more reliable — it just keeps it running. SREs should spend no more than 50% of their time on toil. Time above 50% is a problem to escalate.
Eliminating toil:
1. Measure it first — track toil sources in tickets, calculate hours per week
2. Automate the common path — write the runbook as code
3. Self-healing systems — auto-restart, auto-scale, auto-remediate
4. Fix root causes — if you restart a service manually weekly, fix the OOM bug
5. Reduce ticket volume — improve user tooling so they don't need SRE help
5. How do you conduct a blameless postmortem?
Why they ask this: Postmortem culture is a key differentiator of high-performing engineering organizations.
Ideal answer:
A blameless postmortem analyzes what happened during an incident without assigning personal blame — focusing on systemic fixes rather than individual failures.
Why blameless: If engineers fear punishment, they hide mistakes, incidents get under-reported, and root causes never surface. Psychological safety is a prerequisite for learning.
Postmortem structure:
1. Incident summary: One paragraph — what happened, impact, duration
2. Timeline: Minute-by-minute reconstruction of events (what happened, who noticed, what actions were taken)
3. Root cause analysis: Not "X made a mistake" — "The system allowed X to happen without preventing or detecting it"
4. Impact: Quantified — N users affected, $M revenue lost, X minutes of downtime
5. Action items: Specific, assigned, with due dates. Fixes go into the backlog with priority.
6. What went well: Acknowledge effective responses — this is learning, not punishment
Blameless framing: Change "Bob deployed bad code" → "Our deployment process didn't catch this class of error. Action: add integration test for Y."
Section 2: Reliability & Incident Management
6. How do you define and measure availability?
Ideal answer:
Availability = uptime / (uptime + downtime)
But this simple formula can be misleading for modern systems. More precise:
Request success rate SLI:
availability = good_requests / total_requestsWhere "good" means: returned a correct response within the latency threshold. This is better than uptime because:
- A service can be "up" but returning 500s → bad availability
- A service can be "up" but extremely slow → bad availability
Multi-dimension availability:
- API availability (requests succeeding)
- Latency availability (requests succeeding within SLO)
- Correctness (responses are accurate, not just 200 OK)
Measurement: Use your load balancer metrics, APM tool, or synthetic monitoring. NOT internal health checks (which can show green when users are failing).
9s table: 99.9% = 43 min/month, 99.95% = 22 min/month, 99.99% = 4.3 min/month, 99.999% = 26 sec/month. Most production services target 99.9% to 99.99%.
7. Walk me through a major incident response process
Ideal answer:
Immediate response (0-5 min):
- Acknowledge the alert — prevents duplicate responders
- Assess severity: How many users affected? Is revenue impacted?
- Declare incident if warranted — spin up a war room (Slack channel or video call)
- Page on-call engineer for the affected service if not already alerted
Mitigation first (5-30 min):
- Prioritize restoring service over finding root cause
- Rollback, restart, failover, scale — whatever gets users unblocked
- Communicate status updates every 10-15 minutes to stakeholders
Investigation (parallel or after mitigation):
- Form hypotheses: What changed recently? What does the data show?
- Check deployment history, config changes, traffic spikes
- Isolate — is it one region? One service? One dependency?
Resolution and follow-up:
- Confirm service is fully restored and stable
- Publish a preliminary timeline within 2 hours
- Full postmortem within 48-72 hours
- Track action items to completion
8. What are the four golden signals of monitoring?
Why they ask this: This is the Google SRE Book concept every SRE should know cold.
Ideal answer:
The four golden signals (from Google's SRE Book) are the minimum set of metrics to monitor for any service:
1. Latency: Time to service a request. Track the distribution — p50, p95, p99. Distinguish successful vs error latency (a failing request served in 1ms might skew your latency numbers positively).
2. Traffic: How much demand is placed on your system — requests per second, messages per second, transactions per second. Understand your baseline to detect anomalies.
3. Errors: Rate of requests that fail — explicitly (5xx), implicitly (returning wrong data), or by policy (response over 500ms is a failure). Both rate and absolute count matter.
4. Saturation: How "full" your service is — CPU utilization, memory pressure, queue depth. Saturation is often a leading indicator — latency degrades before you hit hard limits.
Using them: Build a dashboard with all four. An alert on any of these not returning to baseline within N minutes should page on-call. Start here before building complex custom metrics.
9. What is a production readiness review and what does it cover?
Ideal answer:
A Production Readiness Review (PRR) is a structured assessment before a service receives SRE support and goes into production. It ensures reliability is built in, not bolted on.
Typical PRR checklist:
Architecture & reliability:
- Single points of failure identified and mitigated
- Load testing completed (system tested at 2-3x expected peak)
- Graceful degradation under partial failure
- Data backup and restore tested
Observability:
- The four golden signals are instrumented
- Alerts defined for user-impacting failures (not just system metrics)
- Runbooks exist for the top 5 expected failure scenarios
- Distributed tracing enabled
Deployment & rollback:
- Can deploy and rollback in under 15 minutes
- Canary or gradual rollout strategy exists
- Feature flags allow disabling problematic features without a deploy
Oncall & operational:
- Oncall rotation established
- Escalation path documented
- Known operational tasks automated (not manual)
- Toil < 10% of weekly engineering hours
10. What is capacity planning in the context of SRE?
Ideal answer:
Capacity planning ensures your system has enough resources to handle current and future load without unnecessary over-provisioning.
SRE capacity planning approach:
Step 1: Establish baseline load metrics
Current requests/second, CPU/memory utilization, storage growth rate. Segment by component (API, database, cache, queue).
Step 2: Model growth
Project traffic growth based on business metrics (user signups, feature launches). Conservative estimate + safety margin.
Step 3: Load test to find limits
Gradually ramp load until the system degrades. Identify the bottleneck: CPU? Memory? Database connections? Network?
Step 4: Set headroom targets
Typical targets: run at no more than 70% of maximum capacity in steady state. This gives you room to handle traffic spikes and absorb a zone/region failure.
Step 5: Plan for n+1 redundancy
For a 99.99% SLO, losing one AZ should be survivable. That means running at < 50% in each AZ (losing one AZ doubles load on the remaining).
Auto-scaling vs capacity planning: Auto-scaling handles short-term spikes. Capacity planning handles structural growth that requires new reserved instances, database upgrades, or architectural changes.
Section 3: SRE Tools & Advanced Concepts
11. How do you reduce MTTR (Mean Time to Recover)?
Why they ask this: MTTR is the primary operational metric SREs own. Knowing how to systematically reduce it is core.
Ideal answer:
MTTR = time to detect + time to diagnose + time to mitigate + time to verify
Reduce time to detect:
- Better alerting — alert on symptoms (user-visible), not just causes
- Synthetic monitoring (proactive checks from outside)
- Set tighter alert windows — detect in 2 minutes, not 15
Reduce time to diagnose:
- Good dashboards linked from alerts — oncall shouldn't have to go hunting
- Runbooks for known failure classes
- Distributed tracing to correlate errors across services
- Logs with request IDs for end-to-end tracing
Reduce time to mitigate:
- Canary deployments + automated rollback
- Feature flags for instant disable without deploy
- Automated remediation for known issues (auto-restart on OOMKill)
- Practiced game days — don't run a complex rollback for the first time during an incident
Reduce time to verify:
- Define "service is healthy" clearly — don't just check the health endpoint
- Monitor error rates post-mitigation for 10-15 minutes before closing
12. What is chaos engineering and how does it relate to SRE?
Ideal answer:
Chaos engineering is the practice of deliberately injecting failures into production (or staging) systems to uncover weaknesses before they cause real incidents.
Origin: Netflix's Chaos Monkey (2011) randomly terminated EC2 instances in production to force engineers to build fault-tolerant systems.
Principles (from Principles of Chaos Engineering):
1. Build a hypothesis about steady-state behavior
2. Vary real-world events (latency, instance failure, zone outage)
3. Run experiments in production (or a production-like environment)
4. Minimize blast radius
Modern chaos tools:
- AWS Fault Injection Simulator (FIS): Native AWS chaos experiments
- Chaos Mesh: Kubernetes-native chaos engineering
- Gremlin: Commercial platform with a library of attack types
- Litmus: CNCF chaos engineering framework
SRE connection: Chaos engineering validates your error budgets and SLOs under real failure conditions. If your SLO is 99.9% but a single AZ failure causes an hour of downtime, you've learned something important — before a real AZ failure does.
13. How do you implement a tiered reliability model?
Ideal answer:
Not all services require the same reliability level. Over-investing in reliability for non-critical internal tools wastes engineering capacity.
Tiered model example:
Tier 0 (Critical — 99.99% SLO):
Authentication, payments, core API. Outage directly impacts users and revenue. Full SRE support, on-call rotation, frequent game days.
Tier 1 (Important — 99.9% SLO):
Most user-facing features. 43 minutes/month downtime budget. SRE support with fewer on-call requirements.
Tier 2 (Best-effort — 99.5% SLO):
Internal tools, analytics dashboards, non-real-time services. Handled by the owning team's on-call.
Tier 3 (Non-critical — no formal SLO):
Dev tools, staging environments, batch jobs. No on-call, best effort.
How to tier: Ask "If this service is unavailable for 1 hour, what is the impact?" Quantify in terms of revenue, user experience, and regulatory risk.
14. How do you handle oncall effectively?
Ideal answer:
Good oncall design prevents burnout and improves reliability by ensuring responses are effective, not just frantic.
Sustainable oncall design:
- Limit shifts to one week per person per rotation cycle
- No more than 2-3 pages per shift on average (Google's guideline)
- Follow-the-sun rotation for global teams — no middle-of-night pages for non-critical alerts
On receiving a page:
1. Acknowledge quickly (stops escalation, prevents duplicate responders)
2. Assess severity — is this P1 (service down for many users) or P3 (one user affected)?
3. Mitigate first, diagnose second
4. Loop in help early if stuck for > 15 minutes
5. Write up a brief incident report even for minor pages
Improving oncall quality:
- After each on-call week: review all pages. Were they actionable? Were they valid?
- Delete or improve alerts that fired without being actionable
- Automate recurring manual tasks from the oncall runbook
- Measure: pages per week, % of pages that required human action
15. What is the difference between availability and reliability?
Ideal answer:
These terms are often used interchangeably but have a technical distinction:
Availability: The fraction of time a system is operational and accessible.
- Measured as uptime percentage: 99.9%, 99.99%
- A system that crashes every hour and recovers in 1 second has 99.99% availability but poor reliability
Reliability: The probability that a system performs its required function without failure over a specified period and under specified conditions.
- Reliability is about mean time between failures (MTBF) — how long it runs correctly
- A system can be available (up) but unreliable (returning wrong results)
MTBF, MTTR, MTTF:
- MTTF (Mean Time to Failure): Average time between system starts and its first failure (for non-repairable systems)
- MTBF (Mean Time Between Failures): Average time between failures for repairable systems = MTTF + MTTR
- MTTR (Mean Time to Recover): Average time to restore service after failure
In SRE practice: Optimize both. High availability means fast recovery (low MTTR). High reliability means infrequent failures (high MTBF). The best systems have both.
16. How do you set effective SLOs?
Ideal answer:
Setting SLOs poorly is worse than having no SLOs — they either create alert fatigue (too tight) or false confidence (too loose).
Process:
Step 1: Identify user journeys, not services
"User can log in" is more meaningful than "auth service is up."
Step 2: Choose meaningful SLIs
For each user journey, what metric reflects user experience? Availability (request success rate), latency (p95 under threshold), or correctness.
Step 3: Look at historical data
Start with what you're currently achieving — set the SLO at or slightly above your current baseline. Don't set aspirational SLOs you immediately miss.
Step 4: Ask: "If this SLO fires, would we want to be woken up?"
If the answer is no, the SLO is too tight. If the answer is "probably yes but we've been missing it for weeks," it's too loose.
Step 5: Review quarterly
SLOs should evolve with the product. A new feature might require a tighter SLO. A mature service with stable load might relax its SLO.
Common mistake: Setting SLOs based on "what sounds like a good number" (99.99%) without knowing whether you can achieve it or what it would cost to meet it.
17. How do you measure and reduce operational toil?
Ideal answer:
Measuring toil:
Track every operational task in your ticket system with a toil tag. Calculate hours per week. Google targets < 50% of SRE time on toil.
Toil tickets per week: 12
Avg time per ticket: 30 min
Weekly toil: 6 hours
% of SRE capacity: 6/40 = 15% (within target)Top toil sources to eliminate first:
1. Tickets that require copy-pasting the same commands from a runbook → automate via a script or AWX/Rundeck
2. Manual database cleanups on a schedule → scheduled job or Lambda
3. Certificate rotation → cert-manager in Kubernetes or ACM in AWS
4. Capacity adjustments → auto-scaling policies
The toil quadrant: Plot toil items by frequency × time-cost. High-frequency, high-cost items are automation priorities. Low-frequency, low-cost items can wait.
Tracking improvement: Measure toil hours per quarter. If toil grows despite automation efforts, your service is growing faster than you're automating — a capacity issue.
18. What is a service dependency analysis and why does it matter for SRE?
Ideal answer:
A service dependency analysis maps the critical path of dependencies for each tier-0 or tier-1 service — identifying what other services it depends on and what would happen if each dependency failed.
Why it matters:
- Your SLO is limited by your weakest dependency's availability
- If you depend on 10 services each at 99.9% availability, your combined availability is 99.9%^10 ≈ 99% — lower than any individual SLO
Process:
1. Map all service dependencies (use a service mesh, distributed tracing, or manual audit)
2. Identify synchronous vs asynchronous dependencies — synchronous deps are in your critical path
3. For each critical dep: What happens if it goes down? Does your service fail? Degrade gracefully? Queue?
4. Add circuit breakers, fallbacks, or caching for critical synchronous deps
Circuit breaker pattern: If a downstream service starts failing, stop calling it immediately and return a cached or degraded response. Prevents cascading failures.
19. How do you approach a new service you're being asked to support as an SRE?
Ideal answer:
This is a process question — they want to see structure, not improvisation.
Phase 1: Understand the service (week 1)
- Read the architecture documentation and runbooks
- Talk to the development team: what are the common failure modes? What's the current on-call burden?
- Review incident history: what broke in the last 6 months?
- Review current alerting: are the alerts actionable? Are they firing correctly?
Phase 2: Baseline reliability (week 2)
- Instrument the four golden signals if not already done
- Calculate current availability against business expectations
- Identify toil sources by reviewing the last 20 oncall tickets
- Run a production readiness review against your org's checklist
Phase 3: Set SLOs and prioritize (week 3-4)
- Propose SLOs based on historical data and business requirements
- Draft error budget policy
- Identify top 3 reliability improvements with highest ROI
- Propose automation targets for top toil items
20. What would you do if a service is consistently missing its SLO?
Why they ask this: This is the most practical SRE scenario question. They want to see structured thinking, not panic.
Structured response:
Step 1: Understand the scope
Is it missing by 0.1% or 5%? For one week or three months? Is it one endpoint, all endpoints, or one region?
Step 2: Identify root causes
- Is this a known, recurring failure class (e.g., memory leak causing weekly restarts)?
- Is this new (correlated with a recent deployment)?
- Is this an external dependency causing failures?
- Is the SLO itself wrong — too tight for the service's design?
Step 3: Prioritize fixes by impact
Use the error budget burn rate to prioritize. The failures consuming the most budget get fixed first.
Step 4: Escalate and negotiate
If fixing the SLO requires significant engineering work (refactoring, infrastructure changes), present the business case to leadership. Unreliability has a cost — quantify it.
Step 5: Freeze new feature development if needed
Per the error budget policy: if error budget is exhausted, new features pause until reliability improves. This is not punitive — it's the designed incentive to invest in reliability.
Step 6: Review the SLO itself
Sometimes the SLO is aspirational rather than achievable given the service's current architecture. It may need adjustment — with clear documentation of why.
