Site Reliability Engineering (SRE) Interview Questions 2026

SRE is now a distinct career track at top tech companies. Google, Netflix, Spotify, and hundreds of product companies hire explicitly for SRE roles. Whether you're interviewing for an SRE title or a senior DevOps role with reliability responsibilities, these are the 20 questions that matter most.

Section 1: SRE Principles & Philosophy

1. What is the difference between SRE and DevOps?

Why they ask this: This is the most fundamental SRE interview question. A vague answer signals you've read a blog post; a clear answer signals you understand the practice.

Ideal answer:

DevOps is a cultural and organizational philosophy — breaking down silos between development and operations, improving collaboration, and automating the software delivery process. It's a broad movement more than a prescriptive set of practices.

SRE is Google's specific, prescriptive implementation of DevOps principles. Ben Treynor (Google, 2003) defined it: "What happens when you ask a software engineer to design an operations function."

Key SRE-specific constructs that DevOps doesn't prescribe:

SLIs, SLOs, error budgets — formal reliability measurement
Toil elimination — explicit target to reduce operational work
Postmortem culture — blameless, structured incident review
Production readiness reviews — formal criteria before launch
Engineering headcount cap on toil (~50% max)

Practical difference: A DevOps engineer might focus heavily on CI/CD pipelines and cloud automation. An SRE focuses on the production reliability of services at scale — measuring it, setting targets, and systematically eliminating the causes of unreliability.

2. Define SLI, SLO, and SLA. How are they different?

Why they ask this: This is the most tested SRE concept. Every SRE interview will touch this.

Ideal answer:

SLI (Service Level Indicator): A quantitative measure of service behavior. A specific metric that matters to users.

Request success rate: successful_requests / total_requests
Latency: percentage of requests served under 200ms
Availability: fraction of time the service is up

SLO (Service Level Objective): A target value or range for an SLI. An internal commitment to reliability.

"99.9% of requests return a successful response"
"95% of requests are served under 200ms"
SLOs are set internally by the product and SRE teams

SLA (Service Level Agreement): A contractual commitment to external customers, typically with financial penalties for violation.

"We guarantee 99.9% uptime. If we miss it, customers get service credits."
SLAs are weaker than SLOs — you want to miss the SLA before you miss the SLO

Why SLOs > SLAs for operations: SLOs give you an internal warning signal before you violate the customer-facing SLA. The gap between your SLO (99.95%) and SLA (99.9%) is your buffer.

3. What is an error budget and how do you use it?

Why they ask this: Error budgets are the most operationally powerful SRE concept. It's what separates SRE from traditional ops.

Ideal answer:

An error budget is the acceptable amount of unreliability derived from your SLO.

If your SLO is 99.9% availability:

Error budget = 0.1% = 43.8 minutes per month of allowed downtime

How error budgets change behavior:

Without an error budget, developers and operations teams are in constant conflict: developers want to ship faster, ops wants stability.

With an error budget, both teams share the same goal:

When you have budget remaining: Deploy freely, run experiments, take risks with new features. You have room to fail.

When budget is exhausted (or nearly): Freeze feature releases. Focus engineering effort on reliability improvements. No new deployments until budget recovers.

This changes the conversation: Instead of "ops is blocking my release," it becomes "we burned our budget, which means we need to fix reliability before we can ship new features."

Burn rate alerting: Alert when you're consuming the error budget significantly faster than sustainable. If your budget would run out in 2 days at the current failure rate, that's a critical alert — even if you're technically within SLO so far.

4. What is toil and how do SREs eliminate it?

Why they ask this: Toil is a core SRE concept. Knowing how to identify and measure it is essential.

Ideal answer:

Toil (per Google SRE Book) is operational work that is:

Manual
Repetitive
Automatable
Tactical (reactive, not strategic)
Grows linearly with service growth

Examples of toil:

Manually restarting crashed processes
Manually scaling servers during traffic spikes
Handling the same class of tickets repeatedly
Manually rotating certificates
Copy-pasting commands from runbooks

Why it's harmful: Toil doesn't make the service more reliable — it just keeps it running. SREs should spend no more than 50% of their time on toil. Time above 50% is a problem to escalate.

Eliminating toil:

1. Measure it first — track toil sources in tickets, calculate hours per week

2. Automate the common path — write the runbook as code

3. Self-healing systems — auto-restart, auto-scale, auto-remediate

4. Fix root causes — if you restart a service manually weekly, fix the OOM bug

5. Reduce ticket volume — improve user tooling so they don't need SRE help

5. How do you conduct a blameless postmortem?

Why they ask this: Postmortem culture is a key differentiator of high-performing engineering organizations.

Ideal answer:

A blameless postmortem analyzes what happened during an incident without assigning personal blame — focusing on systemic fixes rather than individual failures.

Why blameless: If engineers fear punishment, they hide mistakes, incidents get under-reported, and root causes never surface. Psychological safety is a prerequisite for learning.

Postmortem structure:

1. Incident summary: One paragraph — what happened, impact, duration

2. Timeline: Minute-by-minute reconstruction of events (what happened, who noticed, what actions were taken)

3. Root cause analysis: Not "X made a mistake" — "The system allowed X to happen without preventing or detecting it"

4. Impact: Quantified — N users affected, $M revenue lost, X minutes of downtime

5. Action items: Specific, assigned, with due dates. Fixes go into the backlog with priority.

6. What went well: Acknowledge effective responses — this is learning, not punishment

Blameless framing: Change "Bob deployed bad code" → "Our deployment process didn't catch this class of error. Action: add integration test for Y."

Section 2: Reliability & Incident Management

6. How do you define and measure availability?

Ideal answer:

Availability = uptime / (uptime + downtime)

But this simple formula can be misleading for modern systems. More precise:

Request success rate SLI:

availability = good_requests / total_requests

Where "good" means: returned a correct response within the latency threshold. This is better than uptime because:

A service can be "up" but returning 500s → bad availability
A service can be "up" but extremely slow → bad availability

Multi-dimension availability:

API availability (requests succeeding)
Latency availability (requests succeeding within SLO)
Correctness (responses are accurate, not just 200 OK)

Measurement: Use your load balancer metrics, APM tool, or synthetic monitoring. NOT internal health checks (which can show green when users are failing).

9s table: 99.9% = 43 min/month, 99.95% = 22 min/month, 99.99% = 4.3 min/month, 99.999% = 26 sec/month. Most production services target 99.9% to 99.99%.

7. Walk me through a major incident response process

Ideal answer:

Immediate response (0-5 min):

Acknowledge the alert — prevents duplicate responders
Assess severity: How many users affected? Is revenue impacted?
Declare incident if warranted — spin up a war room (Slack channel or video call)
Page on-call engineer for the affected service if not already alerted

Mitigation first (5-30 min):

Prioritize restoring service over finding root cause
Rollback, restart, failover, scale — whatever gets users unblocked
Communicate status updates every 10-15 minutes to stakeholders

Investigation (parallel or after mitigation):

Form hypotheses: What changed recently? What does the data show?
Check deployment history, config changes, traffic spikes
Isolate — is it one region? One service? One dependency?

Resolution and follow-up:

Confirm service is fully restored and stable
Publish a preliminary timeline within 2 hours
Full postmortem within 48-72 hours
Track action items to completion

8. What are the four golden signals of monitoring?

Why they ask this: This is the Google SRE Book concept every SRE should know cold.

Ideal answer:

The four golden signals (from Google's SRE Book) are the minimum set of metrics to monitor for any service:

1. Latency: Time to service a request. Track the distribution — p50, p95, p99. Distinguish successful vs error latency (a failing request served in 1ms might skew your latency numbers positively).

2. Traffic: How much demand is placed on your system — requests per second, messages per second, transactions per second. Understand your baseline to detect anomalies.

3. Errors: Rate of requests that fail — explicitly (5xx), implicitly (returning wrong data), or by policy (response over 500ms is a failure). Both rate and absolute count matter.

4. Saturation: How "full" your service is — CPU utilization, memory pressure, queue depth. Saturation is often a leading indicator — latency degrades before you hit hard limits.

Using them: Build a dashboard with all four. An alert on any of these not returning to baseline within N minutes should page on-call. Start here before building complex custom metrics.

9. What is a production readiness review and what does it cover?

Ideal answer:

A Production Readiness Review (PRR) is a structured assessment before a service receives SRE support and goes into production. It ensures reliability is built in, not bolted on.

Typical PRR checklist:

Architecture & reliability:

Single points of failure identified and mitigated
Load testing completed (system tested at 2-3x expected peak)
Graceful degradation under partial failure
Data backup and restore tested

Observability:

The four golden signals are instrumented
Alerts defined for user-impacting failures (not just system metrics)
Runbooks exist for the top 5 expected failure scenarios
Distributed tracing enabled

Deployment & rollback:

Can deploy and rollback in under 15 minutes
Canary or gradual rollout strategy exists
Feature flags allow disabling problematic features without a deploy

Oncall & operational:

Oncall rotation established
Escalation path documented
Known operational tasks automated (not manual)
Toil < 10% of weekly engineering hours

10. What is capacity planning in the context of SRE?

Ideal answer:

Capacity planning ensures your system has enough resources to handle current and future load without unnecessary over-provisioning.

SRE capacity planning approach:

Step 1: Establish baseline load metrics

Current requests/second, CPU/memory utilization, storage growth rate. Segment by component (API, database, cache, queue).

Step 2: Model growth

Project traffic growth based on business metrics (user signups, feature launches). Conservative estimate + safety margin.

Step 3: Load test to find limits

Gradually ramp load until the system degrades. Identify the bottleneck: CPU? Memory? Database connections? Network?

Step 4: Set headroom targets

Typical targets: run at no more than 70% of maximum capacity in steady state. This gives you room to handle traffic spikes and absorb a zone/region failure.

Step 5: Plan for n+1 redundancy

For a 99.99% SLO, losing one AZ should be survivable. That means running at < 50% in each AZ (losing one AZ doubles load on the remaining).

Auto-scaling vs capacity planning: Auto-scaling handles short-term spikes. Capacity planning handles structural growth that requires new reserved instances, database upgrades, or architectural changes.

Section 3: SRE Tools & Advanced Concepts

11. How do you reduce MTTR (Mean Time to Recover)?

Why they ask this: MTTR is the primary operational metric SREs own. Knowing how to systematically reduce it is core.

Ideal answer:

MTTR = time to detect + time to diagnose + time to mitigate + time to verify

Reduce time to detect:

Better alerting — alert on symptoms (user-visible), not just causes
Synthetic monitoring (proactive checks from outside)
Set tighter alert windows — detect in 2 minutes, not 15

Reduce time to diagnose:

Good dashboards linked from alerts — oncall shouldn't have to go hunting
Runbooks for known failure classes
Distributed tracing to correlate errors across services
Logs with request IDs for end-to-end tracing

Reduce time to mitigate:

Canary deployments + automated rollback
Feature flags for instant disable without deploy
Automated remediation for known issues (auto-restart on OOMKill)
Practiced game days — don't run a complex rollback for the first time during an incident

Reduce time to verify:

Define "service is healthy" clearly — don't just check the health endpoint
Monitor error rates post-mitigation for 10-15 minutes before closing

12. What is chaos engineering and how does it relate to SRE?

Ideal answer:

Chaos engineering is the practice of deliberately injecting failures into production (or staging) systems to uncover weaknesses before they cause real incidents.

Origin: Netflix's Chaos Monkey (2011) randomly terminated EC2 instances in production to force engineers to build fault-tolerant systems.

Principles (from Principles of Chaos Engineering):

1. Build a hypothesis about steady-state behavior

2. Vary real-world events (latency, instance failure, zone outage)

3. Run experiments in production (or a production-like environment)

4. Minimize blast radius

Modern chaos tools:

AWS Fault Injection Simulator (FIS): Native AWS chaos experiments
Chaos Mesh: Kubernetes-native chaos engineering
Gremlin: Commercial platform with a library of attack types
Litmus: CNCF chaos engineering framework

SRE connection: Chaos engineering validates your error budgets and SLOs under real failure conditions. If your SLO is 99.9% but a single AZ failure causes an hour of downtime, you've learned something important — before a real AZ failure does.

13. How do you implement a tiered reliability model?

Ideal answer:

Not all services require the same reliability level. Over-investing in reliability for non-critical internal tools wastes engineering capacity.

Tiered model example:

Tier 0 (Critical — 99.99% SLO):

Authentication, payments, core API. Outage directly impacts users and revenue. Full SRE support, on-call rotation, frequent game days.

Tier 1 (Important — 99.9% SLO):

Most user-facing features. 43 minutes/month downtime budget. SRE support with fewer on-call requirements.

Tier 2 (Best-effort — 99.5% SLO):

Internal tools, analytics dashboards, non-real-time services. Handled by the owning team's on-call.

Tier 3 (Non-critical — no formal SLO):

Dev tools, staging environments, batch jobs. No on-call, best effort.

How to tier: Ask "If this service is unavailable for 1 hour, what is the impact?" Quantify in terms of revenue, user experience, and regulatory risk.

14. How do you handle oncall effectively?

Ideal answer:

Good oncall design prevents burnout and improves reliability by ensuring responses are effective, not just frantic.

Sustainable oncall design:

Limit shifts to one week per person per rotation cycle
No more than 2-3 pages per shift on average (Google's guideline)
Follow-the-sun rotation for global teams — no middle-of-night pages for non-critical alerts

On receiving a page:

1. Acknowledge quickly (stops escalation, prevents duplicate responders)

2. Assess severity — is this P1 (service down for many users) or P3 (one user affected)?

3. Mitigate first, diagnose second

4. Loop in help early if stuck for > 15 minutes

5. Write up a brief incident report even for minor pages

Improving oncall quality:

After each on-call week: review all pages. Were they actionable? Were they valid?
Delete or improve alerts that fired without being actionable
Automate recurring manual tasks from the oncall runbook
Measure: pages per week, % of pages that required human action

15. What is the difference between availability and reliability?

Ideal answer:

These terms are often used interchangeably but have a technical distinction:

Availability: The fraction of time a system is operational and accessible.

Measured as uptime percentage: 99.9%, 99.99%
A system that crashes every hour and recovers in 1 second has 99.99% availability but poor reliability

Reliability: The probability that a system performs its required function without failure over a specified period and under specified conditions.

Reliability is about mean time between failures (MTBF) — how long it runs correctly
A system can be available (up) but unreliable (returning wrong results)

MTBF, MTTR, MTTF:

MTTF (Mean Time to Failure): Average time between system starts and its first failure (for non-repairable systems)
MTBF (Mean Time Between Failures): Average time between failures for repairable systems = MTTF + MTTR
MTTR (Mean Time to Recover): Average time to restore service after failure

In SRE practice: Optimize both. High availability means fast recovery (low MTTR). High reliability means infrequent failures (high MTBF). The best systems have both.

16. How do you set effective SLOs?

Ideal answer:

Setting SLOs poorly is worse than having no SLOs — they either create alert fatigue (too tight) or false confidence (too loose).

Process:

Step 1: Identify user journeys, not services

"User can log in" is more meaningful than "auth service is up."

Step 2: Choose meaningful SLIs

For each user journey, what metric reflects user experience? Availability (request success rate), latency (p95 under threshold), or correctness.

Step 3: Look at historical data

Start with what you're currently achieving — set the SLO at or slightly above your current baseline. Don't set aspirational SLOs you immediately miss.

Step 4: Ask: "If this SLO fires, would we want to be woken up?"

If the answer is no, the SLO is too tight. If the answer is "probably yes but we've been missing it for weeks," it's too loose.

Step 5: Review quarterly

SLOs should evolve with the product. A new feature might require a tighter SLO. A mature service with stable load might relax its SLO.

Common mistake: Setting SLOs based on "what sounds like a good number" (99.99%) without knowing whether you can achieve it or what it would cost to meet it.

17. How do you measure and reduce operational toil?

Ideal answer:

Measuring toil:

Track every operational task in your ticket system with a toil tag. Calculate hours per week. Google targets < 50% of SRE time on toil.

Toil tickets per week: 12
Avg time per ticket: 30 min
Weekly toil: 6 hours
% of SRE capacity: 6/40 = 15% (within target)

Top toil sources to eliminate first:

1. Tickets that require copy-pasting the same commands from a runbook → automate via a script or AWX/Rundeck

2. Manual database cleanups on a schedule → scheduled job or Lambda

3. Certificate rotation → cert-manager in Kubernetes or ACM in AWS

4. Capacity adjustments → auto-scaling policies

The toil quadrant: Plot toil items by frequency × time-cost. High-frequency, high-cost items are automation priorities. Low-frequency, low-cost items can wait.

Tracking improvement: Measure toil hours per quarter. If toil grows despite automation efforts, your service is growing faster than you're automating — a capacity issue.

18. What is a service dependency analysis and why does it matter for SRE?

Ideal answer:

A service dependency analysis maps the critical path of dependencies for each tier-0 or tier-1 service — identifying what other services it depends on and what would happen if each dependency failed.

Why it matters:

Your SLO is limited by your weakest dependency's availability
If you depend on 10 services each at 99.9% availability, your combined availability is 99.9%^10 ≈ 99% — lower than any individual SLO

Process:

1. Map all service dependencies (use a service mesh, distributed tracing, or manual audit)

2. Identify synchronous vs asynchronous dependencies — synchronous deps are in your critical path

3. For each critical dep: What happens if it goes down? Does your service fail? Degrade gracefully? Queue?

4. Add circuit breakers, fallbacks, or caching for critical synchronous deps

Circuit breaker pattern: If a downstream service starts failing, stop calling it immediately and return a cached or degraded response. Prevents cascading failures.

19. How do you approach a new service you're being asked to support as an SRE?

Ideal answer:

This is a process question — they want to see structure, not improvisation.

Phase 1: Understand the service (week 1)

Read the architecture documentation and runbooks
Talk to the development team: what are the common failure modes? What's the current on-call burden?
Review incident history: what broke in the last 6 months?
Review current alerting: are the alerts actionable? Are they firing correctly?

Phase 2: Baseline reliability (week 2)

Instrument the four golden signals if not already done
Calculate current availability against business expectations
Identify toil sources by reviewing the last 20 oncall tickets
Run a production readiness review against your org's checklist

Phase 3: Set SLOs and prioritize (week 3-4)

Propose SLOs based on historical data and business requirements
Draft error budget policy
Identify top 3 reliability improvements with highest ROI
Propose automation targets for top toil items

20. What would you do if a service is consistently missing its SLO?

Why they ask this: This is the most practical SRE scenario question. They want to see structured thinking, not panic.

Structured response:

Step 1: Understand the scope

Is it missing by 0.1% or 5%? For one week or three months? Is it one endpoint, all endpoints, or one region?

Step 2: Identify root causes

Is this a known, recurring failure class (e.g., memory leak causing weekly restarts)?
Is this new (correlated with a recent deployment)?
Is this an external dependency causing failures?
Is the SLO itself wrong — too tight for the service's design?

Step 3: Prioritize fixes by impact

Use the error budget burn rate to prioritize. The failures consuming the most budget get fixed first.

Step 4: Escalate and negotiate

If fixing the SLO requires significant engineering work (refactoring, infrastructure changes), present the business case to leadership. Unreliability has a cost — quantify it.

Step 5: Freeze new feature development if needed

Per the error budget policy: if error budget is exhausted, new features pause until reliability improves. This is not punitive — it's the designed incentive to invest in reliability.

Step 6: Review the SLO itself

Sometimes the SLO is aspirational rather than achievable given the service's current architecture. It may need adjustment — with clear documentation of why.

Site Reliability Engineering (SRE) Interview Questions 2026

Section 1: SRE Principles & Philosophy

1. What is the difference between SRE and DevOps?

2. Define SLI, SLO, and SLA. How are they different?

3. What is an error budget and how do you use it?

4. What is toil and how do SREs eliminate it?

5. How do you conduct a blameless postmortem?

Section 2: Reliability & Incident Management

6. How do you define and measure availability?

7. Walk me through a major incident response process

8. What are the four golden signals of monitoring?

9. What is a production readiness review and what does it cover?

10. What is capacity planning in the context of SRE?

Section 3: SRE Tools & Advanced Concepts

11. How do you reduce MTTR (Mean Time to Recover)?

12. What is chaos engineering and how does it relate to SRE?

13. How do you implement a tiered reliability model?

14. How do you handle oncall effectively?

15. What is the difference between availability and reliability?

16. How do you set effective SLOs?

17. How do you measure and reduce operational toil?

18. What is a service dependency analysis and why does it matter for SRE?

19. How do you approach a new service you're being asked to support as an SRE?

20. What would you do if a service is consistently missing its SLO?

Reading helps. Practicing wins interviews.