← Back to Blog
Site Reliability Engineering (SRE) Interview Questions 2026
SRE18 min read·Apr 20, 2026
By InterviewDrill Team

Site Reliability Engineering (SRE) Interview Questions 2026

SRE is now a distinct career track at top tech companies. Google, Netflix, Spotify, and hundreds of product companies hire explicitly for SRE roles. Whether you're interviewing for an SRE title or a senior DevOps role with reliability responsibilities, these are the 20 questions that matter most.


Section 1: SRE Principles & Philosophy

1. What is the difference between SRE and DevOps?

Why they ask this: This is the most fundamental SRE interview question. A vague answer signals you've read a blog post; a clear answer signals you understand the practice.

Ideal answer:

DevOps is a cultural and organizational philosophy — breaking down silos between development and operations, improving collaboration, and automating the software delivery process. It's a broad movement more than a prescriptive set of practices.

SRE is Google's specific, prescriptive implementation of DevOps principles. Ben Treynor (Google, 2003) defined it: "What happens when you ask a software engineer to design an operations function."

Key SRE-specific constructs that DevOps doesn't prescribe:

Practical difference: A DevOps engineer might focus heavily on CI/CD pipelines and cloud automation. An SRE focuses on the production reliability of services at scale — measuring it, setting targets, and systematically eliminating the causes of unreliability.


2. Define SLI, SLO, and SLA. How are they different?

Why they ask this: This is the most tested SRE concept. Every SRE interview will touch this.

Ideal answer:

SLI (Service Level Indicator): A quantitative measure of service behavior. A specific metric that matters to users.

SLO (Service Level Objective): A target value or range for an SLI. An internal commitment to reliability.

SLA (Service Level Agreement): A contractual commitment to external customers, typically with financial penalties for violation.

Why SLOs > SLAs for operations: SLOs give you an internal warning signal before you violate the customer-facing SLA. The gap between your SLO (99.95%) and SLA (99.9%) is your buffer.


3. What is an error budget and how do you use it?

Why they ask this: Error budgets are the most operationally powerful SRE concept. It's what separates SRE from traditional ops.

Ideal answer:

An error budget is the acceptable amount of unreliability derived from your SLO.

If your SLO is 99.9% availability:

How error budgets change behavior:

Without an error budget, developers and operations teams are in constant conflict: developers want to ship faster, ops wants stability.

With an error budget, both teams share the same goal:

When you have budget remaining: Deploy freely, run experiments, take risks with new features. You have room to fail.

When budget is exhausted (or nearly): Freeze feature releases. Focus engineering effort on reliability improvements. No new deployments until budget recovers.

This changes the conversation: Instead of "ops is blocking my release," it becomes "we burned our budget, which means we need to fix reliability before we can ship new features."

Burn rate alerting: Alert when you're consuming the error budget significantly faster than sustainable. If your budget would run out in 2 days at the current failure rate, that's a critical alert — even if you're technically within SLO so far.


4. What is toil and how do SREs eliminate it?

Why they ask this: Toil is a core SRE concept. Knowing how to identify and measure it is essential.

Ideal answer:

Toil (per Google SRE Book) is operational work that is:

Examples of toil:

Why it's harmful: Toil doesn't make the service more reliable — it just keeps it running. SREs should spend no more than 50% of their time on toil. Time above 50% is a problem to escalate.

Eliminating toil:

1. Measure it first — track toil sources in tickets, calculate hours per week

2. Automate the common path — write the runbook as code

3. Self-healing systems — auto-restart, auto-scale, auto-remediate

4. Fix root causes — if you restart a service manually weekly, fix the OOM bug

5. Reduce ticket volume — improve user tooling so they don't need SRE help


5. How do you conduct a blameless postmortem?

Why they ask this: Postmortem culture is a key differentiator of high-performing engineering organizations.

Ideal answer:

A blameless postmortem analyzes what happened during an incident without assigning personal blame — focusing on systemic fixes rather than individual failures.

Why blameless: If engineers fear punishment, they hide mistakes, incidents get under-reported, and root causes never surface. Psychological safety is a prerequisite for learning.

Postmortem structure:

1. Incident summary: One paragraph — what happened, impact, duration

2. Timeline: Minute-by-minute reconstruction of events (what happened, who noticed, what actions were taken)

3. Root cause analysis: Not "X made a mistake" — "The system allowed X to happen without preventing or detecting it"

4. Impact: Quantified — N users affected, $M revenue lost, X minutes of downtime

5. Action items: Specific, assigned, with due dates. Fixes go into the backlog with priority.

6. What went well: Acknowledge effective responses — this is learning, not punishment

Blameless framing: Change "Bob deployed bad code" → "Our deployment process didn't catch this class of error. Action: add integration test for Y."


Section 2: Reliability & Incident Management

6. How do you define and measure availability?

Ideal answer:

Availability = uptime / (uptime + downtime)

But this simple formula can be misleading for modern systems. More precise:

Request success rate SLI:

availability = good_requests / total_requests

Where "good" means: returned a correct response within the latency threshold. This is better than uptime because:

Multi-dimension availability:

Measurement: Use your load balancer metrics, APM tool, or synthetic monitoring. NOT internal health checks (which can show green when users are failing).

9s table: 99.9% = 43 min/month, 99.95% = 22 min/month, 99.99% = 4.3 min/month, 99.999% = 26 sec/month. Most production services target 99.9% to 99.99%.


7. Walk me through a major incident response process

Ideal answer:

Immediate response (0-5 min):

Mitigation first (5-30 min):

Investigation (parallel or after mitigation):

Resolution and follow-up:


8. What are the four golden signals of monitoring?

Why they ask this: This is the Google SRE Book concept every SRE should know cold.

Ideal answer:

The four golden signals (from Google's SRE Book) are the minimum set of metrics to monitor for any service:

1. Latency: Time to service a request. Track the distribution — p50, p95, p99. Distinguish successful vs error latency (a failing request served in 1ms might skew your latency numbers positively).

2. Traffic: How much demand is placed on your system — requests per second, messages per second, transactions per second. Understand your baseline to detect anomalies.

3. Errors: Rate of requests that fail — explicitly (5xx), implicitly (returning wrong data), or by policy (response over 500ms is a failure). Both rate and absolute count matter.

4. Saturation: How "full" your service is — CPU utilization, memory pressure, queue depth. Saturation is often a leading indicator — latency degrades before you hit hard limits.

Using them: Build a dashboard with all four. An alert on any of these not returning to baseline within N minutes should page on-call. Start here before building complex custom metrics.


9. What is a production readiness review and what does it cover?

Ideal answer:

A Production Readiness Review (PRR) is a structured assessment before a service receives SRE support and goes into production. It ensures reliability is built in, not bolted on.

Typical PRR checklist:

Architecture & reliability:

Observability:

Deployment & rollback:

Oncall & operational:


10. What is capacity planning in the context of SRE?

Ideal answer:

Capacity planning ensures your system has enough resources to handle current and future load without unnecessary over-provisioning.

SRE capacity planning approach:

Step 1: Establish baseline load metrics

Current requests/second, CPU/memory utilization, storage growth rate. Segment by component (API, database, cache, queue).

Step 2: Model growth

Project traffic growth based on business metrics (user signups, feature launches). Conservative estimate + safety margin.

Step 3: Load test to find limits

Gradually ramp load until the system degrades. Identify the bottleneck: CPU? Memory? Database connections? Network?

Step 4: Set headroom targets

Typical targets: run at no more than 70% of maximum capacity in steady state. This gives you room to handle traffic spikes and absorb a zone/region failure.

Step 5: Plan for n+1 redundancy

For a 99.99% SLO, losing one AZ should be survivable. That means running at < 50% in each AZ (losing one AZ doubles load on the remaining).

Auto-scaling vs capacity planning: Auto-scaling handles short-term spikes. Capacity planning handles structural growth that requires new reserved instances, database upgrades, or architectural changes.


Section 3: SRE Tools & Advanced Concepts

11. How do you reduce MTTR (Mean Time to Recover)?

Why they ask this: MTTR is the primary operational metric SREs own. Knowing how to systematically reduce it is core.

Ideal answer:

MTTR = time to detect + time to diagnose + time to mitigate + time to verify

Reduce time to detect:

Reduce time to diagnose:

Reduce time to mitigate:

Reduce time to verify:


12. What is chaos engineering and how does it relate to SRE?

Ideal answer:

Chaos engineering is the practice of deliberately injecting failures into production (or staging) systems to uncover weaknesses before they cause real incidents.

Origin: Netflix's Chaos Monkey (2011) randomly terminated EC2 instances in production to force engineers to build fault-tolerant systems.

Principles (from Principles of Chaos Engineering):

1. Build a hypothesis about steady-state behavior

2. Vary real-world events (latency, instance failure, zone outage)

3. Run experiments in production (or a production-like environment)

4. Minimize blast radius

Modern chaos tools:

SRE connection: Chaos engineering validates your error budgets and SLOs under real failure conditions. If your SLO is 99.9% but a single AZ failure causes an hour of downtime, you've learned something important — before a real AZ failure does.


13. How do you implement a tiered reliability model?

Ideal answer:

Not all services require the same reliability level. Over-investing in reliability for non-critical internal tools wastes engineering capacity.

Tiered model example:

Tier 0 (Critical — 99.99% SLO):

Authentication, payments, core API. Outage directly impacts users and revenue. Full SRE support, on-call rotation, frequent game days.

Tier 1 (Important — 99.9% SLO):

Most user-facing features. 43 minutes/month downtime budget. SRE support with fewer on-call requirements.

Tier 2 (Best-effort — 99.5% SLO):

Internal tools, analytics dashboards, non-real-time services. Handled by the owning team's on-call.

Tier 3 (Non-critical — no formal SLO):

Dev tools, staging environments, batch jobs. No on-call, best effort.

How to tier: Ask "If this service is unavailable for 1 hour, what is the impact?" Quantify in terms of revenue, user experience, and regulatory risk.


14. How do you handle oncall effectively?

Ideal answer:

Good oncall design prevents burnout and improves reliability by ensuring responses are effective, not just frantic.

Sustainable oncall design:

On receiving a page:

1. Acknowledge quickly (stops escalation, prevents duplicate responders)

2. Assess severity — is this P1 (service down for many users) or P3 (one user affected)?

3. Mitigate first, diagnose second

4. Loop in help early if stuck for > 15 minutes

5. Write up a brief incident report even for minor pages

Improving oncall quality:


15. What is the difference between availability and reliability?

Ideal answer:

These terms are often used interchangeably but have a technical distinction:

Availability: The fraction of time a system is operational and accessible.

Reliability: The probability that a system performs its required function without failure over a specified period and under specified conditions.

MTBF, MTTR, MTTF:

In SRE practice: Optimize both. High availability means fast recovery (low MTTR). High reliability means infrequent failures (high MTBF). The best systems have both.


16. How do you set effective SLOs?

Ideal answer:

Setting SLOs poorly is worse than having no SLOs — they either create alert fatigue (too tight) or false confidence (too loose).

Process:

Step 1: Identify user journeys, not services

"User can log in" is more meaningful than "auth service is up."

Step 2: Choose meaningful SLIs

For each user journey, what metric reflects user experience? Availability (request success rate), latency (p95 under threshold), or correctness.

Step 3: Look at historical data

Start with what you're currently achieving — set the SLO at or slightly above your current baseline. Don't set aspirational SLOs you immediately miss.

Step 4: Ask: "If this SLO fires, would we want to be woken up?"

If the answer is no, the SLO is too tight. If the answer is "probably yes but we've been missing it for weeks," it's too loose.

Step 5: Review quarterly

SLOs should evolve with the product. A new feature might require a tighter SLO. A mature service with stable load might relax its SLO.

Common mistake: Setting SLOs based on "what sounds like a good number" (99.99%) without knowing whether you can achieve it or what it would cost to meet it.


17. How do you measure and reduce operational toil?

Ideal answer:

Measuring toil:

Track every operational task in your ticket system with a toil tag. Calculate hours per week. Google targets < 50% of SRE time on toil.

Toil tickets per week: 12
Avg time per ticket: 30 min
Weekly toil: 6 hours
% of SRE capacity: 6/40 = 15% (within target)

Top toil sources to eliminate first:

1. Tickets that require copy-pasting the same commands from a runbook → automate via a script or AWX/Rundeck

2. Manual database cleanups on a schedule → scheduled job or Lambda

3. Certificate rotation → cert-manager in Kubernetes or ACM in AWS

4. Capacity adjustments → auto-scaling policies

The toil quadrant: Plot toil items by frequency × time-cost. High-frequency, high-cost items are automation priorities. Low-frequency, low-cost items can wait.

Tracking improvement: Measure toil hours per quarter. If toil grows despite automation efforts, your service is growing faster than you're automating — a capacity issue.


18. What is a service dependency analysis and why does it matter for SRE?

Ideal answer:

A service dependency analysis maps the critical path of dependencies for each tier-0 or tier-1 service — identifying what other services it depends on and what would happen if each dependency failed.

Why it matters:

Process:

1. Map all service dependencies (use a service mesh, distributed tracing, or manual audit)

2. Identify synchronous vs asynchronous dependencies — synchronous deps are in your critical path

3. For each critical dep: What happens if it goes down? Does your service fail? Degrade gracefully? Queue?

4. Add circuit breakers, fallbacks, or caching for critical synchronous deps

Circuit breaker pattern: If a downstream service starts failing, stop calling it immediately and return a cached or degraded response. Prevents cascading failures.


19. How do you approach a new service you're being asked to support as an SRE?

Ideal answer:

This is a process question — they want to see structure, not improvisation.

Phase 1: Understand the service (week 1)

Phase 2: Baseline reliability (week 2)

Phase 3: Set SLOs and prioritize (week 3-4)


20. What would you do if a service is consistently missing its SLO?

Why they ask this: This is the most practical SRE scenario question. They want to see structured thinking, not panic.

Structured response:

Step 1: Understand the scope

Is it missing by 0.1% or 5%? For one week or three months? Is it one endpoint, all endpoints, or one region?

Step 2: Identify root causes

Step 3: Prioritize fixes by impact

Use the error budget burn rate to prioritize. The failures consuming the most budget get fixed first.

Step 4: Escalate and negotiate

If fixing the SLO requires significant engineering work (refactoring, infrastructure changes), present the business case to leadership. Unreliability has a cost — quantify it.

Step 5: Freeze new feature development if needed

Per the error budget policy: if error budget is exhausted, new features pause until reliability improves. This is not punitive — it's the designed incentive to invest in reliability.

Step 6: Review the SLO itself

Sometimes the SLO is aspirational rather than achievable given the service's current architecture. It may need adjustment — with clear documentation of why.

Reading helps. Practicing wins interviews.

Practice these exact questions with an AI interviewer that pushes back. First session completely free.

Start Practicing Free →