Prometheus & Grafana Interview Questions 2026: The Complete Guide

Observability is now a core senior DevOps and SRE competency. Prometheus + Grafana is the dominant open-source monitoring stack in 2026. Here are the 20 questions that consistently come up in monitoring-focused interviews.

Section 1: Prometheus Architecture

1. Explain the Prometheus architecture

Why they ask this: They want to know if you understand how Prometheus works under the hood, not just how to run it.

Ideal answer:

Prometheus follows a pull-based architecture. The Prometheus server:

1. Maintains a list of scrape targets (from static config or service discovery)

2. Scrapes HTTP endpoints (usually /metrics) at configured intervals

3. Stores time-series data locally in its TSDB (time series database)

4. Evaluates alerting and recording rules against stored metrics

5. Sends alerts to AlertManager, which handles deduplication, grouping, and routing

Key components:

Prometheus server: Scraper, rule evaluator, TSDB
Exporters: Expose metrics from systems that don't natively speak Prometheus (Node Exporter, Blackbox Exporter)
Pushgateway: For short-lived jobs that can't be scraped (batch jobs)
AlertManager: Receives alerts, deduplicates, groups, and routes to receivers (PagerDuty, Slack, email)
Grafana: Queries Prometheus via PromQL, renders dashboards

2. Why does Prometheus use a pull model? What are the trade-offs?

Why they ask this: This is a design philosophy question — they want depth, not just "because that's how it works."

Pull model advantages:

Centralized health check: If a scrape fails, Prometheus knows the target is down — you get free health checking
No agent credentials in targets: Targets expose metrics passively; Prometheus pulls them
Rate control: The monitoring system controls how often it scrapes — targets can't flood you
Simplicity: No push client libraries needed for most services

Pull model disadvantages:

Short-lived jobs: Batch jobs that run and exit in under one scrape interval can't be scraped. Solution: Pushgateway
Firewall/NAT traversal: Prometheus needs network access to all targets. In complex multi-datacenter setups this can be complex. Solution: remote_write + federation

When to push instead: Use the Pushgateway for batch jobs. Use remote_write to ship metrics to a central aggregation system (Thanos, Cortex, Grafana Mimir).

3. What are the four Prometheus metric types?

Ideal answer:

Counter: A monotonically increasing value. Never goes down (except on restart). Use for counting events — requests served, errors, bytes sent.

Example: http_requests_total
With PromQL: rate(http_requests_total[5m]) gives requests per second

Gauge: A value that can go up or down. Use for current state — memory usage, queue depth, active connections.

Example: node_memory_MemFree_bytes

Histogram: Samples observations and counts them in configurable buckets. Allows calculating quantiles (p50, p95, p99). Use for request latency, response sizes.

Example: http_request_duration_seconds_bucket
Quantiles calculated server-side by Prometheus

Summary: Similar to Histogram but calculates quantiles client-side (in the instrumented application). Cheaper for the client but quantiles can't be aggregated across instances. Histograms are generally preferred for new instrumentation.

Key interview point: The biggest difference between Histogram and Summary is where quantile calculation happens — server-side (Histogram) vs client-side (Summary). Histograms can be aggregated; Summaries cannot.

4. Explain PromQL with examples

Ideal answer:

PromQL is Prometheus's query language for selecting and aggregating time-series data.

Instant vector selector: Returns current values for a metric

http_requests_total{job="api", status="500"}

Range vector selector: Returns values over a time range (needed for rate/increase functions)

http_requests_total[5m]

rate(): Per-second rate of a counter over a window

rate(http_requests_total[5m])

Aggregation with sum:

sum(rate(http_requests_total[5m])) by (job)

Error rate percentage:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

P99 latency from histogram:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Common mistakes: Using rate() on a Gauge (only valid on Counters). Forgetting that rate() requires a range vector, not an instant vector.

5. What is high cardinality and why is it dangerous in Prometheus?

Why they ask this: Cardinality issues are one of the most common production Prometheus problems. This separates experienced operators from beginners.

Ideal answer:

Cardinality is the number of unique time series — determined by the number of unique label value combinations.

High cardinality problem: Adding a high-cardinality label (like user_id, request_id, or IP address) to a metric can explode the number of time series exponentially. Each unique combination is stored separately.

Example:

http_requests_total{method, status} → 4 methods × 5 statuses = 20 series
http_requests_total{method, status, user_id} → 20 × 10,000 users = 200,000 series

Impact: High memory usage, slow queries, Prometheus OOM crashes, slow scrape intervals.

Solutions:

Never use unbounded values (request IDs, user IDs, timestamps) as labels
Use recording rules to pre-aggregate high-cardinality metrics
Consider Grafana Tempo or Jaeger for request-level tracing (the right tool for per-request data)
Use Thanos or Grafana Mimir if you need label values that are high-cardinality at scale

Section 2: AlertManager & Alerting

6. How does AlertManager work?

Ideal answer:

AlertManager receives alert notifications from Prometheus (and other systems), then handles deduplication, grouping, inhibition, silencing, and routing to notification receivers.

Alert lifecycle:

1. Prometheus evaluates an alerting rule — if condition is true for for: 5m, it fires

2. Prometheus sends the alert to AlertManager

3. AlertManager deduplicates (same alert from multiple Prometheus instances counts once)

4. Alerts are grouped by configured labels (e.g., by cluster, by job)

5. After group_wait, AlertManager sends the notification

6. group_interval controls how often subsequent notifications fire for the same group

7. repeat_interval controls how often to resend if the alert is still firing

Routing tree: Routes match alerts to receivers (PagerDuty for critical, Slack for warnings, email for info) based on label matchers.

Inhibition: Suppress child alerts when a parent alert is firing (e.g., suppress disk full alerts if the node is already down).

7. What is the difference between `for`, `group_wait`, and `repeat_interval` in AlertManager?

Ideal answer:

These three settings control alert timing but at different stages:

for (in Prometheus alerting rule): How long the condition must be continuously true before the alert fires. Prevents flapping from transient spikes. Set to 5m for most production alerts.

group_wait (in AlertManager): How long AlertManager waits after receiving the first alert in a new group before sending the notification. Allows other alerts that would join the group to arrive first. Default: 30s.

group_interval: How long to wait before sending follow-up notifications for a group that already received an initial notification and receives new alerts. Default: 5m.

repeat_interval: How long to wait before resending a notification for an alert that is still firing and hasn't changed. Prevents alert fatigue. Set to 4h or longer for non-critical alerts.

8. What are recording rules and when do you use them?

Ideal answer:

Recording rules pre-compute expensive PromQL expressions and store the result as a new time series. This makes dashboards faster and reduces Prometheus query load.

Syntax:

groups:
  - name: request_rates
    interval: 1m
    rules:
    - record: job:http_requests:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)

When to use:

Queries used on high-traffic dashboards that are slow to compute
Aggregations over high-cardinality labels — pre-aggregate to reduce cardinality
Alerting rules that require complex expressions — pre-compute intermediate metrics

Naming convention: level:metric:operations (e.g., job:http_requests:rate5m)

Important note: Recording rules run on the Prometheus server's schedule, not at query time — so the recorded metric always reflects the state at the last evaluation, not real-time.

Section 3: Grafana

9. How do you structure Grafana dashboards for a production environment?

Ideal answer:

Dashboard hierarchy:

1. Overview / Fleet dashboard: High-level RED metrics (Rate, Errors, Duration) for all services at once. Useful for on-call first look.

2. Service dashboard: Deep-dive into a single service — latency percentiles, error rates by endpoint, dependency health.

3. Infrastructure dashboard: Node-level metrics — CPU, memory, disk, network.

4. Business metrics dashboard: User signups, transactions, revenue — for non-engineers.

Best practices:

Use variables for environment, cluster, namespace, service — one dashboard serves all environments
Use templating with $datasource variable to switch between Prometheus instances
Use links to cross-reference dashboards (click on a service to drill into its dashboard)
Keep row names consistent with alerting rule names — makes correlation during incidents faster
Store dashboards as JSON in Git (use Grafana provisioning) — avoid manual UI-only dashboards

10. Explain Grafana Loki and how it complements Prometheus

Ideal answer:

Grafana Loki is a horizontally scalable log aggregation system. Unlike Elasticsearch, Loki indexes only metadata labels (not the full log content), making it much cheaper to run.

How it works:

Log agents (Promtail, Grafana Agent, Vector) tail logs from containers/files and push to Loki with labels
Labels mirror your Prometheus labels (job, namespace, pod) — making correlation easy
Grafana queries Loki with LogQL, which is intentionally similar to PromQL

LogQL example:

{namespace="production", app="api"} |= "ERROR" | json | line_format "{{.message}}"

Prometheus + Loki integration in Grafana:

Display metrics and related logs side by side in the same panel
Click on a metric spike → jump directly to logs from the same time window
Correlate latency spikes with error log messages without switching tools

Trade-off vs Elasticsearch: Loki is cheaper and faster to deploy but full-text search is slower without pre-indexing. For DevOps log aggregation, Loki is usually sufficient.

11. How do you set up Prometheus monitoring for a Kubernetes cluster?

Ideal answer:

kube-prometheus-stack (Helm chart) is the standard approach. It deploys:

Prometheus Operator (manages Prometheus instances via CRDs)
Prometheus with pre-configured Kubernetes scrape configs
AlertManager
kube-state-metrics (exposes Kubernetes object state as metrics)
node-exporter (per-node OS metrics)
Grafana with pre-built dashboards

Service discovery: Prometheus discovers Kubernetes targets via the Kubernetes API using ServiceMonitor and PodMonitor CRDs. You annotate your Service with a label, and Prometheus automatically picks it up.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s

Common gotchas: RBAC permissions for Prometheus to scrape the Kubernetes API. Network policies allowing scrape traffic. Storage sizing for TSDB retention.

Section 4: Advanced Monitoring

12. What is Thanos and when do you need it?

Ideal answer:

Thanos extends Prometheus to provide:

Long-term storage: Uploads metrics blocks to object storage (S3, GCS) for unlimited retention
Global query view: Single query endpoint across multiple Prometheus instances (multi-cluster, multi-datacenter)
High availability: Multiple Prometheus replicas with deduplication
Downsampling: Auto-downsample old data for faster long-range queries

Architecture:

Sidecar: Runs next to each Prometheus instance, uploads blocks to object storage
Store gateway: Serves queries from object storage
Querier: Merges results from Sidecars and Store gateways, deduplicates
Compactor: Compacts and downsamples stored blocks

When you need Thanos (or Grafana Mimir):

Retention beyond 15 days (Prometheus default)
Multiple Kubernetes clusters to monitor with a single query interface
HA Prometheus setup
More than 1-2M active time series per Prometheus instance

13. How do you alert on SLOs with Prometheus?

Ideal answer:

SLO alerting requires a different approach than threshold alerting — you alert on error budget burn rate, not just current error rate.

Burn rate alerting (Google SRE Workbook approach):

If your SLO is 99.9% availability (0.1% error budget), a 1x burn rate means you'd exhaust the budget in 30 days. Alert when burn rate is significantly higher.

# Alert when error budget is being consumed 14x faster than sustainable
- alert: HighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h])) /
      sum(rate(http_requests_total[1h]))
    ) > 14 * 0.001  # 14x burn rate for 99.9% SLO
  for: 2m
  annotations:
    summary: "Error budget burning fast — check immediately"

Multi-window alerting: Use both a short window (1h) and a long window (6h) — short catches sudden spikes, long catches slow burns. Alert only when both windows show elevated burn rate (reduces false positives).

14. What is the Blackbox Exporter and when do you use it?

Ideal answer:

The Blackbox Exporter probes external endpoints from Prometheus's perspective — it performs active monitoring of your services from the outside.

Probe types:

HTTP: Check status codes, response body content, SSL certificate expiry
TCP: Check port connectivity
ICMP: Ping checks
DNS: Check DNS resolution

Use cases:

Synthetic monitoring — verify your API returns 200 and the correct response
SSL certificate expiry alerts — alert when cert expires in less than 30 days
Multi-region probing — deploy Blackbox Exporter in multiple regions to test latency from different locations

Example config:

- job_name: blackbox_http
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
  - targets:
    - https://api.yourapp.com/health
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target

15. How do you handle Prometheus storage sizing?

Ideal answer:

Prometheus TSDB storage depends on:

Number of active time series
Scrape interval
Sample retention period

Rough formula: needed_bytes = series_count × scrape_interval_seconds × bytes_per_sample × retention_seconds

Average: ~1-2 bytes per sample. A setup with 1M series scraped every 15s with 15-day retention:

1,000,000 × (1/15) × 1.5 × (15 × 86400) ≈ 130 GB

Production recommendations:

Use fast SSD storage — Prometheus is write-heavy
Set --storage.tsdb.retention.time or --storage.tsdb.retention.size
Monitor prometheus_tsdb_head_series to track active series growth
Use remote_write + Thanos/Grafana Mimir for retention beyond 15-30 days
Enable WAL compression (--storage.tsdb.wal-compression)

16. Explain Grafana alerting vs AlertManager

Ideal answer:

AlertManager handles alerts generated by Prometheus's alerting rules. It's a separate service focused on routing, deduplication, grouping, and silencing. Rich routing trees, inhibition rules, multi-receiver support.

Grafana Alerting (Unified Alerting) manages alerts directly within Grafana, against any datasource — not just Prometheus. Supports Loki, Elasticsearch, InfluxDB, and others.

Which to use:

AlertManager: When you need sophisticated routing trees, inhibition, or cross-Prometheus HA deduplication. The standard for Prometheus-native alerting.
Grafana Alerting: When you need alerts on non-Prometheus datasources, or want a single pane for all alert management. Can use AlertManager as a contact point.

In practice: Most large setups use Prometheus → AlertManager for infra alerts, and Grafana Alerting for business/application metrics that come from multiple datasources.

17. What are Prometheus federation and remote_write used for?

Ideal answer:

Federation: One Prometheus scrapes a subset of metrics from another Prometheus instance. Used for hierarchical setups — cluster-level Prometheus servers feed a global Prometheus that stores aggregated metrics only.

Limitation: Federation adds latency, creates a fan-out query problem, and doesn't handle high availability well. For serious multi-cluster setups, Thanos or Grafana Mimir is better.

remote_write: Prometheus streams metric samples to a remote storage backend as they're collected. Enables:

Long-term storage (Thanos Receiver, Grafana Mimir, Cortex)
Multi-cluster aggregation
Backup to object storage
Forwarding to cloud monitoring services (Grafana Cloud, AWS AMP)

remote_write:
  - url: "https://prometheus-remote-write-endpoint/api/v1/push"
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 5s

18. How do you reduce alert fatigue?

Ideal answer:

Alert fatigue — too many noisy, unactionable alerts — is a real operational problem. Solutions:

Improve alert quality:

Only alert on symptoms (user-visible impact), not causes (high CPU that doesn't affect latency)
Use for: 5m to require sustained conditions — eliminates transient spikes
Use multi-window burn rate alerting for SLOs instead of threshold alerts

Reduce noise:

Inhibition rules: Suppress child alerts when a parent fires (disk full alert shouldn't fire if node is already down)
Silences: During known maintenance windows
Group alerts: AlertManager groups related alerts into one notification

Operational practices:

Treat every alert as a potential code change — if an alert fires and no action is taken, either fix it or delete it
Track alert volume per team; set a goal of < 5 actionable pages per on-call shift
Review alert rules quarterly with the team

19. How do you monitor a microservices architecture with Prometheus?

Ideal answer:

Instrumentation strategy:

Every service exposes a /metrics endpoint via the Prometheus client library (Go, Python, Java, Node.js)
Standard RED metrics at a minimum: Rate (requests per second), Errors (4xx/5xx rate), Duration (p50/p95/p99 latency)
Add business-level metrics for critical flows (order_created_total, payment_processed_total)

Service mesh integration: If using Istio or Linkerd, the sidecar proxies automatically emit RED metrics for all inter-service communication — no code changes needed.

Distributed tracing complement: Prometheus handles aggregate metrics well. For tracing individual requests across services, complement with Tempo (Grafana) or Jaeger. Grafana can correlate a Prometheus metric spike with a Tempo trace from the same time window via exemplars.

Kubernetes labels: Use consistent label sets across all services (app, version, namespace) so you can aggregate across the entire cluster or drill into individual services without remapping.

20. Walk me through debugging a Prometheus alert that isn't firing when it should be

Why they ask this: Operational debugging under pressure. They want to see systematic troubleshooting.

Step-by-step:

1. Verify the alert rule syntax: Go to Prometheus UI → Rules → check if the rule shows any errors.

2. Check if the metric exists: Query the metric directly in Prometheus UI. Is it returning data? If not — the scrape is failing or the metric isn't being exported.

3. Check scrape health: Prometheus UI → Targets. Is the target UP or DOWN? Check scrape duration.

4. Evaluate the expression manually: Copy the PromQL expression from the alert rule, paste it in the Graph tab. Does it return values above the threshold?

5. Check the for clause: Alert might be in "pending" state (condition is true, but for duration hasn't elapsed yet). Check Prometheus UI → Alerts — look for "PENDING" state.

6. Check AlertManager connectivity: Prometheus → Status → Runtime and Build. Verify AlertManager is configured and reachable.

7. Check AlertManager routing: Even if the alert fires, a misconfigured route might drop it silently. Check AlertManager UI → Alerts.

8. Verify time alignment: Scrape interval vs evaluation interval mismatch can cause race conditions where the metric hasn't updated when the rule evaluates.

Prometheus & Grafana Interview Questions 2026: The Complete Guide

Section 1: Prometheus Architecture

1. Explain the Prometheus architecture

2. Why does Prometheus use a pull model? What are the trade-offs?

3. What are the four Prometheus metric types?

4. Explain PromQL with examples

5. What is high cardinality and why is it dangerous in Prometheus?

Section 2: AlertManager & Alerting

6. How does AlertManager work?

7. What is the difference between `for`, `group_wait`, and `repeat_interval` in AlertManager?

8. What are recording rules and when do you use them?

Section 3: Grafana

9. How do you structure Grafana dashboards for a production environment?

10. Explain Grafana Loki and how it complements Prometheus

11. How do you set up Prometheus monitoring for a Kubernetes cluster?

Section 4: Advanced Monitoring

12. What is Thanos and when do you need it?

13. How do you alert on SLOs with Prometheus?

14. What is the Blackbox Exporter and when do you use it?

15. How do you handle Prometheus storage sizing?

16. Explain Grafana alerting vs AlertManager

17. What are Prometheus federation and remote_write used for?

18. How do you reduce alert fatigue?

19. How do you monitor a microservices architecture with Prometheus?

20. Walk me through debugging a Prometheus alert that isn't firing when it should be

Reading helps. Practicing wins interviews.