Platform & observability · 22 May 2026 · Guide · By Babulal Tamang

PromQL
Prometheus
Grafana
Observability
SRE

PromQL in Depth: The Prometheus Query Language

PromQL is how you ask questions of time-series data in Prometheus—and, by extension, in Grafana, Alertmanager, and autoscalers that speak the Prometheus API. It is not SQL: there are no tables or joins in the relational sense. You select metric time series by name and labels, then transform them with operators, aggregations, and functions until the answer matches the operational question you care about.

In short

Learn the data model (metric + labels + sample type), the difference between instant and range queries, then build dashboards and alerts from rate(), histogram_quantile(), and careful sum by () aggregations. Guard cardinality, never rate() a gauge, and test queries in the Prometheus UI before you paste them into production rules.

Where PromQL fits in the observability stack

Prometheus scrapes targets (or receives remote write), stores samples in a time-series database, and evaluates recording rules and alerting rules written in PromQL. The same expressions power:

Grafana panels — query Prometheus (or Mimir, Thanos, Cortex, Amazon Managed Prometheus) with PromQL.
Alertmanager — fires when rule expressions are true for a configured for duration.
Kubernetes HPA / KEDA — external metrics often originate from a PromQL query (see KEDA in depth for the Prometheus scaler).
Ad-hoc debugging — the Prometheus “Graph” UI and tools like promtool query instant.

PromQL is the lingua franca once metrics land in Prometheus format. If you run workloads on Kubernetes, metrics usually arrive via kube-prometheus-stack, Prometheus Operator, or a managed agent—see Kubernetes architecture for where scraping and cAdvisor/kubelet metrics sit in the platform picture.

The Prometheus data model

Everything PromQL touches is built from four ideas:

Metric name — e.g. http_requests_total, container_cpu_usage_seconds_total.
Labels — key/value dimensions: method="POST", status="500", pod="api-7f2k9".
Timestamp + value — a float64 sample at a point in time (Unix seconds with millisecond resolution in storage).
Sample type — how the value should be interpreted (counter, gauge, histogram, summary).

A time series is the combination of metric name and a unique label set. High-cardinality labels (user IDs, unbounded URLs) explode the number of series and hurt performance—design metrics and relabeling to keep labels bounded.

Metric types (client libraries → storage)

Type	Meaning	PromQL habits
Counter	Monotonically non-decreasing (resets on process restart)	Use `rate()`, `increase()`, or `irate()` over a range—never graph raw counters for “per second” views
Gauge	Can go up or down (memory, queue depth, temperature)	Average, min, max, thresholds—do not apply `rate()`
Histogram	Observations bucketed by le= boundaries + `_sum` / `_count`	`histogram_quantile()` on `rate()` of buckets; respect bucket layout
Summary	Pre-computed quantiles at scrape time (client-side)	Less flexible than histograms in Prometheus; quantiles are not aggregatable across instances the same way

Exporters and instrumentation libraries expose these types. When you write custom metrics, pick the type that matches how the number behaves—not what looks convenient in Grafana.

Instant vectors vs range vectors

PromQL expressions evaluate to one of four value types. The two you use daily are instant vectors and range vectors.

Instant vector — one sample per series at a single evaluation time (a “snapshot”). Returned by the HTTP API query endpoint and used in alerting rules at eval time.
Range vector — multiple samples per series over a window [duration]. Required as input to functions like rate(). Selected with suffix syntax: http_requests_total[5m].

# Instant: current value of each matching series
up

# Range: all samples in the last 5 minutes per series (for rate/increase)
rate(http_requests_total[5m])

# Grafana "Range" queries still use PromQL; the UI sends a start/end/step
# and Prometheus evaluates the expression at each step as an instant query.

Scalar (single number) and string (rare, for certain functions) appear in advanced cases. Most dashboard panels consume instant-vector results plotted over time by re-evaluating at each step.

Selectors: finding the right time series

The simplest selector is a metric name:

http_requests_total

Add label matchers in curly braces:

# Equality
http_requests_total{job="api", method="GET"}

# Negation
http_requests_total{status!="200"}

# Regex (RE2 syntax)
http_requests_total{status=~"5.."}
http_requests_total{handler!~"/debug.*"}

# Multiple matchers (AND)
process_resident_memory_bytes{job="api", instance=~"10\\.0\\..*"}

Matchers apply before functions: you narrow the series set, then transform. Empty selectors return nothing—check typos in label names (job vs jobs) and whether relabeling dropped labels you expected.

Arithmetic and comparison operators

Binary operators work on instant vectors (and scalars) with vector matching when both sides are vectors:

Operators	Notes
`+ - * / % ^`	Standard math; `%` is modulo; `^` is power
`== != < > <= >=`	Comparisons return 0/1 sample values (filtering uses `bool` modifier)
`and or unless`	Set operations on label sets—essential for alert expressions

# CPU cores used vs requested (kube-state-metrics style)
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# Drop series where denominator is missing
...
and on(pod) sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod) > 0

# Alert-style: error share over 5m > 1%
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) > 0.01

By default, binary ops match series with identical label sets on the left and right. When labels differ, use on(), ignoring(), group_left, or group_right—documented in Prometheus’s vector matching section. Getting this wrong produces “no data” or duplicated unexpected series.

Aggregation operators

Aggregations collapse many series into fewer—usually what you want for cluster-wide dashboards:

sum(rate(http_requests_total[5m])) by (job)
avg(process_resident_memory_bytes) by (instance)
max(node_cpu_seconds_total) without (cpu)
count(up == 0)
topk(5, sum(rate(http_requests_total[5m])) by (handler))
bottomk(3, container_memory_working_set_bytes)
quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # rare; prefer histogram_quantile

Available aggregators include sum, min, max, avg, group, stddev, stdvar, count, count_values, bottomk, topk, and quantile.

by (label1, label2) — keep only listed labels in the output; aggregate away the rest.
without (label) — aggregate away listed labels; keep all others.

Rule of thumb: aggregate after rate() on counters when computing totals—sum(rate(...)) not rate(sum(...)) (sum of rates equals rate of sum only when series share the same counter semantics and no resets split your window oddly).

Essential functions

Counters: rate, increase, irate

# Per-second average rate over 5m (most common)
rate(http_requests_total[5m])

# Total increase over window (useful for "errors in last hour")
increase(http_requests_total[1h])

# Instant rate from last two points (spiky; good for gauges-like views of counters)
irate(http_requests_total[5m])

Choose a range window at least four times your scrape interval (often 5m with 30s scrape). Too short a window → noisy graphs; too long → slow to reflect incidents.

Histograms: histogram_quantile

# p99 latency by job (histogram buckets required)
histogram_quantile(
  0.99,
  sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# p50
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Buckets must be cumulative (le labels). Mixing histograms with incompatible bucket boundaries across jobs breaks quantiles—standardize instrumentation or aggregate at the service level with recording rules.

Gauges and time

avg_over_time(process_resident_memory_bytes[1h])
max_over_time(node_memory_MemAvailable_bytes[24h])
predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600)  # predict 4h ahead
time() - process_start_time_seconds  # uptime-style

Sorting, limits, and labels

sort_desc(sum(rate(http_requests_total[5m])) by (instance))
limitk(10, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
label_replace(up, "env", "$1", "kubernetes_namespace", "(.*)")
abs(delta(some_gauge[1h]))

Function groups (reference)

Category	Examples	Typical use
Rate / delta	`rate`, `irate`, `increase`, `delta`, `idelta`	Counters and occasional gauge changes
Aggregations over time	`avg_over_time`, `max_over_time`, `sum_over_time`, …	SLI windows, burn rates
Math / rounding	`abs`, `ceil`, `floor`, `round`, `clamp_min`, `clamp_max`	Display hygiene, thresholds
Histogram / summary	`histogram_quantile`	Latency percentiles
Prediction	`predict_linear`, `deriv`	Disk-full warnings
Label manipulation	`label_replace`, `label_join`	Align labels for joins
Trigonometry / logs	`sin`, `ln`, …	Rare in ops; exists for completeness

Subqueries

Subqueries apply an inner range selector and function over offset steps—useful for rolling statistics without recording rules:

# Max 5m rate over the last hour, evaluated every 1m
max_over_time(
  rate(http_requests_total[5m])[1h:1m]
)

# Syntax: <instant_expr>[range:resolution]

Subqueries are powerful and expensive. Prefer recording rules for hot expressions reused in many dashboards and alerts.

Recording rules and alerting rules

In prometheus.yml rule files, two rule types share PromQL syntax but different goals:

Recording rules — precompute a new time series (e.g. job:http_requests:rate5m) to speed queries and standardize naming.
Alerting rules — when an expression is true (and optional for: 5m), send alerts to Alertmanager with labels and annotations.

groups:
  - name: api-recording
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High 5xx rate on {{ $labels.job }}"

Annotations support {{ $labels.x }} and {{ $value }} templating. Keep alert expressions readable—compose recording rules for numerators and denominators. Pair alerts with runbooks and the calm-incident mindset in incident and disaster response.

Patterns every SRE should recognize

RED method (requests)

Rate — sum(rate(http_requests_total[5m])) by (job)
Errors — 5xx or business-error labels over total rate
Duration — histogram quantiles as above

USE method (resources)

Utilization — CPU/memory/disk busy %
Saturation — run queue length, throttling, OOM pressure
Errors — device/interface error counters

Availability and burn rate (sketch)

# SLI: successful requests / all requests (5m window)
sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

# Multi-window burn alerts (Google SRE book style) combine short+long windows
# — implement via recording rules + two alert thresholds, not one magic query.

Kubernetes snippets

# Pods not ready
kube_pod_status_ready{condition="true"} == 0

# Container restarts (counter)
increase(kube_pod_container_status_restarts_total[1h]) > 3

# CPU throttling (CFS)
rate(container_cpu_cfs_throttled_seconds_total[5m])

Exact metric names depend on your kube-prometheus version and whether you use cAdvisor vs kubelet cadvisor endpoints—verify in Status → Targets and the metric browser before copying dashboards from the internet.

Grafana and the query API

Grafana’s Prometheus data source sends PromQL to the server. Practical tips:

Use Min step aligned to scrape interval to avoid redundant points.
Prefer $__rate_interval (or Grafana’s auto range) over hard-coded [5m] when panels span different time ranges.
Legend templates: {{job}} — {{handler}}—but hide high-cardinality labels in legends for cluster views.
Transform “Table” panels with instant queries; “Time series” with range queries.

Raw HTTP API (for automation and debugging):

# Instant
curl -G 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=up'

# Range
curl -G 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=rate(http_requests_total[5m])' \
  --data-urlencode 'start=2026-05-22T00:00:00Z' \
  --data-urlencode 'end=2026-05-22T01:00:00Z' \
  --data-urlencode 'step=60'

Cardinality, performance, and correctness

High cardinality — unbounded label values (URL paths, user IDs) slow queries and increase memory. Use relabeling to drop or aggregate labels at scrape time.
rate() on gauges — meaningless spikes; use deriv() or over_time functions instead.
Counter resets — rate() handles resets; still watch deployments that reset often in short windows.
Stale markers — series disappear from instant results ~5 minutes after scrape stops; range queries may interpolate differently—know your Prometheus version behavior.
Lookback vs scrape — evaluation needs data within the lookback window; missed scrapes create gaps in rate().
Duplicate labels — same metric/labels from two targets double-count unless you sum intentionally.

Validate with promtool check rules and promtool query instant in CI for rule files—same discipline as testing Terraform plans in Terraform IaC for everyone.

Troubleshooting “no data” and wrong numbers

Metric exists? — Prometheus UI → Graph → start typing the metric name; check /api/v1/label/__name__/values.
Labels correct? — Inspect one series in the UI; compare to your selector.
Scrape healthy? — Status → Targets; fix TLS, auth, or network policy first.
Range window — widen [5m] to [15m] for sparse scrapes.
Vector matching — simplify: compute left and right sides as separate recording rules, then divide.
Timezone / step — Grafana UTC vs local; enormous step skips spikes.

For cluster-level “something is wrong but which Pod?” workflows, combine PromQL with kubectl steps in Kubernetes troubleshooting playbook.

Production checklist

Standardize metric names and label conventions per service (job, instance, service, team).
Recording rules for any expression appearing in more than one dashboard or alert.
Alert labels route to the right on-call rotation; annotations link to runbooks.
Dashboards show RED/USE (or your framework) per tier-1 service—not only infrastructure CPU.
Review cardinality after each new exporter or auto-instrumentation library.
Document “golden queries” in repo docs next to Helm values—GitOps for observability like GitOps principles for apps.

Hands-on: minimal lab

# Docker Prometheus with self-scrape
docker run -d --name prom -p 9090:9090 prom/prometheus:v2.52.0

# Open http://localhost:9090
# Try in the Graph tab:
up
rate(prometheus_http_requests_total[5m])
histogram_quantile(0.9, sum by (le) (rate(prometheus_http_request_duration_seconds_bucket[5m])))

Install node_exporter or sample-app metrics next, then practice sum by () and alert rules on disk-full predictions with predict_linear.

Blog