PromQL in Depth: The Prometheus Query Language
PromQL is how you ask questions of time-series data in Prometheus—and, by extension, in Grafana, Alertmanager, and autoscalers that speak the Prometheus API. It is not SQL: there are no tables or joins in the relational sense. You select metric time series by name and labels, then transform them with operators, aggregations, and functions until the answer matches the operational question you care about.
In short
Learn the data model (metric + labels + sample type), the difference between instant and range queries, then build dashboards and alerts from rate(), histogram_quantile(), and careful sum by () aggregations. Guard cardinality, never rate() a gauge, and test queries in the Prometheus UI before you paste them into production rules.
Where PromQL fits in the observability stack
Prometheus scrapes targets (or receives remote write), stores samples in a time-series database, and evaluates recording rules and alerting rules written in PromQL. The same expressions power:
- Grafana panels — query Prometheus (or Mimir, Thanos, Cortex, Amazon Managed Prometheus) with PromQL.
- Alertmanager — fires when rule expressions are true for a configured
forduration. - Kubernetes HPA / KEDA — external metrics often originate from a PromQL query (see KEDA in depth for the Prometheus scaler).
- Ad-hoc debugging — the Prometheus “Graph” UI and tools like
promtool query instant.
PromQL is the lingua franca once metrics land in Prometheus format. If you run workloads on Kubernetes, metrics usually arrive via kube-prometheus-stack, Prometheus Operator, or a managed agent—see Kubernetes architecture for where scraping and cAdvisor/kubelet metrics sit in the platform picture.
The Prometheus data model
Everything PromQL touches is built from four ideas:
- Metric name — e.g.
http_requests_total,container_cpu_usage_seconds_total. - Labels — key/value dimensions:
method="POST",status="500",pod="api-7f2k9". - Timestamp + value — a float64 sample at a point in time (Unix seconds with millisecond resolution in storage).
- Sample type — how the value should be interpreted (counter, gauge, histogram, summary).
A time series is the combination of metric name and a unique label set. High-cardinality labels (user IDs, unbounded URLs) explode the number of series and hurt performance—design metrics and relabeling to keep labels bounded.
Metric types (client libraries → storage)
| Type | Meaning | PromQL habits |
|---|---|---|
| Counter | Monotonically non-decreasing (resets on process restart) | Use rate(), increase(), or irate() over a range—never graph raw counters for “per second” views |
| Gauge | Can go up or down (memory, queue depth, temperature) | Average, min, max, thresholds—do not apply rate() |
| Histogram | Observations bucketed by le= boundaries + _sum / _count |
histogram_quantile() on rate() of buckets; respect bucket layout |
| Summary | Pre-computed quantiles at scrape time (client-side) | Less flexible than histograms in Prometheus; quantiles are not aggregatable across instances the same way |
Exporters and instrumentation libraries expose these types. When you write custom metrics, pick the type that matches how the number behaves—not what looks convenient in Grafana.
Instant vectors vs range vectors
PromQL expressions evaluate to one of four value types. The two you use daily are instant vectors and range vectors.
- Instant vector — one sample per series at a single evaluation time (a “snapshot”). Returned by the HTTP API
queryendpoint and used in alerting rules at eval time. - Range vector — multiple samples per series over a window
[duration]. Required as input to functions likerate(). Selected with suffix syntax:http_requests_total[5m].
# Instant: current value of each matching series
up
# Range: all samples in the last 5 minutes per series (for rate/increase)
rate(http_requests_total[5m])
# Grafana "Range" queries still use PromQL; the UI sends a start/end/step
# and Prometheus evaluates the expression at each step as an instant query.
Scalar (single number) and string (rare, for certain functions) appear in advanced cases. Most dashboard panels consume instant-vector results plotted over time by re-evaluating at each step.
Selectors: finding the right time series
The simplest selector is a metric name:
http_requests_total
Add label matchers in curly braces:
# Equality
http_requests_total{job="api", method="GET"}
# Negation
http_requests_total{status!="200"}
# Regex (RE2 syntax)
http_requests_total{status=~"5.."}
http_requests_total{handler!~"/debug.*"}
# Multiple matchers (AND)
process_resident_memory_bytes{job="api", instance=~"10\\.0\\..*"}
Matchers apply before functions: you narrow the series set, then transform. Empty selectors return nothing—check typos in label names (job vs jobs) and whether relabeling dropped labels you expected.
Arithmetic and comparison operators
Binary operators work on instant vectors (and scalars) with vector matching when both sides are vectors:
| Operators | Notes |
|---|---|
+ - * / % ^ | Standard math; % is modulo; ^ is power |
== != < > <= >= | Comparisons return 0/1 sample values (filtering uses bool modifier) |
and or unless | Set operations on label sets—essential for alert expressions |
# CPU cores used vs requested (kube-state-metrics style)
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)
# Drop series where denominator is missing
...
and on(pod) sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod) > 0
# Alert-style: error share over 5m > 1%
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
By default, binary ops match series with identical label sets on the left and right. When labels differ, use on(), ignoring(), group_left, or group_right—documented in Prometheus’s vector matching section. Getting this wrong produces “no data” or duplicated unexpected series.
Aggregation operators
Aggregations collapse many series into fewer—usually what you want for cluster-wide dashboards:
sum(rate(http_requests_total[5m])) by (job)
avg(process_resident_memory_bytes) by (instance)
max(node_cpu_seconds_total) without (cpu)
count(up == 0)
topk(5, sum(rate(http_requests_total[5m])) by (handler))
bottomk(3, container_memory_working_set_bytes)
quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # rare; prefer histogram_quantile
Available aggregators include sum, min, max, avg, group, stddev, stdvar, count, count_values, bottomk, topk, and quantile.
by (label1, label2)— keep only listed labels in the output; aggregate away the rest.without (label)— aggregate away listed labels; keep all others.
Rule of thumb: aggregate after rate() on counters when computing totals—sum(rate(...)) not rate(sum(...)) (sum of rates equals rate of sum only when series share the same counter semantics and no resets split your window oddly).
Essential functions
Counters: rate, increase, irate
# Per-second average rate over 5m (most common)
rate(http_requests_total[5m])
# Total increase over window (useful for "errors in last hour")
increase(http_requests_total[1h])
# Instant rate from last two points (spiky; good for gauges-like views of counters)
irate(http_requests_total[5m])
Choose a range window at least four times your scrape interval (often 5m with 30s scrape). Too short a window → noisy graphs; too long → slow to reflect incidents.
Histograms: histogram_quantile
# p99 latency by job (histogram buckets required)
histogram_quantile(
0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# p50
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
Buckets must be cumulative (le labels). Mixing histograms with incompatible bucket boundaries across jobs breaks quantiles—standardize instrumentation or aggregate at the service level with recording rules.
Gauges and time
avg_over_time(process_resident_memory_bytes[1h])
max_over_time(node_memory_MemAvailable_bytes[24h])
predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) # predict 4h ahead
time() - process_start_time_seconds # uptime-style
Sorting, limits, and labels
sort_desc(sum(rate(http_requests_total[5m])) by (instance))
limitk(10, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
label_replace(up, "env", "$1", "kubernetes_namespace", "(.*)")
abs(delta(some_gauge[1h]))
Function groups (reference)
| Category | Examples | Typical use |
|---|---|---|
| Rate / delta | rate, irate, increase, delta, idelta | Counters and occasional gauge changes |
| Aggregations over time | avg_over_time, max_over_time, sum_over_time, … | SLI windows, burn rates |
| Math / rounding | abs, ceil, floor, round, clamp_min, clamp_max | Display hygiene, thresholds |
| Histogram / summary | histogram_quantile | Latency percentiles |
| Prediction | predict_linear, deriv | Disk-full warnings |
| Label manipulation | label_replace, label_join | Align labels for joins |
| Trigonometry / logs | sin, ln, … | Rare in ops; exists for completeness |
Subqueries
Subqueries apply an inner range selector and function over offset steps—useful for rolling statistics without recording rules:
# Max 5m rate over the last hour, evaluated every 1m
max_over_time(
rate(http_requests_total[5m])[1h:1m]
)
# Syntax: <instant_expr>[range:resolution]
Subqueries are powerful and expensive. Prefer recording rules for hot expressions reused in many dashboards and alerts.
Recording rules and alerting rules
In prometheus.yml rule files, two rule types share PromQL syntax but different goals:
- Recording rules — precompute a new time series (e.g.
job:http_requests:rate5m) to speed queries and standardize naming. - Alerting rules — when an expression is true (and optional
for: 5m), send alerts to Alertmanager with labels and annotations.
groups:
- name: api-recording
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 10m
labels:
severity: page
annotations:
summary: "High 5xx rate on {{ $labels.job }}"
Annotations support {{ $labels.x }} and {{ $value }} templating. Keep alert expressions readable—compose recording rules for numerators and denominators. Pair alerts with runbooks and the calm-incident mindset in incident and disaster response.
Patterns every SRE should recognize
RED method (requests)
- Rate —
sum(rate(http_requests_total[5m])) by (job) - Errors — 5xx or business-error labels over total rate
- Duration — histogram quantiles as above
USE method (resources)
- Utilization — CPU/memory/disk busy %
- Saturation — run queue length, throttling, OOM pressure
- Errors — device/interface error counters
Availability and burn rate (sketch)
# SLI: successful requests / all requests (5m window)
sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Multi-window burn alerts (Google SRE book style) combine short+long windows
# — implement via recording rules + two alert thresholds, not one magic query.
Kubernetes snippets
# Pods not ready
kube_pod_status_ready{condition="true"} == 0
# Container restarts (counter)
increase(kube_pod_container_status_restarts_total[1h]) > 3
# CPU throttling (CFS)
rate(container_cpu_cfs_throttled_seconds_total[5m])
Exact metric names depend on your kube-prometheus version and whether you use cAdvisor vs kubelet cadvisor endpoints—verify in Status → Targets and the metric browser before copying dashboards from the internet.
Grafana and the query API
Grafana’s Prometheus data source sends PromQL to the server. Practical tips:
- Use Min step aligned to scrape interval to avoid redundant points.
- Prefer
$__rate_interval(or Grafana’s auto range) over hard-coded[5m]when panels span different time ranges. - Legend templates:
{{job}} — {{handler}}—but hide high-cardinality labels in legends for cluster views. - Transform “Table” panels with instant queries; “Time series” with range queries.
Raw HTTP API (for automation and debugging):
# Instant
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=up'
# Range
curl -G 'http://localhost:9090/api/v1/query_range' \
--data-urlencode 'query=rate(http_requests_total[5m])' \
--data-urlencode 'start=2026-05-22T00:00:00Z' \
--data-urlencode 'end=2026-05-22T01:00:00Z' \
--data-urlencode 'step=60'
Cardinality, performance, and correctness
- High cardinality — unbounded label values (URL paths, user IDs) slow queries and increase memory. Use relabeling to drop or aggregate labels at scrape time.
rate()on gauges — meaningless spikes; usederiv()or over_time functions instead.- Counter resets —
rate()handles resets; still watch deployments that reset often in short windows. - Stale markers — series disappear from instant results ~5 minutes after scrape stops; range queries may interpolate differently—know your Prometheus version behavior.
- Lookback vs scrape — evaluation needs data within the lookback window; missed scrapes create gaps in
rate(). - Duplicate labels — same metric/labels from two targets double-count unless you
sumintentionally.
Validate with promtool check rules and promtool query instant in CI for rule files—same discipline as testing Terraform plans in Terraform IaC for everyone.
Troubleshooting “no data” and wrong numbers
- Metric exists? — Prometheus UI → Graph → start typing the metric name; check
/api/v1/label/__name__/values. - Labels correct? — Inspect one series in the UI; compare to your selector.
- Scrape healthy? — Status → Targets; fix TLS, auth, or network policy first.
- Range window — widen
[5m]to[15m]for sparse scrapes. - Vector matching — simplify: compute left and right sides as separate recording rules, then divide.
- Timezone / step — Grafana UTC vs local; enormous step skips spikes.
For cluster-level “something is wrong but which Pod?” workflows, combine PromQL with kubectl steps in Kubernetes troubleshooting playbook.
Production checklist
- Standardize metric names and label conventions per service (
job,instance,service,team). - Recording rules for any expression appearing in more than one dashboard or alert.
- Alert labels route to the right on-call rotation; annotations link to runbooks.
- Dashboards show RED/USE (or your framework) per tier-1 service—not only infrastructure CPU.
- Review cardinality after each new exporter or auto-instrumentation library.
- Document “golden queries” in repo docs next to Helm values—GitOps for observability like GitOps principles for apps.
Hands-on: minimal lab
# Docker Prometheus with self-scrape
docker run -d --name prom -p 9090:9090 prom/prometheus:v2.52.0
# Open http://localhost:9090
# Try in the Graph tab:
up
rate(prometheus_http_requests_total[5m])
histogram_quantile(0.9, sum by (le) (rate(prometheus_http_request_duration_seconds_bucket[5m])))
Install node_exporter or sample-app metrics next, then practice sum by () and alert rules on disk-full predictions with predict_linear.
Further reading
- Prometheus querying basics
- Operators — vector matching, aggregations
- Functions — complete reference
- Metric and label naming
- Grafana Prometheus data source
- Recording rules
Blog index · Kubernetes architecture · KEDA in depth · Troubleshooting playbook · Incident response · GitOps principles