Observability for autonomous systems

Your agent already knows what's wrong.
Let it tell you.

Sermon is a lightweight observability daemon built for operators who already work with AI agents. Your agent drives; Sermon shows them what's happening.

quick_start.sh

# Install the daemon on each server
❯ curl -fsSL https://sermon.fyi/install.sh | bash -s -- $TOKEN

# Or start it manually after enrollment
❯ sermon agent start

[ alerts // trailheads ]

Alerts fire.
Agents act.

Tell Sermon what to watch, in plain English. When something trips, the alert reaches you wherever you are. Email, SMS, diagnosis already started. Your reply is the agent's next instruction.

Sermon alerts@sermon.fyi · reply-to is live · ALERT-0814

14:44

cpu pegged on web-02, looks like a runaway worker

Hey David,

Quick heads up: web-02 has been sitting at 95-98% CPU for the last 12 minutes. Load average is climbing too, 8.4 on a 4-core box. It's not paging on memory yet, so you've got time, but it's not going to fix itself.

SIGNAL · ALERT-0814 FIRST SEEN 14:32:18 UTC

host: web-02 · prod-edge · sfo-1
metric: cpu.usage_pct
current: 96.4% · 12m above 80% threshold
top proc: gunicorn:worker pid 18472 · 84.1% cpu
spawned: 14:32:11 from api.gunicorn.service

I've already pulled the request log and the parent service status. Happy to share either, kill the worker, or dig into what changed today. Just tell me what you want.

Reply to this email and I'll keep working on it. Plain English: "show me the request log", "kill it", "what changed today".

Standing by.
— Sermon

Reply with

click to send

david@example.com to: alerts@sermon.fyi · re: ALERT-0814

14:46

show me more

Sermon re: ALERT-0814 · here's the cause

14:47

cpu spike correlates with /api/checkout traffic — 3.8x normal

Pulled the request log: /api/checkout went from ~140 req/s baseline to 530 req/s starting 14:30:08. The cpu line tracks the traffic line one-for-one. See chart →

CORRELATION · ALERT-0814 WINDOW 14:20–14:44 UTC

endpoint: POST /api/checkout
baseline: ~140 req/s · p99 38ms
at spike: 530 req/s · p99 1820ms
change: deploy api/v2.4.1 at 14:31:42 added sync DB call

Want me to fix it (kill the worker + revert the deploy), or hold while you dig?

Standing by.
— Sermon

Reply with

click to send

david@example.com to: alerts@sermon.fyi · re: ALERT-0814

14:46

fix it

Sermon re: ALERT-0814 · proposed action

14:47

plan: kill pid 18472, revert api/v2.4.1. approve?

Two reviewable steps:

DRAFT · AWAITING APPROVAL

1. SIGTERM gunicorn:worker pid 18472 (web-02)

2. revert heimann/api#412 · api/v2.4.1 → api/v2.4.0

Worker restart is supervised, so #1 is reversible inside 4s. The revert (#2) opens a PR; deploy goes via your normal CI gate.

Awaiting your call.
— Sermon

Reply with

click to send

david@example.com to: alerts@sermon.fyi · re: ALERT-0814

14:48

approve & ship

Sermon re: ALERT-0814 · resolved

14:48

killed pid 18472. cpu back to baseline. found what changed.

Sent SIGTERM to gunicorn:worker pid 18472 at 14:46:08. Service supervisor spun up a fresh worker 4s later. CPU recovered to 31% baseline by 14:46:30, see the chart.

Root cause: deploy api/v2.4.1 at 14:31:42 added a synchronous DB call inside /api/checkout. Under sustained traffic the worker pool starved.

SHIPPED · YOU APPROVED 14:48:02

✓ SIGTERM sent to pid 18472 — worker recycled in 4s

✓ heimann/api#412 merged — api/v2.4.1 reverted to api/v2.4.0

POSTMORTEM · ALERT-0814

web-02 CPU spike — checkout deploy starved the worker pool

sermon.fyi/incidents/0814 · authored by Sermon

Watching web-02 for the next hour. I'll only ping again if it crosses the threshold within 30 min.
— Sermon

david@example.com to: alerts@sermon.fyi · re: ALERT-0814

14:48

no, hold

Sermon re: ALERT-0814 · on hold

14:48

standing by. tell me what to dig into.

Watching the alert. I'll re-ping if cpu crosses 95% again in the next 15m, or if anything else trips on web-02. Otherwise quiet.

Reply with show me the deploy log, tail nginx errors, compare to last tuesday — or the action plan if you change your mind. The conversation stays open.

On hold.
— Sermon

[ alerts // trailheads ]

Alerts fire.
Agents act.

Tell Sermon what to watch, in plain English. When something trips, the alert reaches you wherever you are. Reply to engage.

delivered via SMS · EMAIL · WEB

iMessage Today 14:44

Sermon

cpu pegged on web-02 → 96% for 12m. top proc gunicorn:worker pid 18472 (84.1% cpu, spawned 14:32:11). already pulled the request log.

tap to choose

show me more

Delivered

cpu × /api/checkout traffic

overlay reveals the cause × correlated

tap to choose

fix it

Delivered

i'll send SIGTERM to gunicorn:worker pid 18472 and revert the api/v2.4.1 deploy. approve?

tap to choose

approve ✓

Delivered

cpu.usage_pct · web-02 · resolved

recovered to 31% baseline 31.2%

i wrote up an incident summary and post mortem ↓

RESOLVED

ALERT-0814 — web-02 CPU spike resolved Synchronous DB call in api/v2.4.1 starved worker pool. Reverted to v2.4.0; recovered to 31% baseline within 4s. sermon.fyi · postmortem

no, hold

Delivered

standing by. mute, dispatch later, or want to dig in?

tap to choose

mute 4h

Delivered

muted ALERT-0814 for 4h. re-paging if cpu > 95% or anything else trips on web-02.

schedule fix

Delivered

scheduled SIGTERM pid 18472 + revert api/v2.4.1 for 22:00 UTC. you'll get a confirm ping at 21:55 with the plan recap.

dig in

Delivered

ok. pulled the deploy diff — orders.get_recent/1 is doing a synchronous DB call inside the request handler. want me to draft a PR moving it to the async queue?

tap to choose

yes, draft

Delivered

drafted it ↓

- result = Orders.get_recent(user_id)
+ Task.async(fn -> Orders.get_recent(user_id) end)

DRAFT

api#413 — move orders.get_recent to async queue +18 / −7 in api/orders.ex. ready for review when you are. github.com · heimann/api

just watch

Delivered

watching. re-paging only if cpu trips again or req latency p99 > 500ms.

System architecture

Three components. A lightweight collector on each box, a central aggregator, and a protocol that any agent can call.

01 / Ingestion [daemon]

[SERVER] ──(logs)──> +────────+
[POD-1]  ──(otel)──> | SERMON |
[SYS]    ──(proc)──> +────────+

Omnivorous daemon

Runs as a ~2MB Zig binary on every server. Reads /proc, tails journalctl, and pushes metric samples over a single authenticated endpoint. No runtime. No agent. Just the binary.

02 / Storage [duckdb + postgres]

      ←── Vectorized
           execution
      ←── Local storage

Embedded analytics

Each daemon stores full-resolution telemetry locally in an embedded DuckDB with 7-day retention. Central Postgres aggregates across the fleet for the dashboard and long-term queries.

03 / Access [mcp / api]

+──────+   [MCP]  +───────+
|SERMON| <======> | AGENT |
+──────+  (CLI)   +───────+
[HUMAN]

Machine-first protocol

An MCP server exposes the fleet to any agent that speaks the protocol. Your Claude Code, Cursor, or custom client queries metrics, tails logs, and dispatches alerts without touching the UI.

design_philosophy.txt

Designed for agents.
Not humans.

Traditional observability tools assume a human is staring at a dashboard, trying to correlate colored lines on a graph. Sermon assumes an AI agent is polling an API.

We deliver structured, high-density context. When something breaks, your agent receives exactly the relevant logs and correlated metrics, ready to act, not just alert.

Investigates root causes Agents pull the last 15 minutes of metrics, correlate with log lines, and surface likely causes.
Proposes remediations Sermon never executes. Your agent executes, under your trust, with the evidence it needs.
Runs continuous watches Long-running agents register query-as-alerts: "cpu > 80 for 5m, fire with context."

agent_session_id: 9x4-aql live

agent: Evaluating cluster state…

agent: Anomalous latency detected on aniara-1. Pulling correlated logs.

  mcp.tool: query_metrics
payload: { "server": "aniara-1", "since": "-15m", "fields": "cpu,mem" }

system: Returned 600 samples. cpu_percent crossed 80% at 13:57:06; sustained 5m.

agent: Correlating with nginx logs... found worker crash at 14:02:03. Proposing fix.

action proposed

Restart nginx Dismiss

[ join_waitlist ]

Give your agents eyes.

Stop forcing LLMs to scrape dashboards. Give them the structured telemetry they need to act autonomously.

View on GitHub

Your agent already knows what's wrong. Let it tell you.

Alerts fire. Agents act.

cpu pegged on web-02, looks like a runaway worker

cpu spike correlates with /api/checkout traffic — 3.8x normal

plan: kill pid 18472, revert api/v2.4.1. approve?

killed pid 18472. cpu back to baseline. found what changed.

standing by. tell me what to dig into.

Alerts fire. Agents act.

System architecture

Omnivorous daemon

Embedded analytics

Machine-first protocol

Designed for agents. Not humans.

Give your agents eyes.

Your agent already knows what's wrong.
Let it tell you.

Alerts fire.
Agents act.

Alerts fire.
Agents act.

Designed for agents.
Not humans.