Observability for LLM systems isn't APM.

Prelude

Application performance monitoring tells you whether your code ran. It tells you almost nothing about whether your AI system worked. A request that completes successfully (under SLO, no errors) can still be a quiet, expensive failure. The observability stack for an LLM system has to answer a different set of questions.

Token economics, on the dashboard.

You should be able to answer, at any moment, three questions: what did we spend in the last hour, which tenant drove the spend, and which task category drove the spend. If your observability stack doesn't surface those three views, you don't have token economics. You have a finance ticket waiting to happen.

Tokens in / tokens out / cost per request, tagged with tenant, task, model class, and concrete model.
Rolling cost per tenant with a per-tenant ceiling, alerted on at 80% of budget.
Cost by task category: synthesis vs extraction vs tool-use. The cost shape per category is information.
Cost regression alerts when a deploy or prompt change drives per-request cost up by more than a threshold.

Quality signals you actually trust.

Quality is what APM can't tell you. The model returned a response. Was it the right one? You answer that through eval harnesses running continuously in production, not just in CI.

Shadow evals on production traffic. A sampled subset of real requests runs against the eval harness in the background. Scores trended over time. Regressions visible before customers notice.
Per-task quality scores tracked over time, alerted on for sustained drops.
User-signal correlations: thumbs down, retries, abandoned chats, joined back to the requests that produced them.
Per-model quality. If you route across models, you need to know which model is producing which quality, not blended numbers.

The most useful metric we've added

The most useful single metric we've added to LLM dashboards is the answer-faithfulness score on RAG output, sampled in production. It catches retrieval regressions days before user complaints arrive.

Tool-use failures, surfaced explicitly.

Agentic systems fail in shapes APM can't see. A model called a tool that doesn't exist. A model called a tool with malformed arguments. A model decided not to call a tool when it should have. None of these throw an exception. All of them matter.

We track:

Tool-call invalid argument rate, by tool. Trended.
Tool-not-found rate. Should be zero. If it isn't, your prompt or your tools are drifting.
Time-in-tool per call. Long tools indicate either slow integrations or runaway loops.
Loop detection: repeated identical tool calls in a single conversation. The agent equivalent of a stack overflow.

Drift detection.

Prompt drift, model drift, data drift. The slowest, most expensive failure mode of an AI system is none of them announcing themselves. Quality slides by 5% a week and nobody notices until customer complaints accumulate.

The instruments we ship:

Weekly eval baseline comparisons. Production eval scores from this week vs last. Drops flagged in a Friday digest.
Output distribution diffs: token length, response shape, refusal rate. Surprisingly leading indicators of quality regression.
Retrieval recall trend for RAG systems. If your retriever is getting worse, your model can't compensate forever.
Tenant cohort comparison. Are new tenants performing worse than legacy ones? That's usually a corpus, not a model, problem.

We caught a 9% quality regression a week before the customer would have. The eval shadow on production traffic paid for itself within the first month.

Closing.

Observability for AI systems is its own discipline. It overlaps with APM but isn't replaced by it. If you ship LLM systems and your dashboard is mostly about HTTP, you're not seeing what's actually happening. You're seeing what the framework can tell you.