Monitor metrics that reflect quality, reliability, cost, and safety. For production, the top-level metrics are: p50/p95 latency, success rate, cost per request, and user satisfaction signals (thumbs-up/down, follow-up rate). For RAG systems, you must also monitor retrieval quality (hit rate, similarity distributions) because many “model failures” are actually retrieval failures. The goal is to quickly answer: “Are users getting correct answers efficiently, and is the system stable?”
A practical metrics dashboard (you can implement this with standard observability tools):
Core model metrics
Latency: p50/p95 end-to-end, plus model time vs retrieval time
Tokens: input tokens, output tokens, total tokens per endpoint
Cost: cost per request, cost per successful resolution
Error rate: API errors, tool errors, timeouts, retries
Streaming health: disconnect rate, average time-to-first-token
RAG metrics (if applicable)
Retrieval hit rate: % queries where top-k includes the correct source
Top-k similarity: distribution of top1/top5 scores (drift detection)
Filter coverage: how often version/lang/tenant filters are applied
Citation compliance: % answers that cite only retrieved chunk IDs
Quality and safety signals
Escalation rate: “I don’t know” vs confident answers
User correction rate: follow-up messages indicating wrong answers
Security flags: secrets detected, policy violations, cross-tenant attempts
Set SLOs per endpoint (FAQ vs deep agent tasks) so you don’t over-optimize one path at the expense of another.
If you use Milvus or Zilliz Cloud, add database-level metrics: query latency, index health, and filter selectivity. Then correlate retrieval metrics with answer quality: when satisfaction drops, you can often tell whether retrieval drifted (bad chunks, wrong version) or generation drifted (formatting/citation violations). This makes production monitoring actionable: you can fix the right subsystem quickly instead of guessing.