Agent Reliability Scorecard
A public, self-serve rubric for assessing whether an agentic AI system is ready for production. We score across 8 dimensions, each 0–5, with weighted overall. Use it yourself; or have us run it on your stack as a paid 2-week audit.
This is the same scorecard Quabyt engineers use on Reliability Audit engagements. It is intentionally open so any technical leader can self-assess.
Scoring scale
| Score | Meaning |
|---|---|
| 0 | Absent. No artifact, no practice. |
| 1 | Ad-hoc. Tried once; not maintained. |
| 2 | Inconsistent. Exists for some flows, not others. |
| 3 | Adequate. Covers the happy path and the obvious edge cases. |
| 4 | Mature. Versioned, automated, owned, audited. |
| 5 | Exemplary. Continuously improved; failure modes anticipated, not discovered. |
A score of 3 across all dimensions is the production bar. Below that, the system has unaddressed failure modes that will surface under real traffic.
The 8 dimensions
1. Eval coverage (weight: 20%)
How comprehensively is agent behavior measured?
- Versioned golden eval set (regression suite) in source control
- Adversarial set targeting known failure modes (jailbreaks, PII leakage, tool misuse)
- Per-step evals for multi-step agents, not just end-to-end
- LLM-judge scorers with a human-calibrated agreement rate
- Production sampling feeds back into the dataset
2. Regression detection (weight: 15%)
Will a bad prompt or model change be caught before users see it?
- Evals run in CI on every PR touching prompts, tools, or models
- Merge blocked on regression beyond a defined threshold
- Eval results trended over time, not just pass/fail
- Model version pinned; upgrades go through eval comparison
3. Guardrails (weight: 15%)
Are inputs and outputs validated, and are tool calls bounded?
- Input: PII detection, prompt-injection detection, schema validation
- Output: schema validation, hallucination check (where verifiable), toxicity/safety
- Tool-use policy: allowed tools per agent role, parameter validation, rate limits
- Human-in-the-loop checkpoints on high-stakes actions
- Every guardrail trip is logged and reviewable
4. Observability (weight: 15%)
Can you debug a failed conversation from a user report?
- Every model call, tool call, and guardrail event traced
- OpenTelemetry
gen_ai.*semantic conventions used - Traces correlate to user/session/tenant IDs
- Retention long enough to investigate (≥ 30 days production traffic)
- Single pane of glass for AI traces + cloud telemetry
5. Cost controls (weight: 10%)
Will the bill be predictable, and can you find leaks?
- Per-request token + dollar accounting
- Cost attributed to tenant / feature / model
- Per-request budget enforced (soft or hard)
- Model routing: cheap model first, escalate on confidence
- Prompt caching where applicable; semantic cache where it pays off
6. Failure-mode mapping (weight: 10%)
Has the team identified what can go wrong, in writing?
- Documented failure-mode list (hallucination, tool failure, infinite loop, cost runaway, region outage, etc.)
- Each failure mode has a detection mechanism
- Each failure mode has a mitigation or playbook
- Post-incident reviews update the list
7. Security & data handling (weight: 10%)
Would this pass a security review tomorrow?
- No PII or secrets in prompts unless redacted
- Least-privilege IAM for tools; no agent has broader access than needed
- Audit log of every action an agent takes on user data
- Prompt injection treated as a real threat (input filtering, output sandboxing)
- Model provider data-handling configured (no-training flags, region pinning)
8. Deployment & operations (weight: 5%)
Can the team ship and operate safely?
- Infrastructure as code; prompts and tool configs version-controlled
- Feature flags or staged rollouts for new agent behavior
- On-call runbook for AI-specific incidents (hallucination spike, latency, cost)
- Rollback plan that doesn’t require redeploying the world
Overall scoring
overall = sum(dimension_score * weight)| Overall | Verdict |
|---|---|
| < 2.0 | Not production-ready. High risk of incident; do not scale traffic. |
| 2.0 – 2.9 | Pilot-quality only. Visible problems for real users imminent. |
| 3.0 – 3.9 | Production-acceptable for non-critical workloads. Known gaps; have a remediation plan. |
| 4.0 – 4.5 | Mature. Suitable for critical workloads with monitoring. |
| > 4.5 | Industry leading. Rare. |
Common findings (what we usually score low)
In ~80% of Quabyt audits, the lowest two dimensions are regression detection and failure-mode mapping. Teams have some evals (so eval coverage scores 2–3) but no CI gate (regression scores 0–1) and no written failure-mode list (mapping scores 0–1). The fix order on hardening engagements is almost always:
- Lock the prompt/tool/model versions
- Wire evals into CI
- Write the failure-mode doc
- Layer in guardrails
- Instrument tracing
- Bring cost controls in
- Document the runbook
This sequence works because each step makes the next safer to do.
Want this run on your system?
A Quabyt Reliability Audit takes 2 weeks, fixed price, and produces this scorecard plus a prioritized remediation roadmap. Contact us or read the full Agent Reliability Platform page.