Agent Reliability Scorecard

A public, self-serve rubric for assessing whether an agentic AI system is ready for production. We score across 8 dimensions, each 0–5, with weighted overall. Use it yourself; or have us run it on your stack as a paid 2-week audit.

This is the same scorecard Quabyt engineers use on Reliability Audit engagements. It is intentionally open so any technical leader can self-assess.

Scoring scale

Score	Meaning
0	Absent. No artifact, no practice.
1	Ad-hoc. Tried once; not maintained.
2	Inconsistent. Exists for some flows, not others.
3	Adequate. Covers the happy path and the obvious edge cases.
4	Mature. Versioned, automated, owned, audited.
5	Exemplary. Continuously improved; failure modes anticipated, not discovered.

A score of 3 across all dimensions is the production bar. Below that, the system has unaddressed failure modes that will surface under real traffic.

The 8 dimensions

1. Eval coverage (weight: 20%)

How comprehensively is agent behavior measured?

Versioned golden eval set (regression suite) in source control
Adversarial set targeting known failure modes (jailbreaks, PII leakage, tool misuse)
Per-step evals for multi-step agents, not just end-to-end
LLM-judge scorers with a human-calibrated agreement rate
Production sampling feeds back into the dataset

2. Regression detection (weight: 15%)

Will a bad prompt or model change be caught before users see it?

Evals run in CI on every PR touching prompts, tools, or models
Merge blocked on regression beyond a defined threshold
Eval results trended over time, not just pass/fail
Model version pinned; upgrades go through eval comparison

3. Guardrails (weight: 15%)

Are inputs and outputs validated, and are tool calls bounded?

Input: PII detection, prompt-injection detection, schema validation
Output: schema validation, hallucination check (where verifiable), toxicity/safety
Tool-use policy: allowed tools per agent role, parameter validation, rate limits
Human-in-the-loop checkpoints on high-stakes actions
Every guardrail trip is logged and reviewable

4. Observability (weight: 15%)

Can you debug a failed conversation from a user report?

Every model call, tool call, and guardrail event traced
OpenTelemetry gen_ai.* semantic conventions used
Traces correlate to user/session/tenant IDs
Retention long enough to investigate (≥ 30 days production traffic)
Single pane of glass for AI traces + cloud telemetry

5. Cost controls (weight: 10%)

Will the bill be predictable, and can you find leaks?

Per-request token + dollar accounting
Cost attributed to tenant / feature / model
Per-request budget enforced (soft or hard)
Model routing: cheap model first, escalate on confidence
Prompt caching where applicable; semantic cache where it pays off

6. Failure-mode mapping (weight: 10%)

Has the team identified what can go wrong, in writing?

Documented failure-mode list (hallucination, tool failure, infinite loop, cost runaway, region outage, etc.)
Each failure mode has a detection mechanism
Each failure mode has a mitigation or playbook
Post-incident reviews update the list

7. Security & data handling (weight: 10%)

Would this pass a security review tomorrow?

No PII or secrets in prompts unless redacted
Least-privilege IAM for tools; no agent has broader access than needed
Audit log of every action an agent takes on user data
Prompt injection treated as a real threat (input filtering, output sandboxing)
Model provider data-handling configured (no-training flags, region pinning)

8. Deployment & operations (weight: 5%)

Can the team ship and operate safely?

Infrastructure as code; prompts and tool configs version-controlled
Feature flags or staged rollouts for new agent behavior
On-call runbook for AI-specific incidents (hallucination spike, latency, cost)
Rollback plan that doesn’t require redeploying the world

Overall scoring

overall = sum(dimension_score * weight)

Overall	Verdict
< 2.0	Not production-ready. High risk of incident; do not scale traffic.
2.0 – 2.9	Pilot-quality only. Visible problems for real users imminent.
3.0 – 3.9	Production-acceptable for non-critical workloads. Known gaps; have a remediation plan.
4.0 – 4.5	Mature. Suitable for critical workloads with monitoring.
> 4.5	Industry leading. Rare.

Common findings (what we usually score low)

In ~80% of Quabyt audits, the lowest two dimensions are regression detection and failure-mode mapping. Teams have some evals (so eval coverage scores 2–3) but no CI gate (regression scores 0–1) and no written failure-mode list (mapping scores 0–1). The fix order on hardening engagements is almost always:

Lock the prompt/tool/model versions
Wire evals into CI
Write the failure-mode doc
Layer in guardrails
Instrument tracing
Bring cost controls in
Document the runbook

This sequence works because each step makes the next safer to do.

Want this run on your system?

A Quabyt Reliability Audit takes 2 weeks, fixed price, and produces this scorecard plus a prioritized remediation roadmap. Contact us or read the full Agent Reliability Platform page.