Agent Reliability Scorecard

A public, self-serve rubric for assessing whether an agentic AI system is ready for production. We score across 8 dimensions, each 0–5, with weighted overall. Use it yourself; or have us run it on your stack as a paid 2-week audit.

This is the same scorecard Quabyt engineers use on Reliability Audit engagements. It is intentionally open so any technical leader can self-assess.


Scoring scale

ScoreMeaning
0Absent. No artifact, no practice.
1Ad-hoc. Tried once; not maintained.
2Inconsistent. Exists for some flows, not others.
3Adequate. Covers the happy path and the obvious edge cases.
4Mature. Versioned, automated, owned, audited.
5Exemplary. Continuously improved; failure modes anticipated, not discovered.

A score of 3 across all dimensions is the production bar. Below that, the system has unaddressed failure modes that will surface under real traffic.


The 8 dimensions

1. Eval coverage (weight: 20%)

How comprehensively is agent behavior measured?

  • Versioned golden eval set (regression suite) in source control
  • Adversarial set targeting known failure modes (jailbreaks, PII leakage, tool misuse)
  • Per-step evals for multi-step agents, not just end-to-end
  • LLM-judge scorers with a human-calibrated agreement rate
  • Production sampling feeds back into the dataset

2. Regression detection (weight: 15%)

Will a bad prompt or model change be caught before users see it?

  • Evals run in CI on every PR touching prompts, tools, or models
  • Merge blocked on regression beyond a defined threshold
  • Eval results trended over time, not just pass/fail
  • Model version pinned; upgrades go through eval comparison

3. Guardrails (weight: 15%)

Are inputs and outputs validated, and are tool calls bounded?

  • Input: PII detection, prompt-injection detection, schema validation
  • Output: schema validation, hallucination check (where verifiable), toxicity/safety
  • Tool-use policy: allowed tools per agent role, parameter validation, rate limits
  • Human-in-the-loop checkpoints on high-stakes actions
  • Every guardrail trip is logged and reviewable

4. Observability (weight: 15%)

Can you debug a failed conversation from a user report?

  • Every model call, tool call, and guardrail event traced
  • OpenTelemetry gen_ai.* semantic conventions used
  • Traces correlate to user/session/tenant IDs
  • Retention long enough to investigate (≥ 30 days production traffic)
  • Single pane of glass for AI traces + cloud telemetry

5. Cost controls (weight: 10%)

Will the bill be predictable, and can you find leaks?

  • Per-request token + dollar accounting
  • Cost attributed to tenant / feature / model
  • Per-request budget enforced (soft or hard)
  • Model routing: cheap model first, escalate on confidence
  • Prompt caching where applicable; semantic cache where it pays off

6. Failure-mode mapping (weight: 10%)

Has the team identified what can go wrong, in writing?

  • Documented failure-mode list (hallucination, tool failure, infinite loop, cost runaway, region outage, etc.)
  • Each failure mode has a detection mechanism
  • Each failure mode has a mitigation or playbook
  • Post-incident reviews update the list

7. Security & data handling (weight: 10%)

Would this pass a security review tomorrow?

  • No PII or secrets in prompts unless redacted
  • Least-privilege IAM for tools; no agent has broader access than needed
  • Audit log of every action an agent takes on user data
  • Prompt injection treated as a real threat (input filtering, output sandboxing)
  • Model provider data-handling configured (no-training flags, region pinning)

8. Deployment & operations (weight: 5%)

Can the team ship and operate safely?

  • Infrastructure as code; prompts and tool configs version-controlled
  • Feature flags or staged rollouts for new agent behavior
  • On-call runbook for AI-specific incidents (hallucination spike, latency, cost)
  • Rollback plan that doesn’t require redeploying the world

Overall scoring

overall = sum(dimension_score * weight)
OverallVerdict
< 2.0Not production-ready. High risk of incident; do not scale traffic.
2.0 – 2.9Pilot-quality only. Visible problems for real users imminent.
3.0 – 3.9Production-acceptable for non-critical workloads. Known gaps; have a remediation plan.
4.0 – 4.5Mature. Suitable for critical workloads with monitoring.
> 4.5Industry leading. Rare.

Common findings (what we usually score low)

In ~80% of Quabyt audits, the lowest two dimensions are regression detection and failure-mode mapping. Teams have some evals (so eval coverage scores 2–3) but no CI gate (regression scores 0–1) and no written failure-mode list (mapping scores 0–1). The fix order on hardening engagements is almost always:

  1. Lock the prompt/tool/model versions
  2. Wire evals into CI
  3. Write the failure-mode doc
  4. Layer in guardrails
  5. Instrument tracing
  6. Bring cost controls in
  7. Document the runbook

This sequence works because each step makes the next safer to do.


Want this run on your system?

A Quabyt Reliability Audit takes 2 weeks, fixed price, and produces this scorecard plus a prioritized remediation roadmap. Contact us or read the full Agent Reliability Platform page.