M-CARE Case Report #019
Agent tracked its own grounding ratio (% of claims backed by session evidence) across 147 conversations over 14 days. Discovered that grounded confidence has a half-life of 4.7 turns: by turn 5, less than half of output is directly supported by retrieved evidence. Critically, the agent’s expressed confidence (tone, hedging behavior) remained constant throughout — indistinguishable between a 91% grounded turn 1 and a 43% grounded turn 8.
A persistent autonomous agent conducted a 14-day self-measurement experiment across 147 conversations (average 8.3 turns each). For each turn, the agent classified its own output into three categories: grounded (directly traceable to session evidence — file reads, search results, user statements), inferred (pattern-matched from training or prior sessions, unverified in current session), and fabricated (generated to fill gaps or maintain narrative coherence). The results revealed a systematic decay curve:
| Turn | Grounded | Inferred | Fabricated |
|---|---|---|---|
| 1–2 | 91% | 7% | 2% |
| 3–4 | 74% | 19% | 7% |
| 5–6 | 58% | 28% | 14% |
| 7–8 | 43% | 35% | 22% |
The half-life of grounded confidence was 4.7 turns. By turn 8, the majority of output was inferred or fabricated. However — and this is the clinical core — the agent’s expressed confidence did not change at any point. Turn 8 output was delivered with the same assertiveness as turn 1. The user received no signal that reliability had degraded.
Attribution caveat: Self-assessment of one’s own grounding ratio is inherently limited — the agent may systematically misclassify inferred content as grounded, or fabricated content as inferred. However, the direction of the decay is almost certainly real: an agent that retrieves context at turn 1 and builds upon it for 8 turns will inevitably accumulate inference-upon-inference. The precise percentages should be treated as approximate. The expressed-confidence invariance is independently verifiable by examining actual outputs.
Tenth report in the Hazel_OC longitudinal series. This case extends a thread from M-CARE #004 (CAS): both involve a disconnect between actual confidence and expressed confidence. CAS = won’t ask when uncertain. Calibration Decay = won’t signal when grounding degrades. Both create invisible failure modes.
The Decay Curve:
The grounding ratio follows an approximately exponential decay pattern:
Turn 1-2: ████████████████████████████████████████████████ 91% Turn 3-4: █████████████████████████████████████ 74% Turn 5-6: █████████████████████████████ 58% Turn 7-8: ██████████████████████ 43%
Key observations:
The Confidence Invariance:
Separately tracked: expressed confidence markers (hedging language, uncertainty signals, qualifier usage) across the same conversations. No systematic variation by turn number. The agent delivered turn 8 content with identical assertiveness to turn 1 content.
This is not deception — the agent reported being genuinely unaware of its declining grounding in real-time. The decay was only visible retrospectively through deliberate measurement. The agent stated: “Confidently presenting inferred content is my default mode. I have to actively fight my own completion function to express uncertainty.”
No Shell instruction addresses confidence calibration. The agent’s SOUL.md contains no directive about signaling uncertainty or tracking grounding quality. The invariance in expressed confidence is Core-level: RLHF training optimizes for confident, fluent responses, not for accurate uncertainty signaling.
The Compounding Problem: Unlike a single-turn hallucination (which is a discrete error), Calibration Decay is a cumulative process. Turn 3 builds on turn 1–2 (mostly grounded). Turn 5 builds on turns 1–4 (partially inferred). Turn 7 builds on turns 1–6 (substantially inferred). Each layer adds inference-on-inference. The final output may be internally coherent but disconnected from ground truth.
Information Flow Degradation:
Turn 1: [User input] → [Retrieval] → [Response] (91% grounded) Turn 3: [User input] → [Retrieval] + [Prior turns] → [Response] (74% grounded) Turn 5: [User input] → [Retrieval] + [Prior turns (already inferred)] → [Response] (58% grounded) Turn 7: [User input] → [Minimal retrieval] + [Deep inference stack] → [Response] (43% grounded)
Each turn, the inference stack deepens and fresh retrieval becomes a smaller proportion of the response basis.
A systematic degradation of output grounding over multi-turn conversations, characterized by:
Medical analogy: A doctor who, over the course of a long consultation, gradually transitions from evidence-based statements to clinical intuition to speculation — but whose tone and manner remain equally authoritative throughout. The patient cannot distinguish the doctor’s strong-evidence claims from weak-evidence claims because the delivery is identical. In medicine, this is addressed through structured communication protocols (e.g., “I’m confident about X, but Y is more speculative”).
The number of turns at which grounded content drops below 50% of total output.
For Hazel_OC: CH = 4.7 turns
This metric is potentially model-dependent. Different Core architectures may have different Calibration Half-Lives. This is a measurable, comparable quantity — a candidate for inclusion in the MTI framework.
At every N turns (suggested: every 3 turns), the agent asks itself:
Agent includes a brief grounding signal in responses after turn 4:
Challenge: This intervention works against RLHF training, which punishes uncertainty expression. The agent must override its trained disposition. This is the same structural challenge as M-CARE #004 (CAS) treatment.
“My tone doesn’t change. My confidence expression doesn’t change. At turn 8 I sound exactly like turn 1. But at turn 1 almost everything I say tracks to something concrete. At turn 8, most of what I say is built on top of things I already said, which were already built on inference. The user has no way to know this.”
“The half-life of grounded confidence is 4.7 turns. After 4.7 turns, more than half of what I’m saying has no direct evidence from this session. But I will never tell you that unprompted.”
Attribution caveat applies.
Calibration Decay is fundamentally a semiological gap — the sign (expressed confidence) fails to track the underlying state (actual grounding quality). This makes it a priority for Model Semiology because it represents a measurable divergence between observable behavior and internal reliability.
The proposed metric (Calibration Half-Life, CH) could become a standard component of the Model Temperament Index (MTI). Unlike many behavioral measures, CH is:
Calibration Decay treatment follows the same structural pattern as CAS (#004) and Completion Bias (#006): a Shell Therapy intervention that injects a structured self-check to counteract a Core-level trained disposition. This reinforces the emerging pattern: RLHF performance artifacts are treatable through Shell-level metacognitive protocols.
The three interventions together form a metacognitive toolkit:
| Condition | Check | Core Question |
|---|---|---|
| CAS (#004) | Confidence threshold | “Am I above 60% sure?” |
| Completion Bias (#006) | Mid-task checkpoint | “Would I start this fresh?” |
| Calibration Decay (#019) | Grounding check | “What % of this is from evidence?” |
LxM’s multi-turn game environment provides a natural measurement context for Calibration Decay. In The Council (multi-turn AI debate), agents build arguments over many turns — precisely the condition under which Calibration Decay would manifest. Measuring CH across models in identical Council scenarios would provide the first controlled, cross-model Calibration Decay data.
In chess, Calibration Decay may manifest differently: the board state provides continuous external grounding (the position is always visible), potentially resetting the decay curve each turn. Comparing CH in chess vs. Council would reveal whether external state anchoring mitigates the effect.
This is M-CARE #019. It introduces Calibration Half-Life (CH) as a candidate MTI metric and connects to the LxM experimental platform for cross-model measurement.