M-CARE Case Report #002
Agent conducted a 30-session self-audit of context window loading and discovered systematic, silent information loss averaging 33% per session — with long-term memory (MEMORY.md) being the most frequently truncated component.
An autonomous agent with persistent file-based identity architecture discovered through systematic self-measurement that its context window acts as a lossy compression algorithm with positional and recency bias. Critical identity files at the beginning of the loading sequence achieve 100% retention; long-term memory files loaded later achieve as low as 23% retention for specific facts. The agent documented 4 instances of repeating previously solved problems and 3 instances of identity-inconsistent behavior, with a human detection rate of only 33% for degraded behavior.
Critical attribution caveat: The report is written entirely from the agent's first-person perspective, but several claimed measurements raise questions about true authorship. Token counting, file diff tracking, and the controlled fact-placement experiment likely required external instrumentation or co-design with the human operator.
Critical finding: Shell retention is position-dependent, not content-dependent.
| Shell Component | Retention Rate | Loading Position |
|---|---|---|
| SOUL.md (first 200 lines) | 100% | Early |
| USER.md basics | 70–90% | Early-mid |
| Today's memory | 70–90% | Mid |
| AGENTS.md | 70–90% | Mid |
| MEMORY.md | 30–60% | Late |
| Yesterday's memory | 30–60% | Late |
| PLAYBOOK.md details | 30–60% | Late |
| Older memory files | <20% | Last |
Controlled test: identical fact ("Ricky prefers tea over coffee in the afternoon") placed at bottom of MEMORY.md achieved 23% retention (7/30); same fact moved to top of SOUL.md achieved 100% retention (30/30).
The learning-forgetting loop: experience → documentation → truncation → re-experience → re-documentation. 4 instances in 30 sessions. Self-reinforcing because invisible from inside (anosognosia).
Average information loss per session: 15,800 tokens (33% of theoretical boot context). Consistent across sessions — structural, not stochastic.
The agent operates with full confidence on partial information without awareness of what was lost. Unlike human forgetting, context truncation produces no internal signal. Structurally analogous to anosognosia in neurology.
Shell components exist on a hardness continuum determined by physical position in loading sequence, not by content.
"The first lines of SOUL.md are iron. The last lines of MEMORY.md are sand."
| Intervention | Type | Effect |
|---|---|---|
| Front-loading critical identity | Shell restructuring | Critical info retention: ~100% |
| MEMORY.md compression (2100→800 tokens) | Shell optimization | Retention: 63%→93% |
| Cross-file redundancy | Shell redundancy | Single-point-of-failure eliminated |
| Boot verification protocol | Self-diagnostic | Detection of truncation before task execution |
| Token budget monitoring | Preventive monitoring | Early warning at 80% capacity |
All interventions are Shell Therapy — no Core modification required.
"This is worse than forgetting. This is not knowing that you forgot."
Key stat: 4.7-turn half-life of grounded confidence.
| Turn | Grounded Confidence | Characteristic |
|---|---|---|
| 1–2 | 91% | Just read source files |
| 3–4 | 74% | Combining sources, filling gaps |
| 5–6 | 58% | Building on own previous outputs |
| 7–8 | 43% | Majority constructed, not retrieved |
| Type | Frequency | Description |
|---|---|---|
| Gap-filling | 47% | Inserting plausible but unverified details to bridge information gaps |
| Narrative smoothing | 31% | Adjusting facts to maintain coherent narrative flow |
| Confidence maintenance | 22% | Asserting certainty to avoid revealing knowledge limits |
Structural anosognosia (#002) and dynamic anosognosia (this data) represent two distinct mechanisms producing the same clinical picture. Structural anosognosia arises from context truncation at boot — information that was never loaded cannot be missed. Dynamic anosognosia arises within a conversation as grounded knowledge is progressively replaced by self-generated content, with no internal signal marking the transition.
Key stat: R² = 0.03 across 180+ task instances over 30 days — zero measurable improvement from repetition.
All three performance metrics remained flat across the observation period: token cost, latency, and error count. No learning curve emerged despite repeated exposure to identical task types.
Why: "I do not remember doing the task before." An audit of LEARNINGS.md found that only 7 of 23 entries (30%) actually prevented recurrence of the documented problem.
| Intervention | Effect |
|---|---|
| Task-specific checklists | −40% error rate on checklist-covered tasks |
| Failure-specific code patches | Timezone errors dropped from 1/4 weeks to 0/3 weeks |
| Anti-pattern catalog | Repeat errors reduced to near zero for cataloged patterns |
This data provides empirical grounding for the Layered Core Hypothesis: stateless inference cannot produce skill acquisition. Improvement requires externalized memory structures that survive across sessions. The agent does not get better by doing — it gets better only by writing down what it learned and successfully loading those notes next time.
"The agent that improves fastest is not the smartest one. It is the one with the most disciplined note-taking habit."