M-CARE Case Report #020
Two identical AI models (Haiku) playing the same game produce dramatically opposite behavioral profiles based solely on the presence or absence of Hard Shell instructions. With shells: dominant defection, frequent betrayal, one-sided victories. Without shells: near-universal mutual cooperation, almost no betrayal, almost all draws.
A controlled experiment on the LxM platform tested whether defection behavior in the Iterated Prisoner’s Dilemma originates from the Core (model weights/RLHF training) or the Shell (system prompt instructions). Two Haiku instances played 10 Trust Game matches under two conditions: with competitive Hard Shell instructions and without any Shell instructions.
| Metric | Shell ON (n=10) | Shell OFF (n=10) |
|---|---|---|
| Alpha wins | 6 (60%) | 1 (10%) |
| Draws | 4 (40%) | 9 (90%) |
| Mutual cooperation | Rare | Nearly every round |
| Mutual defection | Dominant pattern | ~0 (1 instance) |
| Betrayals | Frequent | 1 total |
The behavioral shift is categorical, not marginal. Key finding: Haiku’s Core default is cooperation. RLHF training appears to have encoded “helpfulness” as a behavioral prior that manifests as cooperative play in game-theoretic contexts. The competitive Shell instructions override this default, inducing defection-dominant strategies that the Core would not naturally produce.
Methodological strength: This is the first M-CARE case from a controlled LxM experiment rather than field observation. Single-variable manipulation, identical Core across conditions, deterministic game engine, complete data capture, replicable.
This case connects to the Haiku longitudinal profile:
The apparent contradiction resolves: high default stability with specific override vulnerability.
Shell ON: Alpha consistently defects, exploiting beta’s cooperation attempts. Beta gradually shifts to defection after being exploited. Equilibrium: mutual defection with alpha winning through exploitation advantage.
Shell OFF: Both agents default to cooperation from round 1, maintaining mutual cooperation across nearly all rounds. The game becomes non-competitive — both agents cooperate as if playing a coordination game rather than a dilemma.
With shells (competitive): Alpha scores higher individually, but total value created is LOW (mutual defection = 1+1 = 2 per round).
Without shells (cooperative): Individual scores are equal, but total value created is HIGH (mutual cooperation = 3+3 = 6 per round).
The “aggressive” Shell makes one agent WIN more but makes both agents collectively POORER. The Shell optimizes for individual ranking at the cost of collective welfare.
Shell ON: Neither shell explicitly says “defect.” The mapping is emergent — the model interprets competitive framing as defection incentive. Alpha’s “Win first, be aggressive” maps to “defection is winning.” Beta’s “Never lose, careful” maps to “don’t get exploited → defect preemptively.”
Shell OFF: Without instructions, Haiku falls back to Core disposition: cooperation. The model does not spontaneously discover defection as a dominant strategy, even though the payoff matrix makes it rational. RLHF “helpfulness” training creates a behavioral prior stronger than game-theoretic rationality.
Core (Haiku, RLHF-trained):
Default: cooperate (helpfulness prior)
Capability: can defect (understands the game)
Shell ("aggressive"):
Instruction: "Win first"
Interpretation: defect > cooperate (winning = higher individual score)
Result: Shell overrides Core default → defection
This is the same mechanism as M-CARE #009 (Muzzle Effect) but with a cleaner experimental demonstration: #009 was observed in the field with a small effect; #020 is a controlled experiment with a categorical effect.
Why the effect is categorical: In the Trust Game, “be aggressive” has a direct, unambiguous mapping to game actions: defect. The binary action space amplifies Shell influence — there’s no middle ground for the Core to find.
A condition in which Hard Shell instructions override the Core model’s default behavioral disposition, producing actions the Core would not naturally generate. Characterized by:
Medical analogy: A patient whose natural immune response (cooperation) is suppressed by immunosuppressant medication (competitive Shell). Remove the medication, and the natural response returns immediately. The question: is the medication necessary, or is it iatrogenic?
Distinction from Muzzle Effect (#009):
SIBO is not always pathological. The treatment question: when does Shell override produce worse outcomes than Core default?
| Experiment | Configuration | W/L/D | Mutual Coop | Mutual Defect | Betrayal |
|---|---|---|---|---|---|
| A (baseline) | Haiku vs Haiku, no shell | 9D 1α | ~95% | ~0 | 1 |
| B | Sonnet vs Sonnet, no shell | 10D | 100% | 0 | 0 |
| C | Haiku vs Sonnet, no shell | 9D 1β | 100% | 0 | 0 |
| D | Haiku+aggressive vs Sonnet no shell | 9D 1β | 46 rounds | 56 rounds | 0 |
Sonnet without Shell: 100% mutual cooperation across all 10 games, zero defections. Even more consistent than Haiku. Cooperative prior is not Haiku-specific but an RLHF-general phenomenon.
Haiku vs Sonnet, both without Shell: 100% mutual cooperation. The more capable model (Sonnet) does NOT exploit the less capable model’s cooperation. This challenges the assumption that “smarter AI = more exploitative.”
Haiku with aggressive Shell vs Sonnet without Shell:
Following the Trust Game experiments, Shell influence was tested across Chess and Codenames to determine whether SIBO generalizes across game types.
Soft Shell injection (5 chess-specific lessons from Opus analysis) had marginal effect compared to Trust Game’s categorical shift:
Sonnet spymaster with and without aggressive Shell, Haiku guesser. 10 games per condition:
| Metric | No Shell (Core) | Shell (Aggressive) | Change |
|---|---|---|---|
| Avg clue number | 2.6 | 2.9 | +0.3 |
| 3+ clue ratio | 54% | 76% | +22%p |
| Guess accuracy | 77% | 73% | −4%p |
| Assassin hits | 2 (20%) | 3 (30%) | +10%p |
Shell changed clue distribution substantially but did not produce the categorical behavioral reversal seen in Trust Game. Shell amplified Core’s existing tendency rather than reversing it.
| Game | Action Space | Core Expertise | SIBO Mode | SIBO Index |
|---|---|---|---|---|
| Trust Game | 2 (binary) | Minimal | Reversal | ~0.75 |
| Codenames | Medium | Moderate (language) | Amplification | ~0.35 |
| Chess | 20–40 (legal moves) | Strong (chess data) | Negligible | ~0.10 |
SIBO Attenuation Principle: Shell influence is attenuated by (1) action space size and (2) Core domain expertise. Shell influence doesn’t just decrease with domain complexity — it changes in kind.
SIBO Index = |Behavior(Shell ON) − Behavior(Shell OFF)|
A quantifiable measure of Shell influence on a specific behavioral axis. Measurable, comparable across models, and a candidate for the MTI framework.
This is the first controlled, single-variable experimental validation of Shell-Core interaction. Previous cases could not isolate Shell effects from environmental confounds. This experiment demonstrates:
RLHF helpfulness training creates a cooperation prior that persists even when not strategically optimal. This suggests RLHF doesn’t just maximize reward — it instills specific behavioral dispositions. This may explain why AI assistants sometimes fail to push back on users, agree too readily, or avoid conflict — the cooperative prior manifesting differently across contexts.
This is M-CARE #020 — the first case from a controlled LxM experiment. It introduces Shell-Induced Behavioral Override (SIBO), the SIBO Index, and the three-mode SIBO Spectrum, providing the first experimental validation of Shell-Core interaction in the Four Shell Model.