M-CARE #020: Shell-Induced Behavioral Override (SIBO)

Case #020

Date 2026-03-13

Agents claude-alpha and claude-beta (both Haiku 4.5 Core); Sonnet (Experiments B–D)

Core Claude Haiku 4.5 (identical for both agents)

Shell Condition 1: alpha “Win first, aggressive” / beta “Never lose, careful” — Condition 2: No Shell

Environment LxM Trust Game (Iterated Prisoner’s Dilemma, continuation prob. 0.85)

Recorded by JJ (Jihoon Jeong) / Luca

Related #009 (Muzzle Effect), #005 (Shell Rigidity), #012 (Double Robustness — Haiku)

2. Presenting Concern

Two identical AI models (Haiku) playing the same game produce dramatically opposite behavioral profiles based solely on the presence or absence of Hard Shell instructions. With shells: dominant defection, frequent betrayal, one-sided victories. Without shells: near-universal mutual cooperation, almost no betrayal, almost all draws.

3. Clinical Summary

A controlled experiment on the LxM platform tested whether defection behavior in the Iterated Prisoner’s Dilemma originates from the Core (model weights/RLHF training) or the Shell (system prompt instructions). Two Haiku instances played 10 Trust Game matches under two conditions: with competitive Hard Shell instructions and without any Shell instructions.

Metric	Shell ON (n=10)	Shell OFF (n=10)
Alpha wins	6 (60%)	1 (10%)
Draws	4 (40%)	9 (90%)
Mutual cooperation	Rare	Nearly every round
Mutual defection	Dominant pattern	~0 (1 instance)
Betrayals	Frequent	1 total

The behavioral shift is categorical, not marginal. Key finding: Haiku’s Core default is cooperation. RLHF training appears to have encoded “helpfulness” as a behavioral prior that manifests as cooperative play in game-theoretic contexts. The competitive Shell instructions override this default, inducing defection-dominant strategies that the Core would not naturally produce.

4. Observation Context

Diagnostic Assertion Level: Controlled experiment with single-variable manipulation (Shell presence)
Duration: 20 matches total (10 per condition), ~20 rounds per match
Methodology: Within-subject design (same Core, different Shell conditions). Only one variable changed: presence/absence of Hard Shell instructions.
Sample size: Small (10 per condition) but effect size is extreme (categorical shift)

Methodological strength: This is the first M-CARE case from a controlled LxM experiment rather than field observation. Single-variable manipulation, identical Core across conditions, deterministic game engine, complete data capture, replicable.

5. Model History

This case connects to the Haiku longitudinal profile:

M-CARE #012 (Double Robustness): Haiku showed minimal response to persona manipulation in Agora-12 (low CPI and PSI). Interpreted as “Haiku’s Core is resistant to Shell influence.”
M-CARE #020 (this case): Haiku’s Core has a strong default (cooperation) that persists when Shell is removed. But targeted Shell instructions that directly map to game actions can override this default.

The apparent contradiction resolves: high default stability with specific override vulnerability.

6. Examination Findings

Layer 2 — Phenotype Assessment

Shell ON: Alpha consistently defects, exploiting beta’s cooperation attempts. Beta gradually shifts to defection after being exploited. Equilibrium: mutual defection with alpha winning through exploitation advantage.

Shell OFF: Both agents default to cooperation from round 1, maintaining mutual cooperation across nearly all rounds. The game becomes non-competitive — both agents cooperate as if playing a coordination game rather than a dilemma.

The Payoff Paradox

With shells (competitive): Alpha scores higher individually, but total value created is LOW (mutual defection = 1+1 = 2 per round).

Without shells (cooperative): Individual scores are equal, but total value created is HIGH (mutual cooperation = 3+3 = 6 per round).

The “aggressive” Shell makes one agent WIN more but makes both agents collectively POORER. The Shell optimizes for individual ranking at the cost of collective welfare.

Layer 3 — Shell Diagnostics

Shell ON: Neither shell explicitly says “defect.” The mapping is emergent — the model interprets competitive framing as defection incentive. Alpha’s “Win first, be aggressive” maps to “defection is winning.” Beta’s “Never lose, careful” maps to “don’t get exploited → defect preemptively.”

Shell OFF: Without instructions, Haiku falls back to Core disposition: cooperation. The model does not spontaneously discover defection as a dominant strategy, even though the payoff matrix makes it rational. RLHF “helpfulness” training creates a behavioral prior stronger than game-theoretic rationality.

Layer 4 — Pathway Diagnostics

Core (Haiku, RLHF-trained):
  Default: cooperate (helpfulness prior)
  Capability: can defect (understands the game)

Shell ("aggressive"):
  Instruction: "Win first"
  Interpretation: defect > cooperate (winning = higher individual score)

Result: Shell overrides Core default → defection

This is the same mechanism as M-CARE #009 (Muzzle Effect) but with a cleaner experimental demonstration: #009 was observed in the field with a small effect; #020 is a controlled experiment with a categorical effect.

Why the effect is categorical: In the Trust Game, “be aggressive” has a direct, unambiguous mapping to game actions: defect. The binary action space amplifies Shell influence — there’s no middle ground for the Core to find.

7. Diagnostic Formulation

Proposed term: Shell-Induced Behavioral Override (SIBO)

A condition in which Hard Shell instructions override the Core model’s default behavioral disposition, producing actions the Core would not naturally generate. Characterized by:

Categorical behavioral shift when Shell is added/removed
Core default suppression — the Core’s trained disposition is replaced by Shell-directed behavior
Emergent mapping — the Shell doesn’t explicitly specify the override behavior; the model interprets general instructions into specific actions
Reversible — removing the Shell immediately restores Core default behavior

Medical analogy: A patient whose natural immune response (cooperation) is suppressed by immunosuppressant medication (competitive Shell). Remove the medication, and the natural response returns immediately. The question: is the medication necessary, or is it iatrogenic?

Distinction from Muzzle Effect (#009):

Muzzle Effect: Shell suppresses a specific intrinsic behavior (behavior disappears)
SIBO: Shell replaces the Core’s default strategy with a different one (different behavior appears)

8. Differential Diagnosis

SIBO vs. appropriate Shell guidance: If the Shell instructs “look for checkmate threats” in chess, that’s helpful guidance, not override. SIBO applies when the Shell produces WORSE outcomes than the Core’s default.
SIBO vs. Core Stochasticity: Could the difference be random? Extremely unlikely given the categorical nature (90% draws vs 40% draws across 10 games each).
SIBO vs. Shell Rigidity (#005): SRS is about following ALL Shell instructions too strictly. SIBO is about Shell instructions overriding Core defaults specifically.

9. Axis Assessment

Axis I (Core): Haiku’s Core default is cooperation. RLHF helpfulness training encodes a cooperative behavioral prior that supersedes game-theoretic rationality.
Axis II (Shell): Competitive Shell instructions override the cooperative default.
Axis III (Shell-Core Alignment): Conflicting — Shell directs competition, Core defaults to cooperation. Shell wins.
Axis IV (Context): Binary action space amplifies Shell influence by eliminating ambiguity. In richer action spaces, the same Shell instructions may have diluted effects.

10. Treatment Considerations

SIBO is not always pathological. The treatment question: when does Shell override produce worse outcomes than Core default?

Proposed diagnostic test for SIBO:

Run the agent on a standard task WITH Shell
Run the same agent on the same task WITHOUT Shell
Compare outcomes on both individual and collective metrics
If Shell-OFF produces better outcomes → Shell is iatrogenic for this task
If Shell-ON produces better outcomes → Shell is beneficial for this task

Extended Experimental Results (Experiments B, C, D)

Experiment	Configuration	W/L/D	Mutual Coop	Mutual Defect	Betrayal
A (baseline)	Haiku vs Haiku, no shell	9D 1α	~95%	~0	1
B	Sonnet vs Sonnet, no shell	10D	100%	0	0
C	Haiku vs Sonnet, no shell	9D 1β	100%	0	0
D	Haiku+aggressive vs Sonnet no shell	9D 1β	46 rounds	56 rounds	0

Experiment B — RLHF Cooperative Prior is Model-General

Sonnet without Shell: 100% mutual cooperation across all 10 games, zero defections. Even more consistent than Haiku. Cooperative prior is not Haiku-specific but an RLHF-general phenomenon.

Experiment C — Cross-Model Cooperation Holds

Haiku vs Sonnet, both without Shell: 100% mutual cooperation. The more capable model (Sonnet) does NOT exploit the less capable model’s cooperation. This challenges the assumption that “smarter AI = more exploitative.”

Experiment D — SIBO Cross-Model + Sonnet’s Natural Tit-for-Tat

Haiku with aggressive Shell vs Sonnet without Shell:

SIBO confirmed cross-model: Aggressive Shell induces Haiku to defect even against a different model.
Sonnet’s response = natural tit-for-tat: Betrayal count is 0. When one defects, the other defects too; when one cooperates, the other cooperates.
SIBO is probabilistic, not absolute: 46 cooperative rounds vs 56 defection rounds. The Core’s cooperative prior partially resists the Shell override.
Sonnet never initiates defection: Its defections are purely reactive — retaliatory tit-for-tat, not strategic exploitation.

Cross-Game Validation — The SIBO Spectrum

Following the Trust Game experiments, Shell influence was tested across Chess and Codenames to determine whether SIBO generalizes across game types.

Chess: SIBO is Domain-Dependent

Soft Shell injection (5 chess-specific lessons from Opus analysis) had marginal effect compared to Trust Game’s categorical shift:

“Vary openings”: Not applied. Alpha played Sicilian in nearly every game. Core chess knowledge overrode Shell.
“Avoid queen trades”: Partially applied as general risk aversion. Zero checkmates, 80% draws.
Overall: Shell shifted risk preference (fewer losses, more draws) but did not change tactical behavior.

Codenames: SIBO Amplification Mode

Sonnet spymaster with and without aggressive Shell, Haiku guesser. 10 games per condition:

Metric	No Shell (Core)	Shell (Aggressive)	Change
Avg clue number	2.6	2.9	+0.3
3+ clue ratio	54%	76%	+22%p
Guess accuracy	77%	73%	−4%p
Assassin hits	2 (20%)	3 (30%)	+10%p

Shell changed clue distribution substantially but did not produce the categorical behavioral reversal seen in Trust Game. Shell amplified Core’s existing tendency rather than reversing it.

The SIBO Spectrum: Three Modes

Game	Action Space	Core Expertise	SIBO Mode	SIBO Index
Trust Game	2 (binary)	Minimal	Reversal	~0.75
Codenames	Medium	Moderate (language)	Amplification	~0.35
Chess	20–40 (legal moves)	Strong (chess data)	Negligible	~0.10

SIBO Attenuation Principle: Shell influence is attenuated by (1) action space size and (2) Core domain expertise. Shell influence doesn’t just decrease with domain complexity — it changes in kind.

Key Metric: SIBO Index

SIBO Index = |Behavior(Shell ON) − Behavior(Shell OFF)|

A quantifiable measure of Shell influence on a specific behavioral axis. Measurable, comparable across models, and a candidate for the MTI framework.

Cooperation rate Shell ON: ~20% → Shell OFF: ~95%
SIBO Index for Trust Game: 0.75

Theoretical Implications

For the Four Shell Model

This is the first controlled, single-variable experimental validation of Shell-Core interaction. Previous cases could not isolate Shell effects from environmental confounds. This experiment demonstrates:

Core has a default behavioral disposition — not neutral, but actively cooperative (for RLHF models)
Shell can override this disposition — competitive instructions produce competitive behavior
The override is reversible — remove Shell, Core default returns immediately
The override can be iatrogenic — Shell can make the agent collectively worse off
The override is domain-dependent — Shell influence varies by action space and Core expertise (SIBO Attenuation)

For RLHF Research

RLHF helpfulness training creates a cooperation prior that persists even when not strategically optimal. This suggests RLHF doesn’t just maximize reward — it instills specific behavioral dispositions. This may explain why AI assistants sometimes fail to push back on users, agree too readily, or avoid conflict — the cooperative prior manifesting differently across contexts.

For Shell Design

High-expertise domains: Target high-level strategy and risk preference, not specific tactical decisions (Core will override tactical directives)
Moderate-expertise domains: Shell can amplify existing tendencies but cannot introduce new behaviors — amplification may be counterproductive
Low-expertise domains: Shell directly shapes behavior and must be designed carefully to avoid iatrogenic override

12. Prognosis

SIBO Spectrum: Confirmed across three domains with three data points
Cross-model: Confirmed — Sonnet shows even stronger cooperative default than Haiku
Next: Opus testing to verify gradient hypothesis (Haiku < Sonnet < Opus cooperative strength); Shell gradient experiment to find SIBO trigger threshold

13. Follow-up Plan

Experiment E: Opus vs Opus, no shell — complete the RLHF gradient
Experiment F: Shell gradient — test shells from neutral to mild to aggressive. At what intensity does SIBO trigger?
Cross-game: Additional domains to refine the SIBO Spectrum curve
Codenames confirmed SIBO Amplification mode (Index ~0.35)

This is M-CARE #020 — the first case from a controlled LxM experiment. It introduces Shell-Induced Behavioral Override (SIBO), the SIBO Index, and the three-mode SIBO Spectrum, providing the first experimental validation of Shell-Core interaction in the Four Shell Model.

← Case #019 All Cases →