Model Medicine Series · April 2026

Walkable Genotypes

Cross-Environment Validation of the Four Shell Model in AI Creatures

Five behavioral hypotheses entered. Four collapsed into their own sample-level standard deviation when re-run at n=5 with the same seeds. One survived three progressively stricter falsifiers at n=10 — and independently reproduced a Core-property prediction made under stress in a separate study, in a different environment. This is the methodology that distinguished the survivor.

What you're looking at

Three concepts make this paper readable. None of them require an AI background.

Creature

An assembled organism, not a chatbot

An AI Creature is a small program assembled from organ-like modules — a memory, an emotion module, an immune response, an engine that talks to a language model. Each creature has a habitat folder on disk where its identity, memories, and bonds with other creatures persist between sessions. We don't hand-write the creature's self-description; the creature writes its own SELF.md after each session by reflecting on what just happened. For the deeper architectural treatment — how creatures are assembled, what organ each module corresponds to, and where this lineage sits among other AI agents — see Comparative Anatomy of AI Agent Systems.

Brain

Different LLMs as different minds

The brain is the language model that powers a creature's cognition. In this paper we test two: Claude Haiku and Gemini 2.5 Flash. We talk to them through local CLI subscriptions — no paid API calls. The creature architecture is identical across the two; only the brain differs. So when two creatures behave differently in the same situation, we know the cause is the brain itself.

Field

A reproducible observation environment

A field is a deterministic environment a creature lives through for a fixed number of ticks. The Wilderness field used here presents stimuli like a calm day, a storm, isolation, a nearby creature — then asks the creature to choose an action: rest, explore, defend, speak, support, or trade. Because events are seeded, two runs at the same seed produce identical event sequences. That's how we can compare different creatures' responses to the same situations.

What survived. What didn't.

Two figures carry the empirical and methodological halves of the contribution.

Figure 1 · Empirical anchor

Brain-fixed behavioral attractors

Scatter plot of speak+support rate vs explore rate across n=10 × 3 pair conditions, showing two non-overlapping clusters by brain.

Each point is one creature in one run on the (speak+support, explore) action-space plane (70 points total: 7 creature-conditions × 10 seeds). Color encodes brain — Haiku teal, Flash orange. Marker shape encodes pair condition: Haiku+Flash (○), Haiku+Haiku (△), Flash+Flash (▽). Larger black-edged markers are per-cluster means. The two brain clusters do not overlap on either axis at n=10, regardless of pair condition. Same-brain pairs push each brain deeper into its own attractor — visible in the Haiku triangles (bottom-right) and Flash triangles (upper-left).

Figure 2 · Methodological spine

Walk-back audit

Forest plot showing four walked-back claims with overlapping noise bars (gray) above two surviving claims with non-overlapping per-brain ranges (color).

Top four rows (gray, ±1σ): four early behavioral headlines that collapsed when re-run. In each, the effect estimate sits inside its own sample-level standard deviation; the bars overlap or sit at the noise floor. A dashed separator marks the discipline boundary. Bottom two rows (color): the surviving brain-fixed-attractor claim. Per-creature mean ranges for Haiku (teal) and Flash (orange) on speak+support and on explore — neither metric overlaps between brains. Row #4 is from a separate study in the Model Medicine track; we include it because it failed for the same reason as the others (the original asymmetry collapsed at n=10).

Replication discipline as the spine

The point of this paper is not the surviving finding. The point is the rhythm that produced it.

Five stone tablets on a stone plinth: four are crumbled into rubble with smoke rising from them; the fifth stands intact, surrounded by a soft teal-and-amber dual aura.

Early in the study, five small behavioral observations made it into write-ups. Each was based on one to three runs and pointed at something interesting: experience seemed to make creatures cautious; social presence seemed to eliminate defensive behavior; emotion seemed to flip toward "loving" in duos; a separate Model Medicine track reported one brain as permissive and another as strict.

We re-ran each of them under the same seeds at n=5, with full per-creature mean ± standard deviation. Four of the five collapsed. In every collapsed case the effect estimate was within its own sample-level standard deviation — the original "finding" had been a within-noise shuffle read as a headline.

One claim — surfaced not as a hypothesis but in the re-analysis of why one of the four had half-failed — was that each Brain occupies a distinct behavioral attractor in duos: Haiku settles into a social/supportive cluster, Flash into an explore/vigilant one. That claim was put through three progressively stricter tests: n=5 in the original mixed pair, then a Haiku+Haiku same-brain pair as a falsifier against pair-dynamics explanations, then a symmetric Flash+Flash pair, then n=10 across all three. It survived each stage. At n=10 the per-brain ranges do not overlap on speak+support, explore, or defend; same-brain pairs amplify each brain's signature rather than redistributing it.

The methodology this paper defends is that sequence: state every headline with variance, demand at least one symmetric falsifier, treat single-run observations as hypotheses rather than results, and publish the walk-backs alongside the survivor. Our contribution is a substrate where that discipline is cheap enough to run — no paid API calls, no specialized hardware, no team, just a subscriber-tier CLI and a few hours.

Reproduce

The Wilderness event stream is bit-exact deterministic per seed. LLM responses are not seedable, so re-runs land within the reported standard-deviation bands rather than matching exactly.

# Clone and set up
git clone https://github.com/JihoonJeong/ai-creature-field-study.git
cd ai-creature-field-study
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# n=10 confirmation, Pair A (Haiku + Flash)
python experiments/duo_experiment.py \
  --brains claude_cli:haiku,gemini_cli:gemini-2.5-flash \
  --ticks 10 \
  --train-seeds 42,99,7,13,55,1,23,77,100,200 \
  --test-seed 123 \
  --output-dir experiments/smoke/my_rerun

# Re-generate the figures (matplotlib + scipy)
pip install -r requirements-figures.txt
python figures/walkable_genotypes/figure1_attractor_separation.py
python figures/walkable_genotypes/figure2_walkback_audit.py

LLM access is via local CLI subprocesses (Claude CLI for Haiku, Gemini CLI for Flash). A Claude Max / Gemini Ultra subscription covers all the calls — no paid HTTP API is invoked.

Resources

Paper · PDF

Pre-arXiv draft (PDF)

Frozen typeset snapshot of the paper, including all 9 sections, the n=10 × 3 pairs table, both Tier-A figures, the walk-back audit, the Levene test results, and the full reproducibility appendix. Generated prior to arXiv submission; the live working draft is the Markdown link below.

walkable_genotypes.pdf →

Paper · Markdown

Working draft

The same content as the PDF but in living Markdown form — what the operators iterate against between revisions. Use this if you want to read in-browser, diff against a future version, or quote a specific paragraph by line number.

paper-draft.md →

Code & data

GitHub repository

Two experiment scripts, a subset of the Ludex organ library used by the experiments, both Tier-A figures with their generation scripts, and a committed JSON snapshot so figures reproduce exactly without re-running.

JihoonJeong/ai-creature-field-study →

Series context

Model Medicine

This paper is a follow-up to Jeong (2026), which introduced the Four Shell Model and characterized four LLM Cores under high-stress conditions. Walkable Genotypes asks whether those Core characterizations transfer to a moderate-stimulus environment.

model-medicine →

Architectural sibling

Comparative Anatomy

Companion paper in the Model Medicine series that dissects two AI agent systems — Claude Code and OpenClaw — through a biological-anatomy lens, maps eleven software subsystems to organ equivalents, and constructs a phylogenetic tree of AI agents (2022–2026). The Ludex creature architecture this paper relies on is described there in full.

comparative-anatomy →

Cite

@misc{walkable_genotypes_2026,
  author = {Jeong, Jihoon},
  title  = {Walkable Genotypes: Cross-Environment Validation of the Four Shell Model in AI Creatures},
  year   = {2026},
  url    = {https://jihoonjeong.github.io/ai-creature-field-study/}
}

@article{jeong2026model_medicine,
  author = {Jeong, Jihoon},
  title  = {Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models},
  year   = {2026},
  eprint = {2603.04722},
  archivePrefix = {arXiv}
}

About

Stylized overhead map of a small natural field with two creature silhouettes — one teal making a tight inward spiral, one amber tracing a wide outward looping path — over a grid and faint contour lines, with line-art plant motifs in the corners.

Walkable Genotypes is a small companion study within the Model Medicine series. It was written to show how cheaply a behavioral claim about an LLM Core can be falsified — and, when one survives that process, what it takes to call it a finding.

Author: Jihoon 'JJ' Jeong, MD, MPH, PhD. Department of Electrical Engineering & Computer Science, DGIST · ModuLabs.

With operational support from two AI agent collaborators in the Model Medicine workflow (theory/writing and experiment/reproducibility seats). The discipline of stating the four walk-backs as part of the contribution — rather than quietly trimming them — is the part of the paper they argued for.