Why This Matters

The promise of AI cross-auditing is simple: have one AI check another's work. If GPT-4 hallucinates, let Claude catch it. If Claude has a blind spot, let Gemini find it. Adversarial oversight through redundancy.

But auditing requires independence. When the SEC audits a company, we don't let the company's subsidiary do the audit. When the FAA inspects a plane, we don't let the airline's sister company sign off. Independence is the structural foundation that makes oversight meaningful.

The problem with AI "independence"

Look at the family tree. The connections are pervasive:

Shared personnel: The people who built GPT-3 literally founded Anthropic and built Claude. The people who built LLaMA at Meta left to found Mistral. The CEO of Mistral is a co-author on DeepMind's Chinchilla paper. A small community of ~200 researchers built most of what exists.
Shared training data: Almost every frontier model is trained on Common Crawl, Wikipedia, books, and code scraped from the same internet. The exact overlap is unknown but estimated at >60% for major Western models.
Shared architecture: Every major model descends from a single 2017 paper, "Attention Is All You Need," by 8 Google Brain researchers. At least 6 of the 8 authors have since left to found or join AI startups. The Transformer is the common ancestor of everything.
Direct distillation: Some models are literally trained on another model's outputs. Alpaca = Meta's weights + OpenAI's outputs. Vicuna = Meta's weights + ChatGPT conversations. These are genetic hybrids.
Methodological inheritance: The RLHF technique developed at OpenAI for InstructGPT was carried directly to Anthropic by the researchers who created it, where it evolved into Constitutional AI. The training methodology DNA is shared even when the weights are not.

If two models share the same researchers, were trained on the same data, use the same architecture, and one was partially trained on the other's outputs, in what meaningful sense are they "independent" auditors of each other?

Correlated failures

Shared ancestry means shared blind spots. Models with overlapping training data will hallucinate about the same obscure topics. Models built by the same researchers will have the same implicit assumptions about what "safety" means. Models with the same architecture will fail on the same adversarial inputs.

An audit that misses the same things you miss isn't an audit. It's a mirror.

The case for cross-model auditing (despite the incest)

I believe AI cross-auditing is still worth pursuing, but only if we're honest about its limits and design around them:

Map independence before auditing. That's what this project is. Before trusting Model A to audit Model B, quantify their genealogical distance. The family tree is a prerequisite for meaningful oversight.
Maximize diversity in audit panels. If you're going to use cross-model auditing, use models with the least shared ancestry. A Chinese model (DeepSeek, Qwen) auditing a Western model may catch things that another Western model wouldn't: different data, different researchers, different implicit assumptions about what matters.
Ground audits in verifiable facts. Don't just ask "is this output good?" Ask "is this output true?" and check against external evidence. Cryptographic verification of factual claims against authoritative sources. I've been prototyping this approach.
Don't let cross-auditing substitute for human oversight. AI auditing AI is a complement to human-led external oversight, not a replacement. The family tree shows why: the ecosystem is too inbred for any model to be truly independent.

Historical precedent

Every industry that touches public safety went through a version of this reckoning:

Finance: Arthur Andersen audited Enron while consulting for Enron. The conflict of interest was invisible until $74 billion in shareholder value evaporated. Sarbanes-Oxley (2002) mandated auditor independence: you can't audit a company you also advise.
Pharmaceuticals: Before the FDA, drug companies ran their own safety trials and published only favorable results. The 1962 Kefauver-Harris Amendment required independent evidence of efficacy, not just manufacturer assurances.
Aviation: Boeing's 737 MAX crashes killed 346 people. A key factor: the FAA had delegated much of its safety certification to Boeing itself. The fox was guarding the henhouse.

AI is in the "pre-regulation" phase of this cycle. Every major lab grades its own homework. Model cards are self-reported. Safety benchmarks are often designed by the same organizations whose models they evaluate. The family tree shows that even "cross-model" evaluation doesn't solve this if the models share the same DNA.

For more on why external oversight matters, see Who Audits the Auditor.

Methodology & Confidence Scores

Every connection in this family tree has a confidence score reflecting how certain we are about the relationship. These scores are editorially assigned based on available evidence.

90-100%: Documented in papers, announcements, or primary sources

60-89%: Strong inference from available evidence, some details unconfirmed

Below 60%: Speculative or inferred. Included because the possibility matters, but treat as hypothesis

Relationship types

Same-org iteration: Next version from the same organization (GPT-3 → GPT-4). The strongest form of lineage: same team, same codebase, cumulative knowledge.
Architectural lineage: Shares foundational architecture or specific techniques (attention mechanisms, activation functions, training methodologies). Nearly universal since everything descends from the Transformer, but some connections are more direct than others.
Personnel transfer: Key researchers moved between organizations, carrying institutional knowledge, unpublished insights, and tacit understanding of what works. Often the most impactful form of cross-pollination because it transfers knowledge that never appears in papers.
Fine-tuned from weights: Model B was directly built on Model A's pre-trained weights. The strongest form of genetic inheritance, as 100% of Model A's learned parameters are the starting point for Model B.
Trained on outputs: Model B's training data includes outputs generated by Model A (knowledge distillation). Carries Model A's behavioral patterns, biases, and stylistic choices into Model B, even across organizational boundaries.
Training data overlap: Significant overlap in data sources (Common Crawl, Wikipedia, GitHub, etc.). Creates correlated knowledge and (critically for auditing) correlated blind spots and shared hallucination patterns.

Sources

All connections cite their evidence. Primary sources include:

Research papers: ArXiv preprints and peer-reviewed publications (NeurIPS, ICML, ACL) for model architectures, training details, and scaling laws.
Official announcements: Blog posts, model cards, and press releases from OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, DeepSeek, and others.
Personnel records: LinkedIn profiles, author lists on papers, and founding team announcements for personnel transfer connections.
Investigative reporting: Technology journalism from The Information, Bloomberg, NYT, TechCrunch, and The Verge for undisclosed relationships and context.

What's missing

This family tree is incomplete by nature. Major gaps include:

Proprietary training data: No frontier lab fully discloses its training data. Overlap estimates are based on what's publicly known about data sources and availability at scale.
RLHF and post-training: The human feedback data used to align models is almost never disclosed. Shared RLHF labeling contractors (like Scale AI, Surge AI) across labs could be a hidden source of behavioral correlation.
Informal knowledge transfer: Researchers talk at NeurIPS, read each other's papers, and share ideas over drinks. This diffuse influence is real but unmappable.
Synthetic data loops: As models increasingly generate training data for other models (and for their own next versions), the genealogy becomes recursive. Model collapse (degradation from training on AI-generated text) is a growing concern that makes lineage tracking harder and more important.
Classified/military applications: Government and defense applications of these models are not publicly tracked but represent real deployment lineage.

This project is open to corrections and additions. If you have evidence that a connection is wrong or missing, get in touch.

AI Incest

Relationship Types

Audit Results

Audit Panel Optimizer

Pairwise Independence

Inbreeding Coefficient

Why This Matters

The problem with AI "independence"

Correlated failures

The case for cross-model auditing (despite the incest)

Historical precedent

Methodology & Confidence Scores

Relationship types

Sources

What's missing