llm-wiki — Fleming-tier scientific paper wiki¶

Goal¶

A persistent LLM-maintained wiki of paradigm-shifting scientific papers (Fleming-tier: Nobel-grade, field-founders, papers that enabled subsequent breakthroughs). Each paper is presented with:

Why the discovery mattered
What it enabled (lineage of follow-on work)
How it relates to other paradigm-shifting papers (chunk-level graph)
LLM-generated explanations on every relationship edge

The wiki is git-tracked Markdown, statically deployed to GitHub Pages, conformant to the Karpathy LLM Wiki pattern, and uses standard formats (DOI, CSL-JSON, OpenAlex Concepts/MeSH, JSON-LD, Markdown+YAML, Parquet, GraphML) so the data interoperates with any other system.

Non-goals (v0)¶

Real-time updates (manual ingest only)
Multi-user editing
Mobile-optimised UI
Languages other than English
Exhaustive Fleming-tier judgments — LLM filter + manual spot-check is the v0 quality bar

Architecture (Karpathy LLM Wiki pattern, strict)¶

Three layers, mirrored to disk:

~/code/llm-wiki/
├── data/raw/                Layer 1 (immutable)
│   ├── seed/                9 authoritative source dumps (.json)
│   ├── candidates.parquet   ~1500 deduped candidates
│   ├── core.parquet         ~600 LLM-confirmed Fleming-tier
│   ├── papers/<doi>/        PDF + full.md + tree.json (PageIndex output)
│   └── metadata.parquet     OpenAlex full enrichment
├── docs/                    Layer 2 (LLM-generated wiki, served by MkDocs)
│   ├── paper/<doi>.md
│   ├── topic/<slug>.md
│   ├── lineage/<chain>.md
│   ├── superpowers/specs/   This spec lives here
│   └── viz/                 UMAP map.html, graph.html
├── data/graph/
│   ├── nodes.parquet        chunk-level nodes (PageIndex leaves)
│   ├── edges.parquet        5-typed edges + LLM "why related" labels
│   └── graph.gexf           Cytoscape/Gephi export
├── data.duckdb              Single analytics DB (queryable from anywhere)
├── data/vectors/            ChromaDB (per-leaf embeddings)
├── CLAUDE.md                Layer 3 — Schema (ingest/query/lint commands)
├── mkdocs.yml               Static site config
└── src/llm_wiki/            Implementation (10 components)

Components (10, each independently testable)¶

#	Module	In	Out	External deps
1	`seed_assembler`	9 source URLs	`candidates.parquet` (~1500)	requests, BeautifulSoup
2	`tier_filter`	candidates	`core.parquet` (~600)	LLM (Claude session key)
3	`metadata_enricher`	core	`metadata.parquet`	OpenAlex API
4	`paper_fetcher`	core DOIs	`papers/<doi>/pdf`	OA → TAMU proxy → manual
5	`chunker`	PDFs/HTML	`papers/<doi>/tree.json`	PageIndex library
6	`concept_tagger`	leaves + abstract	tag table	OpenAlex Concepts + MeSH
7	`graph_builder`	leaves + tags + refs	`edges.parquet` (5 types)	embeddings + heuristics
8	`edge_labeler`	edges	LLM-written "why related" labels	LLM
9	`wiki_generator`	all of the above	`docs/paper\|topic\|lineage/*.md`	jinja2
10	`viz_renderer`	graph + UMAP	`docs/viz/*.html`	Cytoscape.js, UMAP, plotly

9 authoritative seed sources¶

Nobel Prize references (nobelprize.org, 1901–2024 lectures + key citations)
Van Noorden "Top 100 papers" (Nature 2014, 514:550–553)
Wikipedia "Year in science" 1900–2024 (each year's discoveries + source DOIs)
NIH Landmark Publications
APS Centennial Papers (physics, 1899–1998)
Karpathy AI reading list + Awesome ML Papers (GitHub)
Garfield Citation Classics (when accessible)
OpenAlex query: cited_by_count > 10000 AND publication_year < 2010 (auto-filter for "14+ years lasting ultra-high citation")
Wikipedia "List of Nobel laureates" — each laureate's "key publications"

After dedup (DOI primary, fuzzy title+year secondary): ~1500 candidates.

tier_filter prompt (component 2)¶

For each candidate, single-shot LLM judgment:

"Is this paper Fleming-tier? Criteria: (a) Founded a new field, OR (b) Caused paradigm shift, OR (c) Enabled subsequent breakthroughs, OR (d) Is universally taught as foundational.

Reject if: high-citation but incremental; review article; methods-only without conceptual contribution; popular but not transformative.

Output JSON: {tier: 'fleming' | 'reject', reason: }"

Run in batch=20 for cost. ~75 batches → ~$3 → ~600 confirmed.

Data flow¶

9 sources (web/API)
  ──► seed_assembler ──► candidates.parquet (1500)
       └─► tier_filter (LLM) ──► core.parquet (600)
            └─► metadata_enricher (OpenAlex) ──► metadata.parquet
                 └─► paper_fetcher (parallel, 10 workers) ──► papers/<doi>/pdf
                      └─► html2md/pdf2md ──► full.md
                           └─► chunker (PageIndex) ──► tree.json
                                └─► concept_tagger ──► tags
                                     └─► graph_builder ──► edges.parquet
                                          └─► edge_labeler (LLM) ──► labeled edges
                                               └─► wiki_generator ──► docs/*.md
                                                    └─► viz_renderer ──► docs/viz/*.html
                                                         └─► mkdocs build
                                                              └─► gh-pages deploy

Every step caches to parquet/SQLite and is idempotent. Re-runs only do the work that's stale.

5 edge types¶

Type	How detected	Example label
`cite`	OpenAlex references graph	"AlphaFold uses ESM protein model as baseline"
`builds-on`	LLM over (cite + semantic)	"Transformer generalises the attention mechanism from Bahdanau et al."
`enables`	LLM follow-up trace	"PCR enabled the Human Genome Project"
`same-method`	shared concept tag + LLM check	"Both use X-ray crystallography"
`contradicts`	explicit LLM detection	"A claims effect X under condition C; B finds none under same C"

Standards (interop)¶

Identifiers: DOI primary, OpenAlex Work ID secondary, arXiv ID where applicable
Bibliographic: CSL-JSON (Zotero/Mendeley/Pandoc compatible)
Concepts: OpenAlex Concepts (primary), MeSH (biomed overlay), CSO (CS overlay)
Graph: JSON-LD with schema.org/CreativeWork + custom predicates
Wiki content: Markdown + YAML frontmatter (Obsidian/Logseq/MkDocs portable)
Embeddings: Parquet (DuckDB/polars/HF datasets)
Storage: SQLite/DuckDB (analytics) + ChromaDB (vector) + plain files (content)
Graph export: GraphML / GEXF (Cytoscape/Gephi)

No project-private formats are introduced.

909 Fleming-AI corpus — supporting layer¶

The existing 909 papers in TAMU GDrive Fleming-AI/papers/ are not Fleming-tier (they are high-cite 2015–2026 multi-disciplinary). They become a supporting context layer: each Fleming-tier core paper page references modern follow-up work from the 909 when overlap is detected.

Deploy¶

MkDocs Material as static site generator
GitHub Pages from gh-pages branch
GitHub Actions workflow: on push to main → run mkdocs gh-deploy
Public site at https://xodn348.github.io/llm-wiki/

Cost & runtime¶

Stage	LLM calls	$	wall time (parallelised)
Seed assembly	0	0	30 min
Tier filter (1500 → 600)	~75 batch	$3	1 h
Metadata enrichment	0	0	5 min
Paper fetch (600)	0	0	2 h
PageIndex (600)	~6 000	$15	3 h
Concept tagging	minor	$2	30 min
Edge gen + labeling	~3 000	$30	2 h
Wiki gen	~1 000	$5	1 h
Viz	0	0	immediate
Deploy	0	0	5 min
Total v0		~$55	~10 h serial / ~4 h parallel
Monthly maintenance		~$5	1 h

Note: Using CLAUDE_SESSION_KEY makes most of this free (Claude.ai session, no API billing). Estimates above assume API usage; session-key actual will be ~$0.

Test plan¶

Unit: each component on mocked inputs
Integration: smoke test on 5–10 papers covering different publishers
Manual review: user reads 5 generated paper pages and judges "Fleming-tier appropriate"
Acceptance: 600 papers ingested, graph rendered, can navigate AlphaFold → Transformer → Attention → backprop chain via labeled edges

Out of scope (v0)¶

Real-time auto-update
Collaborative editing
Mobile UI
Non-English sources
All-corner-case Fleming-tier judgements (LLM + spot-check is enough for v0)

Future (v1+)¶

Auto-discovery: surface new papers that may be Fleming-tier as they're cited by existing ones
Multi-language sources
Dedicated paper detail pages with embedded viz
API surface for queries (/api/papers/<doi>, /api/graph/<concept>)
Optional integration with Zotero / Obsidian / Logseq via the standard formats already produced