llm-wiki — Fleming-tier scientific paper wiki¶
Goal¶
A persistent LLM-maintained wiki of paradigm-shifting scientific papers (Fleming-tier: Nobel-grade, field-founders, papers that enabled subsequent breakthroughs). Each paper is presented with:
- Why the discovery mattered
- What it enabled (lineage of follow-on work)
- How it relates to other paradigm-shifting papers (chunk-level graph)
- LLM-generated explanations on every relationship edge
The wiki is git-tracked Markdown, statically deployed to GitHub Pages, conformant to the Karpathy LLM Wiki pattern, and uses standard formats (DOI, CSL-JSON, OpenAlex Concepts/MeSH, JSON-LD, Markdown+YAML, Parquet, GraphML) so the data interoperates with any other system.
Non-goals (v0)¶
- Real-time updates (manual
ingestonly) - Multi-user editing
- Mobile-optimised UI
- Languages other than English
- Exhaustive Fleming-tier judgments — LLM filter + manual spot-check is the v0 quality bar
Architecture (Karpathy LLM Wiki pattern, strict)¶
Three layers, mirrored to disk:
~/code/llm-wiki/
├── data/raw/ Layer 1 (immutable)
│ ├── seed/ 9 authoritative source dumps (.json)
│ ├── candidates.parquet ~1500 deduped candidates
│ ├── core.parquet ~600 LLM-confirmed Fleming-tier
│ ├── papers/<doi>/ PDF + full.md + tree.json (PageIndex output)
│ └── metadata.parquet OpenAlex full enrichment
├── docs/ Layer 2 (LLM-generated wiki, served by MkDocs)
│ ├── paper/<doi>.md
│ ├── topic/<slug>.md
│ ├── lineage/<chain>.md
│ ├── superpowers/specs/ This spec lives here
│ └── viz/ UMAP map.html, graph.html
├── data/graph/
│ ├── nodes.parquet chunk-level nodes (PageIndex leaves)
│ ├── edges.parquet 5-typed edges + LLM "why related" labels
│ └── graph.gexf Cytoscape/Gephi export
├── data.duckdb Single analytics DB (queryable from anywhere)
├── data/vectors/ ChromaDB (per-leaf embeddings)
├── CLAUDE.md Layer 3 — Schema (ingest/query/lint commands)
├── mkdocs.yml Static site config
└── src/llm_wiki/ Implementation (10 components)
Components (10, each independently testable)¶
| # | Module | In | Out | External deps |
|---|---|---|---|---|
| 1 | seed_assembler |
9 source URLs | candidates.parquet (~1500) |
requests, BeautifulSoup |
| 2 | tier_filter |
candidates | core.parquet (~600) |
LLM (Claude session key) |
| 3 | metadata_enricher |
core | metadata.parquet |
OpenAlex API |
| 4 | paper_fetcher |
core DOIs | papers/<doi>/pdf |
OA → TAMU proxy → manual |
| 5 | chunker |
PDFs/HTML | papers/<doi>/tree.json |
PageIndex library |
| 6 | concept_tagger |
leaves + abstract | tag table | OpenAlex Concepts + MeSH |
| 7 | graph_builder |
leaves + tags + refs | edges.parquet (5 types) |
embeddings + heuristics |
| 8 | edge_labeler |
edges | LLM-written "why related" labels | LLM |
| 9 | wiki_generator |
all of the above | docs/paper|topic|lineage/*.md |
jinja2 |
| 10 | viz_renderer |
graph + UMAP | docs/viz/*.html |
Cytoscape.js, UMAP, plotly |
A cli.py (typer-based) wraps every component:
llm-wiki seed | filter | enrich | fetch | chunk | tag | graph | label | wiki | viz | all
9 authoritative seed sources¶
- Nobel Prize references (nobelprize.org, 1901–2024 lectures + key citations)
- Van Noorden "Top 100 papers" (Nature 2014, 514:550–553)
- Wikipedia "Year in science" 1900–2024 (each year's discoveries + source DOIs)
- NIH Landmark Publications
- APS Centennial Papers (physics, 1899–1998)
- Karpathy AI reading list + Awesome ML Papers (GitHub)
- Garfield Citation Classics (when accessible)
- OpenAlex query:
cited_by_count > 10000 AND publication_year < 2010(auto-filter for "14+ years lasting ultra-high citation") - Wikipedia "List of Nobel laureates" — each laureate's "key publications"
After dedup (DOI primary, fuzzy title+year secondary): ~1500 candidates.
tier_filter prompt (component 2)¶
For each candidate, single-shot LLM judgment:
"Is this paper Fleming-tier? Criteria: (a) Founded a new field, OR (b) Caused paradigm shift, OR (c) Enabled subsequent breakthroughs, OR (d) Is universally taught as foundational.
Reject if: high-citation but incremental; review article; methods-only without conceptual contribution; popular but not transformative.
Output JSON: {tier: 'fleming' | 'reject', reason:
}"
Run in batch=20 for cost. ~75 batches → ~$3 → ~600 confirmed.
Data flow¶
9 sources (web/API)
──► seed_assembler ──► candidates.parquet (1500)
└─► tier_filter (LLM) ──► core.parquet (600)
└─► metadata_enricher (OpenAlex) ──► metadata.parquet
└─► paper_fetcher (parallel, 10 workers) ──► papers/<doi>/pdf
└─► html2md/pdf2md ──► full.md
└─► chunker (PageIndex) ──► tree.json
└─► concept_tagger ──► tags
└─► graph_builder ──► edges.parquet
└─► edge_labeler (LLM) ──► labeled edges
└─► wiki_generator ──► docs/*.md
└─► viz_renderer ──► docs/viz/*.html
└─► mkdocs build
└─► gh-pages deploy
Every step caches to parquet/SQLite and is idempotent. Re-runs only do the work that's stale.
5 edge types¶
| Type | How detected | Example label |
|---|---|---|
cite |
OpenAlex references graph | "AlphaFold uses ESM protein model as baseline" |
builds-on |
LLM over (cite + semantic) | "Transformer generalises the attention mechanism from Bahdanau et al." |
enables |
LLM follow-up trace | "PCR enabled the Human Genome Project" |
same-method |
shared concept tag + LLM check | "Both use X-ray crystallography" |
contradicts |
explicit LLM detection | "A claims effect X under condition C; B finds none under same C" |
Standards (interop)¶
- Identifiers: DOI primary, OpenAlex Work ID secondary, arXiv ID where applicable
- Bibliographic: CSL-JSON (Zotero/Mendeley/Pandoc compatible)
- Concepts: OpenAlex Concepts (primary), MeSH (biomed overlay), CSO (CS overlay)
- Graph: JSON-LD with schema.org/CreativeWork + custom predicates
- Wiki content: Markdown + YAML frontmatter (Obsidian/Logseq/MkDocs portable)
- Embeddings: Parquet (DuckDB/polars/HF datasets)
- Storage: SQLite/DuckDB (analytics) + ChromaDB (vector) + plain files (content)
- Graph export: GraphML / GEXF (Cytoscape/Gephi)
No project-private formats are introduced.
909 Fleming-AI corpus — supporting layer¶
The existing 909 papers in TAMU GDrive Fleming-AI/papers/ are not
Fleming-tier (they are high-cite 2015–2026 multi-disciplinary). They
become a supporting context layer: each Fleming-tier core paper page
references modern follow-up work from the 909 when overlap is detected.
Deploy¶
- MkDocs Material as static site generator
- GitHub Pages from
gh-pagesbranch - GitHub Actions workflow: on push to
main→ runmkdocs gh-deploy - Public site at
https://xodn348.github.io/llm-wiki/
Cost & runtime¶
| Stage | LLM calls | $ | wall time (parallelised) |
|---|---|---|---|
| Seed assembly | 0 | 0 | 30 min |
| Tier filter (1500 → 600) | ~75 batch | $3 | 1 h |
| Metadata enrichment | 0 | 0 | 5 min |
| Paper fetch (600) | 0 | 0 | 2 h |
| PageIndex (600) | ~6 000 | $15 | 3 h |
| Concept tagging | minor | $2 | 30 min |
| Edge gen + labeling | ~3 000 | $30 | 2 h |
| Wiki gen | ~1 000 | $5 | 1 h |
| Viz | 0 | 0 | immediate |
| Deploy | 0 | 0 | 5 min |
| Total v0 | ~$55 | ~10 h serial / ~4 h parallel | |
| Monthly maintenance | ~$5 | 1 h |
Note: Using CLAUDE_SESSION_KEY makes most of this free (Claude.ai session,
no API billing). Estimates above assume API usage; session-key actual will
be ~$0.
Test plan¶
- Unit: each component on mocked inputs
- Integration: smoke test on 5–10 papers covering different publishers
- Manual review: user reads 5 generated paper pages and judges "Fleming-tier appropriate"
- Acceptance: 600 papers ingested, graph rendered, can navigate AlphaFold → Transformer → Attention → backprop chain via labeled edges
Out of scope (v0)¶
- Real-time auto-update
- Collaborative editing
- Mobile UI
- Non-English sources
- All-corner-case Fleming-tier judgements (LLM + spot-check is enough for v0)
Future (v1+)¶
- Auto-discovery: surface new papers that may be Fleming-tier as they're cited by existing ones
- Multi-language sources
- Dedicated paper detail pages with embedded viz
- API surface for queries (
/api/papers/<doi>,/api/graph/<concept>) - Optional integration with Zotero / Obsidian / Logseq via the standard formats already produced