# hedge gauge mlns
public: true
tags: mlns
time: 2026-02-21 09:31:16

Got it — you’re presenting the **datasets + baselines** slide(s). I’ll do two things:

1. **Make you understand each dataset + baseline deeply** (what it is, what signal it tests, and *how it maps to our “cohomology-gauged pair representation” claim*).
2. Give you a **clean speech script for each slide** (straight to the point, but complete).

---

## Part 1 — Datasets (what each is, and why it matters for our method)

### 1) OpenProteinSet (Training, “AF2-scale”)

**What it is**

* OpenProteinSet is an open, large-scale corpus of **protein MSAs** plus associated structural links (PDB homologs / AF2 predictions depending on subset). It was created specifically to enable training AF2-style models at scale in the open. ([arXiv][1])
* OpenFold’s Nat Methods paper states they trained OpenFold from scratch using **OpenProteinSet23**, their open reproduction of the AlphaFold2 training set. ([Nature][2])

**What’s inside (practical view)**

* Massive number of MSAs (millions-scale), plus **PDB chains** and template hits. ([arXiv][1])
* OpenFold documentation explicitly points to OpenProteinSet MSAs + mmCIFs as the training ingredients. ([openfold.readthedocs.io][3])

**How we use it for our problem**

* **Role:** Train / fine-tune an AF2-like backbone (OpenFold) with our gauge layer.
* **Why it matches our claim:** Our method modifies **pair/triangle logic**, so the most defensible training source is the one known to reproduce AF2 behavior in an open stack (OpenFold + OpenProteinSet23). ([Nature][2])
* **What we measure during training (extra diagnostics):**

  * Track “triangle inconsistency” signals (curl magnitude / integrability residual) on train vs validation to see whether our gauge layer is actually shaping the geometry, not just overfitting.

---

### 2) ProteinNet (Controlled splits, “CASP7–12”, seq-identity splits)

**What it is**

* ProteinNet is a standardized dataset for ML on protein structure that bundles **sequence, structure, MSA, PSSM**, plus standardized **train/val/test splits** built around CASP rounds. ([Springer][4])
* The key feature for evaluation is that ProteinNet provides **sequence-identity–controlled splits**, so you can test generalization under reduced homology. ([Springer][4])

**How we use it**

* **Role:** “Low-homology generalization” benchmark.
* **Why it matters for our method:**
  Our pitch is *local triangle logic doesn’t guarantee global consistency; enforcing integrability should help when signals are weak/noisy.* Low-homology targets are exactly where MSA/template signal is weaker and models are more likely to produce inconsistent internal geometry.
* **What we report:**

  * Standard structure metrics (lDDT/TM/GDT where applicable), plus our **integrability diagnostics** vs error (does lower curl correlate with better accuracy / calibration?).

---

### 3) CAMEO (Blind-ish eval, weekly newly released PDB)

**What it is**

* CAMEO is a **continuous automated evaluation** platform that runs **weekly** using **pre-release PDB targets**, designed as an ongoing complement to CASP. ([cameo3d.org][5])
* Targets come from **PDB pre-release**; CAMEO filters and clusters targets and benchmarks prediction servers in a blind, automated way. ([cameo3d.org][5])

**How we use it**

* **Role:** “Realistic generalization” — not a static curated set you can overfit to.
* **Why it matters for our method:**
  If our gauge constraint truly improves “global consistency”, it should show up on continuously changing targets where the model can’t memorize quirks.
* **Best alignment to slide’s evaluation strategy:** **Monomer evaluation = CAMEO + CASP15** (CAMEO for continuous blind-ish; CASP15 for canonical hard targets). ([cameo3d.org][5])

---

### 4) CASP15 (Hard targets benchmark, 127 targets)

**What it is**

* CASP is the community blind assessment. The CASP15 home page reports **127 modeling targets** and tens of thousands of submitted models. ([predictioncenter.org][6])

**How we use it**

* **Role:** “Hard targets / canonical comparison point.”
* **Why it matters for our method:**
  Reviewers trust CASP because it’s the standard “stress test”. If you claim improvement, CASP is where people believe it.

---

### 5) Docking Benchmark 5.5 (DB5.5) (Complexes, bound + unbound, 230 entries)

**What it is**

* DB5.5 is a widely used protein–protein docking benchmark containing **non-redundant, high-quality bound complexes plus corresponding unbound subunits**. ([ScienceDirect][7])
* The 2015 update describes **230 docking benchmark entries** (and a related affinity benchmark with 179 entries). ([ScienceDirect][7])
* There are clean processed distributions used by the docking community (e.g., matched bound/unbound chain numbering), which makes evaluation reproducible. ([GitHub][8])

**How we use it**

* **Role:** “Complex/interface correctness” benchmark.
* **Why it matters for our method:**
  Complexes amplify exactly the failure mode we’re targeting: you can satisfy local constraints within chains but still get **global/interface inconsistency** across chains. If we reduce non-integrable triangle logic in pair space, we expect better interface geometry and fewer contradictory constraints at binding sites.

---

### 6) DockQ (Complex evaluation metric)

**What it is**

* DockQ is a continuous docking quality score combining **Fnat, LRMSD, iRMSD** into a single number in **[0, 1]**. ([PLOS][9])

**How we use it**

* **Role:** Primary metric for DB5.5 interface evaluation.
* **Why it matches our story:**
  DockQ reflects both **contact correctness (Fnat)** and **geometric agreement (RMSDs)**. Our method is explicitly about improving geometric consistency, so DockQ is the right “headline number.”

---

### 7) SAbDab (Antibodies, curated topology-rich interfaces)

**What it is**

* SAbDab is a **weekly updated** structural antibody database with annotations like **heavy/light pairing**, nomenclature, and other curated metadata. ([opig.stats.ox.ac.uk][10])
* It supports creating and downloading datasets for analysis and provides standardized annotation (including chain pairing info). ([opig.stats.ox.ac.uk][10])

**How we use it**

* **Role:** “Topology-rich interfaces / hard geometry domain.”
* **Why it matters for our method:**
  Antibody–antigen interfaces (and antibody CDR regions) are geometrically subtle; models can produce locally plausible but globally inconsistent constraints around loops and interfaces. If “integrability enforcement” is real, antibodies are a good place to expose it.

---

## Part 2 — Baseline models (what each baseline proves)

Your slide splits baselines into three buckets. That’s perfect — keep it.

---

### A) Direct Architecture Baselines (closest competitors to our change)

#### 1) OpenFold (our implementation base)

* OpenFold is a **trainable, memory-efficient PyTorch reproduction of AlphaFold2**. ([GitHub][11])
* OpenFold trained from scratch on OpenProteinSet23 and matched AF2-level quality (per their Nat Methods paper). ([Nature][2])

**Why it’s essential as a baseline**

* It ensures **fairness + reproducibility**: any gain can be attributed to our gauge layer rather than private training tricks.

#### 2) AlphaFold2 (reference architecture)

* AlphaFold2 is the canonical Evoformer + triangle updates architecture that set the standard in CASP14. ([Nature][12])

**Why it matters**

* We’re explicitly modifying the “triangle logic → global consistency” pathway that AF2 relies on, so AF2 is the conceptual ground truth baseline.

#### 3) RoseTTAFold (three-track alternative)

* RoseTTAFold uses a **three-track network** integrating 1D sequence, 2D distances, and 3D coordinates. ([Science][13])

**Why it matters**

* It’s a strong independent architecture; if improvements are only within AF2-like stacks, reviewers will ask why. Comparing against RoseTTAFold shows whether the benefit is specific to triangle/pair style reasoning.

#### 4) AlphaFold-Multimer (complex prediction baseline)

* AlphaFold-Multimer is an AF2 variant trained specifically for multimeric inputs. ([bioRxiv][14])

**Why it matters**

* If we claim improvements on complexes, AF-Multimer is the direct “SOTA-ish” baseline in the AF family.

---

### B) Sequence-Only Control (to rule out “it’s just MSA quality”)

#### ESMFold

* ESMFold predicts structure **directly from single sequence**, leveraging a protein language model (ESM) — **no MSA, no templates**. ([bioRxiv][15])

**Why it matters**

* Our method touches pair/triangle consistency; critics may argue improvements come from better MSA usage or template signal. ESMFold helps separate “MSA pipeline strength” from “geometry reasoning strength.”

---

### C) Topology / Hodge Baselines (to position the math contribution)

These aren’t “protein folding competitors” in the same sense — they justify that we’re not hand-waving topology language.

#### 1) HodgeFormer

* A transformer architecture explicitly inspired by **Discrete Exterior Calculus / Hodge operators** on simplicial complexes. ([arXiv][16])

#### 2) Hodge-aware convolution

* Explicitly uses the Hodge decomposition bias (gradient/curl/harmonic components) for learning on simplicial complexes. ([OpenReview][17])

#### 3) Sheaf cohomology

* Sheaf cohomology is used to quantify **local-to-global inconsistency**; modern ML work uses sheaf Laplacians/cohomology to characterize irreducible inconsistency patterns. ([arXiv][18])

**How to present these without over-claiming**

* Say: “These are *positioning baselines* — they show the mathematical machinery exists, but we’re the first (in this pitch) to **embed it inside an AF2-like triangle/pair pipeline** to fix integrability of triangle logic.”

---

# Part 3 — Speech scripts (what you say on each slide)

Below are **two tight scripts** you can basically read. I’ll keep it “direct, no fluff,” but still complete.

---

## Slide 1 Speech — “Dataset Details + Implementation Platform + Evaluation Strategy”

**Opening (1 sentence)**
“On this slide I’m explaining the dataset stack: what we train on, what we evaluate on, and why each dataset is the right stress test for our ‘global integrability’ idea.”

**Dataset table (go row by row)**

1. “For training, we anchor on **OpenProteinSet**, which is the open AF2-scale MSA corpus — millions of MSAs plus PDB-chain coverage — and it’s explicitly what OpenFold used to retrain an AlphaFold2-class model in the open.” ([arXiv][1])

2. “For controlled generalization, we use **ProteinNet**. The important thing about ProteinNet is it standardizes sequence, structures, MSAs, and crucially gives **CASP-based splits** with **sequence-identity control**, so we can test low-homology performance cleanly.” ([Springer][4])

3. “For blind-ish evaluation, we use **CAMEO**, which runs continuously every week on **pre-release PDB targets** — it’s basically a rolling, automated complement to CASP. This matters because it’s hard to overfit and it reflects real deployment conditions.” ([cameo3d.org][5])

4. “For hard targets, we use **CASP15**. CASP15 has **127 modeling targets**, and it’s the most trusted ‘hard benchmark’ setting for structure prediction comparisons.” ([predictioncenter.org][6])

5. “For complexes, we use **Docking Benchmark 5.5**. DB5.5 is designed exactly for docking-style evaluation: it provides **bound complexes plus unbound subunits**, and the updated benchmark contains **230 docking entries**.” ([ScienceDirect][7])

6. “For antibodies, we include **SAbDab**, which is a curated structural antibody database with **heavy/light pairing annotations** and weekly updates. We use this as a topology-rich interface domain where subtle geometric inconsistency shows up.” ([opig.stats.ox.ac.uk][10])

**Implementation platform box**
“Implementation-wise we build on **OpenFold**, because it’s a trainable PyTorch reproduction of AlphaFold2 and is explicitly meant for fair reproducible research — it’s the right place to insert a new operator into the triangle/pair pipeline and measure causality.” ([GitHub][11])

**Evaluation strategy box (tie directly to our hypothesis)**

* “For **monomers**, we report **CAMEO + CASP15**: CAMEO gives continuous blind-ish reality checks, CASP gives canonical hard targets.” ([cameo3d.org][5])
* “For **low-homology**, we report **ProteinNet hard splits**, because they isolate the case where evolutionary signal is weak and models are more prone to inconsistent internal constraints.” ([Springer][4])
* “For **complexes**, we evaluate **DB5.5** and score with **DockQ**, which compresses interface correctness and geometry into a clean 0–1 metric.” ([ScienceDirect][7])

**Close (1 sentence)**
“So overall: OpenProteinSet/OpenFold gives us reproducible training, ProteinNet tests controlled generalization, CAMEO+CASP15 tests real and hard monomers, and DB5.5+DockQ tests whether enforcing integrability actually improves interfaces.”

---

## Slide 2 Speech — “Baseline Models Being Explored”

**Opening**
“This slide is about baselines — we want comparisons that isolate whether our gains come from the gauge idea, not from a different backbone or data pipeline.”

### Section 1 — Direct Architecture Baselines

1. “First baseline is **OpenFold**, which is our implementation base and the most apples-to-apples comparison: same AF2-class architecture, same training philosophy, just with our gauge layer added.” ([GitHub][11])

2. “Second is **AlphaFold2**, the reference architecture where triangle updates are the core mechanism — our method is literally targeting the gap between triangle-local consistency and global integrability.” ([Nature][12])

3. “Third is **RoseTTAFold**, a strong alternative with a **three-track design** that mixes 1D, 2D, and 3D reasoning. This tells us whether improvements are specific to AF2-style triangle/pair logic or more general.” ([Science][13])

4. “And for complex evaluation, we include **AlphaFold-Multimer**, because it’s the direct AF-family baseline trained for multimeric inputs.” ([bioRxiv][19])

### Section 2 — Sequence-only Control

“Then we include **ESMFold** as a control because it predicts structure from **single sequence only — no MSA**. If our method improves geometric reasoning, we want to show it’s not simply ‘we used MSAs better’.” ([bioRxiv][15])

### Section 3 — Topology / Hodge Baselines (positioning)

“Finally, we list topology-aware baselines to position the math contribution:

* **HodgeFormer** shows transformers can explicitly learn operators grounded in DEC/Hodge ideas.” ([arXiv][16])
* “**Hodge-aware convolution** is directly about separating gradient/curl/harmonic components — that’s the same decomposition we’re exploiting, but we’re applying it inside protein pair representations.” ([OpenReview][17])
* “And **sheaf cohomology** is the formal ‘local-to-global consistency’ lens — it quantifies when local constraints fail to glue globally, which is exactly the failure mode we claim to fix.” ([arXiv][18])

**Close**
“So the baseline logic is: OpenFold/AF2/RoseTTAFold/AF-Multimer measure direct structure prediction competitiveness, ESMFold rules out MSA confounds, and the topology baselines justify that our constraint is grounded in established math — we’re just injecting it into the AF2 triangle/pair pipeline where it actually matters.”

---

If you want, paste the **next slide(s)** after these and I’ll produce the same “teach + speech” breakdown for them too.

[1]: https://arxiv.org/abs/2308.05326?utm_source=chatgpt.com "OpenProteinSet: Training data for structural biology at scale"
[2]: https://www.nature.com/articles/s41592-024-02272-z.pdf?utm_source=chatgpt.com "OpenFold: retraining AlphaFold2 yields new insights into its ... - Nature"
[3]: https://openfold.readthedocs.io/en/latest/OpenFold_Training_Setup.html?utm_source=chatgpt.com "Setting up the OpenFold PDB training set from RODA"
[4]: https://link.springer.com/article/10.1186/s12859-019-2932-0?utm_source=chatgpt.com "ProteinNet: a standardized data set for machine learning of protein ..."
[5]: https://cameo3d.org/help?utm_source=chatgpt.com "Help | CAMEO"
[6]: https://predictioncenter.org/casp15/?utm_source=chatgpt.com "Home - CASP15 - Prediction Center"
[7]: https://www.sciencedirect.com/science/article/abs/pii/S0022283615004180?utm_source=chatgpt.com "Updates to the Integrated Protein–Protein Interaction Benchmarks ..."
[8]: https://github.com/haddocking/BM5-clean?utm_source=chatgpt.com "Docking benchmark 5 - cleaned and ready to use for HADDOCK"
[9]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0161879&utm_source=chatgpt.com "DockQ: A Quality Measure for Protein-Protein Docking Models"
[10]: https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about?utm_source=chatgpt.com "SAbDab: The Structural Antibody Database - University of Oxford"
[11]: https://github.com/aqlaboratory/openfold?utm_source=chatgpt.com "GitHub - aqlaboratory/openfold: Trainable, memory-efficient, and GPU ..."
[12]: https://www.nature.com/articles/s41586-021-03819-2.pdf?utm_source=chatgpt.com "Highly accurate protein structure prediction with AlphaFold"
[13]: https://www.science.org/doi/pdf/10.1126/science.abj8754?download=true&utm_source=chatgpt.com "Accurate prediction of protein structures and interactions ... - Science"
[14]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"
[15]: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1.full.pdf?utm_source=chatgpt.com "Language models of protein sequences at the scale of ... - bioRxiv"
[16]: https://arxiv.org/html/2509.01839v3?utm_source=chatgpt.com "HodgeFormer: Transformers for Learnable Operators on Triangular Meshes ..."
[17]: https://openreview.net/pdf?id=Nm5sp09Q25&utm_source=chatgpt.com "Hodge-AwareConvolutionalLearning onSimplicialComplexes"
[18]: https://arxiv.org/pdf/2511.11092?utm_source=chatgpt.com "Sheaf Cohomology of Linear Predictive Coding Networks - arXiv.org"
[19]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1.full.pdf?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"