hedge gauge mlns

PUBLIC 2026-02-21 09:31:16
mlns

Got it — you’re presenting the datasets + baselines slide(s). I’ll do two things:

1. Make you understand each dataset + baseline deeply (what it is, what signal it tests, and how it maps to our “cohomology-gauged pair representation” claim).

2. Give you a clean speech script for each slide (straight to the point, but complete).

---

Part 1 — Datasets (what each is, and why it matters for our method)

### 1) OpenProteinSet (Training, “AF2-scale”)

What it is

What’s inside (practical view)

How we use it for our problem

---

### 2) ProteinNet (Controlled splits, “CASP7–12”, seq-identity splits)

What it is

How we use it

Our pitch is local triangle logic doesn’t guarantee global consistency; enforcing integrability should help when signals are weak/noisy. Low-homology targets are exactly where MSA/template signal is weaker and models are more likely to produce inconsistent internal geometry.

---

### 3) CAMEO (Blind-ish eval, weekly newly released PDB)

What it is

How we use it

If our gauge constraint truly improves “global consistency”, it should show up on continuously changing targets where the model can’t memorize quirks.

---

### 4) CASP15 (Hard targets benchmark, 127 targets)

What it is

How we use it

Reviewers trust CASP because it’s the standard “stress test”. If you claim improvement, CASP is where people believe it.

---

### 5) Docking Benchmark 5.5 (DB5.5) (Complexes, bound + unbound, 230 entries)

What it is

How we use it

Complexes amplify exactly the failure mode we’re targeting: you can satisfy local constraints within chains but still get global/interface inconsistency across chains. If we reduce non-integrable triangle logic in pair space, we expect better interface geometry and fewer contradictory constraints at binding sites.

---

### 6) DockQ (Complex evaluation metric)

What it is

How we use it

DockQ reflects both contact correctness (Fnat) and geometric agreement (RMSDs). Our method is explicitly about improving geometric consistency, so DockQ is the right “headline number.”

---

### 7) SAbDab (Antibodies, curated topology-rich interfaces)

What it is

How we use it

Antibody–antigen interfaces (and antibody CDR regions) are geometrically subtle; models can produce locally plausible but globally inconsistent constraints around loops and interfaces. If “integrability enforcement” is real, antibodies are a good place to expose it.

---

Part 2 — Baseline models (what each baseline proves)

Your slide splits baselines into three buckets. That’s perfect — keep it.

---

### A) Direct Architecture Baselines (closest competitors to our change)

#### 1) OpenFold (our implementation base)

Why it’s essential as a baseline

#### 2) AlphaFold2 (reference architecture)

Why it matters

#### 3) RoseTTAFold (three-track alternative)

Why it matters

#### 4) AlphaFold-Multimer (complex prediction baseline)

Why it matters

---

### B) Sequence-Only Control (to rule out “it’s just MSA quality”)

#### ESMFold

Why it matters

---

### C) Topology / Hodge Baselines (to position the math contribution)

These aren’t “protein folding competitors” in the same sense — they justify that we’re not hand-waving topology language.

#### 1) HodgeFormer

#### 2) Hodge-aware convolution

#### 3) Sheaf cohomology

How to present these without over-claiming

---

Part 3 — Speech scripts (what you say on each slide)

Below are two tight scripts you can basically read. I’ll keep it “direct, no fluff,” but still complete.

---

Slide 1 Speech — “Dataset Details + Implementation Platform + Evaluation Strategy”

Opening (1 sentence)

“On this slide I’m explaining the dataset stack: what we train on, what we evaluate on, and why each dataset is the right stress test for our ‘global integrability’ idea.”

Dataset table (go row by row)

1. “For training, we anchor on OpenProteinSet, which is the open AF2-scale MSA corpus — millions of MSAs plus PDB-chain coverage — and it’s explicitly what OpenFold used to retrain an AlphaFold2-class model in the open.” ([arXiv][1])

2. “For controlled generalization, we use ProteinNet. The important thing about ProteinNet is it standardizes sequence, structures, MSAs, and crucially gives CASP-based splits with sequence-identity control, so we can test low-homology performance cleanly.” ([Springer][4])

3. “For blind-ish evaluation, we use CAMEO, which runs continuously every week on pre-release PDB targets — it’s basically a rolling, automated complement to CASP. This matters because it’s hard to overfit and it reflects real deployment conditions.” ([cameo3d.org][5])

4. “For hard targets, we use CASP15. CASP15 has 127 modeling targets, and it’s the most trusted ‘hard benchmark’ setting for structure prediction comparisons.” ([predictioncenter.org][6])

5. “For complexes, we use Docking Benchmark 5.5. DB5.5 is designed exactly for docking-style evaluation: it provides bound complexes plus unbound subunits, and the updated benchmark contains 230 docking entries.” ([ScienceDirect][7])

6. “For antibodies, we include SAbDab, which is a curated structural antibody database with heavy/light pairing annotations and weekly updates. We use this as a topology-rich interface domain where subtle geometric inconsistency shows up.” ([opig.stats.ox.ac.uk][10])

Implementation platform box

“Implementation-wise we build on OpenFold, because it’s a trainable PyTorch reproduction of AlphaFold2 and is explicitly meant for fair reproducible research — it’s the right place to insert a new operator into the triangle/pair pipeline and measure causality.” ([GitHub][11])

Evaluation strategy box (tie directly to our hypothesis)

Close (1 sentence)

“So overall: OpenProteinSet/OpenFold gives us reproducible training, ProteinNet tests controlled generalization, CAMEO+CASP15 tests real and hard monomers, and DB5.5+DockQ tests whether enforcing integrability actually improves interfaces.”

---

Slide 2 Speech — “Baseline Models Being Explored”

Opening

“This slide is about baselines — we want comparisons that isolate whether our gains come from the gauge idea, not from a different backbone or data pipeline.”

### Section 1 — Direct Architecture Baselines

1. “First baseline is OpenFold, which is our implementation base and the most apples-to-apples comparison: same AF2-class architecture, same training philosophy, just with our gauge layer added.” ([GitHub][11])

2. “Second is AlphaFold2, the reference architecture where triangle updates are the core mechanism — our method is literally targeting the gap between triangle-local consistency and global integrability.” ([Nature][12])

3. “Third is RoseTTAFold, a strong alternative with a three-track design that mixes 1D, 2D, and 3D reasoning. This tells us whether improvements are specific to AF2-style triangle/pair logic or more general.” ([Science][13])

4. “And for complex evaluation, we include AlphaFold-Multimer, because it’s the direct AF-family baseline trained for multimeric inputs.” ([bioRxiv][19])

### Section 2 — Sequence-only Control

“Then we include ESMFold as a control because it predicts structure from single sequence only — no MSA. If our method improves geometric reasoning, we want to show it’s not simply ‘we used MSAs better’.” ([bioRxiv][15])

### Section 3 — Topology / Hodge Baselines (positioning)

“Finally, we list topology-aware baselines to position the math contribution:

Close

“So the baseline logic is: OpenFold/AF2/RoseTTAFold/AF-Multimer measure direct structure prediction competitiveness, ESMFold rules out MSA confounds, and the topology baselines justify that our constraint is grounded in established math — we’re just injecting it into the AF2 triangle/pair pipeline where it actually matters.”

---

If you want, paste the next slide(s) after these and I’ll produce the same “teach + speech” breakdown for them too.

[1]: https://arxiv.org/abs/2308.05326?utm_source=chatgpt.com "OpenProteinSet: Training data for structural biology at scale"

[2]: https://www.nature.com/articles/s41592-024-02272-z.pdf?utm_source=chatgpt.com "OpenFold: retraining AlphaFold2 yields new insights into its ... - Nature"

[3]: https://openfold.readthedocs.io/en/latest/OpenFold_Training_Setup.html?utm_source=chatgpt.com "Setting up the OpenFold PDB training set from RODA"

[4]: https://link.springer.com/article/10.1186/s12859-019-2932-0?utm_source=chatgpt.com "ProteinNet: a standardized data set for machine learning of protein ..."

[5]: https://cameo3d.org/help?utm_source=chatgpt.com "Help | CAMEO"

[6]: https://predictioncenter.org/casp15/?utm_source=chatgpt.com "Home - CASP15 - Prediction Center"

[7]: https://www.sciencedirect.com/science/article/abs/pii/S0022283615004180?utm_source=chatgpt.com "Updates to the Integrated Protein–Protein Interaction Benchmarks ..."

[8]: https://github.com/haddocking/BM5-clean?utm_source=chatgpt.com "Docking benchmark 5 - cleaned and ready to use for HADDOCK"

[9]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0161879&utm_source=chatgpt.com "DockQ: A Quality Measure for Protein-Protein Docking Models"

[10]: https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about?utm_source=chatgpt.com "SAbDab: The Structural Antibody Database - University of Oxford"

[11]: https://github.com/aqlaboratory/openfold?utm_source=chatgpt.com "GitHub - aqlaboratory/openfold: Trainable, memory-efficient, and GPU ..."

[12]: https://www.nature.com/articles/s41586-021-03819-2.pdf?utm_source=chatgpt.com "Highly accurate protein structure prediction with AlphaFold"

[13]: https://www.science.org/doi/pdf/10.1126/science.abj8754?download=true&utm_source=chatgpt.com "Accurate prediction of protein structures and interactions ... - Science"

[14]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"

[15]: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1.full.pdf?utm_source=chatgpt.com "Language models of protein sequences at the scale of ... - bioRxiv"

[16]: https://arxiv.org/html/2509.01839v3?utm_source=chatgpt.com "HodgeFormer: Transformers for Learnable Operators on Triangular Meshes ..."

[17]: https://openreview.net/pdf?id=Nm5sp09Q25&utm_source=chatgpt.com "Hodge-AwareConvolutionalLearning onSimplicialComplexes"

[18]: https://arxiv.org/pdf/2511.11092?utm_source=chatgpt.com "Sheaf Cohomology of Linear Predictive Coding Networks - arXiv.org"

[19]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1.full.pdf?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"