hedge gauge mlns

PUBLIC 2026-02-21 09:31:16
mlns

Got it — you’re presenting the datasets + baselines slide(s). I’ll do two things:

1. Make you understand each dataset + baseline deeply (what it is, what signal it tests, and how it maps to our “cohomology-gauged pair representation” claim).

2. Give you a clean speech script for each slide (straight to the point, but complete).

---

Part 1 — Datasets (what each is, and why it matters for our method)

### 1) OpenProteinSet (Training, “AF2-scale”)

What it is

OpenProteinSet is an open, large-scale corpus of protein MSAs plus associated structural links (PDB homologs / AF2 predictions depending on subset). It was created specifically to enable training AF2-style models at scale in the open. ([arXiv][1])
OpenFold’s Nat Methods paper states they trained OpenFold from scratch using OpenProteinSet23, their open reproduction of the AlphaFold2 training set. ([Nature][2])

What’s inside (practical view)

Massive number of MSAs (millions-scale), plus PDB chains and template hits. ([arXiv][1])
OpenFold documentation explicitly points to OpenProteinSet MSAs + mmCIFs as the training ingredients. ([openfold.readthedocs.io][3])

How we use it for our problem

Role: Train / fine-tune an AF2-like backbone (OpenFold) with our gauge layer.
Why it matches our claim: Our method modifies pair/triangle logic, so the most defensible training source is the one known to reproduce AF2 behavior in an open stack (OpenFold + OpenProteinSet23). ([Nature][2])
What we measure during training (extra diagnostics):

Track “triangle inconsistency” signals (curl magnitude / integrability residual) on train vs validation to see whether our gauge layer is actually shaping the geometry, not just overfitting.

---

### 2) ProteinNet (Controlled splits, “CASP7–12”, seq-identity splits)

What it is

ProteinNet is a standardized dataset for ML on protein structure that bundles sequence, structure, MSA, PSSM, plus standardized train/val/test splits built around CASP rounds. ([Springer][4])
The key feature for evaluation is that ProteinNet provides sequence-identity–controlled splits, so you can test generalization under reduced homology. ([Springer][4])

How we use it

Role: “Low-homology generalization” benchmark.
Why it matters for our method:

Our pitch is local triangle logic doesn’t guarantee global consistency; enforcing integrability should help when signals are weak/noisy. Low-homology targets are exactly where MSA/template signal is weaker and models are more likely to produce inconsistent internal geometry.

What we report:

Standard structure metrics (lDDT/TM/GDT where applicable), plus our integrability diagnostics vs error (does lower curl correlate with better accuracy / calibration?).

---

### 3) CAMEO (Blind-ish eval, weekly newly released PDB)

What it is

CAMEO is a continuous automated evaluation platform that runs weekly using pre-release PDB targets, designed as an ongoing complement to CASP. ([cameo3d.org][5])
Targets come from PDB pre-release; CAMEO filters and clusters targets and benchmarks prediction servers in a blind, automated way. ([cameo3d.org][5])

How we use it

Role: “Realistic generalization” — not a static curated set you can overfit to.
Why it matters for our method:

If our gauge constraint truly improves “global consistency”, it should show up on continuously changing targets where the model can’t memorize quirks.

Best alignment to slide’s evaluation strategy: Monomer evaluation = CAMEO + CASP15 (CAMEO for continuous blind-ish; CASP15 for canonical hard targets). ([cameo3d.org][5])

---

### 4) CASP15 (Hard targets benchmark, 127 targets)

What it is

CASP is the community blind assessment. The CASP15 home page reports 127 modeling targets and tens of thousands of submitted models. ([predictioncenter.org][6])

How we use it

Role: “Hard targets / canonical comparison point.”
Why it matters for our method:

Reviewers trust CASP because it’s the standard “stress test”. If you claim improvement, CASP is where people believe it.

---

### 5) Docking Benchmark 5.5 (DB5.5) (Complexes, bound + unbound, 230 entries)

What it is

DB5.5 is a widely used protein–protein docking benchmark containing non-redundant, high-quality bound complexes plus corresponding unbound subunits. ([ScienceDirect][7])
The 2015 update describes 230 docking benchmark entries (and a related affinity benchmark with 179 entries). ([ScienceDirect][7])
There are clean processed distributions used by the docking community (e.g., matched bound/unbound chain numbering), which makes evaluation reproducible. ([GitHub][8])

How we use it

Role: “Complex/interface correctness” benchmark.
Why it matters for our method:

Complexes amplify exactly the failure mode we’re targeting: you can satisfy local constraints within chains but still get global/interface inconsistency across chains. If we reduce non-integrable triangle logic in pair space, we expect better interface geometry and fewer contradictory constraints at binding sites.

---

### 6) DockQ (Complex evaluation metric)

What it is

DockQ is a continuous docking quality score combining Fnat, LRMSD, iRMSD into a single number in [0, 1]. ([PLOS][9])

How we use it

Role: Primary metric for DB5.5 interface evaluation.
Why it matches our story:

DockQ reflects both contact correctness (Fnat) and geometric agreement (RMSDs). Our method is explicitly about improving geometric consistency, so DockQ is the right “headline number.”

---

### 7) SAbDab (Antibodies, curated topology-rich interfaces)

What it is

SAbDab is a weekly updated structural antibody database with annotations like heavy/light pairing, nomenclature, and other curated metadata. ([opig.stats.ox.ac.uk][10])
It supports creating and downloading datasets for analysis and provides standardized annotation (including chain pairing info). ([opig.stats.ox.ac.uk][10])

How we use it

Role: “Topology-rich interfaces / hard geometry domain.”
Why it matters for our method:

Antibody–antigen interfaces (and antibody CDR regions) are geometrically subtle; models can produce locally plausible but globally inconsistent constraints around loops and interfaces. If “integrability enforcement” is real, antibodies are a good place to expose it.

---

Part 2 — Baseline models (what each baseline proves)

Your slide splits baselines into three buckets. That’s perfect — keep it.

---

### A) Direct Architecture Baselines (closest competitors to our change)

#### 1) OpenFold (our implementation base)

OpenFold is a trainable, memory-efficient PyTorch reproduction of AlphaFold2. ([GitHub][11])
OpenFold trained from scratch on OpenProteinSet23 and matched AF2-level quality (per their Nat Methods paper). ([Nature][2])

Why it’s essential as a baseline

It ensures fairness + reproducibility: any gain can be attributed to our gauge layer rather than private training tricks.

#### 2) AlphaFold2 (reference architecture)

AlphaFold2 is the canonical Evoformer + triangle updates architecture that set the standard in CASP14. ([Nature][12])

Why it matters

We’re explicitly modifying the “triangle logic → global consistency” pathway that AF2 relies on, so AF2 is the conceptual ground truth baseline.

#### 3) RoseTTAFold (three-track alternative)

RoseTTAFold uses a three-track network integrating 1D sequence, 2D distances, and 3D coordinates. ([Science][13])

Why it matters

It’s a strong independent architecture; if improvements are only within AF2-like stacks, reviewers will ask why. Comparing against RoseTTAFold shows whether the benefit is specific to triangle/pair style reasoning.

#### 4) AlphaFold-Multimer (complex prediction baseline)

AlphaFold-Multimer is an AF2 variant trained specifically for multimeric inputs. ([bioRxiv][14])

Why it matters

If we claim improvements on complexes, AF-Multimer is the direct “SOTA-ish” baseline in the AF family.

---

### B) Sequence-Only Control (to rule out “it’s just MSA quality”)

#### ESMFold

ESMFold predicts structure directly from single sequence, leveraging a protein language model (ESM) — no MSA, no templates. ([bioRxiv][15])

Why it matters

Our method touches pair/triangle consistency; critics may argue improvements come from better MSA usage or template signal. ESMFold helps separate “MSA pipeline strength” from “geometry reasoning strength.”

---

### C) Topology / Hodge Baselines (to position the math contribution)

These aren’t “protein folding competitors” in the same sense — they justify that we’re not hand-waving topology language.

#### 1) HodgeFormer

A transformer architecture explicitly inspired by Discrete Exterior Calculus / Hodge operators on simplicial complexes. ([arXiv][16])

#### 2) Hodge-aware convolution

Explicitly uses the Hodge decomposition bias (gradient/curl/harmonic components) for learning on simplicial complexes. ([OpenReview][17])

#### 3) Sheaf cohomology

Sheaf cohomology is used to quantify local-to-global inconsistency; modern ML work uses sheaf Laplacians/cohomology to characterize irreducible inconsistency patterns. ([arXiv][18])

How to present these without over-claiming

Say: “These are positioning baselines — they show the mathematical machinery exists, but we’re the first (in this pitch) to embed it inside an AF2-like triangle/pair pipeline to fix integrability of triangle logic.”

---

Part 3 — Speech scripts (what you say on each slide)

Below are two tight scripts you can basically read. I’ll keep it “direct, no fluff,” but still complete.

---

Slide 1 Speech — “Dataset Details + Implementation Platform + Evaluation Strategy”

Opening (1 sentence)

“On this slide I’m explaining the dataset stack: what we train on, what we evaluate on, and why each dataset is the right stress test for our ‘global integrability’ idea.”

Dataset table (go row by row)

1. “For training, we anchor on OpenProteinSet, which is the open AF2-scale MSA corpus — millions of MSAs plus PDB-chain coverage — and it’s explicitly what OpenFold used to retrain an AlphaFold2-class model in the open.” ([arXiv][1])

2. “For controlled generalization, we use ProteinNet. The important thing about ProteinNet is it standardizes sequence, structures, MSAs, and crucially gives CASP-based splits with sequence-identity control, so we can test low-homology performance cleanly.” ([Springer][4])

3. “For blind-ish evaluation, we use CAMEO, which runs continuously every week on pre-release PDB targets — it’s basically a rolling, automated complement to CASP. This matters because it’s hard to overfit and it reflects real deployment conditions.” ([cameo3d.org][5])

4. “For hard targets, we use CASP15. CASP15 has 127 modeling targets, and it’s the most trusted ‘hard benchmark’ setting for structure prediction comparisons.” ([predictioncenter.org][6])

5. “For complexes, we use Docking Benchmark 5.5. DB5.5 is designed exactly for docking-style evaluation: it provides bound complexes plus unbound subunits, and the updated benchmark contains 230 docking entries.” ([ScienceDirect][7])

6. “For antibodies, we include SAbDab, which is a curated structural antibody database with heavy/light pairing annotations and weekly updates. We use this as a topology-rich interface domain where subtle geometric inconsistency shows up.” ([opig.stats.ox.ac.uk][10])

Implementation platform box

“Implementation-wise we build on OpenFold, because it’s a trainable PyTorch reproduction of AlphaFold2 and is explicitly meant for fair reproducible research — it’s the right place to insert a new operator into the triangle/pair pipeline and measure causality.” ([GitHub][11])

Evaluation strategy box (tie directly to our hypothesis)

“For monomers, we report CAMEO + CASP15: CAMEO gives continuous blind-ish reality checks, CASP gives canonical hard targets.” ([cameo3d.org][5])
“For low-homology, we report ProteinNet hard splits, because they isolate the case where evolutionary signal is weak and models are more prone to inconsistent internal constraints.” ([Springer][4])
“For complexes, we evaluate DB5.5 and score with DockQ, which compresses interface correctness and geometry into a clean 0–1 metric.” ([ScienceDirect][7])

Close (1 sentence)

“So overall: OpenProteinSet/OpenFold gives us reproducible training, ProteinNet tests controlled generalization, CAMEO+CASP15 tests real and hard monomers, and DB5.5+DockQ tests whether enforcing integrability actually improves interfaces.”

---

Slide 2 Speech — “Baseline Models Being Explored”

Opening

“This slide is about baselines — we want comparisons that isolate whether our gains come from the gauge idea, not from a different backbone or data pipeline.”

### Section 1 — Direct Architecture Baselines

1. “First baseline is OpenFold, which is our implementation base and the most apples-to-apples comparison: same AF2-class architecture, same training philosophy, just with our gauge layer added.” ([GitHub][11])

2. “Second is AlphaFold2, the reference architecture where triangle updates are the core mechanism — our method is literally targeting the gap between triangle-local consistency and global integrability.” ([Nature][12])

3. “Third is RoseTTAFold, a strong alternative with a three-track design that mixes 1D, 2D, and 3D reasoning. This tells us whether improvements are specific to AF2-style triangle/pair logic or more general.” ([Science][13])

4. “And for complex evaluation, we include AlphaFold-Multimer, because it’s the direct AF-family baseline trained for multimeric inputs.” ([bioRxiv][19])

### Section 2 — Sequence-only Control

“Then we include ESMFold as a control because it predicts structure from single sequence only — no MSA. If our method improves geometric reasoning, we want to show it’s not simply ‘we used MSAs better’.” ([bioRxiv][15])

### Section 3 — Topology / Hodge Baselines (positioning)

“Finally, we list topology-aware baselines to position the math contribution:

HodgeFormer shows transformers can explicitly learn operators grounded in DEC/Hodge ideas.” ([arXiv][16])
“Hodge-aware convolution is directly about separating gradient/curl/harmonic components — that’s the same decomposition we’re exploiting, but we’re applying it inside protein pair representations.” ([OpenReview][17])
“And sheaf cohomology is the formal ‘local-to-global consistency’ lens — it quantifies when local constraints fail to glue globally, which is exactly the failure mode we claim to fix.” ([arXiv][18])

Close

“So the baseline logic is: OpenFold/AF2/RoseTTAFold/AF-Multimer measure direct structure prediction competitiveness, ESMFold rules out MSA confounds, and the topology baselines justify that our constraint is grounded in established math — we’re just injecting it into the AF2 triangle/pair pipeline where it actually matters.”

---

If you want, paste the next slide(s) after these and I’ll produce the same “teach + speech” breakdown for them too.

[1]: https://arxiv.org/abs/2308.05326?utm_source=chatgpt.com "OpenProteinSet: Training data for structural biology at scale"

[2]: https://www.nature.com/articles/s41592-024-02272-z.pdf?utm_source=chatgpt.com "OpenFold: retraining AlphaFold2 yields new insights into its ... - Nature"

[3]: https://openfold.readthedocs.io/en/latest/OpenFold_Training_Setup.html?utm_source=chatgpt.com "Setting up the OpenFold PDB training set from RODA"

[4]: https://link.springer.com/article/10.1186/s12859-019-2932-0?utm_source=chatgpt.com "ProteinNet: a standardized data set for machine learning of protein ..."

[5]: https://cameo3d.org/help?utm_source=chatgpt.com "Help | CAMEO"

[6]: https://predictioncenter.org/casp15/?utm_source=chatgpt.com "Home - CASP15 - Prediction Center"

[7]: https://www.sciencedirect.com/science/article/abs/pii/S0022283615004180?utm_source=chatgpt.com "Updates to the Integrated Protein–Protein Interaction Benchmarks ..."

[8]: https://github.com/haddocking/BM5-clean?utm_source=chatgpt.com "Docking benchmark 5 - cleaned and ready to use for HADDOCK"

[9]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0161879&utm_source=chatgpt.com "DockQ: A Quality Measure for Protein-Protein Docking Models"

[10]: https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about?utm_source=chatgpt.com "SAbDab: The Structural Antibody Database - University of Oxford"

[11]: https://github.com/aqlaboratory/openfold?utm_source=chatgpt.com "GitHub - aqlaboratory/openfold: Trainable, memory-efficient, and GPU ..."

[12]: https://www.nature.com/articles/s41586-021-03819-2.pdf?utm_source=chatgpt.com "Highly accurate protein structure prediction with AlphaFold"

[13]: https://www.science.org/doi/pdf/10.1126/science.abj8754?download=true&utm_source=chatgpt.com "Accurate prediction of protein structures and interactions ... - Science"

[14]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"

[15]: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1.full.pdf?utm_source=chatgpt.com "Language models of protein sequences at the scale of ... - bioRxiv"

[16]: https://arxiv.org/html/2509.01839v3?utm_source=chatgpt.com "HodgeFormer: Transformers for Learnable Operators on Triangular Meshes ..."

[17]: https://openreview.net/pdf?id=Nm5sp09Q25&utm_source=chatgpt.com "Hodge-AwareConvolutionalLearning onSimplicialComplexes"

[18]: https://arxiv.org/pdf/2511.11092?utm_source=chatgpt.com "Sheaf Cohomology of Linear Predictive Coding Networks - arXiv.org"

[19]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1.full.pdf?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"