# hedge gauge mlns public: true tags: mlns time: 2026-02-21 09:31:16 Got it — you’re presenting the **datasets + baselines** slide(s). I’ll do two things: 1. **Make you understand each dataset + baseline deeply** (what it is, what signal it tests, and *how it maps to our “cohomology-gauged pair representation” claim*). 2. Give you a **clean speech script for each slide** (straight to the point, but complete). --- ## Part 1 — Datasets (what each is, and why it matters for our method) ### 1) OpenProteinSet (Training, “AF2-scale”) **What it is** * OpenProteinSet is an open, large-scale corpus of **protein MSAs** plus associated structural links (PDB homologs / AF2 predictions depending on subset). It was created specifically to enable training AF2-style models at scale in the open. ([arXiv][1]) * OpenFold’s Nat Methods paper states they trained OpenFold from scratch using **OpenProteinSet23**, their open reproduction of the AlphaFold2 training set. ([Nature][2]) **What’s inside (practical view)** * Massive number of MSAs (millions-scale), plus **PDB chains** and template hits. ([arXiv][1]) * OpenFold documentation explicitly points to OpenProteinSet MSAs + mmCIFs as the training ingredients. ([openfold.readthedocs.io][3]) **How we use it for our problem** * **Role:** Train / fine-tune an AF2-like backbone (OpenFold) with our gauge layer. * **Why it matches our claim:** Our method modifies **pair/triangle logic**, so the most defensible training source is the one known to reproduce AF2 behavior in an open stack (OpenFold + OpenProteinSet23). ([Nature][2]) * **What we measure during training (extra diagnostics):** * Track “triangle inconsistency” signals (curl magnitude / integrability residual) on train vs validation to see whether our gauge layer is actually shaping the geometry, not just overfitting. --- ### 2) ProteinNet (Controlled splits, “CASP7–12”, seq-identity splits) **What it is** * ProteinNet is a standardized dataset for ML on protein structure that bundles **sequence, structure, MSA, PSSM**, plus standardized **train/val/test splits** built around CASP rounds. ([Springer][4]) * The key feature for evaluation is that ProteinNet provides **sequence-identity–controlled splits**, so you can test generalization under reduced homology. ([Springer][4]) **How we use it** * **Role:** “Low-homology generalization” benchmark. * **Why it matters for our method:** Our pitch is *local triangle logic doesn’t guarantee global consistency; enforcing integrability should help when signals are weak/noisy.* Low-homology targets are exactly where MSA/template signal is weaker and models are more likely to produce inconsistent internal geometry. * **What we report:** * Standard structure metrics (lDDT/TM/GDT where applicable), plus our **integrability diagnostics** vs error (does lower curl correlate with better accuracy / calibration?). --- ### 3) CAMEO (Blind-ish eval, weekly newly released PDB) **What it is** * CAMEO is a **continuous automated evaluation** platform that runs **weekly** using **pre-release PDB targets**, designed as an ongoing complement to CASP. ([cameo3d.org][5]) * Targets come from **PDB pre-release**; CAMEO filters and clusters targets and benchmarks prediction servers in a blind, automated way. ([cameo3d.org][5]) **How we use it** * **Role:** “Realistic generalization” — not a static curated set you can overfit to. * **Why it matters for our method:** If our gauge constraint truly improves “global consistency”, it should show up on continuously changing targets where the model can’t memorize quirks. * **Best alignment to slide’s evaluation strategy:** **Monomer evaluation = CAMEO + CASP15** (CAMEO for continuous blind-ish; CASP15 for canonical hard targets). ([cameo3d.org][5]) --- ### 4) CASP15 (Hard targets benchmark, 127 targets) **What it is** * CASP is the community blind assessment. The CASP15 home page reports **127 modeling targets** and tens of thousands of submitted models. ([predictioncenter.org][6]) **How we use it** * **Role:** “Hard targets / canonical comparison point.” * **Why it matters for our method:** Reviewers trust CASP because it’s the standard “stress test”. If you claim improvement, CASP is where people believe it. --- ### 5) Docking Benchmark 5.5 (DB5.5) (Complexes, bound + unbound, 230 entries) **What it is** * DB5.5 is a widely used protein–protein docking benchmark containing **non-redundant, high-quality bound complexes plus corresponding unbound subunits**. ([ScienceDirect][7]) * The 2015 update describes **230 docking benchmark entries** (and a related affinity benchmark with 179 entries). ([ScienceDirect][7]) * There are clean processed distributions used by the docking community (e.g., matched bound/unbound chain numbering), which makes evaluation reproducible. ([GitHub][8]) **How we use it** * **Role:** “Complex/interface correctness” benchmark. * **Why it matters for our method:** Complexes amplify exactly the failure mode we’re targeting: you can satisfy local constraints within chains but still get **global/interface inconsistency** across chains. If we reduce non-integrable triangle logic in pair space, we expect better interface geometry and fewer contradictory constraints at binding sites. --- ### 6) DockQ (Complex evaluation metric) **What it is** * DockQ is a continuous docking quality score combining **Fnat, LRMSD, iRMSD** into a single number in **[0, 1]**. ([PLOS][9]) **How we use it** * **Role:** Primary metric for DB5.5 interface evaluation. * **Why it matches our story:** DockQ reflects both **contact correctness (Fnat)** and **geometric agreement (RMSDs)**. Our method is explicitly about improving geometric consistency, so DockQ is the right “headline number.” --- ### 7) SAbDab (Antibodies, curated topology-rich interfaces) **What it is** * SAbDab is a **weekly updated** structural antibody database with annotations like **heavy/light pairing**, nomenclature, and other curated metadata. ([opig.stats.ox.ac.uk][10]) * It supports creating and downloading datasets for analysis and provides standardized annotation (including chain pairing info). ([opig.stats.ox.ac.uk][10]) **How we use it** * **Role:** “Topology-rich interfaces / hard geometry domain.” * **Why it matters for our method:** Antibody–antigen interfaces (and antibody CDR regions) are geometrically subtle; models can produce locally plausible but globally inconsistent constraints around loops and interfaces. If “integrability enforcement” is real, antibodies are a good place to expose it. --- ## Part 2 — Baseline models (what each baseline proves) Your slide splits baselines into three buckets. That’s perfect — keep it. --- ### A) Direct Architecture Baselines (closest competitors to our change) #### 1) OpenFold (our implementation base) * OpenFold is a **trainable, memory-efficient PyTorch reproduction of AlphaFold2**. ([GitHub][11]) * OpenFold trained from scratch on OpenProteinSet23 and matched AF2-level quality (per their Nat Methods paper). ([Nature][2]) **Why it’s essential as a baseline** * It ensures **fairness + reproducibility**: any gain can be attributed to our gauge layer rather than private training tricks. #### 2) AlphaFold2 (reference architecture) * AlphaFold2 is the canonical Evoformer + triangle updates architecture that set the standard in CASP14. ([Nature][12]) **Why it matters** * We’re explicitly modifying the “triangle logic → global consistency” pathway that AF2 relies on, so AF2 is the conceptual ground truth baseline. #### 3) RoseTTAFold (three-track alternative) * RoseTTAFold uses a **three-track network** integrating 1D sequence, 2D distances, and 3D coordinates. ([Science][13]) **Why it matters** * It’s a strong independent architecture; if improvements are only within AF2-like stacks, reviewers will ask why. Comparing against RoseTTAFold shows whether the benefit is specific to triangle/pair style reasoning. #### 4) AlphaFold-Multimer (complex prediction baseline) * AlphaFold-Multimer is an AF2 variant trained specifically for multimeric inputs. ([bioRxiv][14]) **Why it matters** * If we claim improvements on complexes, AF-Multimer is the direct “SOTA-ish” baseline in the AF family. --- ### B) Sequence-Only Control (to rule out “it’s just MSA quality”) #### ESMFold * ESMFold predicts structure **directly from single sequence**, leveraging a protein language model (ESM) — **no MSA, no templates**. ([bioRxiv][15]) **Why it matters** * Our method touches pair/triangle consistency; critics may argue improvements come from better MSA usage or template signal. ESMFold helps separate “MSA pipeline strength” from “geometry reasoning strength.” --- ### C) Topology / Hodge Baselines (to position the math contribution) These aren’t “protein folding competitors” in the same sense — they justify that we’re not hand-waving topology language. #### 1) HodgeFormer * A transformer architecture explicitly inspired by **Discrete Exterior Calculus / Hodge operators** on simplicial complexes. ([arXiv][16]) #### 2) Hodge-aware convolution * Explicitly uses the Hodge decomposition bias (gradient/curl/harmonic components) for learning on simplicial complexes. ([OpenReview][17]) #### 3) Sheaf cohomology * Sheaf cohomology is used to quantify **local-to-global inconsistency**; modern ML work uses sheaf Laplacians/cohomology to characterize irreducible inconsistency patterns. ([arXiv][18]) **How to present these without over-claiming** * Say: “These are *positioning baselines* — they show the mathematical machinery exists, but we’re the first (in this pitch) to **embed it inside an AF2-like triangle/pair pipeline** to fix integrability of triangle logic.” --- # Part 3 — Speech scripts (what you say on each slide) Below are **two tight scripts** you can basically read. I’ll keep it “direct, no fluff,” but still complete. --- ## Slide 1 Speech — “Dataset Details + Implementation Platform + Evaluation Strategy” **Opening (1 sentence)** “On this slide I’m explaining the dataset stack: what we train on, what we evaluate on, and why each dataset is the right stress test for our ‘global integrability’ idea.” **Dataset table (go row by row)** 1. “For training, we anchor on **OpenProteinSet**, which is the open AF2-scale MSA corpus — millions of MSAs plus PDB-chain coverage — and it’s explicitly what OpenFold used to retrain an AlphaFold2-class model in the open.” ([arXiv][1]) 2. “For controlled generalization, we use **ProteinNet**. The important thing about ProteinNet is it standardizes sequence, structures, MSAs, and crucially gives **CASP-based splits** with **sequence-identity control**, so we can test low-homology performance cleanly.” ([Springer][4]) 3. “For blind-ish evaluation, we use **CAMEO**, which runs continuously every week on **pre-release PDB targets** — it’s basically a rolling, automated complement to CASP. This matters because it’s hard to overfit and it reflects real deployment conditions.” ([cameo3d.org][5]) 4. “For hard targets, we use **CASP15**. CASP15 has **127 modeling targets**, and it’s the most trusted ‘hard benchmark’ setting for structure prediction comparisons.” ([predictioncenter.org][6]) 5. “For complexes, we use **Docking Benchmark 5.5**. DB5.5 is designed exactly for docking-style evaluation: it provides **bound complexes plus unbound subunits**, and the updated benchmark contains **230 docking entries**.” ([ScienceDirect][7]) 6. “For antibodies, we include **SAbDab**, which is a curated structural antibody database with **heavy/light pairing annotations** and weekly updates. We use this as a topology-rich interface domain where subtle geometric inconsistency shows up.” ([opig.stats.ox.ac.uk][10]) **Implementation platform box** “Implementation-wise we build on **OpenFold**, because it’s a trainable PyTorch reproduction of AlphaFold2 and is explicitly meant for fair reproducible research — it’s the right place to insert a new operator into the triangle/pair pipeline and measure causality.” ([GitHub][11]) **Evaluation strategy box (tie directly to our hypothesis)** * “For **monomers**, we report **CAMEO + CASP15**: CAMEO gives continuous blind-ish reality checks, CASP gives canonical hard targets.” ([cameo3d.org][5]) * “For **low-homology**, we report **ProteinNet hard splits**, because they isolate the case where evolutionary signal is weak and models are more prone to inconsistent internal constraints.” ([Springer][4]) * “For **complexes**, we evaluate **DB5.5** and score with **DockQ**, which compresses interface correctness and geometry into a clean 0–1 metric.” ([ScienceDirect][7]) **Close (1 sentence)** “So overall: OpenProteinSet/OpenFold gives us reproducible training, ProteinNet tests controlled generalization, CAMEO+CASP15 tests real and hard monomers, and DB5.5+DockQ tests whether enforcing integrability actually improves interfaces.” --- ## Slide 2 Speech — “Baseline Models Being Explored” **Opening** “This slide is about baselines — we want comparisons that isolate whether our gains come from the gauge idea, not from a different backbone or data pipeline.” ### Section 1 — Direct Architecture Baselines 1. “First baseline is **OpenFold**, which is our implementation base and the most apples-to-apples comparison: same AF2-class architecture, same training philosophy, just with our gauge layer added.” ([GitHub][11]) 2. “Second is **AlphaFold2**, the reference architecture where triangle updates are the core mechanism — our method is literally targeting the gap between triangle-local consistency and global integrability.” ([Nature][12]) 3. “Third is **RoseTTAFold**, a strong alternative with a **three-track design** that mixes 1D, 2D, and 3D reasoning. This tells us whether improvements are specific to AF2-style triangle/pair logic or more general.” ([Science][13]) 4. “And for complex evaluation, we include **AlphaFold-Multimer**, because it’s the direct AF-family baseline trained for multimeric inputs.” ([bioRxiv][19]) ### Section 2 — Sequence-only Control “Then we include **ESMFold** as a control because it predicts structure from **single sequence only — no MSA**. If our method improves geometric reasoning, we want to show it’s not simply ‘we used MSAs better’.” ([bioRxiv][15]) ### Section 3 — Topology / Hodge Baselines (positioning) “Finally, we list topology-aware baselines to position the math contribution: * **HodgeFormer** shows transformers can explicitly learn operators grounded in DEC/Hodge ideas.” ([arXiv][16]) * “**Hodge-aware convolution** is directly about separating gradient/curl/harmonic components — that’s the same decomposition we’re exploiting, but we’re applying it inside protein pair representations.” ([OpenReview][17]) * “And **sheaf cohomology** is the formal ‘local-to-global consistency’ lens — it quantifies when local constraints fail to glue globally, which is exactly the failure mode we claim to fix.” ([arXiv][18]) **Close** “So the baseline logic is: OpenFold/AF2/RoseTTAFold/AF-Multimer measure direct structure prediction competitiveness, ESMFold rules out MSA confounds, and the topology baselines justify that our constraint is grounded in established math — we’re just injecting it into the AF2 triangle/pair pipeline where it actually matters.” --- If you want, paste the **next slide(s)** after these and I’ll produce the same “teach + speech” breakdown for them too. [1]: https://arxiv.org/abs/2308.05326?utm_source=chatgpt.com "OpenProteinSet: Training data for structural biology at scale" [2]: https://www.nature.com/articles/s41592-024-02272-z.pdf?utm_source=chatgpt.com "OpenFold: retraining AlphaFold2 yields new insights into its ... - Nature" [3]: https://openfold.readthedocs.io/en/latest/OpenFold_Training_Setup.html?utm_source=chatgpt.com "Setting up the OpenFold PDB training set from RODA" [4]: https://link.springer.com/article/10.1186/s12859-019-2932-0?utm_source=chatgpt.com "ProteinNet: a standardized data set for machine learning of protein ..." [5]: https://cameo3d.org/help?utm_source=chatgpt.com "Help | CAMEO" [6]: https://predictioncenter.org/casp15/?utm_source=chatgpt.com "Home - CASP15 - Prediction Center" [7]: https://www.sciencedirect.com/science/article/abs/pii/S0022283615004180?utm_source=chatgpt.com "Updates to the Integrated Protein–Protein Interaction Benchmarks ..." [8]: https://github.com/haddocking/BM5-clean?utm_source=chatgpt.com "Docking benchmark 5 - cleaned and ready to use for HADDOCK" [9]: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0161879&utm_source=chatgpt.com "DockQ: A Quality Measure for Protein-Protein Docking Models" [10]: https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/about?utm_source=chatgpt.com "SAbDab: The Structural Antibody Database - University of Oxford" [11]: https://github.com/aqlaboratory/openfold?utm_source=chatgpt.com "GitHub - aqlaboratory/openfold: Trainable, memory-efficient, and GPU ..." [12]: https://www.nature.com/articles/s41586-021-03819-2.pdf?utm_source=chatgpt.com "Highly accurate protein structure prediction with AlphaFold" [13]: https://www.science.org/doi/pdf/10.1126/science.abj8754?download=true&utm_source=chatgpt.com "Accurate prediction of protein structures and interactions ... - Science" [14]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv" [15]: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1.full.pdf?utm_source=chatgpt.com "Language models of protein sequences at the scale of ... - bioRxiv" [16]: https://arxiv.org/html/2509.01839v3?utm_source=chatgpt.com "HodgeFormer: Transformers for Learnable Operators on Triangular Meshes ..." [17]: https://openreview.net/pdf?id=Nm5sp09Q25&utm_source=chatgpt.com "Hodge-AwareConvolutionalLearning onSimplicialComplexes" [18]: https://arxiv.org/pdf/2511.11092?utm_source=chatgpt.com "Sheaf Cohomology of Linear Predictive Coding Networks - arXiv.org" [19]: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1.full.pdf?utm_source=chatgpt.com "Protein complex prediction with AlphaFold-Multimer - bioRxiv"