Report PDB-like percentiles sliders#13
Conversation
Introduces a transparent, rank-style metascore that converts each of the 10 selected AlphaJudge interface features (LIS, ipSAE, pDockQ2, ipTM, confidence, average interface PAE, pDockQ/mpDockQ, shape complementarity, interface area, solvation energy) to its percentile against the frozen benchmark_26 reference distribution and averages them. PAE and solvation energy are sign-flipped so that higher percentile always means stronger interaction evidence, and missing or non-finite inputs are ignored. Reference deciles are baked into BENCHMARK_QUANTILES so the score is reproducible and independent of any per-call benchmark file. Helper calibrated_feature_percentile() can also be called directly to obtain the percentile for an individual feature. The scoring runner now writes interface_meta_score alongside the existing columns in interfaces.csv, so every per-run output has a single ranking number ready for downstream sorting or reporting. Tests in test/test_meta_score.py cover bounded output, NaN handling, direction of inverted features, and clamping at the unit interval.
Generates PDF reports that visualise each AlphaJudge interface metric as a
percentile slider against the frozen benchmark_26 reference deciles,
mirroring the wwPDB "Overall quality at a glance" layout: smooth red ->
yellow -> green gradient bars with a single black marker, serif
typography, page header rule with title/entry id, and a "Continued on
next page" footer.
Public API:
- generate_per_run_report(run_dir) writes report.pdf next to
interfaces.csv with a cover, an overall slider panel, a per-interface
table, the PAE heatmap, and one per-model appendix page per non-best
sample.
- generate_aggregate_report(summary_csv) writes a multi-page PDF with a
cohort cover (meta-score histogram, summary statistics, top-N table)
plus one slider page per complex, ranked by interface_meta_score.
- main_aggregate() exposes both modes through the alphajudge-report CLI
(wired in a follow-up commit).
Tests cover per-run + aggregate generation, the missing-CSV fallback,
and the fallback that recomputes the meta score when an input row is
missing the precomputed interface_meta_score column.
Surfaces the report module so reports can be generated as part of a
scoring run without an explicit second step.
CLI (alphajudge):
- Mutually-exclusive --report / --no-report flags. Default is on when a
single run is scored and off when --summary is requested, so
benchmark aggregations stay fast.
- --aggregate_report PATH writes a cohort PDF from the --summary CSV
after scoring finishes (errors out if --summary is missing).
Runner:
- process_many / _process_one_run gain a write_per_run_report kwarg
that invokes the report module via a defensive
_safe_write_per_run_report helper after each per-run CSV is
materialised (including the reuse paths). The helper imports the
report module lazily and swallows any import or runtime error so a
matplotlib hiccup never blocks scoring.
Packaging:
- pyproject bumps version to 1.0.1 and exposes the alphajudge-report
console entry, which dispatches to per-run or aggregate mode based
on whether the input is a directory or a CSV.
When a run has more than one chain-pair interface (any multimer prediction), include a "Per-interface raw scores" page between the overall quality slider panel and the PAE heatmap. Rows are sorted by interface_meta_score descending so the strongest interfaces appear first, making it easy to read off which subcomplex pairs are well predicted and which are not. For single-interface dimers nothing changes -- the page is skipped so the report stays compact. Also tightens the per-interface table layout: column widths give the Interface and Residues headers enough room, the intro text is split across two short lines instead of one wide one that clipped on A4, and the table sits at y_top=0.78 with row_height=0.024 so 15+ rows fit without crowding the footer.
Per-interface pages: generate_per_run_report now produces one "Overall quality at a glance" page per detected interface, sorted by interface_meta_score descending, instead of only showing the best chain pair. For a 15- interface multimer that means 15 slider pages numbered 2.1 .. 2.15, preceded by the cohort overview table. Dimers with a single interface still get exactly one quality page numbered "1" (no sub-index) so their reports stay compact. Each page's pre-header now shows the model name + chain pair + residue count for that specific interface, not just the best one. Cover info box: Replace "frozen benchmark deciles" with the more accurate "frozen benchmark distribution"; expand the explanation onto a fourth line so the pink box no longer crops the closing words.
The cohort cover and per-page summaries now treat each chain-pair
interface as the unit of analysis, not the predicted complex. Concretely
generate_aggregate_report no longer groups rows by complex and keeps a
"best per complex" entry; it ranks every scorable interface row
directly. Consequences:
- The histogram, min/median/mean/max, ≥0.5 and ≥0.7 counters all run
over interfaces. A 9-chain multimer with 15 detected interfaces
contributes 15 data points; a dimer contributes 1. Mixing the two
in one cohort no longer under-represents multimers.
- The "Top N" table is now "Top N interfaces by meta score" and uses
a "complex · pair" label so multimer rows show which chain pair
they refer to.
- Each per-page slider panel is now an interface page, not a complex
page (renamed _complex_summary_page -> _interface_summary_page).
The header subtitle leads with the interface label and shows
"Rank K of N" against the global interface ranking.
- Backend counts in the cover meta block stay per-complex so they are
not double-counted across a multimer's interfaces.
Cover sub-title says "N interfaces across M complexes" so users see
both axes at a glance. test/test_report.py is updated to match the new
page count (cover + one page per scorable interface row).
Two pieces of repeated text added unnecessary clutter to every page:
- The "Percentile ranks are computed against the AlphaJudge
benchmark distribution; higher is better for every metric
(sign-flipped where needed)." note at the bottom of every quality
page. The slider's "Worse <- -> Better" axis already conveys this.
- The "AlphaJudge" left-side label and the "Continued on next page..."
middle text in the page footer. The thin footer rule plus the page
number on the right are enough.
After this change each page has a header rule (Page N / title / entry
id) and a footer with just the rule and "N / total" page counter.
The page header carried "Page N" on the left, "Title" in the middle, and "entry id" on the right. The footer already shows "N / total" at the bottom right, so the header repeat was redundant. Header now reads "Title ... entry id" only. Module docstring updated.
Adds the new report flags to the CLI synopsis, the option list, the output description, and an example invocation; documents the alphajudge-report console entry that exposes the per-run and aggregate modes outside the scoring pipeline.
After rebasing report_percentiles onto the Boltz-2 work, note explicitly that --report and --aggregate_report flow through the same scoring path for AF2, AF3, and Boltz-2 runs, and broaden the cohort-report example to cover a mixed-backend root.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: edd466c437
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if args.aggregate_report: | ||
| if not args.summary: | ||
| p.error("--aggregate_report requires --summary") | ||
| generate_aggregate_report(args.summary, out_pdf=args.aggregate_report) |
There was a problem hiding this comment.
Do not build aggregate reports from stale summaries
When --aggregate_report is requested, process_many can return None without writing the requested summary (for example, no runnable directories or all processed runs produce no rows), but this call still reads args.summary afterward. If that path already contains a CSV from a previous run, the command will silently produce an aggregate PDF for stale data instead of failing, which can misreport the current cohort. Capture the returned summary path and only generate the report when this invocation actually produced/reused rows for the requested summary.
Useful? React with 👍 / 👎.
process_many returns None whenever it does not actually write the summary (no paths, no runnable dirs, all workers produced no rows). The CLI previously ignored that return value, so if --aggregate_report pointed at a path that already had a CSV from a previous invocation, generate_aggregate_report would silently consume that stale file and emit a PDF for the wrong cohort. Capture the returned summary path and fail loudly via p.error when it is None, instead of building a report from data this invocation did not produce.
The previous green RdYlGn gradient with boxed % and Value columns made
the page look like a dashboard. Switch to the wwPDB look:
- Red -> pale center -> blue percentile bars (LinearSegmentedColormap),
deliberately thin (bar_height ~0.011 of page).
- Drop the explicit % column; percentile is conveyed by marker
position. Three columns only: Metric | Percentile Ranks | Value.
- Single small black marker per row, plus a thin black polyline
connecting valid percentiles down the chart, exactly as on the
wwPDB "Overall quality at a glance" page.
- "Worse" / "Better" italic labels directly under the bars and a
marker-glyph legend, in the wwPDB position.
Page chrome:
- Cover has no running header. It leads with a small text PDB
wordmark, the report title, and a blue circled-i icon, matching
the RCSB cover.
- Pages 2+ get a single-rule running header: "Page N | Title | entry".
- Footer is just a small wordmark, no page-number rule (the running
header already carries the page label).
- Pink info box is square-cornered and uses wwPDB red.
- Section headings render number + title at 17pt with a tight gap;
the optional info icon is now off by default so it never overlaps
chain-pair labels like "Interface B_C".
Typography:
- rcParams switched to a Computer-Modern-style serif stack
(CMU Serif / Latin Modern / STIXGeneral / DejaVu Serif fallback),
mathtext "cm", base font 10pt, PDF font type 42 so text stays
searchable in Acrobat.
The slider primitive now exposes a draw_marker switch so the per-row
bar is clean and the marker polyline is overlaid in a single axes,
which avoids axis clipping for off-bar marker rectangles.
Branding (no external logos):
- Remove the generated PDB-style wordmark and the vector chain-pair
logo. The cover page no longer carries any third-party mark; only
the report title remains. The footer carries a plain "AlphaJudge
report" text mark.
- Title is now "AlphaJudge Interface validation Report".
PAE page + standalone PAE PNG (shared rendering):
- report.py exposes render_pae_png(out_path, pae, ...) which produces
an AlphaFold-DB-like standalone PNG: green square heatmap, horizontal
"Expected position error (Ångströms)" colour bar, Scored / Aligned
residue axes, black inter-chain separator lines.
- runner._save_pae_heatmap now delegates to render_pae_png so the
pae_<model>.png files written during scoring match the in-report
PAE page exactly. No if/else fallback path: the report just embeds
the standalone PNG.
Slider panel:
- Show the overall meta score as a separate "Meta score" row at the
top, visually offset from the per-feature rows by an inter-group
gap, in the same typography as the other rows (PDB-validation-style
uniform treatment). The marker polyline never crosses Meta score.
- Split the polyline into two groups that connect related metrics:
the AlphaFold-derived confidence features (LIS, ipSAE, pDockQ2,
ipTM, confidence, avg interface PAE, pDockQ/mpDockQ) and the
biophysical features (shape complementarity, interface area,
solvation energy). Each group has its own connecting line.
The cover page also drops the older "Overall meta score" label and
restores the canonical "AlphaJudge Interface validation Report" title.
Inspecting the canonical benchmark_26 table (n=7,756 balanced) shows
that interface_area and interface_solv_en are near-redundant
(Pearson rho = -0.80), so the previous biophysical trio
(sc / area / solv_en) was effectively (sc / size / size). interface_sc
is the only biophysical feature that is genuinely orthogonal to the
rest, and its AUROC (0.746) is the highest among biophysicals.
Among the size-cluster features, interface_hb (0.703) is the most
interpretable (count of polar contacts), is slightly less correlated
with solv_en than area is, and adds a different physical concept
(directional polar interactions) on top of geometry and hydrophobic
burial. interface_sb (0.681), interface_ss (chance, disulfides too
sparse) and interface_contact_pairs (0.694) are all weaker than
interface_hb and more redundant with area.
Changes:
- META_SCORE_FEATURES swaps interface_area for interface_hb.
- FEATURE_DIRECTIONS keeps interface_area for backward compat but
adds interface_hb (direction +1).
- BENCHMARK_QUANTILES gains an interface_hb entry computed from the
final_sync_20260523 benchmark (deciles 0, 2, 4, 6, 8, 10, 12, 15,
20, 28, 129).
- report.py's _BIOPHYSICAL_FEATURES shows
(sc, hb, solv_en) and the display label "Hydrogen bonds".
- Drop the small "AlphaJudge report" wordmark from the per-page
footer; the running header at the top already identifies the
report on every page.
Existing interfaces.csv files have a stale interface_meta_score
(computed with area instead of hb) until re-scored with
--force_recompute. Tests updated to confirm the new metascore
feature set behaves correctly.
Re-computes the 11-anchor decile table for every metascore feature
on the final synchronized benchmark_26 best-interface CSV
(benchmark_best.final_sync_20260523_225722_force_recompute_nointerfacefix.csv,
n=7,756). The previous values were derived from an earlier n=7,345
April 22 snapshot before the pair-matched predictions were back-filled.
Notable shifts (after sign-flip where applicable):
- interface_LIS: p50 0.060 -> 0.041, p90 0.516 -> 0.510 (slightly tighter)
- interface_pDockQ2: floor lowered from 7.5e-3 -> 0 (more dynamic range)
- interface_sc: p10 -0.210 -> -0.091 (the new dataset has fewer
extreme low-Sc cases at the tail)
- interface_area: p100 23,847 -> 19,027 (one extreme outlier
excluded by the back-fill)
- interface_solv_en: p100 400.7 -> 233.0
- confidence_score: p0 0.127 -> -99.73 (sentinel for failed
predictions now contributes to the tail)
- interface_hb: unchanged (computed on this same table already)
Header comment in BENCHMARK_QUANTILES updated to point at the new
source. interface_meta_score values in existing interfaces.csv files
will move by a few percentage points after re-scoring with
--force_recompute; the 8hhy demo report was re-generated.
No description provided.