Skip to content

Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean)#1908

Open
romeerp wants to merge 1 commit intoopenai:mainfrom
romeerp:codex/awq-stepmatched
Open

Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean)#1908
romeerp wants to merge 1 commit intoopenai:mainfrom
romeerp:codex/awq-stepmatched

Conversation

@romeerp
Copy link
Copy Markdown
Contributor

@romeerp romeerp commented Apr 28, 2026

Record: PR #1855 base + activation-aware GPTQ mixed precision

Matched-step 3-seed mean val_bpb: 1.06081076 (std 0.00089) | ~15.99 MB | 8×H100 SXM | full TTT eval

This submission keeps the PR #1855 training recipe unchanged and only changes quantization. The quantization change is an activation-aware mixed-precision GPTQ path:

  1. collect per-input-channel activation RMS during the existing GPTQ calibration pass
  2. score candidate column groups with an AWQ-style heuristic
    • weight_score = mean(abs(w), dim=0)
    • saliency = act_rms * weight_score
    • group_score = saliency[start:end].sum()
  3. select one salient 64-column group
  4. quantize that group at int8 inside the same full-tensor GPTQ solve
  5. keep stock PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 LQER on top of the resulting AWQ-aware GPTQ base

The GPUs I had access to showed consistently worse performance than PR#1855, so to demonstrate the benefit of this quantization technique, I step-matched the 3 seeds used in PR#1855 using the same training code.

Results

Step-matched comparisons against PR #1855

Seed Stop step Prequant BPB (PR1855) Prequant BPB (AWQ) Quantized BPB (PR1855) Quantized BPB (AWQ) Post-TTT BPB (PR1855) Post-TTT BPB (AWQ) Artifact bytes (PR1855) Artifact bytes (AWQ)
42 4945 1.06395844 1.06384082 1.07254371 1.07225564 1.05989454 1.05957221 15,897,259 15,985,824
0 4932 1.06544819 1.06555331 1.07406724 1.07403531 1.06124613 1.06127329 15,900,947 15,983,935
1234 4917 1.06596989 1.06574247 1.07477929 1.07427091 1.06208695 1.06158679 15,907,550 15,996,559
Mean 4931 1.06512551 1.06504553 1.07379675 1.07352062 1.06107587 1.06081076 15,901,918 15,988,772

Quantization-tax view

So the activation-aware GPTQ recipe recovers about 0.00019615 BPB of mean quantization tax on the matched-step 3-seed suite, while staying under the 16 MB cap on every seed.

At final post-TTT, the matched-step means are:

for a mean reduction of 0.00026511 BPB.

What changed

Compared to the PR #1855 base stack, the functional change is in train_gpt.py:

  • add activation-stat collection during the existing GPTQ calibration pass
  • add exact mixed-bit GPTQ support for a selected group inside the same Hessian-based solve
  • keep stock LQER behavior on top of the AWQ-aware quantized base
  • add FORCE_STOP_STEP to support step-matched evaluation

No training hyperparameters were changed for these runs. The base model recipe is the PR #1855 seed-matched recipe.

Reproducing

This record folder assumes the same CaseOps sp8192 dataset/tokenizer used by PR #1855, sourced from Hugging Face:

  • dataset repo: romeerp/parameter-golf-caseops-v1
  • variant: sp8192_lossless_caps_caseops_v1_reserved

The three runs in this folder use:

  • seed 42, FORCE_STOP_STEP=4945
  • seed 0, FORCE_STOP_STEP=4932
  • seed 1234, FORCE_STOP_STEP=4917

The quantization knobs are:

Included files

  • train_gpt.py — modified training/quantization script
  • README.md — this writeup
  • submission.json — structured metadata
  • requirements.txt — Python dependencies reference
  • train_seed42.log, train_seed0.log, train_seed1234.log — full matched-step run logs

@romeerp romeerp changed the title Record candidate: PR #1855 base + activation-aware GPTQ mixed precision (step-matched) Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) Apr 28, 2026
@romeerp
Copy link
Copy Markdown
Contributor Author

romeerp commented Apr 28, 2026

I personally don't want to have to re-run these three seeds, so this should be open for anyone who wants to claim a new record if they can re-run it under the 600s wallclock on a better GPU setup that matches PR#1855s throughput.

@msisovic
Copy link
Copy Markdown
Contributor

Interestingly, the GPUs I've been renting today and yesterday have been consistently slower as well... IDK what could be going on

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
Four post-training specs to stack on 060A's openai#1855 port:

- 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated
  −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port.
- 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on
  weaker base, never tested with 2500 prefix).
- 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000
  greedy-validated 2000→2500 on this exact stack in openai#1855).
- 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on
  weaker base; never tested on phased+SmearGate stack like openai#1855).

All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change
for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy
direction (which decreased rank 96→80).

Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AayushBaniya2006 added a commit to AayushBaniya2006/parameter-golf that referenced this pull request Apr 28, 2026
Track B's PR openai#1493 base maxed at ~1.066 — mid-pack now that PR openai#1855
(1.06108) and PR openai#1908 (1.06081) landed. Pivot to PR openai#1908 train_gpt.py
as the base and exploit knobs PR openai#1908 left at conservative defaults:
- AWQ_LITE_GROUP_TOP_K=1 (only 1 protected group at int8)
- LQER_TOP_K=3 (only 3 LQER-corrected tensors)
- LQER_GAIN_SELECT=0 (uses error-norm, not actual gain)

The QUANTIZE_ONLY=1 flag in train_gpt_pr1908.py lets us train a base
once per seed and sweep many quant configs at ~$0.10 each.

Pipeline (5 stages, all on 8xH100):
  scripts/top1_bootstrap.sh       — apt+pip+lrzip+caseops data
  scripts/top1_repro_pr1908.sh    — seed-42 repro to validate setup (~$4)
  scripts/top1_quant_sweep.sh     — 9-config knob sweep on saved base (~$8)
  scripts/top1_final_3seed.sh     — 3-seed final with winning knobs (~$12)
  scripts/top1_pack_submission.sh — bundle record dir + submission.json

Final 3-seed uses organic 600s wallclock cap (no FORCE_STOP_STEP) for
compliance safety. PR openai#1908's 4945-step run used 601153ms of the 600000ms
cap — risk we will not take.

Also adds scripts/jupyter_exec.py (HTTPS-proxy executor for SSH-firewalled
networks) and PR1908 reference README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AayushBaniya2006 added a commit to AayushBaniya2006/parameter-golf that referenced this pull request Apr 29, 2026
Single script chains the full pipeline. Picks sweep winner by lowest
post-TTT BPB with bytes < 16,000,000. Always runs the 3-seed final
because PR openai#1908 admits 600s overshoot — a compliant 3-seed at their
quality could take openai#1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aquariouseworkman
Copy link
Copy Markdown
Contributor

I personally don't want to have to re-run these three seeds, so this should be open for anyone who wants to claim a new record if they can re-run it under the 600s wallclock on a better GPU setup that matches PR#1855s throughput.

If it's rerun with no change byte by byte, then it's still your score.

aquariouseworkman added a commit to aquariouseworkman/parameter-golf that referenced this pull request Apr 29, 2026
…d mean)

Applies activation-aware mixed-precision GPTQ (from PR openai#1908 / romeerp) on top of codemath3000 PR openai#1855 stack.

## Results

| Seed | val_bpb (post-TTT) | artifact bytes | steps | eval time |
|------|--------------------|----------------|-------|-----------|
| 42   | 1.06118            | 15,978,503     | 4989  | 392.8s    |
| 314  | 1.06005            | 15,976,469     | 4986  | 395.8s    |
| 1234 | 1.06135            | 15,976,673     | 4977  | 395.5s    |
| **mean** | **1.06086**    | —              | —     | —         |

3-seed std: 0.00069. Beats codemath3000 PR openai#1855 (1.06108) by 0.00022 BPB.

## Technique

Training is identical to PR openai#1855. The only change is post-training quantization:

**AWQ-lite (activation-aware GPTQ):**
1. Collect per-input-channel activation RMS during GPTQ calibration
2. Score column groups: `saliency = act_rms * mean(abs(weight))`
3. Select top-1 most salient 64-column group per matrix
4. Quantize that group at int8 inside the same full-tensor GPTQ solve (rest stays int6)

Env vars: `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64`

## Setup
1. `pip install -r requirements.txt`
2. `apt-get install -y lrzip`
3. Install FA3: `pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/`
4. Run `prepare_caseops_data.py` to build the dataset
5. `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 torchrun --standalone --nproc_per_node=8 train_gpt.py`

## Environment
- 8xH100 80GB SXM (RunPod)
- PyTorch 2.9.1+cu128
- FlashAttention 3.0.0
- Triton 3.5.1
@aquariouseworkman
Copy link
Copy Markdown
Contributor

The run went over 600 seconds of wallclock. (~1.2 seconds over). Possibly invalid run.

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 29, 2026
- spec 060N: compound AWQ-lite (PR openai#1908) + 4 TTT phases + 3000 prefix
  + 2 global-SGD epochs, eval-only on 060A's final_model.pt. Single-shot
  compound to use openai#1918's ~205s eval-time slack; safe fallback drops
  GLOBAL_TTT_EPOCHS if wallclock blows.
- new idea 1925-matrix-lr-ttt-prefix-tune (PR openai#1925, hyperparam-only
  on openai#1855: MATRIX_LR=0.028 + PHASED_TTT_PREFIX_DOCS=3500 → 1.06109).
- new idea 1915-per-doc-lora-ttt (PR openai#1915, per-doc-only LoRA TTT
  discipline; parked as fallback if global-SGD class is ruled out).
- frontier scan: 21 new PRs (openai#1906-openai#1931). Headline: PRs openai#1908+openai#1918
  independently confirm AWQ-lite mixed-bit GPTQ pattern at ~1.0608 on
  openai#1855 base; openai#1925 hyperparam-only at 1.06109; openai#1923 Asymmetric Logit
  Rescale = empirical negative; openai#1929 banned SLOT+prequant-TTT.
- frontier-state.json: 21 PRs added; total 200.
- diary/2026-04-29-frontier-scan.md: full scan report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 29, 2026
AWQ-lite from PR openai#1908 ported onto exp/060N-awq-ttt-compound:
+167/-14 LOC in train_gpt.py, syntax-checked, default-off when
AWQ_LITE_ENABLED=0 (byte-identical to baseline).

Spec now frozen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
… V19 scouts

Root cause discovered by inspecting train_gpt.py line 480:

    self.val_bytes = None
    if self.caseops_enabled:                # <- key gate
        self.val_bytes = load_validation_byte_sidecar(...)

When CASEOPS_ENABLED=0 (default), the code falls back to SentencePiece LUT
byte counting which gives ~3.44 bytes/token effective. With CASEOPS_ENABLED=1
the code uses the byte sidecar (fineweb_val_bytes_*.bin) which gives 3.157
bytes/token matching PR openai#1908's reported 1.06081.

Verified PR openai#1908 actual training log shows:
  caseops_enabled: True
  val_bytes_files: .../fineweb_val_bytes_*.bin

So PR openai#1908's reported 1.06081 = 8xH100 SXM eval with byte sidecar enabled.
Our V18 baseline 0.97651 was on the WRONG byte counting (no sidecar).

Fix:
- All scouts now set CASEOPS_ENABLED=1 + explicit DATA_PATH and TOKENIZER_PATH
  pointing to the CaseOps-tokenized variant.
- Decision thresholds updated to 1.06 range to match PR openai#1908 reported.
- Win threshold = PR openai#1908 reported (1.06081) - 0.0006 community floor = 1.06021.

New script: run_baseline_verify.sh
- Runs PR openai#1908 unchanged (no V19 changes) with CASEOPS_ENABLED=1 +
  FORCE_STOP_STEP=4945 to verify our setup reproduces seed 42's reported
  1.05957. If this gives ~1.0596, our pipeline matches PR openai#1908.

Updated decision rule on all scouts:
  V19c < 1.06021 -> CLEAR WIN (>floor), 3-seed
  V19c 1.06021-1.0608 -> borderline, ablate
  V19c > 1.0608 -> regression, fallback Lead B
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
V19c (seed 42) result: 1.06179 BPB (LOSS by +0.001 vs PR openai#1908 frontier 1.06081).

V19c data attribution:
  pre-quant 1.06906 vs PR openai#1908 1.06384 = +0.0052 hurt
    -> primary cause: MATRIX_LR=0.028 (vs default 0.026) penalty on seed 42
  TTT recovery -0.01489 vs PR openai#1908 -0.01269 = +0.0022 helped
    -> AsymLogit + PHASED_TTT_PREFIX=3500 actually working

V20 strategy: remove LR penalty + keep TTT helpers + add LORA capacity:
  - DROP MATRIX_LR=0.028 -> default 0.026 (recovers +0.005 BPB on pre-quant)
  - KEEP ASYM_LOGIT_RESCALE=1 (eval-only, verified -0.001 to -0.002)
  - KEEP TTT_WEIGHT_DECAY=2.0 (stability fix)
  - KEEP PHASED_TTT_PREFIX_DOCS=3500 (verified more LoRA training data)
  - ADD TTT_LORA_RANK=144 (vs 96 default, +50% LoRA capacity)
    PR openai#1909 GodlyDonuts verified rank=192 gives small benefit on PR openai#1874
    Conservative 144 to balance benefit vs eval-time budget (V19c was 527s, 73s buffer)

Predicted (seed 42):
  pre-quant: ~1.063 (no train hparam changes from PR openai#1908)
  quantized: ~1.072 (matches PR openai#1908 quant tax)
  post-TTT:  ~1.057 (TTT recovery -0.013 base + -0.002 AsymLogit/PHASED + -0.001 RANK = -0.016)

Win threshold: < 1.06021 (PR openai#1908 - 0.0006 community floor)
Probability of true win: ~50%

Cost: ~$22 single-seed scout on 8xH100 SXM
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
V19c/V20 ran with FUNDAMENTALLY WRONG base config:
  - smear_gate_enabled: False  (PR openai#1855 needs True)
  - sparse_attn_gate_enabled: False  (PR openai#1855 needs True)
  - num_phases: 1  (PR openai#1855 needs 3)
  - compressor: brotli  (PR openai#1855 needs pergroup with lrzip)
  - embed_bits: 8  (PR openai#1855 needs 7)
  - 11+ other hparams default-not-PR1855

Hence V19c/V20 artifacts hit 16.93 MB (over 16 MB cap, INVALID submission)
and TTT recovery was 1-phase only, severely handicapped.

V21 = exact PR openai#1855 README reproduction command env vars + AWQ-lite (PR openai#1908)
+ ASYM_LOGIT_RESCALE=1 (V19 innovation, V19c proved -0.001/-0.002 BPB benefit).

Source: PR openai#1855 README lines 125-145 (codemath3000 official reproduction).

Predicted (seed 42):
  pre-quant: ~1.064  (matches PR openai#1908 1.06384)
  quantized: ~1.072  (matches PR openai#1908 1.07226)
  artifact:  ~15.99 MB  (lrzip pergroup compression + EMBED_BITS=7)
  post-TTT:  ~1.057  (PR openai#1908 1.05957 - 0.002 from AsymLogit)

Win threshold: < 1.06021
Probability: 50-60% real frontier break

Pre-req: apt-get install lrzip on RunPod pod (handled in setup script)
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
V21 single-seed (seed 42, FSS=4945): val_bpb 1.05829, wallclock 602.458s.
Reduce FSS to 4920 (-25 steps) to ensure all 3 seeds finish under 600s.
Cost: ~+0.0005 BPB per seed, predicted 3-seed mean ~1.0588 (still
breaks PR openai#1908 frontier 1.06081 by 0.0019 BPB).
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
Seed 42 already completed at FSS=4920 GPTQ_RESERVE=0.5 -> 602s borderline,
val_bpb 1.05834.

Fix: GPTQ_RESERVE_SECONDS=4.0 reserves 4s of wallclock for GPTQ Hessian
collection, leaving 596s for training. Last step overshoot ~2s -> total
~598s, strict under 600s cap.

Predicted seed 0 + seed 1234 final BPB: ~1.0585-1.0590 (slightly higher
than seed 42's 1.05834 due to ~5 fewer training steps)
Predicted 3-seed mean: ~1.0585 (still breaks PR openai#1908 frontier 1.06081
by ~0.0023 BPB, well above community 0.0006 floor)
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…1908 frontier

V21 = PR openai#1855 base (cocohearts-merged openai#1) + PR openai#1908 AWQ-lite quantization
+ PR openai#1923 Asymmetric Logit Rescale.

3-seed results:
  seed 42:   val_bpb 1.058336 (FSS=4920, wallclock 602.048s borderline*)
  seed 0:    val_bpb 1.059394 (no FSS, wallclock 596.057s strict <600s)
  seed 1234: val_bpb 1.060243 (no FSS, wallclock 596.045s strict <600s)
  MEAN:      1.059324
  STD:       0.000780

* seed 42 borderline matches PR openai#1908 seed 42 (601.153s, accepted by cocohearts)
  Seeds 0 + 1234 use GPTQ_RESERVE_SECONDS=4.0 to ensure strict <600s wallclock.

Comparisons:
  vs PR openai#1908 frontier (1.06081):  -0.00149 BPB ✅ WIN
  vs PR openai#1855 official openai#1 (1.06108): -0.00176 BPB ✅
  vs win threshold (1.06021):       -0.00089 BPB ✅ passes community floor
  vs MERGED SOTA bigbag (1.0810):   -0.02168 BPB 🏆
  vs record threshold (1.0738):     -0.01448 BPB (breaks record by 2.0x margin)

Welch one-sided t-test V21 vs PR openai#1908 (n=3 each, std 0.00078 vs 0.00089):
  t ≈ 2.18, p ≈ 0.045 — well below cocohearts-applied p<0.25 chain threshold

Stack:
  - PR openai#1855 (codemath3000): 11L XSA + LQER + SparseAttnGate + BOS-fixed SmearGate
                             + Polar-Express NS + Phased TTT 3-phase + lrzip pergroup
  - PR openai#1908 (romeerp): AWQ-lite mixed-precision GPTQ (1 group of 64 cols int8)
  - PR openai#1923 (jorge-asenjo): Asymmetric Logit Rescale (V21 INNOVATION on this stack)

Code changes vs PR openai#1908: 5 surgical edits to train_gpt.py (+26 lines, eval-only).
Train numerics bit-identical to PR openai#1908. Asymmetric softcap adds 8 bytes
(2 fp16 passthrough scalars) to artifact.

Compliance Issue openai#1017 Track A all 4 conditions verified:
  - Causality (VarLen + per-doc cu_seqlens)
  - Normalized softmax (full SP8192 vocab)
  - Score-before-update (Phased TTT 3-phase, gd:0 then gd:1)
  - Single pass (each val token scored exactly once)
  No SLOT, no pre-quant TTT, no n-gram cache, no ETLB.

V21's empirical falsification of sunnypatneedi 2026-04-29 frontier-scan flag:
PR openai#1923 standalone is -0.00469 BPB negative on PR openai#1855 base (1.06577 vs 1.06108)
but +0.00128 BPB POSITIVE consistently across 3 seeds when stacked on PR openai#1908
quantization. Mechanism: per-doc LoRA in 3-phase TTT learns asymmetric logit
distributions that the symmetric softcap cannot capture.

Files included:
  - V21_README.md: full strategy + results + reproduction
  - submission.json: structured 3-seed metadata + comparison + attribution
  - train_seed42.log + train_seed0.log + train_seed1234.log: full per-seed logs
  - train_gpt.py: PR openai#1908 base + 5 V21 edits (already in branch)

Hardware: 8xH100 80GB SXM (RunPod, AP-IN-1)
Pytorch: 2.9.1+cu128
System dep: lrzip (apt-get install lrzip)

Authors:
  V21 integration: @alertcat
  PR openai#1908 base:   @romeerp
  PR openai#1855 stack:  @codemath3000
  PR openai#1923 axis:   @jorge-asenjo
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
@aquariouseworkman + @romeerp pointed out seed 42's 602.048s wallclock makes the
3-seed test functionally a 2-seed (with invalid 3rd). @romeerp confirmed his
own PR openai#1908 step-matched runs were for ablation, not record submission.

This rerun uses GPTQ_RESERVE_SECONDS=4.0 and no FORCE_STOP_STEP, identical to
V21 seeds 0 and 1234 (which both finished strict <600s).
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…review

Seed 42 v1: FORCE_STOP_STEP=4920 + GPTQ_RESERVE=0.5 -> wallclock 602.048s (borderline)
Seed 42 v2: GPTQ_RESERVE=4.0, no FORCE_STOP_STEP -> wallclock 596.102s (strict <600s)

v2 results:
  seed 42:   val_bpb 1.058675 (was 1.058336 in v1, +0.000339 due to 12 fewer steps)
  seed 0:    val_bpb 1.059394 (unchanged)
  seed 1234: val_bpb 1.060243 (unchanged)
  MEAN:      1.059434 (was 1.059324 in v1, +0.000110)
  STD:       0.000642 (was 0.000780 in v1, TIGHTER)

All 3 seeds now strict <600s wallclock (596.045-596.102s).
All 3 seeds use IDENTICAL config (GPTQ_RESERVE=4.0, no FSS).

Comparisons:
  vs PR openai#1908 frontier (1.06081):  -0.00138 (Welch t=2.18, p=0.045)
  vs PR openai#1855 official openai#1 (1.06108): -0.00165
  vs PR openai#1934 liujshi (1.05993):    -0.00050 (Welch t=0.85, p=0.22, edge of p<0.25)
  vs win threshold (1.06021):       -0.00078
  vs MERGED SOTA bigbag (1.0810):   -0.02157

Compliance: all 3 seeds train+eval strict <600s, artifact <16MB,
3-phase TTT score-first, lossless CaseOps tokenizer, lrzip pergroup.

Files updated:
  - V21_README.md: revised results table + revisions note
  - submission.json: v2 numbers + revisions field
  - train_seed42.log: replaced with strict <600s redo log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants