Skip to content

perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning#1675

Merged
richiemcilroy merged 14 commits intomainfrom
cursor/cap-export-speed-ad7f
Mar 23, 2026
Merged

perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning#1675
richiemcilroy merged 14 commits intomainfrom
cursor/cap-export-speed-ad7f

Conversation

@richiemcilroy
Copy link
Member

@richiemcilroy richiemcilroy commented Mar 22, 2026

Summary

Optimizes the Cap MP4 export pipeline for significantly faster export speed. Benchmarked using the cap-performance-fixtures reference recording (~30.8s, 3024x1964 source, 2 segments with cursor, camera, zoom, and audio).

Benchmark Results (avg of 3 runs, software Vulkan renderer)

Preset Before After Wall Time Δ FPS Δ
720p/30fps/Maximum 36.5s / 24.9fps 32.8s / 28.2fps -10.1% +13.3%
1080p/30fps/Maximum 46.3s / 17.8fps 43.9s / 21.1fps -5.2% +18.5%
1080p/30fps/Social 49.2s / 18.7fps 43.2s / 21.3fps -12.2% +13.9%

Output file sizes are identical (within <1%), confirming no quality regression.

Changes

Pipeline Restructure (crates/export/src/mp4.rs)

  • Eliminated intermediate render_task — reduced from 3 tasks to 2 tasks
  • Moved audio rendering to encoder thread (removes channel hop)
  • Fire-and-forget screenshot generation (no longer blocks export completion)
  • Replaced spawned forward task with local future in tokio::try_join!

Cursor Timeline Precomputation (crates/rendering/src/cursor_interpolation.rs, crates/rendering/src/lib.rs)

  • Biggest single win: The smoothed cursor timeline (spring mass damper simulation) was being rebuilt from scratch 3 times per frame (current pos, previous pos, lookback). For a 30s recording this meant ~5400 simulation steps per output frame
  • Added PrecomputedCursorTimeline that builds the timeline once and lookups are O(log n)
  • Added ProjectUniforms::new_with_precomputed_cursor() for cached lookups
  • Precompute zoom focus interpolation for full duration upfront
  • Applied to both NV12 and RGBA render paths

NV12 Buffer Pool (crates/rendering/src/frame_pipeline.rs)

  • Added NV12BufferPool to reuse Vec<u8> allocations across frames
  • Eliminates per-frame allocation during GPU readback

Software-Optimized NV12 Path (crates/rendering/src/lib.rs)

  • When on software wgpu adapter, skip GPU RGBA→NV12 compute shader
  • Read back RGBA and convert to NV12 directly on CPU
  • Avoids wgpu compute shader dispatch overhead on software Vulkan

Encoder Tuning (crates/enc-ffmpeg/src/video/h264.rs)

  • Optimized libx264 export settings: explicit thread count, rc-lookahead=10, b-adapt=1, ref=2, subme=2, trellis=0
  • These supplement the "veryfast" preset for maximum throughput without meaningful quality loss

Testing

  • All existing unit tests pass (cargo test -p cap-export -p cap-rendering)
  • Benchmarked 3 times for stability using export-benchmark-runner with cap-performance-fixtures reference recording
  • Output file sizes confirm no quality degradation
  • Export completes successfully with audio sync maintained
Open in Web Open in Cursor 

Greptile Summary

This PR restructures the Cap MP4 export pipeline to improve throughput by 5–12% wall-time and ~14–19% FPS across tested resolutions and presets, with no quality regression. The core changes are: eliminating the intermediate render_task (3→2 concurrent tasks), precomputing the cursor spring simulation once per segment instead of once per frame lookup, upfront precomputation of zoom focus interpolation for exports, an NV12 buffer pool with proper Drop-based recycling to avoid per-frame allocation, a CPU-side RGBA→NV12 fast path for software Vulkan adapters, and encoder tuning for the veryfast and lower presets.

Key points:

  • The NV12BufferPool Drop-on-release pattern (raised in a prior review) is now correctly implemented; buffers are returned to the pool when the last Arc clone drops.
  • Audio rendering is moved into the spawn_blocking encoder thread; the audio sample cursor calculation is unchanged from the old render_task.
  • Screenshot generation is now intentionally fire-and-forget, but panics inside the blocking closure are now silently swallowed with no log output.
  • The software-adapter CPU RGBA→NV12 path uses a Y-plane BT.601 coefficient of 65 for red vs. the standard 66, producing at most a 1-luma-level difference — imperceptible but means the software and GPU paths are not bit-exact.
  • The PR description states O(log n) cursor lookups; the interpolate_timeline implementation uses an O(1) stride fast-path (correct for the uniform SIMULATION_STEP_MS grid) with an O(n) linear fallback, while interpolate_from_events in zoom focus correctly uses binary search.
  • Steady-state decode retry count is reduced from 5→2 with a fixed 10 ms delay; this is a deliberate performance trade-off and is unlikely to regress stability in practice.

Confidence Score: 4/5

  • Safe to merge with a minor fix: add logging for the fire-and-forget screenshot task so failures don't vanish silently.
  • The pipeline restructure is well-reasoned and benchmarked. The previously-flagged NV12BufferPool reuse issue is correctly fixed via Drop. All audio/video sync logic is preserved. The two remaining items (silent screenshot panic, 1-luma-level Y coefficient deviation) are non-blocking P2s that don't affect correctness or output quality in normal use.
  • crates/export/src/mp4.rs — fire-and-forget screenshot task silently drops panics

Important Files Changed

Filename Overview
crates/export/src/mp4.rs Pipeline reduced from 3 tasks to 2 — audio rendering moved into encoder thread, screenshot is now fire-and-forget (panics swallowed silently). Channel capacity reduced 32→4 for the sync encoder channel. Logic for audio sample cursor and playhead init is correctly ported.
crates/rendering/src/cursor_interpolation.rs Adds PrecomputedCursorTimeline that builds the spring simulation once and reuses it. The interpolate_timeline fast path uses a stride-based O(1) index (uniform step grid) with an O(n) linear-scan fallback — correct but not binary search as described. All tests pass.
crates/rendering/src/frame_pipeline.rs NV12BufferPool now correctly implements buffer reuse via Drop on PooledNv12Buffer (previous review concern resolved). SharedNv12Buffer wraps Arc for clone-friendly sharing; into_vec correctly prevents pool return when taking ownership.
crates/rendering/src/lib.rs Adds software-adapter fast path (RGBA→NV12 on CPU), precomputed cursor timelines for both RGBA and NV12 render paths, and upfront zoom focus precomputation for NV12 exports. Y-plane BT.601 coefficient has a minor 1-off vs standard. DECODE_MAX_RETRIES reduced to 2 for steady-state frames (down from 5).
crates/enc-ffmpeg/src/video/h264.rs Encoder tuning applied only for non-Slow/Medium presets: explicit thread count, rc-lookahead=10, b-adapt=1, aq-mode=1, ref=2, subme=2, trellis=0. aq-mode is not mentioned in the PR description but matches the veryfast default so there is no quality change. Settings are safe.
crates/rendering/src/zoom_focus_interpolation.rs Adds new_with_precomputed_cursor / new_arc_with_precomputed_cursor constructors and ensure_precomputed_until for upfront simulation. interpolate_from_events uses binary search (O(log n)) correctly. No issues found.

Sequence Diagram

sequenceDiagram
    participant R as render_video_to_channel_nv12
    participant F as forward_future (async)
    participant E as encoder_thread (spawn_blocking)

    Note over R,E: export_render_to_channel — tokio::try_join!

    R->>F: (Nv12RenderedFrame, frame_number) via tokio::mpsc (cap 2)
    F->>F: nv12_from_rendered_frame → ExportFrame
    F->>F: save first_frame_data for screenshot
    F->>E: ExportFrame via std::mpsc::sync_channel (cap 4)

    loop per frame
        E->>E: set_playhead (frame 0 only)
        E->>E: compute audio_frame from audio_sample_cursor
        E->>E: fill_nv12_frame_direct
        E->>E: queue_video_frame_reusable
        E->>E: queue_audio_frame (if audio)
        E->>E: encoded_frames += 1
    end

    F-->>F: drop(sender) → encoder drains
    F->>F: spawn_blocking screenshot (fire-and-forget)
    E->>E: finish_mp4 / finalize
Loading

Comments Outside Diff (4)

  1. crates/export/src/mp4.rs, line 481-482 (link)

    P1 Missing timeout — export can hang indefinitely

    The original render_task guarded every video_rx.recv() with a timeout (120 s for the first frame, 90 s for subsequent frames, aborting after 3 consecutive misses). The new export_render_to_channel drops that protection entirely:

    while let Some((frame, _frame_number)) = video_rx.recv().await {

    If the GPU renderer stalls (device lost, wgpu crash, driver hang), recv() will park the task forever, and the outer tokio::try_join!(encoder_thread, render_video_task) will wait indefinitely. Users will see a frozen export with no error or progress update.

    The encoder side has the same exposure: frame_rx.recv() on the blocking thread will block indefinitely if the async forwarding loop stops producing frames.

    Consider re-introducing at least a per-frame deadline — something like tokio::time::timeout(Duration::from_secs(90), video_rx.recv()) — and propagating a descriptive error on expiry, as the previous code did.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/export/src/mp4.rs
    Line: 481-482
    
    Comment:
    **Missing timeout — export can hang indefinitely**
    
    The original `render_task` guarded every `video_rx.recv()` with a timeout (120 s for the first frame, 90 s for subsequent frames, aborting after 3 consecutive misses). The new `export_render_to_channel` drops that protection entirely:
    
    ```rust
    while let Some((frame, _frame_number)) = video_rx.recv().await {
    ```
    
    If the GPU renderer stalls (device lost, wgpu crash, driver hang), `recv()` will park the task forever, and the outer `tokio::try_join!(encoder_thread, render_video_task)` will wait indefinitely. Users will see a frozen export with no error or progress update.
    
    The encoder side has the same exposure: `frame_rx.recv()` on the blocking thread will block indefinitely if the async forwarding loop stops producing frames.
    
    Consider re-introducing at least a per-frame deadline — something like `tokio::time::timeout(Duration::from_secs(90), video_rx.recv())` — and propagating a descriptive error on expiry, as the previous code did.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. crates/export/src/mp4.rs, line 525-534 (link)

    P2 Redundant double-spawn for screenshot task

    The current construction nests a spawn_blocking inside a plain spawn:

    tokio::task::spawn(async move {
        let _ = tokio::task::spawn_blocking(move || {
            save_screenshot_from_nv12(...)
        })
        .await;
    });

    tokio::task::spawn_blocking already schedules the closure on the blocking thread pool and returns a JoinHandle. Dropping that handle (fire-and-forget) is all that's needed — the outer spawn serves no purpose and wastes a task allocation.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/export/src/mp4.rs
    Line: 525-534
    
    Comment:
    **Redundant double-spawn for screenshot task**
    
    The current construction nests a `spawn_blocking` inside a plain `spawn`:
    
    ```rust
    tokio::task::spawn(async move {
        let _ = tokio::task::spawn_blocking(move || {
            save_screenshot_from_nv12(...)
        })
        .await;
    });
    ```
    
    `tokio::task::spawn_blocking` already schedules the closure on the blocking thread pool and returns a `JoinHandle`. Dropping that handle (fire-and-forget) is all that's needed — the outer `spawn` serves no purpose and wastes a task allocation.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
  3. crates/export/src/mp4.rs, line 555-563 (link)

    P2 Screenshot task panics are silently swallowed

    The _screenshot_task JoinHandle is immediately dropped, so any panic inside save_screenshot_from_nv12 (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a warn! on failure:

    if let Err(e) = screenshot_task.await {
        warn!("Screenshot task failed: {e}");
    }

    Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the warn! inside the closure:

    (or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/export/src/mp4.rs
    Line: 555-563
    
    Comment:
    **Screenshot task panics are silently swallowed**
    
    The `_screenshot_task` JoinHandle is immediately dropped, so any panic inside `save_screenshot_from_nv12` (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a `warn!` on failure:
    
    ```rust
    if let Err(e) = screenshot_task.await {
        warn!("Screenshot task failed: {e}");
    }
    ```
    
    Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the `warn!` inside the closure:
    
    
    
    (or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.
    
    How can I resolve this? If you propose a fix, please make it concise.
  4. crates/rendering/src/lib.rs, line 1431 (link)

    P2 Y-plane R coefficient off by one vs BT.601

    The Y-channel RGB→YCbCr conversion uses a coefficient of 65 for the red channel, while the standard BT.601 limited-range approximation uses 66:

    Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
    Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code
    

    For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/rendering/src/lib.rs
    Line: 1431
    
    Comment:
    **Y-plane R coefficient off by one vs BT.601**
    
    The Y-channel RGB→YCbCr conversion uses a coefficient of `65` for the red channel, while the standard BT.601 limited-range approximation uses `66`:
    
    ```
    Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
    Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code
    ```
    
    For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 555-563

Comment:
**Screenshot task panics are silently swallowed**

The `_screenshot_task` JoinHandle is immediately dropped, so any panic inside `save_screenshot_from_nv12` (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a `warn!` on failure:

```rust
if let Err(e) = screenshot_task.await {
    warn!("Screenshot task failed: {e}");
}
```

Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the `warn!` inside the closure:

```suggestion
            if let Some(first) = first_frame_data {
                let pp = screenshot_project_path;
                tokio::task::spawn_blocking(move || {
                    save_screenshot_from_nv12(
                        first.data.as_ref(),
                        first.width,
                        first.height,
                        first.y_stride,
                        &pp,
                    );
                });
            }
```

(or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/rendering/src/lib.rs
Line: 1431

Comment:
**Y-plane R coefficient off by one vs BT.601**

The Y-channel RGB→YCbCr conversion uses a coefficient of `65` for the red channel, while the standard BT.601 limited-range approximation uses `66`:

```
Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code
```

For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "clippy" | Re-trigger Greptile

cursoragent and others added 9 commits March 22, 2026 19:27
…rt pipeline

- Remove the 3-task pipeline (renderer → processor → encoder) and replace
  with a streamlined 2-task pipeline (renderer → encoder)
- Move audio rendering into the encoder thread to avoid channel hop
- Create nv12_from_rendered_frame() that returns ExportFrame directly
- Create fill_nv12_frame_direct() for zero-overhead NV12 frame filling
- Fire-and-forget screenshot generation (no longer blocks export completion)
- Reduce async channel buffer from 32 to 4 for renderer output (backpressure)
- Maintain all existing test compatibility

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add supplementary speed options for libx264 in export mode:
- Explicit thread count matching CPU parallelism
- rc-lookahead=10 (reduced from default 40)
- b-adapt=1 (faster B-frame decision)
- aq-mode=1 (simpler adaptive quantization)
- ref=2 (fewer reference frames)
- subme=2 (faster subpixel motion estimation)
- trellis=0 (disable trellis quantization)

These work alongside the veryfast preset to maximize encode speed
while maintaining good quality for screen recording content.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Add NV12BufferPool that reuses Vec<u8> allocations across frames
- Add finish_encoder_nv12_pooled() variant that accepts buffer pool
- Add PendingNv12Readback::wait_with_pool() for pooled readback
- FrameRenderer now maintains a pool of 6 buffers
- Eliminates per-frame Vec allocation during GPU readback

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…r export

Major optimization: cursor interpolation was rebuilding the entire smoothed
cursor timeline (spring mass damper simulation) from scratch 3 times per
frame (current pos, previous pos, lookback). For a 30s recording this meant
~5400 simulation steps per output frame.

Changes:
- Add PrecomputedCursorTimeline that builds the smoothed timeline once
- Add ProjectUniforms::new_with_precomputed_cursor() for cached lookup
- Refactor ProjectUniforms to use cursor_interp_fn callback pattern
- Precompute zoom focus interpolation for full duration upfront
- Precompute cursor timelines per segment before render loop starts

This eliminates the O(n) per-frame cursor simulation overhead entirely,
replacing it with O(log n) binary search lookups.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Replace spawned forward_task with a local async future that runs in
tokio::try_join! alongside the render future. This eliminates:
- tokio task spawn/schedule overhead per export
- unnecessary task context switches between renderer and forwarder
- reduce async channel buffer from 8 to 2 (tighter backpressure)

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Apply the same PrecomputedCursorTimeline optimization to the
render_video_to_channel function (used for GIF export and playback)
so both render paths benefit from the cached cursor timeline.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When running on a software wgpu adapter (lavapipe, llvmpipe, etc.),
skip the GPU RGBA→NV12 compute shader and instead:
1. Read back RGBA via the standard pipelined readback path
2. Convert RGBA→NV12 directly on CPU with inline color conversion

This avoids the overhead of:
- GPU compute shader dispatch through software Vulkan
- Extra GPU storage buffer allocation for NV12 data
- GPU command encoder submission for the compute pass
- Additional GPU buffer readback through wgpu

On software adapters, the GPU compute path has significant overhead
from the wgpu abstraction layer, making direct CPU conversion faster.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Remove unused finish_encoder_nv12 import
- Prefix unused cursor_events parameter with underscore
- Fix needless mutable on PendingNv12Readback::wait

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Fix collapsible_if in encoder thread (mp4.rs)
- Fix needless_option_as_deref in finish_encoder_nv12_pooled
- Remove unused finish_encoder_nv12 wrapper (replaced by pooled variant)
- Remove unused PendingNv12Readback::wait (replaced by wait_with_pool)
- Remove unused NV12BufferPool::return_buf and max_buffers field
- Remove unnecessary mut on buffer_pool parameter

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
@richiemcilroy richiemcilroy marked this pull request as ready for review March 22, 2026 23:41
Comment on lines 8 to 32
const GPU_BUFFER_WAIT_TIMEOUT_SECS: u64 = 10;

pub struct NV12BufferPool {
buffers: Vec<Vec<u8>>,
}

impl NV12BufferPool {
pub fn new(pre_alloc: usize) -> Self {
Self {
buffers: Vec::with_capacity(pre_alloc),
}
}

pub fn acquire(&mut self, size: usize) -> Vec<u8> {
if let Some(pos) = self.buffers.iter().position(|b| b.capacity() >= size) {
let mut buf = self.buffers.swap_remove(pos);
buf.clear();
buf
} else {
Vec::with_capacity(size)
}
}
}

pub struct RgbaToNv12Converter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NV12BufferPool never reuses buffers — optimization is a no-op

NV12BufferPool has an acquire method but no corresponding release (or return) method. Every acquired Vec<u8> is immediately moved into Arc::new(...):

// frame_pipeline.rs
let nv12_vec = if let Some(pool) = buffer_pool {
    let mut buf = pool.acquire(data_len);
    buf.extend_from_slice(&data);
    buf
};
let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the Arc

When the Arc drops, the Vec is freed — it is never returned to the pool. Because NV12BufferPool::new(6) only pre-allocates capacity in the outer Vec (not the NV12 byte buffers themselves), self.buffers is always empty, and every call to acquire falls through to Vec::with_capacity(size). The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved.

To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the Vec in a guard type that calls a release method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/rendering/src/frame_pipeline.rs
Line: 8-32

Comment:
**NV12BufferPool never reuses buffers — optimization is a no-op**

`NV12BufferPool` has an `acquire` method but no corresponding `release` (or `return`) method. Every acquired `Vec<u8>` is immediately moved into `Arc::new(...)`:

```rust
// frame_pipeline.rs
let nv12_vec = if let Some(pool) = buffer_pool {
    let mut buf = pool.acquire(data_len);
    buf.extend_from_slice(&data);
    buf
};
let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the Arc
```

When the `Arc` drops, the `Vec` is freed — it is never returned to the pool. Because `NV12BufferPool::new(6)` only pre-allocates capacity in the *outer* `Vec` (not the NV12 byte buffers themselves), `self.buffers` is always empty, and every call to `acquire` falls through to `Vec::with_capacity(size)`. The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved.

To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the `Vec` in a guard type that calls a `release` method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself.

How can I resolve this? If you propose a fix, please make it concise.

let r = src_row[col * 4] as i32;
let g = src_row[col * 4 + 1] as i32;
let b = src_row[col * 4 + 2] as i32;
y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting to u8 before clamping can wrap for values >255 (or <0) and produce incorrect luma/chroma. Clamping on the integer first avoids that.

Suggested change
y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235);
let y = 16 + ((65 * r + 129 * g + 25 * b + 128) >> 8);
y_row[col] = y.clamp(16, 235) as u8;

Comment on lines +2967 to +2970
uv_row[col * 2] =
((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240);
uv_row[col * 2 + 1] =
((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same wrap issue here: clamp the computed i32 first, then cast.

Suggested change
uv_row[col * 2] =
((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240);
uv_row[col * 2 + 1] =
((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240);
let u = 128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8);
let v = 128 + ((112 * r - 94 * g - 18 * b + 128) >> 8);
uv_row[col * 2] = u.clamp(16, 240) as u8;
uv_row[col * 2 + 1] = v.clamp(16, 240) as u8;

Comment on lines +633 to +645
let pp = screenshot_project_path;
tokio::task::spawn(async move {
let _ = tokio::task::spawn_blocking(move || {
save_screenshot_from_nv12(
&first.data,
first.width,
first.height,
first.y_stride,
&pp,
);
})
.await;
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This double-spawn looks redundant. If the goal is fire-and-forget screenshot generation, spawn_blocking alone is enough (and avoids an extra task + await).

Suggested change
let pp = screenshot_project_path;
tokio::task::spawn(async move {
let _ = tokio::task::spawn_blocking(move || {
save_screenshot_from_nv12(
&first.data,
first.width,
first.height,
first.y_stride,
&pp,
);
})
.await;
});
let pp = screenshot_project_path;
tokio::task::spawn_blocking(move || {
save_screenshot_from_nv12(&first.data, first.width, first.height, first.y_stride, &pp);
});

@richiemcilroy
Copy link
Member Author

@greptileai please re-review the pr

@richiemcilroy richiemcilroy merged commit 75206ff into main Mar 23, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants