perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning#1675
Conversation
…rt pipeline - Remove the 3-task pipeline (renderer → processor → encoder) and replace with a streamlined 2-task pipeline (renderer → encoder) - Move audio rendering into the encoder thread to avoid channel hop - Create nv12_from_rendered_frame() that returns ExportFrame directly - Create fill_nv12_frame_direct() for zero-overhead NV12 frame filling - Fire-and-forget screenshot generation (no longer blocks export completion) - Reduce async channel buffer from 32 to 4 for renderer output (backpressure) - Maintain all existing test compatibility Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add supplementary speed options for libx264 in export mode: - Explicit thread count matching CPU parallelism - rc-lookahead=10 (reduced from default 40) - b-adapt=1 (faster B-frame decision) - aq-mode=1 (simpler adaptive quantization) - ref=2 (fewer reference frames) - subme=2 (faster subpixel motion estimation) - trellis=0 (disable trellis quantization) These work alongside the veryfast preset to maximize encode speed while maintaining good quality for screen recording content. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Add NV12BufferPool that reuses Vec<u8> allocations across frames - Add finish_encoder_nv12_pooled() variant that accepts buffer pool - Add PendingNv12Readback::wait_with_pool() for pooled readback - FrameRenderer now maintains a pool of 6 buffers - Eliminates per-frame Vec allocation during GPU readback Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…r export Major optimization: cursor interpolation was rebuilding the entire smoothed cursor timeline (spring mass damper simulation) from scratch 3 times per frame (current pos, previous pos, lookback). For a 30s recording this meant ~5400 simulation steps per output frame. Changes: - Add PrecomputedCursorTimeline that builds the smoothed timeline once - Add ProjectUniforms::new_with_precomputed_cursor() for cached lookup - Refactor ProjectUniforms to use cursor_interp_fn callback pattern - Precompute zoom focus interpolation for full duration upfront - Precompute cursor timelines per segment before render loop starts This eliminates the O(n) per-frame cursor simulation overhead entirely, replacing it with O(log n) binary search lookups. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Replace spawned forward_task with a local async future that runs in tokio::try_join! alongside the render future. This eliminates: - tokio task spawn/schedule overhead per export - unnecessary task context switches between renderer and forwarder - reduce async channel buffer from 8 to 2 (tighter backpressure) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Apply the same PrecomputedCursorTimeline optimization to the render_video_to_channel function (used for GIF export and playback) so both render paths benefit from the cached cursor timeline. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When running on a software wgpu adapter (lavapipe, llvmpipe, etc.), skip the GPU RGBA→NV12 compute shader and instead: 1. Read back RGBA via the standard pipelined readback path 2. Convert RGBA→NV12 directly on CPU with inline color conversion This avoids the overhead of: - GPU compute shader dispatch through software Vulkan - Extra GPU storage buffer allocation for NV12 data - GPU command encoder submission for the compute pass - Additional GPU buffer readback through wgpu On software adapters, the GPU compute path has significant overhead from the wgpu abstraction layer, making direct CPU conversion faster. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Remove unused finish_encoder_nv12 import - Prefix unused cursor_events parameter with underscore - Fix needless mutable on PendingNv12Readback::wait Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Fix collapsible_if in encoder thread (mp4.rs) - Fix needless_option_as_deref in finish_encoder_nv12_pooled - Remove unused finish_encoder_nv12 wrapper (replaced by pooled variant) - Remove unused PendingNv12Readback::wait (replaced by wait_with_pool) - Remove unused NV12BufferPool::return_buf and max_buffers field - Remove unnecessary mut on buffer_pool parameter Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
| const GPU_BUFFER_WAIT_TIMEOUT_SECS: u64 = 10; | ||
|
|
||
| pub struct NV12BufferPool { | ||
| buffers: Vec<Vec<u8>>, | ||
| } | ||
|
|
||
| impl NV12BufferPool { | ||
| pub fn new(pre_alloc: usize) -> Self { | ||
| Self { | ||
| buffers: Vec::with_capacity(pre_alloc), | ||
| } | ||
| } | ||
|
|
||
| pub fn acquire(&mut self, size: usize) -> Vec<u8> { | ||
| if let Some(pos) = self.buffers.iter().position(|b| b.capacity() >= size) { | ||
| let mut buf = self.buffers.swap_remove(pos); | ||
| buf.clear(); | ||
| buf | ||
| } else { | ||
| Vec::with_capacity(size) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| pub struct RgbaToNv12Converter { |
There was a problem hiding this comment.
NV12BufferPool never reuses buffers — optimization is a no-op
NV12BufferPool has an acquire method but no corresponding release (or return) method. Every acquired Vec<u8> is immediately moved into Arc::new(...):
// frame_pipeline.rs
let nv12_vec = if let Some(pool) = buffer_pool {
let mut buf = pool.acquire(data_len);
buf.extend_from_slice(&data);
buf
};
let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the ArcWhen the Arc drops, the Vec is freed — it is never returned to the pool. Because NV12BufferPool::new(6) only pre-allocates capacity in the outer Vec (not the NV12 byte buffers themselves), self.buffers is always empty, and every call to acquire falls through to Vec::with_capacity(size). The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved.
To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the Vec in a guard type that calls a release method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/rendering/src/frame_pipeline.rs
Line: 8-32
Comment:
**NV12BufferPool never reuses buffers — optimization is a no-op**
`NV12BufferPool` has an `acquire` method but no corresponding `release` (or `return`) method. Every acquired `Vec<u8>` is immediately moved into `Arc::new(...)`:
```rust
// frame_pipeline.rs
let nv12_vec = if let Some(pool) = buffer_pool {
let mut buf = pool.acquire(data_len);
buf.extend_from_slice(&data);
buf
};
let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the Arc
```
When the `Arc` drops, the `Vec` is freed — it is never returned to the pool. Because `NV12BufferPool::new(6)` only pre-allocates capacity in the *outer* `Vec` (not the NV12 byte buffers themselves), `self.buffers` is always empty, and every call to `acquire` falls through to `Vec::with_capacity(size)`. The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved.
To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the `Vec` in a guard type that calls a `release` method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself.
How can I resolve this? If you propose a fix, please make it concise.| let r = src_row[col * 4] as i32; | ||
| let g = src_row[col * 4 + 1] as i32; | ||
| let b = src_row[col * 4 + 2] as i32; | ||
| y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235); |
There was a problem hiding this comment.
Casting to u8 before clamping can wrap for values >255 (or <0) and produce incorrect luma/chroma. Clamping on the integer first avoids that.
| y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235); | |
| let y = 16 + ((65 * r + 129 * g + 25 * b + 128) >> 8); | |
| y_row[col] = y.clamp(16, 235) as u8; |
| uv_row[col * 2] = | ||
| ((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240); | ||
| uv_row[col * 2 + 1] = | ||
| ((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240); |
There was a problem hiding this comment.
Same wrap issue here: clamp the computed i32 first, then cast.
| uv_row[col * 2] = | |
| ((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240); | |
| uv_row[col * 2 + 1] = | |
| ((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240); | |
| let u = 128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8); | |
| let v = 128 + ((112 * r - 94 * g - 18 * b + 128) >> 8); | |
| uv_row[col * 2] = u.clamp(16, 240) as u8; | |
| uv_row[col * 2 + 1] = v.clamp(16, 240) as u8; |
| let pp = screenshot_project_path; | ||
| tokio::task::spawn(async move { | ||
| let _ = tokio::task::spawn_blocking(move || { | ||
| save_screenshot_from_nv12( | ||
| &first.data, | ||
| first.width, | ||
| first.height, | ||
| first.y_stride, | ||
| &pp, | ||
| ); | ||
| }) | ||
| .await; | ||
| }); |
There was a problem hiding this comment.
This double-spawn looks redundant. If the goal is fire-and-forget screenshot generation, spawn_blocking alone is enough (and avoids an extra task + await).
| let pp = screenshot_project_path; | |
| tokio::task::spawn(async move { | |
| let _ = tokio::task::spawn_blocking(move || { | |
| save_screenshot_from_nv12( | |
| &first.data, | |
| first.width, | |
| first.height, | |
| first.y_stride, | |
| &pp, | |
| ); | |
| }) | |
| .await; | |
| }); | |
| let pp = screenshot_project_path; | |
| tokio::task::spawn_blocking(move || { | |
| save_screenshot_from_nv12(&first.data, first.width, first.height, first.y_stride, &pp); | |
| }); |
|
@greptileai please re-review the pr |
Summary
Optimizes the Cap MP4 export pipeline for significantly faster export speed. Benchmarked using the
cap-performance-fixturesreference recording (~30.8s, 3024x1964 source, 2 segments with cursor, camera, zoom, and audio).Benchmark Results (avg of 3 runs, software Vulkan renderer)
Output file sizes are identical (within <1%), confirming no quality regression.
Changes
Pipeline Restructure (
crates/export/src/mp4.rs)render_task— reduced from 3 tasks to 2 taskstokio::try_join!Cursor Timeline Precomputation (
crates/rendering/src/cursor_interpolation.rs,crates/rendering/src/lib.rs)PrecomputedCursorTimelinethat builds the timeline once and lookups are O(log n)ProjectUniforms::new_with_precomputed_cursor()for cached lookupsNV12 Buffer Pool (
crates/rendering/src/frame_pipeline.rs)NV12BufferPoolto reuseVec<u8>allocations across framesSoftware-Optimized NV12 Path (
crates/rendering/src/lib.rs)Encoder Tuning (
crates/enc-ffmpeg/src/video/h264.rs)rc-lookahead=10,b-adapt=1,ref=2,subme=2,trellis=0Testing
cargo test -p cap-export -p cap-rendering)export-benchmark-runnerwithcap-performance-fixturesreference recordingGreptile Summary
This PR restructures the Cap MP4 export pipeline to improve throughput by 5–12% wall-time and ~14–19% FPS across tested resolutions and presets, with no quality regression. The core changes are: eliminating the intermediate
render_task(3→2 concurrent tasks), precomputing the cursor spring simulation once per segment instead of once per frame lookup, upfront precomputation of zoom focus interpolation for exports, an NV12 buffer pool with properDrop-based recycling to avoid per-frame allocation, a CPU-side RGBA→NV12 fast path for software Vulkan adapters, and encoder tuning for theveryfastand lower presets.Key points:
Drop-on-release pattern (raised in a prior review) is now correctly implemented; buffers are returned to the pool when the lastArcclone drops.spawn_blockingencoder thread; the audio sample cursor calculation is unchanged from the oldrender_task.interpolate_timelineimplementation uses an O(1) stride fast-path (correct for the uniformSIMULATION_STEP_MSgrid) with an O(n) linear fallback, whileinterpolate_from_eventsin zoom focus correctly uses binary search.Confidence Score: 4/5
Drop. All audio/video sync logic is preserved. The two remaining items (silent screenshot panic, 1-luma-level Y coefficient deviation) are non-blocking P2s that don't affect correctness or output quality in normal use.Important Files Changed
Sequence Diagram
sequenceDiagram participant R as render_video_to_channel_nv12 participant F as forward_future (async) participant E as encoder_thread (spawn_blocking) Note over R,E: export_render_to_channel — tokio::try_join! R->>F: (Nv12RenderedFrame, frame_number) via tokio::mpsc (cap 2) F->>F: nv12_from_rendered_frame → ExportFrame F->>F: save first_frame_data for screenshot F->>E: ExportFrame via std::mpsc::sync_channel (cap 4) loop per frame E->>E: set_playhead (frame 0 only) E->>E: compute audio_frame from audio_sample_cursor E->>E: fill_nv12_frame_direct E->>E: queue_video_frame_reusable E->>E: queue_audio_frame (if audio) E->>E: encoded_frames += 1 end F-->>F: drop(sender) → encoder drains F->>F: spawn_blocking screenshot (fire-and-forget) E->>E: finish_mp4 / finalizeComments Outside Diff (4)
crates/export/src/mp4.rs, line 481-482 (link)The original
render_taskguarded everyvideo_rx.recv()with a timeout (120 s for the first frame, 90 s for subsequent frames, aborting after 3 consecutive misses). The newexport_render_to_channeldrops that protection entirely:If the GPU renderer stalls (device lost, wgpu crash, driver hang),
recv()will park the task forever, and the outertokio::try_join!(encoder_thread, render_video_task)will wait indefinitely. Users will see a frozen export with no error or progress update.The encoder side has the same exposure:
frame_rx.recv()on the blocking thread will block indefinitely if the async forwarding loop stops producing frames.Consider re-introducing at least a per-frame deadline — something like
tokio::time::timeout(Duration::from_secs(90), video_rx.recv())— and propagating a descriptive error on expiry, as the previous code did.Prompt To Fix With AI
crates/export/src/mp4.rs, line 525-534 (link)The current construction nests a
spawn_blockinginside a plainspawn:tokio::task::spawn_blockingalready schedules the closure on the blocking thread pool and returns aJoinHandle. Dropping that handle (fire-and-forget) is all that's needed — the outerspawnserves no purpose and wastes a task allocation.Prompt To Fix With AI
crates/export/src/mp4.rs, line 555-563 (link)The
_screenshot_taskJoinHandle is immediately dropped, so any panic insidesave_screenshot_from_nv12(e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged awarn!on failure:Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the
warn!inside the closure:(or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.
Prompt To Fix With AI
crates/rendering/src/lib.rs, line 1431 (link)The Y-channel RGB→YCbCr conversion uses a coefficient of
65for the red channel, while the standard BT.601 limited-range approximation uses66:For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.
Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "clippy" | Re-trigger Greptile