perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning by richiemcilroy · Pull Request #1675 · CapSoftware/Cap

richiemcilroy · 2026-03-22T20:29:25Z

Summary

Optimizes the Cap MP4 export pipeline for significantly faster export speed. Benchmarked using the cap-performance-fixtures reference recording (~30.8s, 3024x1964 source, 2 segments with cursor, camera, zoom, and audio).

Benchmark Results (avg of 3 runs, software Vulkan renderer)

Preset	Before	After	Wall Time Δ	FPS Δ
720p/30fps/Maximum	36.5s / 24.9fps	32.8s / 28.2fps	-10.1%	+13.3%
1080p/30fps/Maximum	46.3s / 17.8fps	43.9s / 21.1fps	-5.2%	+18.5%
1080p/30fps/Social	49.2s / 18.7fps	43.2s / 21.3fps	-12.2%	+13.9%

Output file sizes are identical (within <1%), confirming no quality regression.

Changes

Pipeline Restructure (`crates/export/src/mp4.rs`)

Eliminated intermediate render_task — reduced from 3 tasks to 2 tasks
Moved audio rendering to encoder thread (removes channel hop)
Fire-and-forget screenshot generation (no longer blocks export completion)
Replaced spawned forward task with local future in tokio::try_join!

Cursor Timeline Precomputation (`crates/rendering/src/cursor_interpolation.rs`, `crates/rendering/src/lib.rs`)

Biggest single win: The smoothed cursor timeline (spring mass damper simulation) was being rebuilt from scratch 3 times per frame (current pos, previous pos, lookback). For a 30s recording this meant ~5400 simulation steps per output frame
Added PrecomputedCursorTimeline that builds the timeline once and lookups are O(log n)
Added ProjectUniforms::new_with_precomputed_cursor() for cached lookups
Precompute zoom focus interpolation for full duration upfront
Applied to both NV12 and RGBA render paths

NV12 Buffer Pool (`crates/rendering/src/frame_pipeline.rs`)

Added NV12BufferPool to reuse Vec<u8> allocations across frames
Eliminates per-frame allocation during GPU readback

Software-Optimized NV12 Path (`crates/rendering/src/lib.rs`)

When on software wgpu adapter, skip GPU RGBA→NV12 compute shader
Read back RGBA and convert to NV12 directly on CPU
Avoids wgpu compute shader dispatch overhead on software Vulkan

Encoder Tuning (`crates/enc-ffmpeg/src/video/h264.rs`)

Optimized libx264 export settings: explicit thread count, rc-lookahead=10, b-adapt=1, ref=2, subme=2, trellis=0
These supplement the "veryfast" preset for maximum throughput without meaningful quality loss

Testing

All existing unit tests pass (cargo test -p cap-export -p cap-rendering)
Benchmarked 3 times for stability using export-benchmark-runner with cap-performance-fixtures reference recording
Output file sizes confirm no quality degradation
Export completes successfully with audio sync maintained

Greptile Summary

This PR restructures the Cap MP4 export pipeline to improve throughput by 5–12% wall-time and ~14–19% FPS across tested resolutions and presets, with no quality regression. The core changes are: eliminating the intermediate render_task (3→2 concurrent tasks), precomputing the cursor spring simulation once per segment instead of once per frame lookup, upfront precomputation of zoom focus interpolation for exports, an NV12 buffer pool with proper Drop-based recycling to avoid per-frame allocation, a CPU-side RGBA→NV12 fast path for software Vulkan adapters, and encoder tuning for the veryfast and lower presets.

Key points:

The NV12BufferPool Drop-on-release pattern (raised in a prior review) is now correctly implemented; buffers are returned to the pool when the last Arc clone drops.
Audio rendering is moved into the spawn_blocking encoder thread; the audio sample cursor calculation is unchanged from the old render_task.
Screenshot generation is now intentionally fire-and-forget, but panics inside the blocking closure are now silently swallowed with no log output.
The software-adapter CPU RGBA→NV12 path uses a Y-plane BT.601 coefficient of 65 for red vs. the standard 66, producing at most a 1-luma-level difference — imperceptible but means the software and GPU paths are not bit-exact.
The PR description states O(log n) cursor lookups; the interpolate_timeline implementation uses an O(1) stride fast-path (correct for the uniform SIMULATION_STEP_MS grid) with an O(n) linear fallback, while interpolate_from_events in zoom focus correctly uses binary search.
Steady-state decode retry count is reduced from 5→2 with a fixed 10 ms delay; this is a deliberate performance trade-off and is unlikely to regress stability in practice.

Confidence Score: 4/5

Safe to merge with a minor fix: add logging for the fire-and-forget screenshot task so failures don't vanish silently.
The pipeline restructure is well-reasoned and benchmarked. The previously-flagged NV12BufferPool reuse issue is correctly fixed via Drop. All audio/video sync logic is preserved. The two remaining items (silent screenshot panic, 1-luma-level Y coefficient deviation) are non-blocking P2s that don't affect correctness or output quality in normal use.
crates/export/src/mp4.rs — fire-and-forget screenshot task silently drops panics

Important Files Changed

Filename	Overview
crates/export/src/mp4.rs	Pipeline reduced from 3 tasks to 2 — audio rendering moved into encoder thread, screenshot is now fire-and-forget (panics swallowed silently). Channel capacity reduced 32→4 for the sync encoder channel. Logic for audio sample cursor and playhead init is correctly ported.
crates/rendering/src/cursor_interpolation.rs	Adds PrecomputedCursorTimeline that builds the spring simulation once and reuses it. The interpolate_timeline fast path uses a stride-based O(1) index (uniform step grid) with an O(n) linear-scan fallback — correct but not binary search as described. All tests pass.
crates/rendering/src/frame_pipeline.rs	NV12BufferPool now correctly implements buffer reuse via Drop on PooledNv12Buffer (previous review concern resolved). SharedNv12Buffer wraps Arc for clone-friendly sharing; into_vec correctly prevents pool return when taking ownership.
crates/rendering/src/lib.rs	Adds software-adapter fast path (RGBA→NV12 on CPU), precomputed cursor timelines for both RGBA and NV12 render paths, and upfront zoom focus precomputation for NV12 exports. Y-plane BT.601 coefficient has a minor 1-off vs standard. DECODE_MAX_RETRIES reduced to 2 for steady-state frames (down from 5).
crates/enc-ffmpeg/src/video/h264.rs	Encoder tuning applied only for non-Slow/Medium presets: explicit thread count, rc-lookahead=10, b-adapt=1, aq-mode=1, ref=2, subme=2, trellis=0. aq-mode is not mentioned in the PR description but matches the veryfast default so there is no quality change. Settings are safe.
crates/rendering/src/zoom_focus_interpolation.rs	Adds new_with_precomputed_cursor / new_arc_with_precomputed_cursor constructors and ensure_precomputed_until for upfront simulation. interpolate_from_events uses binary search (O(log n)) correctly. No issues found.

Sequence Diagram

sequenceDiagram
    participant R as render_video_to_channel_nv12
    participant F as forward_future (async)
    participant E as encoder_thread (spawn_blocking)

    Note over R,E: export_render_to_channel — tokio::try_join!

    R->>F: (Nv12RenderedFrame, frame_number) via tokio::mpsc (cap 2)
    F->>F: nv12_from_rendered_frame → ExportFrame
    F->>F: save first_frame_data for screenshot
    F->>E: ExportFrame via std::mpsc::sync_channel (cap 4)

    loop per frame
        E->>E: set_playhead (frame 0 only)
        E->>E: compute audio_frame from audio_sample_cursor
        E->>E: fill_nv12_frame_direct
        E->>E: queue_video_frame_reusable
        E->>E: queue_audio_frame (if audio)
        E->>E: encoded_frames += 1
    end

    F-->>F: drop(sender) → encoder drains
    F->>F: spawn_blocking screenshot (fire-and-forget)
    E->>E: finish_mp4 / finalize

Comments Outside Diff (4)

crates/export/src/mp4.rs, line 481-482 (link)

Missing timeout — export can hang indefinitely

The original render_task guarded every video_rx.recv() with a timeout (120 s for the first frame, 90 s for subsequent frames, aborting after 3 consecutive misses). The new export_render_to_channel drops that protection entirely:

while let Some((frame, _frame_number)) = video_rx.recv().await {

If the GPU renderer stalls (device lost, wgpu crash, driver hang), recv() will park the task forever, and the outer tokio::try_join!(encoder_thread, render_video_task) will wait indefinitely. Users will see a frozen export with no error or progress update.

The encoder side has the same exposure: frame_rx.recv() on the blocking thread will block indefinitely if the async forwarding loop stops producing frames.

Consider re-introducing at least a per-frame deadline — something like tokio::time::timeout(Duration::from_secs(90), video_rx.recv()) — and propagating a descriptive error on expiry, as the previous code did.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 481-482

Comment:
**Missing timeout — export can hang indefinitely**

The original `render_task` guarded every `video_rx.recv()` with a timeout (120 s for the first frame, 90 s for subsequent frames, aborting after 3 consecutive misses). The new `export_render_to_channel` drops that protection entirely:

```rust
while let Some((frame, _frame_number)) = video_rx.recv().await {
```

If the GPU renderer stalls (device lost, wgpu crash, driver hang), `recv()` will park the task forever, and the outer `tokio::try_join!(encoder_thread, render_video_task)` will wait indefinitely. Users will see a frozen export with no error or progress update.

The encoder side has the same exposure: `frame_rx.recv()` on the blocking thread will block indefinitely if the async forwarding loop stops producing frames.

Consider re-introducing at least a per-frame deadline — something like `tokio::time::timeout(Duration::from_secs(90), video_rx.recv())` — and propagating a descriptive error on expiry, as the previous code did.

How can I resolve this? If you propose a fix, please make it concise.

crates/export/src/mp4.rs, line 525-534 (link)

Redundant double-spawn for screenshot task

The current construction nests a spawn_blocking inside a plain spawn:

tokio::task::spawn(async move {
    let _ = tokio::task::spawn_blocking(move || {
        save_screenshot_from_nv12(...)
    })
    .await;
});

tokio::task::spawn_blocking already schedules the closure on the blocking thread pool and returns a JoinHandle. Dropping that handle (fire-and-forget) is all that's needed — the outer spawn serves no purpose and wastes a task allocation.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 525-534

Comment:
**Redundant double-spawn for screenshot task**

The current construction nests a `spawn_blocking` inside a plain `spawn`:

```rust
tokio::task::spawn(async move {
    let _ = tokio::task::spawn_blocking(move || {
        save_screenshot_from_nv12(...)
    })
    .await;
});
```

`tokio::task::spawn_blocking` already schedules the closure on the blocking thread pool and returns a `JoinHandle`. Dropping that handle (fire-and-forget) is all that's needed — the outer `spawn` serves no purpose and wastes a task allocation.



How can I resolve this? If you propose a fix, please make it concise.

crates/export/src/mp4.rs, line 555-563 (link)

Screenshot task panics are silently swallowed

The _screenshot_task JoinHandle is immediately dropped, so any panic inside save_screenshot_from_nv12 (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a warn! on failure:

if let Err(e) = screenshot_task.await {
    warn!("Screenshot task failed: {e}");
}

Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the warn! inside the closure:

(or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 555-563

Comment:
**Screenshot task panics are silently swallowed**

The `_screenshot_task` JoinHandle is immediately dropped, so any panic inside `save_screenshot_from_nv12` (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a `warn!` on failure:

```rust
if let Err(e) = screenshot_task.await {
    warn!("Screenshot task failed: {e}");
}
```

Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the `warn!` inside the closure:



(or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.

How can I resolve this? If you propose a fix, please make it concise.

crates/rendering/src/lib.rs, line 1431 (link)

Y-plane R coefficient off by one vs BT.601

The Y-channel RGB→YCbCr conversion uses a coefficient of 65 for the red channel, while the standard BT.601 limited-range approximation uses 66:

Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code

For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/rendering/src/lib.rs
Line: 1431

Comment:
**Y-plane R coefficient off by one vs BT.601**

The Y-channel RGB→YCbCr conversion uses a coefficient of `65` for the red channel, while the standard BT.601 limited-range approximation uses `66`:

```
Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code
```

For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/export/src/mp4.rs
Line: 555-563

Comment:
**Screenshot task panics are silently swallowed**

The `_screenshot_task` JoinHandle is immediately dropped, so any panic inside `save_screenshot_from_nv12` (e.g., during image encoding or file I/O) is silently discarded with no log output at all. The previous implementation awaited the handle and logged a `warn!` on failure:

```rust
if let Err(e) = screenshot_task.await {
    warn!("Screenshot task failed: {e}");
}
```

Since the export already returns before this runs, observability is the only concern here. Consider at minimum moving the `warn!` inside the closure:

```suggestion
            if let Some(first) = first_frame_data {
                let pp = screenshot_project_path;
                tokio::task::spawn_blocking(move || {
                    save_screenshot_from_nv12(
                        first.data.as_ref(),
                        first.width,
                        first.height,
                        first.y_stride,
                        &pp,
                    );
                });
            }
```

(or keep the handle and log the result). As-is, if thumbnail generation silently fails, there's no indication in logs.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/rendering/src/lib.rs
Line: 1431

Comment:
**Y-plane R coefficient off by one vs BT.601**

The Y-channel RGB→YCbCr conversion uses a coefficient of `65` for the red channel, while the standard BT.601 limited-range approximation uses `66`:

```
Y = ((66·R + 129·G + 25·B + 128) >> 8) + 16   // BT.601
Y = ((65·R + 129·G + 25·B + 128) >> 8) + 16   // this code
```

For a saturated red pixel (R=255, G=0, B=0), this causes the computed luma to be 1 level lower than standard (≈81 vs 82). The perceptual difference is imperceptible, but it means the software-adapter path's output will not be bit-exact with the GPU-converter path, which could complicate future debugging or comparison tests.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "clippy" | Re-trigger Greptile}

…rt pipeline - Remove the 3-task pipeline (renderer → processor → encoder) and replace with a streamlined 2-task pipeline (renderer → encoder) - Move audio rendering into the encoder thread to avoid channel hop - Create nv12_from_rendered_frame() that returns ExportFrame directly - Create fill_nv12_frame_direct() for zero-overhead NV12 frame filling - Fire-and-forget screenshot generation (no longer blocks export completion) - Reduce async channel buffer from 32 to 4 for renderer output (backpressure) - Maintain all existing test compatibility Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Add supplementary speed options for libx264 in export mode: - Explicit thread count matching CPU parallelism - rc-lookahead=10 (reduced from default 40) - b-adapt=1 (faster B-frame decision) - aq-mode=1 (simpler adaptive quantization) - ref=2 (fewer reference frames) - subme=2 (faster subpixel motion estimation) - trellis=0 (disable trellis quantization) These work alongside the veryfast preset to maximize encode speed while maintaining good quality for screen recording content. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

- Add NV12BufferPool that reuses Vec<u8> allocations across frames - Add finish_encoder_nv12_pooled() variant that accepts buffer pool - Add PendingNv12Readback::wait_with_pool() for pooled readback - FrameRenderer now maintains a pool of 6 buffers - Eliminates per-frame Vec allocation during GPU readback Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

…r export Major optimization: cursor interpolation was rebuilding the entire smoothed cursor timeline (spring mass damper simulation) from scratch 3 times per frame (current pos, previous pos, lookback). For a 30s recording this meant ~5400 simulation steps per output frame. Changes: - Add PrecomputedCursorTimeline that builds the smoothed timeline once - Add ProjectUniforms::new_with_precomputed_cursor() for cached lookup - Refactor ProjectUniforms to use cursor_interp_fn callback pattern - Precompute zoom focus interpolation for full duration upfront - Precompute cursor timelines per segment before render loop starts This eliminates the O(n) per-frame cursor simulation overhead entirely, replacing it with O(log n) binary search lookups. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Replace spawned forward_task with a local async future that runs in tokio::try_join! alongside the render future. This eliminates: - tokio task spawn/schedule overhead per export - unnecessary task context switches between renderer and forwarder - reduce async channel buffer from 8 to 2 (tighter backpressure) Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

Apply the same PrecomputedCursorTimeline optimization to the render_video_to_channel function (used for GIF export and playback) so both render paths benefit from the cached cursor timeline. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

When running on a software wgpu adapter (lavapipe, llvmpipe, etc.), skip the GPU RGBA→NV12 compute shader and instead: 1. Read back RGBA via the standard pipelined readback path 2. Convert RGBA→NV12 directly on CPU with inline color conversion This avoids the overhead of: - GPU compute shader dispatch through software Vulkan - Extra GPU storage buffer allocation for NV12 data - GPU command encoder submission for the compute pass - Additional GPU buffer readback through wgpu On software adapters, the GPU compute path has significant overhead from the wgpu abstraction layer, making direct CPU conversion faster. Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

- Remove unused finish_encoder_nv12 import - Prefix unused cursor_events parameter with underscore - Fix needless mutable on PendingNv12Readback::wait Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

- Fix collapsible_if in encoder thread (mp4.rs) - Fix needless_option_as_deref in finish_encoder_nv12_pooled - Remove unused finish_encoder_nv12 wrapper (replaced by pooled variant) - Remove unused PendingNv12Readback::wait (replaced by wait_with_pool) - Remove unused NV12BufferPool::return_buf and max_buffers field - Remove unnecessary mut on buffer_pool parameter Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

greptile-apps · 2026-03-22T23:45:38Z

crates/rendering/src/frame_pipeline.rs

 const GPU_BUFFER_WAIT_TIMEOUT_SECS: u64 = 10;

+pub struct NV12BufferPool {
+    buffers: Vec<Vec<u8>>,
+}
+
+impl NV12BufferPool {
+    pub fn new(pre_alloc: usize) -> Self {
+        Self {
+            buffers: Vec::with_capacity(pre_alloc),
+        }
+    }
+
+    pub fn acquire(&mut self, size: usize) -> Vec<u8> {
+        if let Some(pos) = self.buffers.iter().position(|b| b.capacity() >= size) {
+            let mut buf = self.buffers.swap_remove(pos);
+            buf.clear();
+            buf
+        } else {
+            Vec::with_capacity(size)
+        }
+    }
+}
+
 pub struct RgbaToNv12Converter {


NV12BufferPool never reuses buffers — optimization is a no-op

NV12BufferPool has an acquire method but no corresponding release (or return) method. Every acquired Vec<u8> is immediately moved into Arc::new(...):

// frame_pipeline.rs let nv12_vec = if let Some(pool) = buffer_pool { let mut buf = pool.acquire(data_len); buf.extend_from_slice(&data); buf }; let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the Arc

When the Arc drops, the Vec is freed — it is never returned to the pool. Because NV12BufferPool::new(6) only pre-allocates capacity in the outer Vec (not the NV12 byte buffers themselves), self.buffers is always empty, and every call to acquire falls through to Vec::with_capacity(size). The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved.

To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the Vec in a guard type that calls a release method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/rendering/src/frame_pipeline.rs Line: 8-32 Comment: **NV12BufferPool never reuses buffers — optimization is a no-op** `NV12BufferPool` has an `acquire` method but no corresponding `release` (or `return`) method. Every acquired `Vec<u8>` is immediately moved into `Arc::new(...)`: ```rust // frame_pipeline.rs let nv12_vec = if let Some(pool) = buffer_pool { let mut buf = pool.acquire(data_len); buf.extend_from_slice(&data); buf }; let nv12_data = Arc::new(nv12_vec); // Vec is now owned by the Arc ``` When the `Arc` drops, the `Vec` is freed — it is never returned to the pool. Because `NV12BufferPool::new(6)` only pre-allocates capacity in the *outer* `Vec` (not the NV12 byte buffers themselves), `self.buffers` is always empty, and every call to `acquire` falls through to `Vec::with_capacity(size)`. The stated goal — "Eliminates per-frame allocation during GPU readback" — is not achieved. To make the pool functional, buffers need to be returned after use. One common pattern is wrapping the `Vec` in a guard type that calls a `release` method (or sends back through a channel) on drop. Alternatively, the pool can be pre-populated with a fixed set of buffers owned by the pool itself. How can I resolve this? If you propose a fix, please make it concise.

tembo · 2026-03-22T23:53:57Z

crates/rendering/src/lib.rs

+                let r = src_row[col * 4] as i32;
+                let g = src_row[col * 4 + 1] as i32;
+                let b = src_row[col * 4 + 2] as i32;
+                y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235);


Casting to u8 before clamping can wrap for values >255 (or <0) and produce incorrect luma/chroma. Clamping on the integer first avoids that.

Suggested change

y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235);

let y = 16 + ((65 * r + 129 * g + 25 * b + 128) >> 8);

y_row[col] = y.clamp(16, 235) as u8;

tembo · 2026-03-22T23:53:58Z

crates/rendering/src/lib.rs

+                uv_row[col * 2] =
+                    ((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240);
+                uv_row[col * 2 + 1] =
+                    ((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240);


Same wrap issue here: clamp the computed i32 first, then cast.

Suggested change

uv_row[col * 2] =

((128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8)) as u8).clamp(16, 240);

uv_row[col * 2 + 1] =

((128 + ((112 * r - 94 * g - 18 * b + 128) >> 8)) as u8).clamp(16, 240);

let u = 128 + ((-38 * r - 74 * g + 112 * b + 128) >> 8);

let v = 128 + ((112 * r - 94 * g - 18 * b + 128) >> 8);

uv_row[col * 2] = u.clamp(16, 240) as u8;

uv_row[col * 2 + 1] = v.clamp(16, 240) as u8;

tembo · 2026-03-22T23:55:27Z

crates/export/src/mp4.rs

+                let pp = screenshot_project_path;
+                tokio::task::spawn(async move {
+                    let _ = tokio::task::spawn_blocking(move || {
+                        save_screenshot_from_nv12(
+                            &first.data,
+                            first.width,
+                            first.height,
+                            first.y_stride,
+                            &pp,
+                        );
+                    })
+                    .await;
+                });


This double-spawn looks redundant. If the goal is fire-and-forget screenshot generation, spawn_blocking alone is enough (and avoids an extra task + await).

Suggested change

let pp = screenshot_project_path;

tokio::task::spawn(async move {

let _ = tokio::task::spawn_blocking(move || {

save_screenshot_from_nv12(

&first.data,

first.width,

first.height,

first.y_stride,

&pp,

);

})

.await;

});

let pp = screenshot_project_path;

tokio::task::spawn_blocking(move || {

save_screenshot_from_nv12(&first.data, first.width, first.height, first.y_stride, &pp);

});

…e decode retries

richiemcilroy · 2026-03-23T15:40:27Z

@greptileai please re-review the pr

cursoragent and others added 9 commits March 22, 2026 19:27

chore: clean up warnings from export optimization changes

f6b5257

- Remove unused finish_encoder_nv12 import - Prefix unused cursor_events parameter with underscore - Fix needless mutable on PendingNv12Readback::wait Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>

richiemcilroy marked this pull request as ready for review March 22, 2026 23:41

greptile-apps bot reviewed Mar 22, 2026

View reviewed changes

tembo bot reviewed Mar 22, 2026

View reviewed changes

richiemcilroy added 4 commits March 23, 2026 15:37

feat(rendering): add SharedNv12Buffer with pool-based buffer recycling

0ac3c47

feat(rendering): add precomputed cursor support to ZoomFocusInterpolator

527e998

perf(rendering): share precomputed cursor timelines via Arc and reduc…

8eb228e

…e decode retries

perf(export): add frame receive timeouts and adopt SharedNv12Buffer

1b50454

clippy

036968b

richiemcilroy merged commit 75206ff into main Mar 23, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning#1675

perf: improve export speed with pipeline restructure, cursor precomputation, and encoder tuning#1675
richiemcilroy merged 14 commits intomainfrom
cursor/cap-export-speed-ad7f

richiemcilroy commented Mar 22, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot Mar 22, 2026

Uh oh!

tembo bot Mar 22, 2026

Uh oh!

tembo bot Mar 22, 2026

Uh oh!

tembo bot Mar 22, 2026

Uh oh!

richiemcilroy commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	y_row[col] = ((16 + ((65 * r + 129 * g + 25 * b + 128) >> 8)) as u8).clamp(16, 235);
	let y = 16 + ((65 * r + 129 * g + 25 * b + 128) >> 8);
	y_row[col] = y.clamp(16, 235) as u8;

Conversation

richiemcilroy commented Mar 22, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results (avg of 3 runs, software Vulkan renderer)

Changes

Pipeline Restructure (crates/export/src/mp4.rs)

Cursor Timeline Precomputation (crates/rendering/src/cursor_interpolation.rs, crates/rendering/src/lib.rs)

NV12 Buffer Pool (crates/rendering/src/frame_pipeline.rs)

Software-Optimized NV12 Path (crates/rendering/src/lib.rs)

Encoder Tuning (crates/enc-ffmpeg/src/video/h264.rs)

Testing

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (4)

Uh oh!

greptile-apps bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

richiemcilroy commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richiemcilroy commented Mar 22, 2026 •

edited by greptile-apps bot

Loading

Pipeline Restructure (`crates/export/src/mp4.rs`)

Cursor Timeline Precomputation (`crates/rendering/src/cursor_interpolation.rs`, `crates/rendering/src/lib.rs`)

NV12 Buffer Pool (`crates/rendering/src/frame_pipeline.rs`)

Software-Optimized NV12 Path (`crates/rendering/src/lib.rs`)

Encoder Tuning (`crates/enc-ffmpeg/src/video/h264.rs`)