Skip to content

virtio-net: fix UDP TX stall and improve TCP throughput#102

Open
agicy wants to merge 11 commits into
syswonder:mainfrom
agicy:perf/virtio-net
Open

virtio-net: fix UDP TX stall and improve TCP throughput#102
agicy wants to merge 11 commits into
syswonder:mainfrom
agicy:perf/virtio-net

Conversation

@agicy

@agicy agicy commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Two virtio-net TX correctness bugs fixed:

  • Re-check avail ring after virtqueue_enable_notify: the guest may add TX descriptors during the VRING_USED_F_NO_NOTIFY window. Without a re-check those descriptors are lost.
  • Inject IRQ after TX used-ring update: UDP has no implicit polling (unlike TCP which polls via ACK processing). Without an IRQ the guest never recycles completed TX descriptors, deadlocking sendto().

Also included are per-packet CPU optimizations (process_descriptor_chain_buf replaces per-packet malloc/free, update_used_ring_batch amortises DMB overhead) and six standalone bug fixes.

Test results

RK3588, zone1 (192.168.255.3/24) <-> root-linux (192.168.255.2/24), iperf3 60s:

iperf3 -i 10 -c 192.168.255.2 -t 60 --bidir           # TCP
iperf3 -i 10 -c 192.168.255.2 -t 60 -u -b 10G         # UDP TX
iperf3 -i 10 -c 192.168.255.2 -t 60 -u -b 10G -R      # UDP RX
Test Metric Before After
TCP bidir TX Gbps 0.51 1.11
TCP bidir RX Gbps 0.13 0.21
UDP TX Gbps 0.0008 (broken) 2.28
UDP -R RX Gbps 0.24 0.27

UDP TX and TCP throughput is improved.

References

Related issue #101.

Partially-fixes hvisor #240.

agicy added 11 commits June 20, 2026 05:23
…leak

Replace static char pad[64] with stack-local char pad[64] = {0}.
A static variable is shared across all net devices; when per-vq
threads are introduced, concurrent TX would race on the shared
buffer.  Zero-initialization prevents leaking previous stack
contents into the TAP device on the short-packet padding path.
…thmetic

A malicious or buggy guest may provide iov[0].iov_len < header_len.
Without this check, iov_len underflows (unsigned wrap to a huge value)
and writev reads beyond the guest buffer — leaking host memory.

Add a bounds check that returns the descriptor via update_used_ring
with len=0 when the header doesn't fit, matching the existing pattern
in rm_iov_header on the RX path.
When rm_iov_header fails (iov[0].iov_len < header_len), the descriptor
has already been consumed from the avail ring by process_descriptor_chain
but was never posted to the used ring.  This leaks the descriptor
permanently from the guest's perspective.

Fix: call update_used_ring(vq, idx, 0) to return the descriptor before
freeing the iov.
When readv returns a non-EAGAIN error (EIO, EFAULT, etc.), the
descriptor was consumed from the avail ring but never returned.
The code fell through to the success path which would call
update_used_ring with a bogus length.

Fix: return the descriptor via update_used_ring(vq, idx, 0)
on any readv error that is not EWOULDBLOCK.
When the TAP device is closed or the bridge goes down, readv returns 0
(EOF).  Without handling this case, the code falls through to the
success path and calls update_used_ring with header_len bytes, plus
EPOLLIN fires continuously on an EOF fd, creating a tight loop that
drains guest RX buffers.

Fix: detect len==0, return the descriptor via update_used_ring(vq,idx,0),
set rx_ready=0 to stop further RX attempts, and break out.
The file-scope static uint8_t trashbuf[1600] is shared across all
net device instances.  When per-vq threads are introduced, concurrent
RX drops from multiple devices would race on this single buffer.

Move the buffer to a stack-local variable inside virtio_net_event_handler.
UDP streams lose descriptors because the guest may add new TX
descriptors between the last empty check and enable_notify, and the
suppressed notification never re-fires.  Wrap the drain+enable in a
loop with a post-enable re-check.
…criptor traversal

Single-pass traversal that counts and fills in one loop.  INDIRECT
descriptors are handled inline: the indirect table size is known from
vdesc->len/16, so no separate counting pass is needed.

Single pass halves the cross-VM desc_table accesses for the common
non-INDIRECT path (net), saving ~150-400 cycles per packet on A55.

The original process_descriptor_chain() (malloc version) is unchanged.
…ket malloc/free

Replace process_descriptor_chain() + free(iov) in rx_event_handler and
virtq_tx_handle_one_request with process_descriptor_chain_buf() using
stack-allocated iov arrays (NET_IOV_MAX=8 entries).

Eliminates ~80-230 cycles of malloc/free overhead per packet on A55.
update_used_ring() emits two write_barrier() (DMB ishst) calls for
every single used-ring entry.  On in-order cores like A55 each barrier
stalls the store buffer for ~30-50 cycles, so N entries cost 2N barriers.

The new update_used_ring_batch() writes all N entries in a tight loop
and emits only two barriers total — one before the idx update and one
after — regardless of N.  This reduces the per-packet barrier cost from
2N to 2.

Original update_used_ring() is kept for callers that process one
request at a time (blk, console, gpu).
Switch to batch used-ring updates and inject an IRQ after every
batch flush so the guest can recycle completed TX descriptors.
Without the IRQ, UDP sendto() blocks permanently after filling
the virtqueue.
@agicy agicy requested review from dallasxy, li041 and liulog June 23, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

the packet loss in virtio-net

1 participant