Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
7851f34
BACKPORT: net: page_pool: rename page_pool_alloc_netmem to *_netmems
pran005 Feb 28, 2025
69db6a9
BACKPORT: net: page_pool: create page_pool_alloc_netmem
pran005 Feb 28, 2025
3b7e186
BACKPORT: page_pool: Set `dma_sync` to false for devmem memory provider
pran005 Feb 28, 2025
893dfbf
BACKPORT: page_pool: disable sync for cpu for dmabuf memory provider
pran005 Feb 28, 2025
63881ea
BACKPORT: net: Document netmem driver support
pran005 Feb 28, 2025
098641f
BACKPORT: net: add get_netmem/put_netmem support
pran005 Mar 3, 2025
64302e6
BACKPORT: net: devmem: TCP tx netlink api
pran005 Mar 3, 2025
84f16e1
BACKPORT: net: devmem: Implement TX path
pran005 Mar 4, 2025
597150e
BACKPORT: net: add devmem TCP TX documentation
pran005 Mar 4, 2025
f7cf06f
BACKPORT: net: enable driver support for netmem TX
pran005 Mar 5, 2025
9dea192
BACKPORT: gve: add netmem TX support to GVE DQO-RDA mode
pran005 Mar 5, 2025
26de610
BACKPORT: net: check for driver support in netmem TX
pran005 Mar 5, 2025
2a3c953
BACLPORT: selftests: ncdevmem: Redirect all non-payload output to stderr
pran005 Mar 5, 2025
e2fe2cb
BACKPORT: selftests: ncdevmem: Separate out dmabuf provider
pran005 Mar 5, 2025
0736ab7
BACKPORT: selftests: ncdevmem: Unify error handling
pran005 Mar 5, 2025
67a1f62
BACKPORT: selftests: ncdevmem: Make client_ip optional
pran005 Mar 5, 2025
379da3e
BACKPORT: selftests: ncdevmem: Remove default arguments
pran005 Mar 5, 2025
e6619f0
BACKPORT: selftests: ncdevmem: Switch to AF_INET6
pran005 Mar 5, 2025
c0e9efc
BACKPORT: selftests: ncdevmem: Properly reset flow steering
pran005 Mar 5, 2025
9a93c88
BACKPORT: selftests: ncdevmem: Use YNL to enable TCP header split
pran005 Mar 5, 2025
37d6882
BACKPORT: selftests: ncdevmem: Remove hard-coded queue numbers
pran005 Mar 5, 2025
6f19d80
BACKPORT: selftests: ncdevmem: Run selftest when none of the -s or -c…
pran005 Mar 5, 2025
29a372b
BACKPORT: selftests: ncdevmem: Move ncdevmem under drivers/net/hw
pran005 Mar 5, 2025
79da72d
BACKPORT: selftests: ncdevmem: Add automated test
pran005 Mar 5, 2025
6bc9789
BACKPORT: ncdevmem doesn't need libmnl, remove the unnecessary include
pran005 Mar 5, 2025
9abf145
BACKPORT: selftests: ncdevmem: Implement devmem TCP TX
pran005 Mar 5, 2025
d81e966
BACKPORT: gve: Add RSS cache for non RSS device option scenario
pran005 Mar 5, 2025
57d3ecd
BACKPORT: gve: move DQO rx buffer management related code to a new file
pran005 Mar 5, 2025
95f5af6
BACKPORT: gve: adopt page pool for DQ RDA mode
pran005 Mar 5, 2025
04ba396
BACKPORT: gve: add support for basic queue stats
pran005 Mar 5, 2025
b54d4a3
BACKPORT: gve: change to use page_pool_put_full_page when recycling p…
pran005 Mar 5, 2025
ff6b5a2
BACKPORT: gve: unlink old napi when stopping a queue using queue API
pran005 Mar 5, 2025
d8a5e1d
BACKPORT: gve: convert to use netmem for DQO RDA mode
pran005 Mar 5, 2025
55337fd
gve: tcp devmem implementation
pran005 Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Documentation/netlink/specs/netdev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,18 @@ operations:
reply:
attributes:
- id
-
name: bind-tx
doc: Bind dmabuf to netdev for TX
attribute-set: dmabuf
do:
request:
attributes:
- ifindex
- fd
reply:
attributes:
- id

kernel-family:
headers: [ "linux/list.h"]
Expand Down
143 changes: 142 additions & 1 deletion Documentation/networking/devmem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ More Info
https://lore.kernel.org/netdev/[email protected]/


Interface
Rx Interface
=========


Expand Down Expand Up @@ -235,6 +235,147 @@ can be less than the tokens provided by the user in case of:
(a) an internal kernel leak bug.
(b) the user passed more than 1024 frags.

TX Interface
============


Example
-------

./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
setting up the TX path of this API.


NIC Setup
---------

The user must bind a TX dmabuf to a given NIC using the netlink API::

struct netdev_bind_tx_req *req = NULL;
struct netdev_bind_tx_rsp *rsp = NULL;
struct ynl_error yerr;

*ys = ynl_sock_create(&ynl_netdev_family, &yerr);

req = netdev_bind_tx_req_alloc();
netdev_bind_tx_req_set_ifindex(req, ifindex);
netdev_bind_tx_req_set_fd(req, dmabuf_fd);

rsp = netdev_bind_tx(*ys, req);

tx_dmabuf_id = rsp->id;


The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
that has been bound.

The user can unbind the dmabuf from the netdevice by closing the netlink socket
that established the binding. We do this so that the binding is automatically
unbound even if the userspace process crashes.

Note that any reasonably well-behaved dmabuf from any exporter should work with
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.

Socket Setup
------------

The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
cannot be copied by the kernel, so the semantics of the devmem TX are similar
to the semantics of MSG_ZEROCOPY::

setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));

It is also recommended that the user binds the TX socket to the same interface
the dma-buf has been bound to via SO_BINDTODEVICE::

setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);


Sending data
------------

Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.

The user should create a msghdr where,

* iov_base is set to the offset into the dmabuf to start sending from
* iov_len is set to the number of bytes to be sent from the dmabuf

The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.

The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::

char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
struct dmabuf_tx_cmsg ddmabuf;
struct msghdr msg = {};
struct cmsghdr *cmsg;
struct iovec iov[2];

iov[0].iov_base = (void*)100;
iov[0].iov_len = 1024;
iov[1].iov_base = (void*)2000;
iov[1].iov_len = 2048;

msg.msg_iov = iov;
msg.msg_iovlen = 2;

msg.msg_control = ctrl_data;
msg.msg_controllen = sizeof(ctrl_data);

cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));

ddmabuf.dmabuf_id = tx_dmabuf_id;

*((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;

sendmsg(socket_fd, &msg, MSG_ZEROCOPY);


Reusing TX dmabufs
------------------

Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
contents of the dma-buf while a send operation is in progress. This is because
the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
will pin and send data from the buffer available to the userspace.

Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
using MSG_ERRQUEUE::

int64_t tstop = gettimeofday_ms() + waittime_ms;
char control[CMSG_SPACE(100)] = {};
struct sock_extended_err *serr;
struct msghdr msg = {};
struct cmsghdr *cm;
int retries = 10;
__u32 hi, lo;

msg.msg_control = control;
msg.msg_controllen = sizeof(control);

while (gettimeofday_ms() < tstop) {
if (!do_poll(fd)) continue;

ret = recvmsg(fd, &msg, MSG_ERRQUEUE);

for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
serr = (void *)CMSG_DATA(cm);

hi = serr->ee_data;
lo = serr->ee_info;

fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
}
}

After the associated sendmsg has been completed, the dmabuf can be reused by
the userspace.

Implementation & Caveats
========================

Expand Down
1 change: 1 addition & 0 deletions Documentation/networking/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Contents:
netdevices
netfilter-sysctl
netif-msg
netmem
nexthop-group-resilient
nf_conntrack-sysctl
nf_flowtable
Expand Down
1 change: 1 addition & 0 deletions Documentation/networking/net_cachelines/net_device.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Type Name fastpath_tx_access
..struct ..net_device
unsigned_long:32 priv_flags read_mostly - __dev_queue_xmit(tx)
unsigned_long:1 lltx read_mostly - HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
unsigned long:1 netmem_tx:1; read_mostly
char name[16] - -
struct_netdev_name_node* name_node
struct_dev_ifalias* ifalias
Expand Down
5 changes: 5 additions & 0 deletions Documentation/networking/netdev-features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,3 +188,8 @@ Redundancy) frames from one port to another in hardware.
This should be set for devices which duplicate outgoing HSR (High-availability
Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
frames in hardware.

* netmem-tx

This should be set for devices which support netmem TX. See
Documentation/networking/netmem.rst
98 changes: 98 additions & 0 deletions Documentation/networking/netmem.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
.. SPDX-License-Identifier: GPL-2.0

==================================
Netmem Support for Network Drivers
==================================

This document outlines the requirements for network drivers to support netmem,
an abstract memory type that enables features like device memory TCP. By
supporting netmem, drivers can work with various underlying memory types
with little to no modification.

Benefits of Netmem :

* Flexibility: Netmem can be backed by different memory types (e.g., struct
page, DMA-buf), allowing drivers to support various use cases such as device
memory TCP.
* Future-proof: Drivers with netmem support are ready for upcoming
features that rely on it.
* Simplified Development: Drivers interact with a consistent API,
regardless of the underlying memory implementation.

Driver Rx Requirements
======================

1. The driver must support page_pool.

2. The driver must support the tcp-data-split ethtool option.

3. The driver must use the page_pool netmem APIs for payload memory. The netmem
APIs currently 1-to-1 correspond with page APIs. Conversion to netmem should
be achievable by switching the page APIs to netmem APIs and tracking memory
via netmem_refs in the driver rather than struct page * :

- page_pool_alloc -> page_pool_alloc_netmem
- page_pool_get_dma_addr -> page_pool_get_dma_addr_netmem
- page_pool_put_page -> page_pool_put_netmem

Not all page APIs have netmem equivalents at the moment. If your driver
relies on a missing netmem API, feel free to add and propose to netdev@, or
reach out to the maintainers and/or [email protected] for help adding
the netmem API.

4. The driver must use the following PP_FLAGS:

- PP_FLAG_DMA_MAP: netmem is not dma-mappable by the driver. The driver
must delegate the dma mapping to the page_pool, which knows when
dma-mapping is (or is not) appropriate.
- PP_FLAG_DMA_SYNC_DEV: netmem dma addr is not necessarily dma-syncable
by the driver. The driver must delegate the dma syncing to the page_pool,
which knows when dma-syncing is (or is not) appropriate.
- PP_FLAG_ALLOW_UNREADABLE_NETMEM. The driver must specify this flag iff
tcp-data-split is enabled.

5. The driver must not assume the netmem is readable and/or backed by pages.
The netmem returned by the page_pool may be unreadable, in which case
netmem_address() will return NULL. The driver must correctly handle
unreadable netmem, i.e. don't attempt to handle its contents when
netmem_address() is NULL.

Ideally, drivers should not have to check the underlying netmem type via
helpers like netmem_is_net_iov() or convert the netmem to any of its
underlying types via netmem_to_page() or netmem_to_net_iov(). In most cases,
netmem or page_pool helpers that abstract this complexity are provided
(and more can be added).

6. The driver must use page_pool_dma_sync_netmem_for_cpu() in lieu of
dma_sync_single_range_for_cpu(). For some memory providers, dma_syncing for
CPU will be done by the page_pool, for others (particularly dmabuf memory
provider), dma syncing for CPU is the responsibility of the userspace using
dmabuf APIs. The driver must delegate the entire dma-syncing operation to
the page_pool which will do it correctly.

7. Avoid implementing driver-specific recycling on top of the page_pool. Drivers
cannot hold onto a struct page to do their own recycling as the netmem may
not be backed by a struct page. However, you may hold onto a page_pool
reference with page_pool_fragment_netmem() or page_pool_ref_netmem() for
that purpose, but be mindful that some netmem types might have longer
circulation times, such as when userspace holds a reference in zerocopy
scenarios.

Driver TX Requirements
======================

1. The Driver must not pass the netmem dma_addr to any of the dma-mapping APIs
directly. This is because netmem dma_addrs may come from a source like
dma-buf that is not compatible with the dma-mapping APIs.

Helpers like netmem_dma_unmap_page_attrs() & netmem_dma_unmap_addr_set()
should be used in lieu of dma_unmap_page[_attrs](), dma_unmap_addr_set().
The netmem variants will handle netmem dma_addrs correctly regardless of the
source, delegating to the dma-mapping APIs when appropriate.

Not all dma-mapping APIs have netmem equivalents at the moment. If your
driver relies on a missing netmem API, feel free to add and propose to
netdev@, or reach out to the maintainers and/or [email protected] for
help adding the netmem API.

2. Driver should declare support by setting `netdev->netmem_tx = true`
1 change: 1 addition & 0 deletions drivers/net/ethernet/google/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ if NET_VENDOR_GOOGLE
config GVE
tristate "Google Virtual NIC (gVNIC) support"
depends on (PCI_MSI && (X86 || CPU_LITTLE_ENDIAN))
select PAGE_POOL
help
This driver supports Google Virtual NIC (gVNIC)"

Expand Down
3 changes: 2 additions & 1 deletion drivers/net/ethernet/google/gve/Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Makefile for the Google virtual Ethernet (gve) driver

obj-$(CONFIG_GVE) += gve.o
gve-objs := gve_main.o gve_tx.o gve_tx_dqo.o gve_rx.o gve_rx_dqo.o gve_ethtool.o gve_adminq.o gve_utils.o gve_flow_rule.o
gve-objs := gve_main.o gve_tx.o gve_tx_dqo.o gve_rx.o gve_rx_dqo.o gve_ethtool.o gve_adminq.o gve_utils.o gve_flow_rule.o \
gve_buffer_mgmt_dqo.o
Loading