Message ID | 20230622215824.2173343-1-i.maximets@ovn.org |
---|---|
State | New |
Headers | show |
Series | net: add initial support for AF_XDP network backend | expand |
On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > AF_XDP is a network socket family that allows communication directly > with the network device driver in the kernel, bypassing most or all > of the kernel networking stack. In the essence, the technology is > pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > and works with any network interfaces without driver modifications. > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > require access to character devices or unix sockets. Only access to > the network interface itself is necessary. > > This patch implements a network backend that communicates with the > kernel by creating an AF_XDP socket. A chunk of userspace memory > is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > Fill and Completion) are placed in that memory along with a pool of > memory buffers for the packet data. Data transmission is done by > allocating one of the buffers, copying packet data into it and > placing the pointer into Tx ring. After transmission, device will > return the buffer via Completion ring. On Rx, device will take > a buffer form a pre-populated Fill ring, write the packet data into > it and place the buffer into Rx ring. > > AF_XDP network backend takes on the communication with the host > kernel and the network interface and forwards packets to/from the > peer device in QEMU. > > Usage example: > > -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > > XDP program bridges the socket with a network interface. It can be > attached to the interface in 2 different modes: > > 1. skb - this mode should work for any interface and doesn't require > driver support. With a caveat of lower performance. > > 2. native - this does require support from the driver and allows to > bypass skb allocation in the kernel and potentially use > zero-copy while getting packets in/out userspace. > > By default, QEMU will try to use native mode and fall back to skb. > Mode can be forced via 'mode' option. To force 'copy' even in native > mode, use 'force-copy=on' option. This might be useful if there is > some issue with the driver. > > Option 'queues=N' allows to specify how many device queues should > be open. Note that all the queues that are not open are still > functional and can receive traffic, but it will not be delivered to > QEMU. So, the number of device queues should generally match the > QEMU configuration, unless the device is shared with something > else and the traffic re-direction to appropriate queues is correctly > configured on a device level (e.g. with ethtool -N). > 'start-queue=M' option can be used to specify from which queue id > QEMU should start configuring 'N' queues. It might also be necessary > to use this option with certain NICs, e.g. MLX5 NICs. See the docs > for examples. > > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > capabilities in order to load default XSK/XDP programs to the > network interface and configure BTF maps. I think you mean "BPF" actually? > It is possible, however, > to run only with CAP_NET_RAW. Qemu often runs without any privileges, so we need to fix it. I think adding support for SCM_RIGHTS via monitor would be a way to go. > For that to work, an external process > with admin capabilities will need to pre-load default XSK program > and pass an open file descriptor for this program's 'xsks_map' to > QEMU process on startup. Network backend will need to be configured > with 'inhibit=on' to avoid loading of the programs. The file > descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. > > There are few performance challenges with the current network backends. > > First is that they do not support IO threads. The current networking codes needs some major recatoring to support IO threads which I'm not sure is worthwhile. > This means that data > path is handled by the main thread in QEMU and may slow down other > work or may be slowed down by some other work. This also means that > taking advantage of multi-queue is generally not possible today. > > Another thing is that data path is going through the device emulation > code, which is not really optimized for performance. The fastest > "frontend" device is virtio-net. But it's not optimized for heavy > traffic either, because it expects such use-cases to be handled via > some implementation of vhost (user, kernel, vdpa). In practice, we > have virtio notifications and rcu lock/unlock on a per-packet basis > and not very efficient accesses to the guest memory. Communication > channels between backend and frontend devices do not allow passing > more than one packet at a time as well. > > Some of these challenges can be avoided in the future by adding better > batching into device emulation or by implementing vhost-af-xdp variant. It might require you to register(pin) the whole guest memory to XSK or there could be a copy. Both of them are sub-optimal. A really interesting project is to do AF_XDP passthrough, then we don't need to care about pin and copy and we will get ultra speed in the guest. (But again, it might needs BPF support in virtio-net). > > There are also a few kernel limitations. AF_XDP sockets do not > support any kinds of checksum or segmentation offloading. Buffers > are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > support is not implemented for AF_XDP today. Also, transmission in > all non-zero-copy modes is synchronous, i.e. done in a syscall. > That doesn't allow high packet rates on virtual interfaces. > > However, keeping in mind all of these challenges, current implementation > of the AF_XDP backend shows a decent performance while running on top > of a physical NIC with zero-copy support. > > Test setup: > > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > Network backend is configured to open the NIC directly in native mode. > The driver supports zero-copy. NIC is configured to use 1 queue. > > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > for PPS testing. > > iperf3 result: > TCP stream : 19.1 Gbps > > dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > Tx only : 3.4 Mpps > Rx only : 2.0 Mpps > L2 FWD Loopback : 1.5 Mpps I don't object to merging this backend (considering we've already merged netmap) once the code is fine, but the number is not amazing so I wonder what is the use case for this backend? Thanks
On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > AF_XDP is a network socket family that allows communication directly > > with the network device driver in the kernel, bypassing most or all > > of the kernel networking stack. In the essence, the technology is > > pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > > and works with any network interfaces without driver modifications. > > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > > require access to character devices or unix sockets. Only access to > > the network interface itself is necessary. > > > > This patch implements a network backend that communicates with the > > kernel by creating an AF_XDP socket. A chunk of userspace memory > > is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > > Fill and Completion) are placed in that memory along with a pool of > > memory buffers for the packet data. Data transmission is done by > > allocating one of the buffers, copying packet data into it and > > placing the pointer into Tx ring. After transmission, device will > > return the buffer via Completion ring. On Rx, device will take > > a buffer form a pre-populated Fill ring, write the packet data into > > it and place the buffer into Rx ring. > > > > AF_XDP network backend takes on the communication with the host > > kernel and the network interface and forwards packets to/from the > > peer device in QEMU. > > > > Usage example: > > > > -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > > -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > > > > XDP program bridges the socket with a network interface. It can be > > attached to the interface in 2 different modes: > > > > 1. skb - this mode should work for any interface and doesn't require > > driver support. With a caveat of lower performance. > > > > 2. native - this does require support from the driver and allows to > > bypass skb allocation in the kernel and potentially use > > zero-copy while getting packets in/out userspace. > > > > By default, QEMU will try to use native mode and fall back to skb. > > Mode can be forced via 'mode' option. To force 'copy' even in native > > mode, use 'force-copy=on' option. This might be useful if there is > > some issue with the driver. > > > > Option 'queues=N' allows to specify how many device queues should > > be open. Note that all the queues that are not open are still > > functional and can receive traffic, but it will not be delivered to > > QEMU. So, the number of device queues should generally match the > > QEMU configuration, unless the device is shared with something > > else and the traffic re-direction to appropriate queues is correctly > > configured on a device level (e.g. with ethtool -N). > > 'start-queue=M' option can be used to specify from which queue id > > QEMU should start configuring 'N' queues. It might also be necessary > > to use this option with certain NICs, e.g. MLX5 NICs. See the docs > > for examples. > > > > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > > capabilities in order to load default XSK/XDP programs to the > > network interface and configure BTF maps. > > I think you mean "BPF" actually? > > > It is possible, however, > > to run only with CAP_NET_RAW. > > Qemu often runs without any privileges, so we need to fix it. > > I think adding support for SCM_RIGHTS via monitor would be a way to go. > > > > For that to work, an external process > > with admin capabilities will need to pre-load default XSK program > > and pass an open file descriptor for this program's 'xsks_map' to > > QEMU process on startup. Network backend will need to be configured > > with 'inhibit=on' to avoid loading of the programs. The file > > descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. > > > > There are few performance challenges with the current network backends. > > > > First is that they do not support IO threads. > > The current networking codes needs some major recatoring to support IO > threads which I'm not sure is worthwhile. > > > This means that data > > path is handled by the main thread in QEMU and may slow down other > > work or may be slowed down by some other work. This also means that > > taking advantage of multi-queue is generally not possible today. > > > > Another thing is that data path is going through the device emulation > > code, which is not really optimized for performance. The fastest > > "frontend" device is virtio-net. But it's not optimized for heavy > > traffic either, because it expects such use-cases to be handled via > > some implementation of vhost (user, kernel, vdpa). In practice, we > > have virtio notifications and rcu lock/unlock on a per-packet basis > > and not very efficient accesses to the guest memory. Communication > > channels between backend and frontend devices do not allow passing > > more than one packet at a time as well. > > > > Some of these challenges can be avoided in the future by adding better > > batching into device emulation or by implementing vhost-af-xdp variant. > > It might require you to register(pin) the whole guest memory to XSK or > there could be a copy. Both of them are sub-optimal. > > A really interesting project is to do AF_XDP passthrough, then we > don't need to care about pin and copy and we will get ultra speed in > the guest. (But again, it might needs BPF support in virtio-net). > > > > > There are also a few kernel limitations. AF_XDP sockets do not > > support any kinds of checksum or segmentation offloading. Buffers > > are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > > support is not implemented for AF_XDP today. Also, transmission in > > all non-zero-copy modes is synchronous, i.e. done in a syscall. > > That doesn't allow high packet rates on virtual interfaces. > > > > However, keeping in mind all of these challenges, current implementation > > of the AF_XDP backend shows a decent performance while running on top > > of a physical NIC with zero-copy support. > > > > Test setup: > > > > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > > Network backend is configured to open the NIC directly in native mode. > > The driver supports zero-copy. NIC is configured to use 1 queue. > > > > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > > for PPS testing. > > > > iperf3 result: > > TCP stream : 19.1 Gbps > > > > dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > > Tx only : 3.4 Mpps > > Rx only : 2.0 Mpps > > L2 FWD Loopback : 1.5 Mpps > > I don't object to merging this backend (considering we've already > merged netmap) once the code is fine, but the number is not amazing so > I wonder what is the use case for this backend? A more ambitious method is to reuse DPDK via dedicated threads, then we can reuse any of its PMD like AF_XDP. Thanks > > Thanks
On 6/26/23 08:32, Jason Wang wrote: > On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >> >> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>> >>> AF_XDP is a network socket family that allows communication directly >>> with the network device driver in the kernel, bypassing most or all >>> of the kernel networking stack. In the essence, the technology is >>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>> and works with any network interfaces without driver modifications. >>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>> require access to character devices or unix sockets. Only access to >>> the network interface itself is necessary. >>> >>> This patch implements a network backend that communicates with the >>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>> Fill and Completion) are placed in that memory along with a pool of >>> memory buffers for the packet data. Data transmission is done by >>> allocating one of the buffers, copying packet data into it and >>> placing the pointer into Tx ring. After transmission, device will >>> return the buffer via Completion ring. On Rx, device will take >>> a buffer form a pre-populated Fill ring, write the packet data into >>> it and place the buffer into Rx ring. >>> >>> AF_XDP network backend takes on the communication with the host >>> kernel and the network interface and forwards packets to/from the >>> peer device in QEMU. >>> >>> Usage example: >>> >>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>> >>> XDP program bridges the socket with a network interface. It can be >>> attached to the interface in 2 different modes: >>> >>> 1. skb - this mode should work for any interface and doesn't require >>> driver support. With a caveat of lower performance. >>> >>> 2. native - this does require support from the driver and allows to >>> bypass skb allocation in the kernel and potentially use >>> zero-copy while getting packets in/out userspace. >>> >>> By default, QEMU will try to use native mode and fall back to skb. >>> Mode can be forced via 'mode' option. To force 'copy' even in native >>> mode, use 'force-copy=on' option. This might be useful if there is >>> some issue with the driver. >>> >>> Option 'queues=N' allows to specify how many device queues should >>> be open. Note that all the queues that are not open are still >>> functional and can receive traffic, but it will not be delivered to >>> QEMU. So, the number of device queues should generally match the >>> QEMU configuration, unless the device is shared with something >>> else and the traffic re-direction to appropriate queues is correctly >>> configured on a device level (e.g. with ethtool -N). >>> 'start-queue=M' option can be used to specify from which queue id >>> QEMU should start configuring 'N' queues. It might also be necessary >>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>> for examples. >>> >>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>> capabilities in order to load default XSK/XDP programs to the >>> network interface and configure BTF maps. >> >> I think you mean "BPF" actually? "BPF Type Format maps" kind of makes some sense, but yes. :) >> >>> It is possible, however, >>> to run only with CAP_NET_RAW. >> >> Qemu often runs without any privileges, so we need to fix it. >> >> I think adding support for SCM_RIGHTS via monitor would be a way to go. I looked through the code and it seems like we can run completely non-privileged as far as kernel concerned. We'll need an API modification in libxdp though. The thing is, IIUC, the only syscall that requires CAP_NET_RAW is a base socket creation. Binding and other configuration doesn't require any privileges. So, we could create a socket externally and pass it to QEMU. Should work, unless it's an oversight from the kernel side that needs to be patched. :) libxdp doesn't have a way to specify externally created socket today, so we'll need to change that. Should be easy to do though. I can explore. In case the bind syscall will actually need CAP_NET_RAW for some reason, we could change the kernel and allow non-privileged bind by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged process bind the socket to a particular device, so QEMU can't bind it to a random one. Might be a good use case to allow even if not strictly necessary. >> >> >>> For that to work, an external process >>> with admin capabilities will need to pre-load default XSK program >>> and pass an open file descriptor for this program's 'xsks_map' to >>> QEMU process on startup. Network backend will need to be configured >>> with 'inhibit=on' to avoid loading of the programs. The file >>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. >>> >>> There are few performance challenges with the current network backends. >>> >>> First is that they do not support IO threads. >> >> The current networking codes needs some major recatoring to support IO >> threads which I'm not sure is worthwhile. >> >>> This means that data >>> path is handled by the main thread in QEMU and may slow down other >>> work or may be slowed down by some other work. This also means that >>> taking advantage of multi-queue is generally not possible today. >>> >>> Another thing is that data path is going through the device emulation >>> code, which is not really optimized for performance. The fastest >>> "frontend" device is virtio-net. But it's not optimized for heavy >>> traffic either, because it expects such use-cases to be handled via >>> some implementation of vhost (user, kernel, vdpa). In practice, we >>> have virtio notifications and rcu lock/unlock on a per-packet basis >>> and not very efficient accesses to the guest memory. Communication >>> channels between backend and frontend devices do not allow passing >>> more than one packet at a time as well. >>> >>> Some of these challenges can be avoided in the future by adding better >>> batching into device emulation or by implementing vhost-af-xdp variant. >> >> It might require you to register(pin) the whole guest memory to XSK or >> there could be a copy. Both of them are sub-optimal. A single copy by itself shouldn't be a huge problem, right? vhost-user and -kernel do copy packets. >> >> A really interesting project is to do AF_XDP passthrough, then we >> don't need to care about pin and copy and we will get ultra speed in >> the guest. (But again, it might needs BPF support in virtio-net). I suppose, if we're doing pass-through we need a new device type and a driver in the kernel/dpdk. There is no point pretending it's a virtio-net and translating between different ring layouts. Or is there? >> >>> >>> There are also a few kernel limitations. AF_XDP sockets do not >>> support any kinds of checksum or segmentation offloading. Buffers >>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >>> support is not implemented for AF_XDP today. Also, transmission in >>> all non-zero-copy modes is synchronous, i.e. done in a syscall. >>> That doesn't allow high packet rates on virtual interfaces. >>> >>> However, keeping in mind all of these challenges, current implementation >>> of the AF_XDP backend shows a decent performance while running on top >>> of a physical NIC with zero-copy support. >>> >>> Test setup: >>> >>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >>> Network backend is configured to open the NIC directly in native mode. >>> The driver supports zero-copy. NIC is configured to use 1 queue. >>> >>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >>> for PPS testing. >>> >>> iperf3 result: >>> TCP stream : 19.1 Gbps >>> >>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>> Tx only : 3.4 Mpps >>> Rx only : 2.0 Mpps >>> L2 FWD Loopback : 1.5 Mpps >> >> I don't object to merging this backend (considering we've already >> merged netmap) once the code is fine, but the number is not amazing so >> I wonder what is the use case for this backend? I don't think there is a use case right now that would significantly benefit from the current implementation, so I'm fine if the merge is postponed. It is noticeably more performant than a tap with vhost=on in terms of PPS. So, that might be one case. Taking into account that just rcu lock and unlock in virtio-net code takes more time than a packet copy, some batching on QEMU side should improve performance significantly. And it shouldn't be too hard to implement. Performance over virtual interfaces may potentially be improved by creating a kernel thread for async Tx. Similarly to what io_uring allows. Currently Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to scale well. So, I do think that there is a potential in this backend. The main benefit, assuming we can reach performance comparable with other high-performance backends (vhost-user), I think, is the fact that it's Linux-native and doesn't require talking with any other devices (like chardevs/sockets), except for a network interface itself. i.e. it could be easier to manage in complex environments. > A more ambitious method is to reuse DPDK via dedicated threads, then > we can reuse any of its PMD like AF_XDP. Linking with DPDK will make configuration much more complex. I don't think it makes sense to bring it in for AF_XDP specifically. Might be a separate project though, sure. Best regards, Ilya Maximets.
On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/26/23 08:32, Jason Wang wrote: > > On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >> > >> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>> > >>> AF_XDP is a network socket family that allows communication directly > >>> with the network device driver in the kernel, bypassing most or all > >>> of the kernel networking stack. In the essence, the technology is > >>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > >>> and works with any network interfaces without driver modifications. > >>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > >>> require access to character devices or unix sockets. Only access to > >>> the network interface itself is necessary. > >>> > >>> This patch implements a network backend that communicates with the > >>> kernel by creating an AF_XDP socket. A chunk of userspace memory > >>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > >>> Fill and Completion) are placed in that memory along with a pool of > >>> memory buffers for the packet data. Data transmission is done by > >>> allocating one of the buffers, copying packet data into it and > >>> placing the pointer into Tx ring. After transmission, device will > >>> return the buffer via Completion ring. On Rx, device will take > >>> a buffer form a pre-populated Fill ring, write the packet data into > >>> it and place the buffer into Rx ring. > >>> > >>> AF_XDP network backend takes on the communication with the host > >>> kernel and the network interface and forwards packets to/from the > >>> peer device in QEMU. > >>> > >>> Usage example: > >>> > >>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > >>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > >>> > >>> XDP program bridges the socket with a network interface. It can be > >>> attached to the interface in 2 different modes: > >>> > >>> 1. skb - this mode should work for any interface and doesn't require > >>> driver support. With a caveat of lower performance. > >>> > >>> 2. native - this does require support from the driver and allows to > >>> bypass skb allocation in the kernel and potentially use > >>> zero-copy while getting packets in/out userspace. > >>> > >>> By default, QEMU will try to use native mode and fall back to skb. > >>> Mode can be forced via 'mode' option. To force 'copy' even in native > >>> mode, use 'force-copy=on' option. This might be useful if there is > >>> some issue with the driver. > >>> > >>> Option 'queues=N' allows to specify how many device queues should > >>> be open. Note that all the queues that are not open are still > >>> functional and can receive traffic, but it will not be delivered to > >>> QEMU. So, the number of device queues should generally match the > >>> QEMU configuration, unless the device is shared with something > >>> else and the traffic re-direction to appropriate queues is correctly > >>> configured on a device level (e.g. with ethtool -N). > >>> 'start-queue=M' option can be used to specify from which queue id > >>> QEMU should start configuring 'N' queues. It might also be necessary > >>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs > >>> for examples. > >>> > >>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > >>> capabilities in order to load default XSK/XDP programs to the > >>> network interface and configure BTF maps. > >> > >> I think you mean "BPF" actually? > > "BPF Type Format maps" kind of makes some sense, but yes. :) > > >> > >>> It is possible, however, > >>> to run only with CAP_NET_RAW. > >> > >> Qemu often runs without any privileges, so we need to fix it. > >> > >> I think adding support for SCM_RIGHTS via monitor would be a way to go. > > I looked through the code and it seems like we can run completely > non-privileged as far as kernel concerned. We'll need an API > modification in libxdp though. > > The thing is, IIUC, the only syscall that requires CAP_NET_RAW is > a base socket creation. Binding and other configuration doesn't > require any privileges. So, we could create a socket externally > and pass it to QEMU. That's the way TAP works for example. > Should work, unless it's an oversight from > the kernel side that needs to be patched. :) libxdp doesn't have > a way to specify externally created socket today, so we'll need > to change that. Should be easy to do though. I can explore. Please do that. > > In case the bind syscall will actually need CAP_NET_RAW for some > reason, we could change the kernel and allow non-privileged bind > by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged > process bind the socket to a particular device, so QEMU can't > bind it to a random one. Might be a good use case to allow even > if not strictly necessary. Yes. > > >> > >> > >>> For that to work, an external process > >>> with admin capabilities will need to pre-load default XSK program > >>> and pass an open file descriptor for this program's 'xsks_map' to > >>> QEMU process on startup. Network backend will need to be configured > >>> with 'inhibit=on' to avoid loading of the programs. The file > >>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. > >>> > >>> There are few performance challenges with the current network backends. > >>> > >>> First is that they do not support IO threads. > >> > >> The current networking codes needs some major recatoring to support IO > >> threads which I'm not sure is worthwhile. > >> > >>> This means that data > >>> path is handled by the main thread in QEMU and may slow down other > >>> work or may be slowed down by some other work. This also means that > >>> taking advantage of multi-queue is generally not possible today. > >>> > >>> Another thing is that data path is going through the device emulation > >>> code, which is not really optimized for performance. The fastest > >>> "frontend" device is virtio-net. But it's not optimized for heavy > >>> traffic either, because it expects such use-cases to be handled via > >>> some implementation of vhost (user, kernel, vdpa). In practice, we > >>> have virtio notifications and rcu lock/unlock on a per-packet basis > >>> and not very efficient accesses to the guest memory. Communication > >>> channels between backend and frontend devices do not allow passing > >>> more than one packet at a time as well. > >>> > >>> Some of these challenges can be avoided in the future by adding better > >>> batching into device emulation or by implementing vhost-af-xdp variant. > >> > >> It might require you to register(pin) the whole guest memory to XSK or > >> there could be a copy. Both of them are sub-optimal. > > A single copy by itself shouldn't be a huge problem, right? Probably. > vhost-user and -kernel do copy packets. > > >> > >> A really interesting project is to do AF_XDP passthrough, then we > >> don't need to care about pin and copy and we will get ultra speed in > >> the guest. (But again, it might needs BPF support in virtio-net). > > I suppose, if we're doing pass-through we need a new device type and a > driver in the kernel/dpdk. There is no point pretending it's a > virtio-net and translating between different ring layouts. Yes. > Or is there? > > >> > >>> > >>> There are also a few kernel limitations. AF_XDP sockets do not > >>> support any kinds of checksum or segmentation offloading. Buffers > >>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > >>> support is not implemented for AF_XDP today. Also, transmission in > >>> all non-zero-copy modes is synchronous, i.e. done in a syscall. > >>> That doesn't allow high packet rates on virtual interfaces. > >>> > >>> However, keeping in mind all of these challenges, current implementation > >>> of the AF_XDP backend shows a decent performance while running on top > >>> of a physical NIC with zero-copy support. > >>> > >>> Test setup: > >>> > >>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > >>> Network backend is configured to open the NIC directly in native mode. > >>> The driver supports zero-copy. NIC is configured to use 1 queue. > >>> > >>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > >>> for PPS testing. > >>> > >>> iperf3 result: > >>> TCP stream : 19.1 Gbps > >>> > >>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > >>> Tx only : 3.4 Mpps > >>> Rx only : 2.0 Mpps > >>> L2 FWD Loopback : 1.5 Mpps > >> > >> I don't object to merging this backend (considering we've already > >> merged netmap) once the code is fine, but the number is not amazing so > >> I wonder what is the use case for this backend? > > I don't think there is a use case right now that would significantly benefit > from the current implementation, so I'm fine if the merge is postponed. Just to be clear, I don't want to postpone this if we decide to invest/enhance it. I will go through the codes and get back. > It is noticeably more performant than a tap with vhost=on in terms of PPS. > So, that might be one case. Taking into account that just rcu lock and > unlock in virtio-net code takes more time than a packet copy, some batching > on QEMU side should improve performance significantly. And it shouldn't be > too hard to implement. > > Performance over virtual interfaces may potentially be improved by creating > a kernel thread for async Tx. Similarly to what io_uring allows. Currently > Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > scale well. Interestingly, actually, there are a lot of "duplication" between io_uring and AF_XDP: 1) both have similar memory model (user register) 2) both use ring for communication I wonder if we can let io_uring talks directly to AF_XDP. > > So, I do think that there is a potential in this backend. > > The main benefit, assuming we can reach performance comparable with other > high-performance backends (vhost-user), I think, is the fact that it's > Linux-native and doesn't require talking with any other devices > (like chardevs/sockets), except for a network interface itself. i.e. it > could be easier to manage in complex environments. Yes. > > > A more ambitious method is to reuse DPDK via dedicated threads, then > > we can reuse any of its PMD like AF_XDP. > > Linking with DPDK will make configuration much more complex. I don't > think it makes sense to bring it in for AF_XDP specifically. Might be > a separate project though, sure. Right. Thanks > > Best regards, Ilya Maximets. >
Can multiple VMs share a host netdev by filtering incoming traffic based on each VM's MAC address and directing it to the appropriate XSK? If yes, then I think AF_XDP is interesting when SR-IOV or similar hardware features are not available. The idea of an AF_XDP passthrough device seems interesting because it would minimize the overhead and avoid some of the existing software limitations (mostly in QEMU's networking subsystem) that you described. I don't know whether the AF_XDP API is suitable or can be extended to build a hardware emulation interface, but it seems plausible. When Stefano Garzarella played with io_uring passthrough into the guest, one of the issues was guest memory translation (since the guest doesn't use host userspace virtual addresses). I guess AF_XDP would need an API for adding/removing memory translations or operate in a mode where addresses are relative offsets from the start of the umem regions (but this may be impractical if it limits where the guest can allocate packet payload buffers). Whether you pursue the passthrough approach or not, making -netdev af-xdp work in an environment where QEMU runs unprivileged seems like the most important practical issue to solve. Stefan
On 6/27/23 04:54, Jason Wang wrote: > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 6/26/23 08:32, Jason Wang wrote: >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >>>> >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>> >>>>> AF_XDP is a network socket family that allows communication directly >>>>> with the network device driver in the kernel, bypassing most or all >>>>> of the kernel networking stack. In the essence, the technology is >>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>>> and works with any network interfaces without driver modifications. >>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>>> require access to character devices or unix sockets. Only access to >>>>> the network interface itself is necessary. >>>>> >>>>> This patch implements a network backend that communicates with the >>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>>> Fill and Completion) are placed in that memory along with a pool of >>>>> memory buffers for the packet data. Data transmission is done by >>>>> allocating one of the buffers, copying packet data into it and >>>>> placing the pointer into Tx ring. After transmission, device will >>>>> return the buffer via Completion ring. On Rx, device will take >>>>> a buffer form a pre-populated Fill ring, write the packet data into >>>>> it and place the buffer into Rx ring. >>>>> >>>>> AF_XDP network backend takes on the communication with the host >>>>> kernel and the network interface and forwards packets to/from the >>>>> peer device in QEMU. >>>>> >>>>> Usage example: >>>>> >>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>>> >>>>> XDP program bridges the socket with a network interface. It can be >>>>> attached to the interface in 2 different modes: >>>>> >>>>> 1. skb - this mode should work for any interface and doesn't require >>>>> driver support. With a caveat of lower performance. >>>>> >>>>> 2. native - this does require support from the driver and allows to >>>>> bypass skb allocation in the kernel and potentially use >>>>> zero-copy while getting packets in/out userspace. >>>>> >>>>> By default, QEMU will try to use native mode and fall back to skb. >>>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>>> mode, use 'force-copy=on' option. This might be useful if there is >>>>> some issue with the driver. >>>>> >>>>> Option 'queues=N' allows to specify how many device queues should >>>>> be open. Note that all the queues that are not open are still >>>>> functional and can receive traffic, but it will not be delivered to >>>>> QEMU. So, the number of device queues should generally match the >>>>> QEMU configuration, unless the device is shared with something >>>>> else and the traffic re-direction to appropriate queues is correctly >>>>> configured on a device level (e.g. with ethtool -N). >>>>> 'start-queue=M' option can be used to specify from which queue id >>>>> QEMU should start configuring 'N' queues. It might also be necessary >>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>>> for examples. >>>>> >>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>>> capabilities in order to load default XSK/XDP programs to the >>>>> network interface and configure BTF maps. >>>> >>>> I think you mean "BPF" actually? >> >> "BPF Type Format maps" kind of makes some sense, but yes. :) >> >>>> >>>>> It is possible, however, >>>>> to run only with CAP_NET_RAW. >>>> >>>> Qemu often runs without any privileges, so we need to fix it. >>>> >>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. >> >> I looked through the code and it seems like we can run completely >> non-privileged as far as kernel concerned. We'll need an API >> modification in libxdp though. >> >> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is >> a base socket creation. Binding and other configuration doesn't >> require any privileges. So, we could create a socket externally >> and pass it to QEMU. > > That's the way TAP works for example. > >> Should work, unless it's an oversight from >> the kernel side that needs to be patched. :) libxdp doesn't have >> a way to specify externally created socket today, so we'll need >> to change that. Should be easy to do though. I can explore. > > Please do that. I have a prototype: https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 Need to test it out and then submit PR to xdp-tools project. > >> >> In case the bind syscall will actually need CAP_NET_RAW for some >> reason, we could change the kernel and allow non-privileged bind >> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged >> process bind the socket to a particular device, so QEMU can't >> bind it to a random one. Might be a good use case to allow even >> if not strictly necessary. > > Yes. Will propose something for a kernel as well. We might want something more granular though, e.g. bind to a queue instead of a device. In case we want better control in the device sharing scenario. > >> >>>> >>>> >>>>> For that to work, an external process >>>>> with admin capabilities will need to pre-load default XSK program >>>>> and pass an open file descriptor for this program's 'xsks_map' to >>>>> QEMU process on startup. Network backend will need to be configured >>>>> with 'inhibit=on' to avoid loading of the programs. The file >>>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. >>>>> >>>>> There are few performance challenges with the current network backends. >>>>> >>>>> First is that they do not support IO threads. >>>> >>>> The current networking codes needs some major recatoring to support IO >>>> threads which I'm not sure is worthwhile. >>>> >>>>> This means that data >>>>> path is handled by the main thread in QEMU and may slow down other >>>>> work or may be slowed down by some other work. This also means that >>>>> taking advantage of multi-queue is generally not possible today. >>>>> >>>>> Another thing is that data path is going through the device emulation >>>>> code, which is not really optimized for performance. The fastest >>>>> "frontend" device is virtio-net. But it's not optimized for heavy >>>>> traffic either, because it expects such use-cases to be handled via >>>>> some implementation of vhost (user, kernel, vdpa). In practice, we >>>>> have virtio notifications and rcu lock/unlock on a per-packet basis >>>>> and not very efficient accesses to the guest memory. Communication >>>>> channels between backend and frontend devices do not allow passing >>>>> more than one packet at a time as well. >>>>> >>>>> Some of these challenges can be avoided in the future by adding better >>>>> batching into device emulation or by implementing vhost-af-xdp variant. >>>> >>>> It might require you to register(pin) the whole guest memory to XSK or >>>> there could be a copy. Both of them are sub-optimal. >> >> A single copy by itself shouldn't be a huge problem, right? > > Probably. > >> vhost-user and -kernel do copy packets. >> >>>> >>>> A really interesting project is to do AF_XDP passthrough, then we >>>> don't need to care about pin and copy and we will get ultra speed in >>>> the guest. (But again, it might needs BPF support in virtio-net). >> >> I suppose, if we're doing pass-through we need a new device type and a >> driver in the kernel/dpdk. There is no point pretending it's a >> virtio-net and translating between different ring layouts. > > Yes. > >> Or is there? >> >>>> >>>>> >>>>> There are also a few kernel limitations. AF_XDP sockets do not >>>>> support any kinds of checksum or segmentation offloading. Buffers >>>>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >>>>> support is not implemented for AF_XDP today. Also, transmission in >>>>> all non-zero-copy modes is synchronous, i.e. done in a syscall. >>>>> That doesn't allow high packet rates on virtual interfaces. >>>>> >>>>> However, keeping in mind all of these challenges, current implementation >>>>> of the AF_XDP backend shows a decent performance while running on top >>>>> of a physical NIC with zero-copy support. >>>>> >>>>> Test setup: >>>>> >>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >>>>> Network backend is configured to open the NIC directly in native mode. >>>>> The driver supports zero-copy. NIC is configured to use 1 queue. >>>>> >>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >>>>> for PPS testing. >>>>> >>>>> iperf3 result: >>>>> TCP stream : 19.1 Gbps >>>>> >>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>>>> Tx only : 3.4 Mpps >>>>> Rx only : 2.0 Mpps >>>>> L2 FWD Loopback : 1.5 Mpps >>>> >>>> I don't object to merging this backend (considering we've already >>>> merged netmap) once the code is fine, but the number is not amazing so >>>> I wonder what is the use case for this backend? >> >> I don't think there is a use case right now that would significantly benefit >> from the current implementation, so I'm fine if the merge is postponed. > > Just to be clear, I don't want to postpone this if we decide to > invest/enhance it. I will go through the codes and get back. Ack. Thanks. > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. >> So, that might be one case. Taking into account that just rcu lock and >> unlock in virtio-net code takes more time than a packet copy, some batching >> on QEMU side should improve performance significantly. And it shouldn't be >> too hard to implement. >> >> Performance over virtual interfaces may potentially be improved by creating >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to >> scale well. > > Interestingly, actually, there are a lot of "duplication" between > io_uring and AF_XDP: > > 1) both have similar memory model (user register) > 2) both use ring for communication > > I wonder if we can let io_uring talks directly to AF_XDP. Well, if we submit poll() in QEMU main loop via io_uring, then we can avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for virtual interfaces. io_uring thread in the kernel will be able to perform transmission for us. But yeah, there are way too many way too similar ring buffer interfaces in the kernel. > >> >> So, I do think that there is a potential in this backend. >> >> The main benefit, assuming we can reach performance comparable with other >> high-performance backends (vhost-user), I think, is the fact that it's >> Linux-native and doesn't require talking with any other devices >> (like chardevs/sockets), except for a network interface itself. i.e. it >> could be easier to manage in complex environments. > > Yes. > >> >>> A more ambitious method is to reuse DPDK via dedicated threads, then >>> we can reuse any of its PMD like AF_XDP. >> >> Linking with DPDK will make configuration much more complex. I don't >> think it makes sense to bring it in for AF_XDP specifically. Might be >> a separate project though, sure. > > Right. > > Thanks > >> >> Best regards, Ilya Maximets. >> >
On 6/27/23 10:56, Stefan Hajnoczi wrote: > Can multiple VMs share a host netdev by filtering incoming traffic > based on each VM's MAC address and directing it to the appropriate > XSK? If yes, then I think AF_XDP is interesting when SR-IOV or similar > hardware features are not available. Good point. Thanks! Yes, they can. Traffic can be re-directed via 'ethtool -N' similarly to an example in the patch. Or, potentially, via custom XDP program. Then different QEMU instances may use different start-queue arguments and use their own range of queues this way. > > The idea of an AF_XDP passthrough device seems interesting because it > would minimize the overhead and avoid some of the existing software > limitations (mostly in QEMU's networking subsystem) that you > described. I don't know whether the AF_XDP API is suitable or can be > extended to build a hardware emulation interface, but it seems > plausible. When Stefano Garzarella played with io_uring passthrough > into the guest, one of the issues was guest memory translation (since > the guest doesn't use host userspace virtual addresses). I guess > AF_XDP would need an API for adding/removing memory translations or > operate in a mode where addresses are relative offsets from the start > of the umem regions Actually, addresses in AF_XDP rings are already offsets from the start of the umem region. For example, xsk_umem__get_data is implemented as &((char *)umem_area)[addr]; inside libxdp. So, that should not be an issue. > (but this may be impractical if it limits where > the guest can allocate packet payload buffers). Yeah, we will either need to: a. register the whole guest memory as umem and offset buffer pointers in the guest driver by the start of guest physical memory. (I'm not familiar much with QEMU memory subsystem. Is guest physical memory always start at 0? I know that it's not always true for the real hardware.) b. or require the guest driver to allocate a chunk of aligned contiguous memory and copy all the packets there on Tx. And populate the Fill ring only with buffers from that area. Assuming guest pages align with the host pages. Again, a single copy might not be that bad, but it's hard to tell what the actual impact will be without testing. > > Whether you pursue the passthrough approach or not, making -netdev > af-xdp work in an environment where QEMU runs unprivileged seems like > the most important practical issue to solve. Yes, working on it. Doesn't seem to be hard to do, but I need to test. Best regards, Ilya Maximets.
On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/27/23 04:54, Jason Wang wrote: > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 6/26/23 08:32, Jason Wang wrote: > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >>>> > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>> > >>>>> AF_XDP is a network socket family that allows communication directly > >>>>> with the network device driver in the kernel, bypassing most or all > >>>>> of the kernel networking stack. In the essence, the technology is > >>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > >>>>> and works with any network interfaces without driver modifications. > >>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > >>>>> require access to character devices or unix sockets. Only access to > >>>>> the network interface itself is necessary. > >>>>> > >>>>> This patch implements a network backend that communicates with the > >>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory > >>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > >>>>> Fill and Completion) are placed in that memory along with a pool of > >>>>> memory buffers for the packet data. Data transmission is done by > >>>>> allocating one of the buffers, copying packet data into it and > >>>>> placing the pointer into Tx ring. After transmission, device will > >>>>> return the buffer via Completion ring. On Rx, device will take > >>>>> a buffer form a pre-populated Fill ring, write the packet data into > >>>>> it and place the buffer into Rx ring. > >>>>> > >>>>> AF_XDP network backend takes on the communication with the host > >>>>> kernel and the network interface and forwards packets to/from the > >>>>> peer device in QEMU. > >>>>> > >>>>> Usage example: > >>>>> > >>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > >>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > >>>>> > >>>>> XDP program bridges the socket with a network interface. It can be > >>>>> attached to the interface in 2 different modes: > >>>>> > >>>>> 1. skb - this mode should work for any interface and doesn't require > >>>>> driver support. With a caveat of lower performance. > >>>>> > >>>>> 2. native - this does require support from the driver and allows to > >>>>> bypass skb allocation in the kernel and potentially use > >>>>> zero-copy while getting packets in/out userspace. > >>>>> > >>>>> By default, QEMU will try to use native mode and fall back to skb. > >>>>> Mode can be forced via 'mode' option. To force 'copy' even in native > >>>>> mode, use 'force-copy=on' option. This might be useful if there is > >>>>> some issue with the driver. > >>>>> > >>>>> Option 'queues=N' allows to specify how many device queues should > >>>>> be open. Note that all the queues that are not open are still > >>>>> functional and can receive traffic, but it will not be delivered to > >>>>> QEMU. So, the number of device queues should generally match the > >>>>> QEMU configuration, unless the device is shared with something > >>>>> else and the traffic re-direction to appropriate queues is correctly > >>>>> configured on a device level (e.g. with ethtool -N). > >>>>> 'start-queue=M' option can be used to specify from which queue id > >>>>> QEMU should start configuring 'N' queues. It might also be necessary > >>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs > >>>>> for examples. > >>>>> > >>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > >>>>> capabilities in order to load default XSK/XDP programs to the > >>>>> network interface and configure BTF maps. > >>>> > >>>> I think you mean "BPF" actually? > >> > >> "BPF Type Format maps" kind of makes some sense, but yes. :) > >> > >>>> > >>>>> It is possible, however, > >>>>> to run only with CAP_NET_RAW. > >>>> > >>>> Qemu often runs without any privileges, so we need to fix it. > >>>> > >>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. > >> > >> I looked through the code and it seems like we can run completely > >> non-privileged as far as kernel concerned. We'll need an API > >> modification in libxdp though. > >> > >> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is > >> a base socket creation. Binding and other configuration doesn't > >> require any privileges. So, we could create a socket externally > >> and pass it to QEMU. > > > > That's the way TAP works for example. > > > >> Should work, unless it's an oversight from > >> the kernel side that needs to be patched. :) libxdp doesn't have > >> a way to specify externally created socket today, so we'll need > >> to change that. Should be easy to do though. I can explore. > > > > Please do that. > > I have a prototype: > https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 > > Need to test it out and then submit PR to xdp-tools project. > > > > >> > >> In case the bind syscall will actually need CAP_NET_RAW for some > >> reason, we could change the kernel and allow non-privileged bind > >> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged > >> process bind the socket to a particular device, so QEMU can't > >> bind it to a random one. Might be a good use case to allow even > >> if not strictly necessary. > > > > Yes. > > Will propose something for a kernel as well. We might want something > more granular though, e.g. bind to a queue instead of a device. In > case we want better control in the device sharing scenario. I may miss something but the bind is already done at dev plus queue right now, isn't it? > > > > >> > >>>> > >>>> > >>>>> For that to work, an external process > >>>>> with admin capabilities will need to pre-load default XSK program > >>>>> and pass an open file descriptor for this program's 'xsks_map' to > >>>>> QEMU process on startup. Network backend will need to be configured > >>>>> with 'inhibit=on' to avoid loading of the programs. The file > >>>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. > >>>>> > >>>>> There are few performance challenges with the current network backends. > >>>>> > >>>>> First is that they do not support IO threads. > >>>> > >>>> The current networking codes needs some major recatoring to support IO > >>>> threads which I'm not sure is worthwhile. > >>>> > >>>>> This means that data > >>>>> path is handled by the main thread in QEMU and may slow down other > >>>>> work or may be slowed down by some other work. This also means that > >>>>> taking advantage of multi-queue is generally not possible today. > >>>>> > >>>>> Another thing is that data path is going through the device emulation > >>>>> code, which is not really optimized for performance. The fastest > >>>>> "frontend" device is virtio-net. But it's not optimized for heavy > >>>>> traffic either, because it expects such use-cases to be handled via > >>>>> some implementation of vhost (user, kernel, vdpa). In practice, we > >>>>> have virtio notifications and rcu lock/unlock on a per-packet basis > >>>>> and not very efficient accesses to the guest memory. Communication > >>>>> channels between backend and frontend devices do not allow passing > >>>>> more than one packet at a time as well. > >>>>> > >>>>> Some of these challenges can be avoided in the future by adding better > >>>>> batching into device emulation or by implementing vhost-af-xdp variant. > >>>> > >>>> It might require you to register(pin) the whole guest memory to XSK or > >>>> there could be a copy. Both of them are sub-optimal. > >> > >> A single copy by itself shouldn't be a huge problem, right? > > > > Probably. > > > >> vhost-user and -kernel do copy packets. > >> > >>>> > >>>> A really interesting project is to do AF_XDP passthrough, then we > >>>> don't need to care about pin and copy and we will get ultra speed in > >>>> the guest. (But again, it might needs BPF support in virtio-net). > >> > >> I suppose, if we're doing pass-through we need a new device type and a > >> driver in the kernel/dpdk. There is no point pretending it's a > >> virtio-net and translating between different ring layouts. > > > > Yes. > > > >> Or is there? > >> > >>>> > >>>>> > >>>>> There are also a few kernel limitations. AF_XDP sockets do not > >>>>> support any kinds of checksum or segmentation offloading. Buffers > >>>>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > >>>>> support is not implemented for AF_XDP today. Also, transmission in > >>>>> all non-zero-copy modes is synchronous, i.e. done in a syscall. > >>>>> That doesn't allow high packet rates on virtual interfaces. > >>>>> > >>>>> However, keeping in mind all of these challenges, current implementation > >>>>> of the AF_XDP backend shows a decent performance while running on top > >>>>> of a physical NIC with zero-copy support. > >>>>> > >>>>> Test setup: > >>>>> > >>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > >>>>> Network backend is configured to open the NIC directly in native mode. > >>>>> The driver supports zero-copy. NIC is configured to use 1 queue. > >>>>> > >>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > >>>>> for PPS testing. > >>>>> > >>>>> iperf3 result: > >>>>> TCP stream : 19.1 Gbps > >>>>> > >>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > >>>>> Tx only : 3.4 Mpps > >>>>> Rx only : 2.0 Mpps > >>>>> L2 FWD Loopback : 1.5 Mpps > >>>> > >>>> I don't object to merging this backend (considering we've already > >>>> merged netmap) once the code is fine, but the number is not amazing so > >>>> I wonder what is the use case for this backend? > >> > >> I don't think there is a use case right now that would significantly benefit > >> from the current implementation, so I'm fine if the merge is postponed. > > > > Just to be clear, I don't want to postpone this if we decide to > > invest/enhance it. I will go through the codes and get back. > > Ack. Thanks. > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > >> So, that might be one case. Taking into account that just rcu lock and > >> unlock in virtio-net code takes more time than a packet copy, some batching > >> on QEMU side should improve performance significantly. And it shouldn't be > >> too hard to implement. > >> > >> Performance over virtual interfaces may potentially be improved by creating > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > >> scale well. > > > > Interestingly, actually, there are a lot of "duplication" between > > io_uring and AF_XDP: > > > > 1) both have similar memory model (user register) > > 2) both use ring for communication > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > virtual interfaces. io_uring thread in the kernel will be able to > perform transmission for us. It would be nice if we can use iothread/vhost other than the main loop even if io_uring can use kthreads. We can avoid the memory translation cost. Thanks > > But yeah, there are way too many way too similar ring buffer interfaces > in the kernel. > > > > >> > >> So, I do think that there is a potential in this backend. > >> > >> The main benefit, assuming we can reach performance comparable with other > >> high-performance backends (vhost-user), I think, is the fact that it's > >> Linux-native and doesn't require talking with any other devices > >> (like chardevs/sockets), except for a network interface itself. i.e. it > >> could be easier to manage in complex environments. > > > > Yes. > > > >> > >>> A more ambitious method is to reuse DPDK via dedicated threads, then > >>> we can reuse any of its PMD like AF_XDP. > >> > >> Linking with DPDK will make configuration much more complex. I don't > >> think it makes sense to bring it in for AF_XDP specifically. Might be > >> a separate project though, sure. > > > > Right. > > > > Thanks > > > >> > >> Best regards, Ilya Maximets. > >> > > >
On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 6/27/23 04:54, Jason Wang wrote: > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> > > >> On 6/26/23 08:32, Jason Wang wrote: > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > >>>> > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > >> So, that might be one case. Taking into account that just rcu lock and > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > >> on QEMU side should improve performance significantly. And it shouldn't be > > >> too hard to implement. > > >> > > >> Performance over virtual interfaces may potentially be improved by creating > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > >> scale well. > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > io_uring and AF_XDP: > > > > > > 1) both have similar memory model (user register) > > > 2) both use ring for communication > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > virtual interfaces. io_uring thread in the kernel will be able to > > perform transmission for us. > > It would be nice if we can use iothread/vhost other than the main loop > even if io_uring can use kthreads. We can avoid the memory translation > cost. The QEMU event loop (AioContext) has io_uring code (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working on patches to re-enable it and will probably send them in July. The patches also add an API to submit arbitrary io_uring operations so that you can do stuff besides file descriptor monitoring. Both the main loop and IOThreads will be able to use io_uring on Linux hosts. Stefan
On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > >> > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > >>>> > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > >> So, that might be one case. Taking into account that just rcu lock and > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > >> too hard to implement. > > > >> > > > >> Performance over virtual interfaces may potentially be improved by creating > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > >> scale well. > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > io_uring and AF_XDP: > > > > > > > > 1) both have similar memory model (user register) > > > > 2) both use ring for communication > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > virtual interfaces. io_uring thread in the kernel will be able to > > > perform transmission for us. > > > > It would be nice if we can use iothread/vhost other than the main loop > > even if io_uring can use kthreads. We can avoid the memory translation > > cost. > > The QEMU event loop (AioContext) has io_uring code > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > on patches to re-enable it and will probably send them in July. The > patches also add an API to submit arbitrary io_uring operations so > that you can do stuff besides file descriptor monitoring. Both the > main loop and IOThreads will be able to use io_uring on Linux hosts. Just to make sure I understand. If we still need a copy from guest to io_uring buffer, we still need to go via memory API for GPA which seems expensive. Vhost seems to be a shortcut for this. Thanks > > Stefan >
On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > >> > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > >>>> > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > >> too hard to implement. > > > > >> > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > >> scale well. > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > io_uring and AF_XDP: > > > > > > > > > > 1) both have similar memory model (user register) > > > > > 2) both use ring for communication > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > perform transmission for us. > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > even if io_uring can use kthreads. We can avoid the memory translation > > > cost. > > > > The QEMU event loop (AioContext) has io_uring code > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > on patches to re-enable it and will probably send them in July. The > > patches also add an API to submit arbitrary io_uring operations so > > that you can do stuff besides file descriptor monitoring. Both the > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > Just to make sure I understand. If we still need a copy from guest to > io_uring buffer, we still need to go via memory API for GPA which > seems expensive. > > Vhost seems to be a shortcut for this. I'm not sure how exactly you're thinking of using io_uring. Simply using io_uring for the event loop (file descriptor monitoring) doesn't involve an extra buffer, but the packet payload still needs to reside in AF_XDP umem, so there is a copy between guest memory and umem. If umem encompasses guest memory, it may be possible to avoid copying the packet payload. Stefan
On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > >> > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > >>>> > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > >> too hard to implement. > > > > > >> > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > >> scale well. > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > 2) both use ring for communication > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > perform transmission for us. > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > cost. > > > > > > The QEMU event loop (AioContext) has io_uring code > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > on patches to re-enable it and will probably send them in July. The > > > patches also add an API to submit arbitrary io_uring operations so > > > that you can do stuff besides file descriptor monitoring. Both the > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > Just to make sure I understand. If we still need a copy from guest to > > io_uring buffer, we still need to go via memory API for GPA which > > seems expensive. > > > > Vhost seems to be a shortcut for this. > > I'm not sure how exactly you're thinking of using io_uring. > > Simply using io_uring for the event loop (file descriptor monitoring) > doesn't involve an extra buffer, but the packet payload still needs to > reside in AF_XDP umem, so there is a copy between guest memory and > umem. So there would be a translation from GPA to HVA (unless io_uring support 2 stages) which needs to go via qemu memory core. And this part seems to be very expensive according to my test in the past. > If umem encompasses guest memory, It requires you to pin the whole guest memory and a GPA to HVA translation is still required. Thanks >it may be possible to avoid > copying the packet payload. > > Stefan >
On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > >> > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > >>>> > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > >> too hard to implement. > > > > > > >> > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > >> scale well. > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > perform transmission for us. > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > cost. > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > on patches to re-enable it and will probably send them in July. The > > > > patches also add an API to submit arbitrary io_uring operations so > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > io_uring buffer, we still need to go via memory API for GPA which > > > seems expensive. > > > > > > Vhost seems to be a shortcut for this. > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > doesn't involve an extra buffer, but the packet payload still needs to > > reside in AF_XDP umem, so there is a copy between guest memory and > > umem. > > So there would be a translation from GPA to HVA (unless io_uring > support 2 stages) which needs to go via qemu memory core. And this > part seems to be very expensive according to my test in the past. Yes, but in the current approach where AF_XDP is implemented as a QEMU netdev, there is already QEMU device emulation (e.g. virtio-net) happening. So the GPA to HVA translation will happen anyway in device emulation. Are you thinking about AF_XDP passthrough where the guest directly interacts with AF_XDP? > > If umem encompasses guest memory, > > It requires you to pin the whole guest memory and a GPA to HVA > translation is still required. Ilya mentioned that umem uses relative offsets instead of absolute memory addresses. In the AF_XDP passthrough case this means no address translation needs to be added to AF_XDP. Regarding pinning - I wonder if that's something that can be refined in the kernel by adding an AF_XDP flag that enables on-demand pinning of umem. That way only rx and tx buffers that are currently in use will be pinned. The disadvantage is the runtime overhead to pin/unpin pages. I'm not sure whether it's possible to implement this, I haven't checked the kernel code. Stefan
On 6/28/23 05:27, Jason Wang wrote: > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 6/27/23 04:54, Jason Wang wrote: >>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> On 6/26/23 08:32, Jason Wang wrote: >>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >>>>>> >>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>> >>>>>>> AF_XDP is a network socket family that allows communication directly >>>>>>> with the network device driver in the kernel, bypassing most or all >>>>>>> of the kernel networking stack. In the essence, the technology is >>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>>>>> and works with any network interfaces without driver modifications. >>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>>>>> require access to character devices or unix sockets. Only access to >>>>>>> the network interface itself is necessary. >>>>>>> >>>>>>> This patch implements a network backend that communicates with the >>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>>>>> Fill and Completion) are placed in that memory along with a pool of >>>>>>> memory buffers for the packet data. Data transmission is done by >>>>>>> allocating one of the buffers, copying packet data into it and >>>>>>> placing the pointer into Tx ring. After transmission, device will >>>>>>> return the buffer via Completion ring. On Rx, device will take >>>>>>> a buffer form a pre-populated Fill ring, write the packet data into >>>>>>> it and place the buffer into Rx ring. >>>>>>> >>>>>>> AF_XDP network backend takes on the communication with the host >>>>>>> kernel and the network interface and forwards packets to/from the >>>>>>> peer device in QEMU. >>>>>>> >>>>>>> Usage example: >>>>>>> >>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>>>>> >>>>>>> XDP program bridges the socket with a network interface. It can be >>>>>>> attached to the interface in 2 different modes: >>>>>>> >>>>>>> 1. skb - this mode should work for any interface and doesn't require >>>>>>> driver support. With a caveat of lower performance. >>>>>>> >>>>>>> 2. native - this does require support from the driver and allows to >>>>>>> bypass skb allocation in the kernel and potentially use >>>>>>> zero-copy while getting packets in/out userspace. >>>>>>> >>>>>>> By default, QEMU will try to use native mode and fall back to skb. >>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>>>>> mode, use 'force-copy=on' option. This might be useful if there is >>>>>>> some issue with the driver. >>>>>>> >>>>>>> Option 'queues=N' allows to specify how many device queues should >>>>>>> be open. Note that all the queues that are not open are still >>>>>>> functional and can receive traffic, but it will not be delivered to >>>>>>> QEMU. So, the number of device queues should generally match the >>>>>>> QEMU configuration, unless the device is shared with something >>>>>>> else and the traffic re-direction to appropriate queues is correctly >>>>>>> configured on a device level (e.g. with ethtool -N). >>>>>>> 'start-queue=M' option can be used to specify from which queue id >>>>>>> QEMU should start configuring 'N' queues. It might also be necessary >>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>>>>> for examples. >>>>>>> >>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>>>>> capabilities in order to load default XSK/XDP programs to the >>>>>>> network interface and configure BTF maps. >>>>>> >>>>>> I think you mean "BPF" actually? >>>> >>>> "BPF Type Format maps" kind of makes some sense, but yes. :) >>>> >>>>>> >>>>>>> It is possible, however, >>>>>>> to run only with CAP_NET_RAW. >>>>>> >>>>>> Qemu often runs without any privileges, so we need to fix it. >>>>>> >>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. >>>> >>>> I looked through the code and it seems like we can run completely >>>> non-privileged as far as kernel concerned. We'll need an API >>>> modification in libxdp though. >>>> >>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is >>>> a base socket creation. Binding and other configuration doesn't >>>> require any privileges. So, we could create a socket externally >>>> and pass it to QEMU. >>> >>> That's the way TAP works for example. >>> >>>> Should work, unless it's an oversight from >>>> the kernel side that needs to be patched. :) libxdp doesn't have >>>> a way to specify externally created socket today, so we'll need >>>> to change that. Should be easy to do though. I can explore. >>> >>> Please do that. >> >> I have a prototype: >> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 >> >> Need to test it out and then submit PR to xdp-tools project. >> >>> >>>> >>>> In case the bind syscall will actually need CAP_NET_RAW for some >>>> reason, we could change the kernel and allow non-privileged bind >>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged >>>> process bind the socket to a particular device, so QEMU can't >>>> bind it to a random one. Might be a good use case to allow even >>>> if not strictly necessary. >>> >>> Yes. >> >> Will propose something for a kernel as well. We might want something >> more granular though, e.g. bind to a queue instead of a device. In >> case we want better control in the device sharing scenario. > > I may miss something but the bind is already done at dev plus queue > right now, isn't it? Yes, the bind() syscall will bind socket to the dev+queue. I was talking about SO_BINDTODEVICE that only ties the socket to a particular device, but not a queue. Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and assuming a privileged process does: fd = socket(AF_XDP, ...); setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>); And sends fd to a non-privileged process. That non-privileged process will be able to call: bind(fd, <device>, <random queue>); It will have to use the same device, but can choose any queue, if that queue is not already busy with another socket. So, I was thinking maybe implementing something like XDP_BINDTOQID option. This way the privileged process may call: setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>); And later kernel will be able to refuse bind() for any other queue for this particular socket. Not sure if that is necessary though. Since we're allocating the socket in the privileged process, that process may add the socket to the BPF map on the correct queue id. This way the non-privileged process will not be able to receive any packets from any other queue on this socket, even if bound to it. And no other AF_XDP socket will be able to be bound to that other queue as well. So, the rogue QEMU will be able to hog one extra queue, but it will not be able to intercept traffic any from it, AFAICT. May not be a huge problem after all. SO_BINDTODEVICE would still be nice to have. Especially for cases where we give the whole device to one VM. Best regards, Ilya Maximets.
On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > >> > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > >>>> > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > >> too hard to implement. > > > > > > > >> > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > >> scale well. > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > perform transmission for us. > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > cost. > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > on patches to re-enable it and will probably send them in July. The > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > seems expensive. > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > doesn't involve an extra buffer, but the packet payload still needs to > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > umem. > > > > So there would be a translation from GPA to HVA (unless io_uring > > support 2 stages) which needs to go via qemu memory core. And this > > part seems to be very expensive according to my test in the past. > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > netdev, there is already QEMU device emulation (e.g. virtio-net) > happening. So the GPA to HVA translation will happen anyway in device > emulation. Just to make sure we're on the same page. I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the QEMU netdev, it would be very hard to achieve that if we stick to using the Qemu memory core translations which need to take care about too much extra stuff. That's why I suggest using vhost in io threads which only cares about ram so the translation could be very fast. > > Are you thinking about AF_XDP passthrough where the guest directly > interacts with AF_XDP? This could be another way to solve, since it won't use Qemu's memory core to do the translation. > > > > If umem encompasses guest memory, > > > > It requires you to pin the whole guest memory and a GPA to HVA > > translation is still required. > > Ilya mentioned that umem uses relative offsets instead of absolute > memory addresses. In the AF_XDP passthrough case this means no address > translation needs to be added to AF_XDP. I don't see how it can avoid the translations as it works at the level of HVA. But what guests submit is PA or even IOVA. What's more, guest memory could be backed by different memory backends, this means a single umem may not even work. > > Regarding pinning - I wonder if that's something that can be refined > in the kernel by adding an AF_XDP flag that enables on-demand pinning > of umem. That way only rx and tx buffers that are currently in use > will be pinned. The disadvantage is the runtime overhead to pin/unpin > pages. I'm not sure whether it's possible to implement this, I haven't > checked the kernel code. It requires the device to do page faults which is not commonly supported nowadays. Thanks > > Stefan >
On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > >> > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > >>>> > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > >> too hard to implement. > > > > > > > > >> > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > cost. > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > seems expensive. > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > umem. > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > support 2 stages) which needs to go via qemu memory core. And this > > > part seems to be very expensive according to my test in the past. > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > happening. So the GPA to HVA translation will happen anyway in device > > emulation. > > Just to make sure we're on the same page. > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > QEMU netdev, it would be very hard to achieve that if we stick to > using the Qemu memory core translations which need to take care about > too much extra stuff. That's why I suggest using vhost in io threads > which only cares about ram so the translation could be very fast. What does using "vhost in io threads" mean? Is that a vhost kernel approach where userspace dedicates threads (the stuff that Mike Christie has been working on)? I haven't looked at how Mike's recent patches work, but I wouldn't call that approach QEMU IOThreads, because the threads probably don't run the AioContext event loop and instead execute vhost kernel code the entire time. But despite these questions, I think I'm beginning to understand that you're proposing a vhost_net.ko AF_XDP implementation instead of a userspace QEMU AF_XDP netdev implementation. I wonder if any optimizations can be made when the AF_XDP user is kernel code instead of userspace code. > > > > Are you thinking about AF_XDP passthrough where the guest directly > > interacts with AF_XDP? > > This could be another way to solve, since it won't use Qemu's memory > core to do the translation. > > > > > > > If umem encompasses guest memory, > > > > > > It requires you to pin the whole guest memory and a GPA to HVA > > > translation is still required. > > > > Ilya mentioned that umem uses relative offsets instead of absolute > > memory addresses. In the AF_XDP passthrough case this means no address > > translation needs to be added to AF_XDP. > > I don't see how it can avoid the translations as it works at the level > of HVA. But what guests submit is PA or even IOVA. In a passthrough scenario the guest is doing AF_XDP, so it writes relative umem offsets, thereby eliminating address translation concerns (the addresses are not PAs or IOVAs). However, this approach probably won't work well with memory hotplug - or at least it will end up becoming a memory translation mechanism in order to support memory hotplug. > > What's more, guest memory could be backed by different memory > backends, this means a single umem may not even work. Maybe. I don't know the nature of umem. If there can be multiple vmas in the umem range, then there should be no issue mixing different memory backends. > > > > > Regarding pinning - I wonder if that's something that can be refined > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > of umem. That way only rx and tx buffers that are currently in use > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > pages. I'm not sure whether it's possible to implement this, I haven't > > checked the kernel code. > > It requires the device to do page faults which is not commonly > supported nowadays. I don't understand this comment. AF_XDP processes each rx/tx descriptor. At that point it can getuserpages() or similar in order to pin the page. When the memory is no longer needed, it can put those pages. No fault mechanism is needed. What am I missing? Stefan
On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > >> > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > >>>> > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > >> too hard to implement. > > > > > > > > > >> > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > cost. > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > seems expensive. > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > umem. > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > part seems to be very expensive according to my test in the past. > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > happening. So the GPA to HVA translation will happen anyway in device > > > emulation. > > > > Just to make sure we're on the same page. > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > QEMU netdev, it would be very hard to achieve that if we stick to > > using the Qemu memory core translations which need to take care about > > too much extra stuff. That's why I suggest using vhost in io threads > > which only cares about ram so the translation could be very fast. > > What does using "vhost in io threads" mean? It means a vhost userspace dataplane that is implemented via io threads. > Is that a vhost kernel > approach where userspace dedicates threads (the stuff that Mike > Christie has been working on)? I haven't looked at how Mike's recent > patches work, but I wouldn't call that approach QEMU IOThreads, > because the threads probably don't run the AioContext event loop and > instead execute vhost kernel code the entire time. > > But despite these questions, I think I'm beginning to understand that > you're proposing a vhost_net.ko AF_XDP implementation instead of a > userspace QEMU AF_XDP netdev implementation. Sorry for being unclear, but I'm not proposing that. > I wonder if any > optimizations can be made when the AF_XDP user is kernel code instead > of userspace code. The only possible way to go is to adapt AF_XDP umem memory model to vhost which I'm not sure of anything we can gain. > > > > > > > Are you thinking about AF_XDP passthrough where the guest directly > > > interacts with AF_XDP? > > > > This could be another way to solve, since it won't use Qemu's memory > > core to do the translation. > > > > > > > > > > If umem encompasses guest memory, > > > > > > > > It requires you to pin the whole guest memory and a GPA to HVA > > > > translation is still required. > > > > > > Ilya mentioned that umem uses relative offsets instead of absolute > > > memory addresses. In the AF_XDP passthrough case this means no address > > > translation needs to be added to AF_XDP. > > > > I don't see how it can avoid the translations as it works at the level > > of HVA. But what guests submit is PA or even IOVA. > > In a passthrough scenario the guest is doing AF_XDP, so it writes > relative umem offsets, thereby eliminating address translation > concerns (the addresses are not PAs or IOVAs). However, this approach > probably won't work well with memory hotplug - or at least it will end > up becoming a memory translation mechanism in order to support memory > hotplug. Ok. > > > > > What's more, guest memory could be backed by different memory > > backends, this means a single umem may not even work. > > Maybe. I don't know the nature of umem. If there can be multiple vmas > in the umem range, then there should be no issue mixing different > memory backends. If I understand correctly, a single umem requires contiguous VA at least. > > > > > > > > > Regarding pinning - I wonder if that's something that can be refined > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > of umem. That way only rx and tx buffers that are currently in use > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > checked the kernel code. > > > > It requires the device to do page faults which is not commonly > > supported nowadays. > > I don't understand this comment. AF_XDP processes each rx/tx > descriptor. At that point it can getuserpages() or similar in order to > pin the page. When the memory is no longer needed, it can put those > pages. No fault mechanism is needed. What am I missing? Ok, I think I kind of get you, you mean doing pinning while processing rx/tx buffers? It's not easy since GUP itself is not very fast, it may hit PPS for sure. Thanks > > Stefan >
On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 6/28/23 05:27, Jason Wang wrote: > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 6/27/23 04:54, Jason Wang wrote: > >>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>> > >>>> On 6/26/23 08:32, Jason Wang wrote: > >>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >>>>>> > >>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>> > >>>>>>> AF_XDP is a network socket family that allows communication directly > >>>>>>> with the network device driver in the kernel, bypassing most or all > >>>>>>> of the kernel networking stack. In the essence, the technology is > >>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > >>>>>>> and works with any network interfaces without driver modifications. > >>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > >>>>>>> require access to character devices or unix sockets. Only access to > >>>>>>> the network interface itself is necessary. > >>>>>>> > >>>>>>> This patch implements a network backend that communicates with the > >>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory > >>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > >>>>>>> Fill and Completion) are placed in that memory along with a pool of > >>>>>>> memory buffers for the packet data. Data transmission is done by > >>>>>>> allocating one of the buffers, copying packet data into it and > >>>>>>> placing the pointer into Tx ring. After transmission, device will > >>>>>>> return the buffer via Completion ring. On Rx, device will take > >>>>>>> a buffer form a pre-populated Fill ring, write the packet data into > >>>>>>> it and place the buffer into Rx ring. > >>>>>>> > >>>>>>> AF_XDP network backend takes on the communication with the host > >>>>>>> kernel and the network interface and forwards packets to/from the > >>>>>>> peer device in QEMU. > >>>>>>> > >>>>>>> Usage example: > >>>>>>> > >>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > >>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > >>>>>>> > >>>>>>> XDP program bridges the socket with a network interface. It can be > >>>>>>> attached to the interface in 2 different modes: > >>>>>>> > >>>>>>> 1. skb - this mode should work for any interface and doesn't require > >>>>>>> driver support. With a caveat of lower performance. > >>>>>>> > >>>>>>> 2. native - this does require support from the driver and allows to > >>>>>>> bypass skb allocation in the kernel and potentially use > >>>>>>> zero-copy while getting packets in/out userspace. > >>>>>>> > >>>>>>> By default, QEMU will try to use native mode and fall back to skb. > >>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native > >>>>>>> mode, use 'force-copy=on' option. This might be useful if there is > >>>>>>> some issue with the driver. > >>>>>>> > >>>>>>> Option 'queues=N' allows to specify how many device queues should > >>>>>>> be open. Note that all the queues that are not open are still > >>>>>>> functional and can receive traffic, but it will not be delivered to > >>>>>>> QEMU. So, the number of device queues should generally match the > >>>>>>> QEMU configuration, unless the device is shared with something > >>>>>>> else and the traffic re-direction to appropriate queues is correctly > >>>>>>> configured on a device level (e.g. with ethtool -N). > >>>>>>> 'start-queue=M' option can be used to specify from which queue id > >>>>>>> QEMU should start configuring 'N' queues. It might also be necessary > >>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs > >>>>>>> for examples. > >>>>>>> > >>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > >>>>>>> capabilities in order to load default XSK/XDP programs to the > >>>>>>> network interface and configure BTF maps. > >>>>>> > >>>>>> I think you mean "BPF" actually? > >>>> > >>>> "BPF Type Format maps" kind of makes some sense, but yes. :) > >>>> > >>>>>> > >>>>>>> It is possible, however, > >>>>>>> to run only with CAP_NET_RAW. > >>>>>> > >>>>>> Qemu often runs without any privileges, so we need to fix it. > >>>>>> > >>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. > >>>> > >>>> I looked through the code and it seems like we can run completely > >>>> non-privileged as far as kernel concerned. We'll need an API > >>>> modification in libxdp though. > >>>> > >>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is > >>>> a base socket creation. Binding and other configuration doesn't > >>>> require any privileges. So, we could create a socket externally > >>>> and pass it to QEMU. > >>> > >>> That's the way TAP works for example. > >>> > >>>> Should work, unless it's an oversight from > >>>> the kernel side that needs to be patched. :) libxdp doesn't have > >>>> a way to specify externally created socket today, so we'll need > >>>> to change that. Should be easy to do though. I can explore. > >>> > >>> Please do that. > >> > >> I have a prototype: > >> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 > >> > >> Need to test it out and then submit PR to xdp-tools project. > >> > >>> > >>>> > >>>> In case the bind syscall will actually need CAP_NET_RAW for some > >>>> reason, we could change the kernel and allow non-privileged bind > >>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged > >>>> process bind the socket to a particular device, so QEMU can't > >>>> bind it to a random one. Might be a good use case to allow even > >>>> if not strictly necessary. > >>> > >>> Yes. > >> > >> Will propose something for a kernel as well. We might want something > >> more granular though, e.g. bind to a queue instead of a device. In > >> case we want better control in the device sharing scenario. > > > > I may miss something but the bind is already done at dev plus queue > > right now, isn't it? > > > Yes, the bind() syscall will bind socket to the dev+queue. I was talking > about SO_BINDTODEVICE that only ties the socket to a particular device, > but not a queue. > > Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and > assuming a privileged process does: > > fd = socket(AF_XDP, ...); > setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>); > > And sends fd to a non-privileged process. That non-privileged process > will be able to call: > > bind(fd, <device>, <random queue>); > > It will have to use the same device, but can choose any queue, if that > queue is not already busy with another socket. > > So, I was thinking maybe implementing something like XDP_BINDTOQID option. > This way the privileged process may call: > > setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>); > > And later kernel will be able to refuse bind() for any other queue for > this particular socket. Not sure, if file descriptor passing works, we probably don't need another way. > > Not sure if that is necessary though. > Since we're allocating the socket in the privileged process, that process > may add the socket to the BPF map on the correct queue id. This way the > non-privileged process will not be able to receive any packets from any > other queue on this socket, even if bound to it. And no other AF_XDP > socket will be able to be bound to that other queue as well. I think that's by design, or anything wrong with this model? > So, the > rogue QEMU will be able to hog one extra queue, but it will not be able > to intercept traffic any from it, AFAICT. May not be a huge problem > after all. > > SO_BINDTODEVICE would still be nice to have. Especially for cases where > we give the whole device to one VM. Then we need to use AF_XDP in the guest which seems to be a different topic. Alibaba is working on the AF_XDP support for virtio-net. Thanks > > Best regards, Ilya Maximets. >
On 6/30/23 09:44, Jason Wang wrote: > On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 6/28/23 05:27, Jason Wang wrote: >>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> On 6/27/23 04:54, Jason Wang wrote: >>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>> >>>>>> On 6/26/23 08:32, Jason Wang wrote: >>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>> >>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>> >>>>>>>>> AF_XDP is a network socket family that allows communication directly >>>>>>>>> with the network device driver in the kernel, bypassing most or all >>>>>>>>> of the kernel networking stack. In the essence, the technology is >>>>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>>>>>>> and works with any network interfaces without driver modifications. >>>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>>>>>>> require access to character devices or unix sockets. Only access to >>>>>>>>> the network interface itself is necessary. >>>>>>>>> >>>>>>>>> This patch implements a network backend that communicates with the >>>>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>>>>>>> Fill and Completion) are placed in that memory along with a pool of >>>>>>>>> memory buffers for the packet data. Data transmission is done by >>>>>>>>> allocating one of the buffers, copying packet data into it and >>>>>>>>> placing the pointer into Tx ring. After transmission, device will >>>>>>>>> return the buffer via Completion ring. On Rx, device will take >>>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into >>>>>>>>> it and place the buffer into Rx ring. >>>>>>>>> >>>>>>>>> AF_XDP network backend takes on the communication with the host >>>>>>>>> kernel and the network interface and forwards packets to/from the >>>>>>>>> peer device in QEMU. >>>>>>>>> >>>>>>>>> Usage example: >>>>>>>>> >>>>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>>>>>>> >>>>>>>>> XDP program bridges the socket with a network interface. It can be >>>>>>>>> attached to the interface in 2 different modes: >>>>>>>>> >>>>>>>>> 1. skb - this mode should work for any interface and doesn't require >>>>>>>>> driver support. With a caveat of lower performance. >>>>>>>>> >>>>>>>>> 2. native - this does require support from the driver and allows to >>>>>>>>> bypass skb allocation in the kernel and potentially use >>>>>>>>> zero-copy while getting packets in/out userspace. >>>>>>>>> >>>>>>>>> By default, QEMU will try to use native mode and fall back to skb. >>>>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>>>>>>> mode, use 'force-copy=on' option. This might be useful if there is >>>>>>>>> some issue with the driver. >>>>>>>>> >>>>>>>>> Option 'queues=N' allows to specify how many device queues should >>>>>>>>> be open. Note that all the queues that are not open are still >>>>>>>>> functional and can receive traffic, but it will not be delivered to >>>>>>>>> QEMU. So, the number of device queues should generally match the >>>>>>>>> QEMU configuration, unless the device is shared with something >>>>>>>>> else and the traffic re-direction to appropriate queues is correctly >>>>>>>>> configured on a device level (e.g. with ethtool -N). >>>>>>>>> 'start-queue=M' option can be used to specify from which queue id >>>>>>>>> QEMU should start configuring 'N' queues. It might also be necessary >>>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>>>>>>> for examples. >>>>>>>>> >>>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>>>>>>> capabilities in order to load default XSK/XDP programs to the >>>>>>>>> network interface and configure BTF maps. >>>>>>>> >>>>>>>> I think you mean "BPF" actually? >>>>>> >>>>>> "BPF Type Format maps" kind of makes some sense, but yes. :) >>>>>> >>>>>>>> >>>>>>>>> It is possible, however, >>>>>>>>> to run only with CAP_NET_RAW. >>>>>>>> >>>>>>>> Qemu often runs without any privileges, so we need to fix it. >>>>>>>> >>>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. >>>>>> >>>>>> I looked through the code and it seems like we can run completely >>>>>> non-privileged as far as kernel concerned. We'll need an API >>>>>> modification in libxdp though. >>>>>> >>>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is >>>>>> a base socket creation. Binding and other configuration doesn't >>>>>> require any privileges. So, we could create a socket externally >>>>>> and pass it to QEMU. >>>>> >>>>> That's the way TAP works for example. >>>>> >>>>>> Should work, unless it's an oversight from >>>>>> the kernel side that needs to be patched. :) libxdp doesn't have >>>>>> a way to specify externally created socket today, so we'll need >>>>>> to change that. Should be easy to do though. I can explore. >>>>> >>>>> Please do that. >>>> >>>> I have a prototype: >>>> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 >>>> >>>> Need to test it out and then submit PR to xdp-tools project. The change is now accepted: https://github.com/xdp-project/xdp-tools/commit/740c839806a02517da5bce7bd0ccaba908b3f675 I can update the QEMU patch with support for passing socket fds. It may look like this: -netved af-xdp,eth0,queues=2,inhibit=on,sock-fds=fd1,fd2 We'll need an fd per queue. And we may require these fds to be already added to the xsks map, so QEMU doesn't need xsks-map-fd. I'd say we'll need to compile support for that conditionally based on availability of xsk_umem__create_with_fd() as it may not be available in distributions for some time. Alternative is to require libxdp >= 1.4.0, which is not released yet. The last restriction will be that QEMU will need 32 MB of RLIMIT_MEMLOCK per queue for umem registration, but that should not be a huge deal, right? Alternative is to have CAP_IPC_LOCK. And I'd keep the xsks-map-fd parameter for setups that do not have latest libxdp and can allow CAP_NET_RAW. So, they could still do: -netdev af-xdp,eth0,queues=2,inhibit=on,xsks-map-fd=fd What do you think? >>>> >>>>> >>>>>> >>>>>> In case the bind syscall will actually need CAP_NET_RAW for some >>>>>> reason, we could change the kernel and allow non-privileged bind >>>>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged >>>>>> process bind the socket to a particular device, so QEMU can't >>>>>> bind it to a random one. Might be a good use case to allow even >>>>>> if not strictly necessary. >>>>> >>>>> Yes. >>>> >>>> Will propose something for a kernel as well. We might want something >>>> more granular though, e.g. bind to a queue instead of a device. In >>>> case we want better control in the device sharing scenario. >>> >>> I may miss something but the bind is already done at dev plus queue >>> right now, isn't it? >> >> >> Yes, the bind() syscall will bind socket to the dev+queue. I was talking >> about SO_BINDTODEVICE that only ties the socket to a particular device, >> but not a queue. >> >> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and >> assuming a privileged process does: >> >> fd = socket(AF_XDP, ...); >> setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>); >> >> And sends fd to a non-privileged process. That non-privileged process >> will be able to call: >> >> bind(fd, <device>, <random queue>); >> >> It will have to use the same device, but can choose any queue, if that >> queue is not already busy with another socket. >> >> So, I was thinking maybe implementing something like XDP_BINDTOQID option. >> This way the privileged process may call: >> >> setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>); >> >> And later kernel will be able to refuse bind() for any other queue for >> this particular socket. > > Not sure, if file descriptor passing works, we probably don't need another way. > >> >> Not sure if that is necessary though. >> Since we're allocating the socket in the privileged process, that process >> may add the socket to the BPF map on the correct queue id. This way the >> non-privileged process will not be able to receive any packets from any >> other queue on this socket, even if bound to it. And no other AF_XDP >> socket will be able to be bound to that other queue as well. > > I think that's by design, or anything wrong with this model? No, should be fine. I'll posted a simple SO_BINDTODEVICE change to bpf-next as an RFC for now since the tree is closed: https://lore.kernel.org/netdev/20230630145831.2988845-1-i.maximets@ovn.org/ Will re-send a non-RFC once it is open (after 10th of July, IIRC). > >> So, the >> rogue QEMU will be able to hog one extra queue, but it will not be able >> to intercept traffic any from it, AFAICT. May not be a huge problem >> after all. >> >> SO_BINDTODEVICE would still be nice to have. Especially for cases where >> we give the whole device to one VM. > > Then we need to use AF_XDP in the guest which seems to be a different > topic. Alibaba is working on the AF_XDP support for virtio-net. > > Thanks > >> >> Best regards, Ilya Maximets. >> >
On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > >> > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > >>>> > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > >> too hard to implement. > > > > > > > > > > >> > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > cost. > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > seems expensive. > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > umem. > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > emulation. > > > > > > Just to make sure we're on the same page. > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > using the Qemu memory core translations which need to take care about > > > too much extra stuff. That's why I suggest using vhost in io threads > > > which only cares about ram so the translation could be very fast. > > > > What does using "vhost in io threads" mean? > > It means a vhost userspace dataplane that is implemented via io threads. AFAIK this does not exist today. QEMU's built-in devices that use IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, vhost-user, or vDPA but not built-in devices that use IOThreads. The built-in devices implement VirtioDeviceClass callbacks directly and use AioContext APIs to run in IOThreads. Do you have an idea for using vhost code for built-in devices? Maybe it's fastest if you explain your idea and its advantages instead of me guessing. > > > > Regarding pinning - I wonder if that's something that can be refined > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > > of umem. That way only rx and tx buffers that are currently in use > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > > checked the kernel code. > > > > > > It requires the device to do page faults which is not commonly > > > supported nowadays. > > > > I don't understand this comment. AF_XDP processes each rx/tx > > descriptor. At that point it can getuserpages() or similar in order to > > pin the page. When the memory is no longer needed, it can put those > > pages. No fault mechanism is needed. What am I missing? > > Ok, I think I kind of get you, you mean doing pinning while processing > rx/tx buffers? It's not easy since GUP itself is not very fast, it may > hit PPS for sure. Yes. It's not as fast as permanently pinning rx/tx buffers, but it supports unpinned guest RAM. There are variations on this approach, like keeping a certain amount of pages pinned after they have been used so the cost of pinning/unpinning can be avoided when the same pages are reused in the future, but I don't know how effective that is in practice. Is there a more efficient approach without relying on hardware page fault support? My understanding is that hardware page fault support is not yet deployed. We'd be left with pinning guest RAM permanently or using a runtime pinning/unpinning approach like I've described. Stefan
On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > >> > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > > >> too hard to implement. > > > > > > > > > > > >> > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > > cost. > > > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > > seems expensive. > > > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > > umem. > > > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > > emulation. > > > > > > > > Just to make sure we're on the same page. > > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > > using the Qemu memory core translations which need to take care about > > > > too much extra stuff. That's why I suggest using vhost in io threads > > > > which only cares about ram so the translation could be very fast. > > > > > > What does using "vhost in io threads" mean? > > > > It means a vhost userspace dataplane that is implemented via io threads. > > AFAIK this does not exist today. QEMU's built-in devices that use > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > vhost-user, or vDPA but not built-in devices that use IOThreads. The > built-in devices implement VirtioDeviceClass callbacks directly and > use AioContext APIs to run in IOThreads. Yes. > > Do you have an idea for using vhost code for built-in devices? Maybe > it's fastest if you explain your idea and its advantages instead of me > guessing. It's something like I'd proposed in [1]: 1) a vhost that is implemented via IOThreads 2) memory translation is done via vhost memory table/IOTLB The advantages are: 1) No 3rd application like DPDK application 2) Attack surface were reduced 3) Better understanding/interactions with device model for things like RSS and IOMMU There could be some dis-advantages but it's not obvious to me :) It's something like linking SPDK/DPDK to Qemu. > > > > > > Regarding pinning - I wonder if that's something that can be refined > > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > > > of umem. That way only rx and tx buffers that are currently in use > > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > > > checked the kernel code. > > > > > > > > It requires the device to do page faults which is not commonly > > > > supported nowadays. > > > > > > I don't understand this comment. AF_XDP processes each rx/tx > > > descriptor. At that point it can getuserpages() or similar in order to > > > pin the page. When the memory is no longer needed, it can put those > > > pages. No fault mechanism is needed. What am I missing? > > > > Ok, I think I kind of get you, you mean doing pinning while processing > > rx/tx buffers? It's not easy since GUP itself is not very fast, it may > > hit PPS for sure. > > Yes. It's not as fast as permanently pinning rx/tx buffers, but it > supports unpinned guest RAM. Right, it's a balance between pin and PPS. PPS seems to be more important in this case. > > There are variations on this approach, like keeping a certain amount > of pages pinned after they have been used so the cost of > pinning/unpinning can be avoided when the same pages are reused in the > future, but I don't know how effective that is in practice. > > Is there a more efficient approach without relying on hardware page > fault support? I guess so, I see some slides that say device page fault is very slow. > > My understanding is that hardware page fault support is not yet > deployed. We'd be left with pinning guest RAM permanently or using a > runtime pinning/unpinning approach like I've described. Probably. Thanks > > Stefan >
On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > > > >> too hard to implement. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > > > cost. > > > > > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > > > seems expensive. > > > > > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > > > umem. > > > > > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > > > emulation. > > > > > > > > > > Just to make sure we're on the same page. > > > > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > > > using the Qemu memory core translations which need to take care about > > > > > too much extra stuff. That's why I suggest using vhost in io threads > > > > > which only cares about ram so the translation could be very fast. > > > > > > > > What does using "vhost in io threads" mean? > > > > > > It means a vhost userspace dataplane that is implemented via io threads. > > > > AFAIK this does not exist today. QEMU's built-in devices that use > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > vhost-user, or vDPA but not built-in devices that use IOThreads. The > > built-in devices implement VirtioDeviceClass callbacks directly and > > use AioContext APIs to run in IOThreads. > > Yes. > > > > > Do you have an idea for using vhost code for built-in devices? Maybe > > it's fastest if you explain your idea and its advantages instead of me > > guessing. > > It's something like I'd proposed in [1]: > > 1) a vhost that is implemented via IOThreads > 2) memory translation is done via vhost memory table/IOTLB > > The advantages are: > > 1) No 3rd application like DPDK application > 2) Attack surface were reduced > 3) Better understanding/interactions with device model for things like > RSS and IOMMU > > There could be some dis-advantages but it's not obvious to me :) Why is QEMU's native device emulation API not the natural choice for writing built-in devices? I don't understand why the vhost interface is desirable for built-in devices. > > It's something like linking SPDK/DPDK to Qemu. Sergio Lopez tried loading vhost-user devices as shared libraries that run in the QEMU process. It worked as an experiment but wasn't pursued further. I think that might make sense in specific cases where there is an existing vhost-user codebase that needs to run as part of QEMU. In this case the AF_XDP code is new, so it's not a case of moving existing code into QEMU. > > > > > > > > > Regarding pinning - I wonder if that's something that can be refined > > > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > > > > of umem. That way only rx and tx buffers that are currently in use > > > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > > > > checked the kernel code. > > > > > > > > > > It requires the device to do page faults which is not commonly > > > > > supported nowadays. > > > > > > > > I don't understand this comment. AF_XDP processes each rx/tx > > > > descriptor. At that point it can getuserpages() or similar in order to > > > > pin the page. When the memory is no longer needed, it can put those > > > > pages. No fault mechanism is needed. What am I missing? > > > > > > Ok, I think I kind of get you, you mean doing pinning while processing > > > rx/tx buffers? It's not easy since GUP itself is not very fast, it may > > > hit PPS for sure. > > > > Yes. It's not as fast as permanently pinning rx/tx buffers, but it > > supports unpinned guest RAM. > > Right, it's a balance between pin and PPS. PPS seems to be more > important in this case. > > > > > There are variations on this approach, like keeping a certain amount > > of pages pinned after they have been used so the cost of > > pinning/unpinning can be avoided when the same pages are reused in the > > future, but I don't know how effective that is in practice. > > > > Is there a more efficient approach without relying on hardware page > > fault support? > > I guess so, I see some slides that say device page fault is very slow. > > > > > My understanding is that hardware page fault support is not yet > > deployed. We'd be left with pinning guest RAM permanently or using a > > runtime pinning/unpinning approach like I've described. > > Probably. > > Thanks > > > > > Stefan > > >
On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > > > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > > > > >> too hard to implement. > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > > > > cost. > > > > > > > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > > > > seems expensive. > > > > > > > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > > > > umem. > > > > > > > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > > > > emulation. > > > > > > > > > > > > Just to make sure we're on the same page. > > > > > > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > > > > using the Qemu memory core translations which need to take care about > > > > > > too much extra stuff. That's why I suggest using vhost in io threads > > > > > > which only cares about ram so the translation could be very fast. > > > > > > > > > > What does using "vhost in io threads" mean? > > > > > > > > It means a vhost userspace dataplane that is implemented via io threads. > > > > > > AFAIK this does not exist today. QEMU's built-in devices that use > > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > > vhost-user, or vDPA but not built-in devices that use IOThreads. The > > > built-in devices implement VirtioDeviceClass callbacks directly and > > > use AioContext APIs to run in IOThreads. > > > > Yes. > > > > > > > > Do you have an idea for using vhost code for built-in devices? Maybe > > > it's fastest if you explain your idea and its advantages instead of me > > > guessing. > > > > It's something like I'd proposed in [1]: > > > > 1) a vhost that is implemented via IOThreads > > 2) memory translation is done via vhost memory table/IOTLB > > > > The advantages are: > > > > 1) No 3rd application like DPDK application > > 2) Attack surface were reduced > > 3) Better understanding/interactions with device model for things like > > RSS and IOMMU > > > > There could be some dis-advantages but it's not obvious to me :) > > Why is QEMU's native device emulation API not the natural choice for > writing built-in devices? I don't understand why the vhost interface > is desirable for built-in devices. Unless the memory helpers (like address translations) were optimized fully to satisfy this 10M+ PPS. Not sure if this is too hard, but last time I benchmark, perf told me most of the time spent in the translation. Using a vhost is a workaround since its memory model is much more simpler so it can skip lots of memory sections like I/O and ROM etc. Thanks > > > > > It's something like linking SPDK/DPDK to Qemu. > > Sergio Lopez tried loading vhost-user devices as shared libraries that > run in the QEMU process. It worked as an experiment but wasn't pursued > further. > > I think that might make sense in specific cases where there is an > existing vhost-user codebase that needs to run as part of QEMU. > > In this case the AF_XDP code is new, so it's not a case of moving > existing code into QEMU. > > > > > > > > > > > > > Regarding pinning - I wonder if that's something that can be refined > > > > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > > > > > of umem. That way only rx and tx buffers that are currently in use > > > > > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > > > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > > > > > checked the kernel code. > > > > > > > > > > > > It requires the device to do page faults which is not commonly > > > > > > supported nowadays. > > > > > > > > > > I don't understand this comment. AF_XDP processes each rx/tx > > > > > descriptor. At that point it can getuserpages() or similar in order to > > > > > pin the page. When the memory is no longer needed, it can put those > > > > > pages. No fault mechanism is needed. What am I missing? > > > > > > > > Ok, I think I kind of get you, you mean doing pinning while processing > > > > rx/tx buffers? It's not easy since GUP itself is not very fast, it may > > > > hit PPS for sure. > > > > > > Yes. It's not as fast as permanently pinning rx/tx buffers, but it > > > supports unpinned guest RAM. > > > > Right, it's a balance between pin and PPS. PPS seems to be more > > important in this case. > > > > > > > > There are variations on this approach, like keeping a certain amount > > > of pages pinned after they have been used so the cost of > > > pinning/unpinning can be avoided when the same pages are reused in the > > > future, but I don't know how effective that is in practice. > > > > > > Is there a more efficient approach without relying on hardware page > > > fault support? > > > > I guess so, I see some slides that say device page fault is very slow. > > > > > > > > My understanding is that hardware page fault support is not yet > > > deployed. We'd be left with pinning guest RAM permanently or using a > > > runtime pinning/unpinning approach like I've described. > > > > Probably. > > > > Thanks > > > > > > > > Stefan > > > > > >
On 7/7/23 03:43, Jason Wang wrote: > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: >>> >>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> >>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: >>>>> >>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>> >>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>> >>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: >>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: >>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. >>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and >>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching >>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be >>>>>>>>>>>>>>>> too hard to implement. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating >>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently >>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to >>>>>>>>>>>>>>>> scale well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between >>>>>>>>>>>>>>> io_uring and AF_XDP: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) both have similar memory model (user register) >>>>>>>>>>>>>>> 2) both use ring for communication >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can >>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for >>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to >>>>>>>>>>>>>> perform transmission for us. >>>>>>>>>>>>> >>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop >>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation >>>>>>>>>>>>> cost. >>>>>>>>>>>> >>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code >>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working >>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The >>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so >>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the >>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. >>>>>>>>>>> >>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to >>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which >>>>>>>>>>> seems expensive. >>>>>>>>>>> >>>>>>>>>>> Vhost seems to be a shortcut for this. >>>>>>>>>> >>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. >>>>>>>>>> >>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) >>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to >>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and >>>>>>>>>> umem. >>>>>>>>> >>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring >>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this >>>>>>>>> part seems to be very expensive according to my test in the past. >>>>>>>> >>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU >>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) >>>>>>>> happening. So the GPA to HVA translation will happen anyway in device >>>>>>>> emulation. >>>>>>> >>>>>>> Just to make sure we're on the same page. >>>>>>> >>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the >>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to >>>>>>> using the Qemu memory core translations which need to take care about >>>>>>> too much extra stuff. That's why I suggest using vhost in io threads >>>>>>> which only cares about ram so the translation could be very fast. >>>>>> >>>>>> What does using "vhost in io threads" mean? >>>>> >>>>> It means a vhost userspace dataplane that is implemented via io threads. >>>> >>>> AFAIK this does not exist today. QEMU's built-in devices that use >>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, >>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The >>>> built-in devices implement VirtioDeviceClass callbacks directly and >>>> use AioContext APIs to run in IOThreads. >>> >>> Yes. >>> >>>> >>>> Do you have an idea for using vhost code for built-in devices? Maybe >>>> it's fastest if you explain your idea and its advantages instead of me >>>> guessing. >>> >>> It's something like I'd proposed in [1]: >>> >>> 1) a vhost that is implemented via IOThreads >>> 2) memory translation is done via vhost memory table/IOTLB >>> >>> The advantages are: >>> >>> 1) No 3rd application like DPDK application >>> 2) Attack surface were reduced >>> 3) Better understanding/interactions with device model for things like >>> RSS and IOMMU >>> >>> There could be some dis-advantages but it's not obvious to me :) >> >> Why is QEMU's native device emulation API not the natural choice for >> writing built-in devices? I don't understand why the vhost interface >> is desirable for built-in devices. > > Unless the memory helpers (like address translations) were optimized > fully to satisfy this 10M+ PPS. > > Not sure if this is too hard, but last time I benchmark, perf told me > most of the time spent in the translation. > > Using a vhost is a workaround since its memory model is much more > simpler so it can skip lots of memory sections like I/O and ROM etc. So, we can have a thread running as part of QEMU process that implements vhost functionality for a virtio-net device. And this thread has an optimized way to access memory. What prevents current virtio-net emulation code accessing memory in the same optimized way? i.e. we likely don't actually need to implement the whole vhost-virtio communication protocol in order to have faster memory access from the device emulation code. I mean, if vhost can access device memory faster, why device itself can't? With that we could probably split the "datapath" part of the virtio-net emulation into a separate thread driven by iothread loop. Then add batch API for communication with a network backend (af-xdp) to avoid per-packet calls. These are 3 more or less independent tasks that should allow the similar performance to a full fledged vhost control and dataplane implementation inside QEMU. Or am I missing something? (Probably) > > Thanks > >> >>> >>> It's something like linking SPDK/DPDK to Qemu. >> >> Sergio Lopez tried loading vhost-user devices as shared libraries that >> run in the QEMU process. It worked as an experiment but wasn't pursued >> further. >> >> I think that might make sense in specific cases where there is an >> existing vhost-user codebase that needs to run as part of QEMU. >> >> In this case the AF_XDP code is new, so it's not a case of moving >> existing code into QEMU. >> >>> >>>> >>>>>>>> Regarding pinning - I wonder if that's something that can be refined >>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning >>>>>>>> of umem. That way only rx and tx buffers that are currently in use >>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin >>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't >>>>>>>> checked the kernel code. >>>>>>> >>>>>>> It requires the device to do page faults which is not commonly >>>>>>> supported nowadays. >>>>>> >>>>>> I don't understand this comment. AF_XDP processes each rx/tx >>>>>> descriptor. At that point it can getuserpages() or similar in order to >>>>>> pin the page. When the memory is no longer needed, it can put those >>>>>> pages. No fault mechanism is needed. What am I missing? >>>>> >>>>> Ok, I think I kind of get you, you mean doing pinning while processing >>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may >>>>> hit PPS for sure. >>>> >>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it >>>> supports unpinned guest RAM. >>> >>> Right, it's a balance between pin and PPS. PPS seems to be more >>> important in this case. >>> >>>> >>>> There are variations on this approach, like keeping a certain amount >>>> of pages pinned after they have been used so the cost of >>>> pinning/unpinning can be avoided when the same pages are reused in the >>>> future, but I don't know how effective that is in practice. >>>> >>>> Is there a more efficient approach without relying on hardware page >>>> fault support? >>> >>> I guess so, I see some slides that say device page fault is very slow. >>> >>>> >>>> My understanding is that hardware page fault support is not yet >>>> deployed. We'd be left with pinning guest RAM permanently or using a >>>> runtime pinning/unpinning approach like I've described. >>> >>> Probably. >>> >>> Thanks >>> >>>> >>>> Stefan >>>> >>> >> >
On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 7/7/23 03:43, Jason Wang wrote: > > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> > >> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > >>> > >>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>> > >>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > >>>>> > >>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>> > >>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>> > >>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: > >>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: > >>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. > >>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and > >>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching > >>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be > >>>>>>>>>>>>>>>> too hard to implement. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating > >>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > >>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > >>>>>>>>>>>>>>>> scale well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between > >>>>>>>>>>>>>>> io_uring and AF_XDP: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1) both have similar memory model (user register) > >>>>>>>>>>>>>>> 2) both use ring for communication > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can > >>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > >>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to > >>>>>>>>>>>>>> perform transmission for us. > >>>>>>>>>>>>> > >>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop > >>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation > >>>>>>>>>>>>> cost. > >>>>>>>>>>>> > >>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code > >>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > >>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The > >>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so > >>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the > >>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. > >>>>>>>>>>> > >>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to > >>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which > >>>>>>>>>>> seems expensive. > >>>>>>>>>>> > >>>>>>>>>>> Vhost seems to be a shortcut for this. > >>>>>>>>>> > >>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. > >>>>>>>>>> > >>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) > >>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to > >>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and > >>>>>>>>>> umem. > >>>>>>>>> > >>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring > >>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this > >>>>>>>>> part seems to be very expensive according to my test in the past. > >>>>>>>> > >>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU > >>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) > >>>>>>>> happening. So the GPA to HVA translation will happen anyway in device > >>>>>>>> emulation. > >>>>>>> > >>>>>>> Just to make sure we're on the same page. > >>>>>>> > >>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > >>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to > >>>>>>> using the Qemu memory core translations which need to take care about > >>>>>>> too much extra stuff. That's why I suggest using vhost in io threads > >>>>>>> which only cares about ram so the translation could be very fast. > >>>>>> > >>>>>> What does using "vhost in io threads" mean? > >>>>> > >>>>> It means a vhost userspace dataplane that is implemented via io threads. > >>>> > >>>> AFAIK this does not exist today. QEMU's built-in devices that use > >>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > >>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The > >>>> built-in devices implement VirtioDeviceClass callbacks directly and > >>>> use AioContext APIs to run in IOThreads. > >>> > >>> Yes. > >>> > >>>> > >>>> Do you have an idea for using vhost code for built-in devices? Maybe > >>>> it's fastest if you explain your idea and its advantages instead of me > >>>> guessing. > >>> > >>> It's something like I'd proposed in [1]: > >>> > >>> 1) a vhost that is implemented via IOThreads > >>> 2) memory translation is done via vhost memory table/IOTLB > >>> > >>> The advantages are: > >>> > >>> 1) No 3rd application like DPDK application > >>> 2) Attack surface were reduced > >>> 3) Better understanding/interactions with device model for things like > >>> RSS and IOMMU > >>> > >>> There could be some dis-advantages but it's not obvious to me :) > >> > >> Why is QEMU's native device emulation API not the natural choice for > >> writing built-in devices? I don't understand why the vhost interface > >> is desirable for built-in devices. > > > > Unless the memory helpers (like address translations) were optimized > > fully to satisfy this 10M+ PPS. > > > > Not sure if this is too hard, but last time I benchmark, perf told me > > most of the time spent in the translation. > > > > Using a vhost is a workaround since its memory model is much more > > simpler so it can skip lots of memory sections like I/O and ROM etc. > > So, we can have a thread running as part of QEMU process that implements > vhost functionality for a virtio-net device. And this thread has an > optimized way to access memory. What prevents current virtio-net emulation > code accessing memory in the same optimized way? Current emulation using memory core accessors which needs to take care of a lot of stuff like MMIO or even P2P. Such kind of stuff is not considered since day0 of vhost. You can do some experiment on this e.g just dropping packets after fetching it from the TX ring. > i.e. we likely don't > actually need to implement the whole vhost-virtio communication protocol > in order to have faster memory access from the device emulation code. > I mean, if vhost can access device memory faster, why device itself can't? I'm not saying it can't but it would end up with something similar to vhost. And that's why I'm saying using vhost is a shortcut (at least for a POC). Thanks > > With that we could probably split the "datapath" part of the virtio-net > emulation into a separate thread driven by iothread loop. > > Then add batch API for communication with a network backend (af-xdp) to > avoid per-packet calls. > > These are 3 more or less independent tasks that should allow the similar > performance to a full fledged vhost control and dataplane implementation > inside QEMU. > > Or am I missing something? (Probably) > > > > > Thanks > > > >> > >>> > >>> It's something like linking SPDK/DPDK to Qemu. > >> > >> Sergio Lopez tried loading vhost-user devices as shared libraries that > >> run in the QEMU process. It worked as an experiment but wasn't pursued > >> further. > >> > >> I think that might make sense in specific cases where there is an > >> existing vhost-user codebase that needs to run as part of QEMU. > >> > >> In this case the AF_XDP code is new, so it's not a case of moving > >> existing code into QEMU. > >> > >>> > >>>> > >>>>>>>> Regarding pinning - I wonder if that's something that can be refined > >>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning > >>>>>>>> of umem. That way only rx and tx buffers that are currently in use > >>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin > >>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't > >>>>>>>> checked the kernel code. > >>>>>>> > >>>>>>> It requires the device to do page faults which is not commonly > >>>>>>> supported nowadays. > >>>>>> > >>>>>> I don't understand this comment. AF_XDP processes each rx/tx > >>>>>> descriptor. At that point it can getuserpages() or similar in order to > >>>>>> pin the page. When the memory is no longer needed, it can put those > >>>>>> pages. No fault mechanism is needed. What am I missing? > >>>>> > >>>>> Ok, I think I kind of get you, you mean doing pinning while processing > >>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may > >>>>> hit PPS for sure. > >>>> > >>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it > >>>> supports unpinned guest RAM. > >>> > >>> Right, it's a balance between pin and PPS. PPS seems to be more > >>> important in this case. > >>> > >>>> > >>>> There are variations on this approach, like keeping a certain amount > >>>> of pages pinned after they have been used so the cost of > >>>> pinning/unpinning can be avoided when the same pages are reused in the > >>>> future, but I don't know how effective that is in practice. > >>>> > >>>> Is there a more efficient approach without relying on hardware page > >>>> fault support? > >>> > >>> I guess so, I see some slides that say device page fault is very slow. > >>> > >>>> > >>>> My understanding is that hardware page fault support is not yet > >>>> deployed. We'd be left with pinning guest RAM permanently or using a > >>>> runtime pinning/unpinning approach like I've described. > >>> > >>> Probably. > >>> > >>> Thanks > >>> > >>>> > >>>> Stefan > >>>> > >>> > >> > > >
On 7/10/23 05:51, Jason Wang wrote: > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 7/7/23 03:43, Jason Wang wrote: >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: >>>>> >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>> >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: >>>>>>> >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>> >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. >>>>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be >>>>>>>>>>>>>>>>>> too hard to implement. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating >>>>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to >>>>>>>>>>>>>>>>>> scale well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between >>>>>>>>>>>>>>>>> io_uring and AF_XDP: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register) >>>>>>>>>>>>>>>>> 2) both use ring for communication >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for >>>>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to >>>>>>>>>>>>>>>> perform transmission for us. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation >>>>>>>>>>>>>>> cost. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. >>>>>>>>>>>>> >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which >>>>>>>>>>>>> seems expensive. >>>>>>>>>>>>> >>>>>>>>>>>>> Vhost seems to be a shortcut for this. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. >>>>>>>>>>>> >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and >>>>>>>>>>>> umem. >>>>>>>>>>> >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this >>>>>>>>>>> part seems to be very expensive according to my test in the past. >>>>>>>>>> >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device >>>>>>>>>> emulation. >>>>>>>>> >>>>>>>>> Just to make sure we're on the same page. >>>>>>>>> >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to >>>>>>>>> using the Qemu memory core translations which need to take care about >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads >>>>>>>>> which only cares about ram so the translation could be very fast. >>>>>>>> >>>>>>>> What does using "vhost in io threads" mean? >>>>>>> >>>>>>> It means a vhost userspace dataplane that is implemented via io threads. >>>>>> >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and >>>>>> use AioContext APIs to run in IOThreads. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe >>>>>> it's fastest if you explain your idea and its advantages instead of me >>>>>> guessing. >>>>> >>>>> It's something like I'd proposed in [1]: >>>>> >>>>> 1) a vhost that is implemented via IOThreads >>>>> 2) memory translation is done via vhost memory table/IOTLB >>>>> >>>>> The advantages are: >>>>> >>>>> 1) No 3rd application like DPDK application >>>>> 2) Attack surface were reduced >>>>> 3) Better understanding/interactions with device model for things like >>>>> RSS and IOMMU >>>>> >>>>> There could be some dis-advantages but it's not obvious to me :) >>>> >>>> Why is QEMU's native device emulation API not the natural choice for >>>> writing built-in devices? I don't understand why the vhost interface >>>> is desirable for built-in devices. >>> >>> Unless the memory helpers (like address translations) were optimized >>> fully to satisfy this 10M+ PPS. >>> >>> Not sure if this is too hard, but last time I benchmark, perf told me >>> most of the time spent in the translation. >>> >>> Using a vhost is a workaround since its memory model is much more >>> simpler so it can skip lots of memory sections like I/O and ROM etc. >> >> So, we can have a thread running as part of QEMU process that implements >> vhost functionality for a virtio-net device. And this thread has an >> optimized way to access memory. What prevents current virtio-net emulation >> code accessing memory in the same optimized way? > > Current emulation using memory core accessors which needs to take care > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not > considered since day0 of vhost. You can do some experiment on this e.g > just dropping packets after fetching it from the TX ring. If I'm reading that right, virtio implementation is using address space caching by utilizing a memory listener and pre-translated addresses of interesting memory regions. Then it's performing address_space_read_cached, which is bypassing all the memory address translation logic on a cache hit. That sounds pretty similar to how memory table is prepared for vhost. > >> i.e. we likely don't >> actually need to implement the whole vhost-virtio communication protocol >> in order to have faster memory access from the device emulation code. >> I mean, if vhost can access device memory faster, why device itself can't? > > I'm not saying it can't but it would end up with something similar to > vhost. And that's why I'm saying using vhost is a shortcut (at least > for a POC). > > Thanks > >> >> With that we could probably split the "datapath" part of the virtio-net >> emulation into a separate thread driven by iothread loop. >> >> Then add batch API for communication with a network backend (af-xdp) to >> avoid per-packet calls. >> >> These are 3 more or less independent tasks that should allow the similar >> performance to a full fledged vhost control and dataplane implementation >> inside QEMU. >> >> Or am I missing something? (Probably) >> >>> >>> Thanks >>> >>>> >>>>> >>>>> It's something like linking SPDK/DPDK to Qemu. >>>> >>>> Sergio Lopez tried loading vhost-user devices as shared libraries that >>>> run in the QEMU process. It worked as an experiment but wasn't pursued >>>> further. >>>> >>>> I think that might make sense in specific cases where there is an >>>> existing vhost-user codebase that needs to run as part of QEMU. >>>> >>>> In this case the AF_XDP code is new, so it's not a case of moving >>>> existing code into QEMU. >>>> >>>>> >>>>>> >>>>>>>>>> Regarding pinning - I wonder if that's something that can be refined >>>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning >>>>>>>>>> of umem. That way only rx and tx buffers that are currently in use >>>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin >>>>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't >>>>>>>>>> checked the kernel code. >>>>>>>>> >>>>>>>>> It requires the device to do page faults which is not commonly >>>>>>>>> supported nowadays. >>>>>>>> >>>>>>>> I don't understand this comment. AF_XDP processes each rx/tx >>>>>>>> descriptor. At that point it can getuserpages() or similar in order to >>>>>>>> pin the page. When the memory is no longer needed, it can put those >>>>>>>> pages. No fault mechanism is needed. What am I missing? >>>>>>> >>>>>>> Ok, I think I kind of get you, you mean doing pinning while processing >>>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may >>>>>>> hit PPS for sure. >>>>>> >>>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it >>>>>> supports unpinned guest RAM. >>>>> >>>>> Right, it's a balance between pin and PPS. PPS seems to be more >>>>> important in this case. >>>>> >>>>>> >>>>>> There are variations on this approach, like keeping a certain amount >>>>>> of pages pinned after they have been used so the cost of >>>>>> pinning/unpinning can be avoided when the same pages are reused in the >>>>>> future, but I don't know how effective that is in practice. >>>>>> >>>>>> Is there a more efficient approach without relying on hardware page >>>>>> fault support? >>>>> >>>>> I guess so, I see some slides that say device page fault is very slow. >>>>> >>>>>> >>>>>> My understanding is that hardware page fault support is not yet >>>>>> deployed. We'd be left with pinning guest RAM permanently or using a >>>>>> runtime pinning/unpinning approach like I've described. >>>>> >>>>> Probably. >>>>> >>>>> Thanks >>>>> >>>>>> >>>>>> Stefan >>>>>> >>>>> >>>> >>> >> >
On Thu, 6 Jul 2023 at 21:43, Jason Wang <jasowang@redhat.com> wrote: > > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > > > > > >> too hard to implement. > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > > > > > cost. > > > > > > > > > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > > > > > seems expensive. > > > > > > > > > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > > > > > umem. > > > > > > > > > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > > > > > emulation. > > > > > > > > > > > > > > Just to make sure we're on the same page. > > > > > > > > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > > > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > > > > > using the Qemu memory core translations which need to take care about > > > > > > > too much extra stuff. That's why I suggest using vhost in io threads > > > > > > > which only cares about ram so the translation could be very fast. > > > > > > > > > > > > What does using "vhost in io threads" mean? > > > > > > > > > > It means a vhost userspace dataplane that is implemented via io threads. > > > > > > > > AFAIK this does not exist today. QEMU's built-in devices that use > > > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > > > vhost-user, or vDPA but not built-in devices that use IOThreads. The > > > > built-in devices implement VirtioDeviceClass callbacks directly and > > > > use AioContext APIs to run in IOThreads. > > > > > > Yes. > > > > > > > > > > > Do you have an idea for using vhost code for built-in devices? Maybe > > > > it's fastest if you explain your idea and its advantages instead of me > > > > guessing. > > > > > > It's something like I'd proposed in [1]: > > > > > > 1) a vhost that is implemented via IOThreads > > > 2) memory translation is done via vhost memory table/IOTLB > > > > > > The advantages are: > > > > > > 1) No 3rd application like DPDK application > > > 2) Attack surface were reduced > > > 3) Better understanding/interactions with device model for things like > > > RSS and IOMMU > > > > > > There could be some dis-advantages but it's not obvious to me :) > > > > Why is QEMU's native device emulation API not the natural choice for > > writing built-in devices? I don't understand why the vhost interface > > is desirable for built-in devices. > > Unless the memory helpers (like address translations) were optimized > fully to satisfy this 10M+ PPS. > > Not sure if this is too hard, but last time I benchmark, perf told me > most of the time spent in the translation. > > Using a vhost is a workaround since its memory model is much more > simpler so it can skip lots of memory sections like I/O and ROM etc. I see, that sounds like a question of optimization. Most DMA transfers will be to/from guest RAM and it seems like QEMU's memory API could be optimized for that case. PIO/MMIO dispatch could use a different API from DMA transfers, if necessary. I don't think there is a fundamental reason why QEMU's own device emulation code cannot translate memory as fast as vhost devices can. Stefan
On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maximets@ovn.org> wrote: > > On 7/10/23 05:51, Jason Wang wrote: > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 7/7/23 03:43, Jason Wang wrote: > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>> > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > >>>>> > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>> > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>> > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. > >>>>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be > >>>>>>>>>>>>>>>>>> too hard to implement. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > >>>>>>>>>>>>>>>>>> scale well. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between > >>>>>>>>>>>>>>>>> io_uring and AF_XDP: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register) > >>>>>>>>>>>>>>>>> 2) both use ring for communication > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > >>>>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to > >>>>>>>>>>>>>>>> perform transmission for us. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation > >>>>>>>>>>>>>>> cost. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which > >>>>>>>>>>>>> seems expensive. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Vhost seems to be a shortcut for this. > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. > >>>>>>>>>>>> > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and > >>>>>>>>>>>> umem. > >>>>>>>>>>> > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this > >>>>>>>>>>> part seems to be very expensive according to my test in the past. > >>>>>>>>>> > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device > >>>>>>>>>> emulation. > >>>>>>>>> > >>>>>>>>> Just to make sure we're on the same page. > >>>>>>>>> > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to > >>>>>>>>> using the Qemu memory core translations which need to take care about > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads > >>>>>>>>> which only cares about ram so the translation could be very fast. > >>>>>>>> > >>>>>>>> What does using "vhost in io threads" mean? > >>>>>>> > >>>>>>> It means a vhost userspace dataplane that is implemented via io threads. > >>>>>> > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and > >>>>>> use AioContext APIs to run in IOThreads. > >>>>> > >>>>> Yes. > >>>>> > >>>>>> > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe > >>>>>> it's fastest if you explain your idea and its advantages instead of me > >>>>>> guessing. > >>>>> > >>>>> It's something like I'd proposed in [1]: > >>>>> > >>>>> 1) a vhost that is implemented via IOThreads > >>>>> 2) memory translation is done via vhost memory table/IOTLB > >>>>> > >>>>> The advantages are: > >>>>> > >>>>> 1) No 3rd application like DPDK application > >>>>> 2) Attack surface were reduced > >>>>> 3) Better understanding/interactions with device model for things like > >>>>> RSS and IOMMU > >>>>> > >>>>> There could be some dis-advantages but it's not obvious to me :) > >>>> > >>>> Why is QEMU's native device emulation API not the natural choice for > >>>> writing built-in devices? I don't understand why the vhost interface > >>>> is desirable for built-in devices. > >>> > >>> Unless the memory helpers (like address translations) were optimized > >>> fully to satisfy this 10M+ PPS. > >>> > >>> Not sure if this is too hard, but last time I benchmark, perf told me > >>> most of the time spent in the translation. > >>> > >>> Using a vhost is a workaround since its memory model is much more > >>> simpler so it can skip lots of memory sections like I/O and ROM etc. > >> > >> So, we can have a thread running as part of QEMU process that implements > >> vhost functionality for a virtio-net device. And this thread has an > >> optimized way to access memory. What prevents current virtio-net emulation > >> code accessing memory in the same optimized way? > > > > Current emulation using memory core accessors which needs to take care > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not > > considered since day0 of vhost. You can do some experiment on this e.g > > just dropping packets after fetching it from the TX ring. > > If I'm reading that right, virtio implementation is using address space > caching by utilizing a memory listener and pre-translated addresses of > interesting memory regions. Then it's performing address_space_read_cached, > which is bypassing all the memory address translation logic on a cache hit. > That sounds pretty similar to how memory table is prepared for vhost. Exactly, but only for the vring memory structures (avail, used, and descriptor rings in the Split Virtqueue Layout). The packet headers and payloads are still translated using the uncached virtqueue_pop() -> dma_memory_map() -> address_space_map() API. Running a tx packet drop benchmark as Jason suggested and checking if memory translation is a bottleneck seems worthwhile. Improving dma_memory_map() performance would speed up all built-in QEMU devices. Jason: When you noticed this bottleneck, were you using a normal virtio-net-pci device without vIOMMU? Stefan
On Mon, Jul 10, 2023 at 6:55 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 7/10/23 05:51, Jason Wang wrote: > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 7/7/23 03:43, Jason Wang wrote: > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>> > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > >>>>> > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>> > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>> > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>> > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>> > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. > >>>>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be > >>>>>>>>>>>>>>>>>> too hard to implement. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > >>>>>>>>>>>>>>>>>> scale well. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between > >>>>>>>>>>>>>>>>> io_uring and AF_XDP: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register) > >>>>>>>>>>>>>>>>> 2) both use ring for communication > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > >>>>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to > >>>>>>>>>>>>>>>> perform transmission for us. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation > >>>>>>>>>>>>>>> cost. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which > >>>>>>>>>>>>> seems expensive. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Vhost seems to be a shortcut for this. > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. > >>>>>>>>>>>> > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and > >>>>>>>>>>>> umem. > >>>>>>>>>>> > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this > >>>>>>>>>>> part seems to be very expensive according to my test in the past. > >>>>>>>>>> > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device > >>>>>>>>>> emulation. > >>>>>>>>> > >>>>>>>>> Just to make sure we're on the same page. > >>>>>>>>> > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to > >>>>>>>>> using the Qemu memory core translations which need to take care about > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads > >>>>>>>>> which only cares about ram so the translation could be very fast. > >>>>>>>> > >>>>>>>> What does using "vhost in io threads" mean? > >>>>>>> > >>>>>>> It means a vhost userspace dataplane that is implemented via io threads. > >>>>>> > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and > >>>>>> use AioContext APIs to run in IOThreads. > >>>>> > >>>>> Yes. > >>>>> > >>>>>> > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe > >>>>>> it's fastest if you explain your idea and its advantages instead of me > >>>>>> guessing. > >>>>> > >>>>> It's something like I'd proposed in [1]: > >>>>> > >>>>> 1) a vhost that is implemented via IOThreads > >>>>> 2) memory translation is done via vhost memory table/IOTLB > >>>>> > >>>>> The advantages are: > >>>>> > >>>>> 1) No 3rd application like DPDK application > >>>>> 2) Attack surface were reduced > >>>>> 3) Better understanding/interactions with device model for things like > >>>>> RSS and IOMMU > >>>>> > >>>>> There could be some dis-advantages but it's not obvious to me :) > >>>> > >>>> Why is QEMU's native device emulation API not the natural choice for > >>>> writing built-in devices? I don't understand why the vhost interface > >>>> is desirable for built-in devices. > >>> > >>> Unless the memory helpers (like address translations) were optimized > >>> fully to satisfy this 10M+ PPS. > >>> > >>> Not sure if this is too hard, but last time I benchmark, perf told me > >>> most of the time spent in the translation. > >>> > >>> Using a vhost is a workaround since its memory model is much more > >>> simpler so it can skip lots of memory sections like I/O and ROM etc. > >> > >> So, we can have a thread running as part of QEMU process that implements > >> vhost functionality for a virtio-net device. And this thread has an > >> optimized way to access memory. What prevents current virtio-net emulation > >> code accessing memory in the same optimized way? > > > > Current emulation using memory core accessors which needs to take care > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not > > considered since day0 of vhost. You can do some experiment on this e.g > > just dropping packets after fetching it from the TX ring. > > If I'm reading that right, virtio implementation is using address space > caching by utilizing a memory listener and pre-translated addresses of > interesting memory regions. Then it's performing address_space_read_cached, > which is bypassing all the memory address translation logic on a cache hit. > That sounds pretty similar to how memory table is prepared for vhost. It's only done for virtqueue metadata (desc, driver and device area), we still need to do dma map for the packet buffer itself. Thanks > > > > >> i.e. we likely don't > >> actually need to implement the whole vhost-virtio communication protocol > >> in order to have faster memory access from the device emulation code. > >> I mean, if vhost can access device memory faster, why device itself can't? > > > > I'm not saying it can't but it would end up with something similar to > > vhost. And that's why I'm saying using vhost is a shortcut (at least > > for a POC). > > > > Thanks > > > >> > >> With that we could probably split the "datapath" part of the virtio-net > >> emulation into a separate thread driven by iothread loop. > >> > >> Then add batch API for communication with a network backend (af-xdp) to > >> avoid per-packet calls. > >> > >> These are 3 more or less independent tasks that should allow the similar > >> performance to a full fledged vhost control and dataplane implementation > >> inside QEMU. > >> > >> Or am I missing something? (Probably) > >> > >>> > >>> Thanks > >>> > >>>> > >>>>> > >>>>> It's something like linking SPDK/DPDK to Qemu. > >>>> > >>>> Sergio Lopez tried loading vhost-user devices as shared libraries that > >>>> run in the QEMU process. It worked as an experiment but wasn't pursued > >>>> further. > >>>> > >>>> I think that might make sense in specific cases where there is an > >>>> existing vhost-user codebase that needs to run as part of QEMU. > >>>> > >>>> In this case the AF_XDP code is new, so it's not a case of moving > >>>> existing code into QEMU. > >>>> > >>>>> > >>>>>> > >>>>>>>>>> Regarding pinning - I wonder if that's something that can be refined > >>>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning > >>>>>>>>>> of umem. That way only rx and tx buffers that are currently in use > >>>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin > >>>>>>>>>> pages. I'm not sure whether it's possible to implement this, I haven't > >>>>>>>>>> checked the kernel code. > >>>>>>>>> > >>>>>>>>> It requires the device to do page faults which is not commonly > >>>>>>>>> supported nowadays. > >>>>>>>> > >>>>>>>> I don't understand this comment. AF_XDP processes each rx/tx > >>>>>>>> descriptor. At that point it can getuserpages() or similar in order to > >>>>>>>> pin the page. When the memory is no longer needed, it can put those > >>>>>>>> pages. No fault mechanism is needed. What am I missing? > >>>>>>> > >>>>>>> Ok, I think I kind of get you, you mean doing pinning while processing > >>>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may > >>>>>>> hit PPS for sure. > >>>>>> > >>>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it > >>>>>> supports unpinned guest RAM. > >>>>> > >>>>> Right, it's a balance between pin and PPS. PPS seems to be more > >>>>> important in this case. > >>>>> > >>>>>> > >>>>>> There are variations on this approach, like keeping a certain amount > >>>>>> of pages pinned after they have been used so the cost of > >>>>>> pinning/unpinning can be avoided when the same pages are reused in the > >>>>>> future, but I don't know how effective that is in practice. > >>>>>> > >>>>>> Is there a more efficient approach without relying on hardware page > >>>>>> fault support? > >>>>> > >>>>> I guess so, I see some slides that say device page fault is very slow. > >>>>> > >>>>>> > >>>>>> My understanding is that hardware page fault support is not yet > >>>>>> deployed. We'd be left with pinning guest RAM permanently or using a > >>>>>> runtime pinning/unpinning approach like I've described. > >>>>> > >>>>> Probably. > >>>>> > >>>>> Thanks > >>>>> > >>>>>> > >>>>>> Stefan > >>>>>> > >>>>> > >>>> > >>> > >> > > >
On Mon, Jul 10, 2023 at 11:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 7/10/23 05:51, Jason Wang wrote: > > > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> > > >> On 7/7/23 03:43, Jason Wang wrote: > > >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>> > > >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > > >>>>> > > >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>>>> > > >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>> > > >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>>>>>> > > >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>> > > >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>>>>>>>> > > >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote: > > >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote: > > >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > >>>>>>>>>>>>>>>>>> So, that might be one case. Taking into account that just rcu lock and > > >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet copy, some batching > > >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly. And it shouldn't be > > >>>>>>>>>>>>>>>>>> too hard to implement. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be improved by creating > > >>>>>>>>>>>>>>>>>> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > >>>>>>>>>>>>>>>>>> scale well. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" between > > >>>>>>>>>>>>>>>>> io_uring and AF_XDP: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register) > > >>>>>>>>>>>>>>>>> 2) both use ring for communication > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then we can > > >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > >>>>>>>>>>>>>>>> virtual interfaces. io_uring thread in the kernel will be able to > > >>>>>>>>>>>>>>>> perform transmission for us. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the main loop > > >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory translation > > >>>>>>>>>>>>>>> cost. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code > > >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. The > > >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations so > > >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both the > > >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux hosts. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from guest to > > >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which > > >>>>>>>>>>>>> seems expensive. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Vhost seems to be a shortcut for this. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor monitoring) > > >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still needs to > > >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and > > >>>>>>>>>>>> umem. > > >>>>>>>>>>> > > >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring > > >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this > > >>>>>>>>>>> part seems to be very expensive according to my test in the past. > > >>>>>>>>>> > > >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a QEMU > > >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net) > > >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device > > >>>>>>>>>> emulation. > > >>>>>>>>> > > >>>>>>>>> Just to make sure we're on the same page. > > >>>>>>>>> > > >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to > > >>>>>>>>> using the Qemu memory core translations which need to take care about > > >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads > > >>>>>>>>> which only cares about ram so the translation could be very fast. > > >>>>>>>> > > >>>>>>>> What does using "vhost in io threads" mean? > > >>>>>>> > > >>>>>>> It means a vhost userspace dataplane that is implemented via io threads. > > >>>>>> > > >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use > > >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The > > >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and > > >>>>>> use AioContext APIs to run in IOThreads. > > >>>>> > > >>>>> Yes. > > >>>>> > > >>>>>> > > >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe > > >>>>>> it's fastest if you explain your idea and its advantages instead of me > > >>>>>> guessing. > > >>>>> > > >>>>> It's something like I'd proposed in [1]: > > >>>>> > > >>>>> 1) a vhost that is implemented via IOThreads > > >>>>> 2) memory translation is done via vhost memory table/IOTLB > > >>>>> > > >>>>> The advantages are: > > >>>>> > > >>>>> 1) No 3rd application like DPDK application > > >>>>> 2) Attack surface were reduced > > >>>>> 3) Better understanding/interactions with device model for things like > > >>>>> RSS and IOMMU > > >>>>> > > >>>>> There could be some dis-advantages but it's not obvious to me :) > > >>>> > > >>>> Why is QEMU's native device emulation API not the natural choice for > > >>>> writing built-in devices? I don't understand why the vhost interface > > >>>> is desirable for built-in devices. > > >>> > > >>> Unless the memory helpers (like address translations) were optimized > > >>> fully to satisfy this 10M+ PPS. > > >>> > > >>> Not sure if this is too hard, but last time I benchmark, perf told me > > >>> most of the time spent in the translation. > > >>> > > >>> Using a vhost is a workaround since its memory model is much more > > >>> simpler so it can skip lots of memory sections like I/O and ROM etc. > > >> > > >> So, we can have a thread running as part of QEMU process that implements > > >> vhost functionality for a virtio-net device. And this thread has an > > >> optimized way to access memory. What prevents current virtio-net emulation > > >> code accessing memory in the same optimized way? > > > > > > Current emulation using memory core accessors which needs to take care > > > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not > > > considered since day0 of vhost. You can do some experiment on this e.g > > > just dropping packets after fetching it from the TX ring. > > > > If I'm reading that right, virtio implementation is using address space > > caching by utilizing a memory listener and pre-translated addresses of > > interesting memory regions. Then it's performing address_space_read_cached, > > which is bypassing all the memory address translation logic on a cache hit. > > That sounds pretty similar to how memory table is prepared for vhost. > > Exactly, but only for the vring memory structures (avail, used, and > descriptor rings in the Split Virtqueue Layout). Yes. It should speed up somehow. > > The packet headers and payloads are still translated using the > uncached virtqueue_pop() -> dma_memory_map() -> address_space_map() > API. > > Running a tx packet drop benchmark as Jason suggested and checking if > memory translation is a bottleneck seems worthwhile. Improving > dma_memory_map() performance would speed up all built-in QEMU devices. +1 > > Jason: When you noticed this bottleneck, were you using a normal > virtio-net-pci device without vIOMMU? Normal virtio-net-pci device without vIOMMU. Thanks > > Stefan >
On Mon, Jul 10, 2023 at 11:14 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Thu, 6 Jul 2023 at 21:43, Jason Wang <jasowang@redhat.com> wrote: > > > > On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > > > > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. > > > > > > > > > > > > > > > >> So, that might be one case. Taking into account that just rcu lock and > > > > > > > > > > > > > > > >> unlock in virtio-net code takes more time than a packet copy, some batching > > > > > > > > > > > > > > > >> on QEMU side should improve performance significantly. And it shouldn't be > > > > > > > > > > > > > > > >> too hard to implement. > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Performance over virtual interfaces may potentially be improved by creating > > > > > > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently > > > > > > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to > > > > > > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" between > > > > > > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, then we can > > > > > > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for > > > > > > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be able to > > > > > > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the main loop > > > > > > > > > > > > > > even if io_uring can use kthreads. We can avoid the memory translation > > > > > > > > > > > > > > cost. > > > > > > > > > > > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm working > > > > > > > > > > > > > on patches to re-enable it and will probably send them in July. The > > > > > > > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > > > > > > > main loop and IOThreads will be able to use io_uring on Linux hosts. > > > > > > > > > > > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest to > > > > > > > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > > > > > > > seems expensive. > > > > > > > > > > > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > > > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > > > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > > > > > > > umem. > > > > > > > > > > > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > > > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > > > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > > > > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > > > > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > > > > > > happening. So the GPA to HVA translation will happen anyway in device > > > > > > > > > emulation. > > > > > > > > > > > > > > > > Just to make sure we're on the same page. > > > > > > > > > > > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > > > > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > > > > > > using the Qemu memory core translations which need to take care about > > > > > > > > too much extra stuff. That's why I suggest using vhost in io threads > > > > > > > > which only cares about ram so the translation could be very fast. > > > > > > > > > > > > > > What does using "vhost in io threads" mean? > > > > > > > > > > > > It means a vhost userspace dataplane that is implemented via io threads. > > > > > > > > > > AFAIK this does not exist today. QEMU's built-in devices that use > > > > > IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, > > > > > vhost-user, or vDPA but not built-in devices that use IOThreads. The > > > > > built-in devices implement VirtioDeviceClass callbacks directly and > > > > > use AioContext APIs to run in IOThreads. > > > > > > > > Yes. > > > > > > > > > > > > > > Do you have an idea for using vhost code for built-in devices? Maybe > > > > > it's fastest if you explain your idea and its advantages instead of me > > > > > guessing. > > > > > > > > It's something like I'd proposed in [1]: > > > > > > > > 1) a vhost that is implemented via IOThreads > > > > 2) memory translation is done via vhost memory table/IOTLB > > > > > > > > The advantages are: > > > > > > > > 1) No 3rd application like DPDK application > > > > 2) Attack surface were reduced > > > > 3) Better understanding/interactions with device model for things like > > > > RSS and IOMMU > > > > > > > > There could be some dis-advantages but it's not obvious to me :) > > > > > > Why is QEMU's native device emulation API not the natural choice for > > > writing built-in devices? I don't understand why the vhost interface > > > is desirable for built-in devices. > > > > Unless the memory helpers (like address translations) were optimized > > fully to satisfy this 10M+ PPS. > > > > Not sure if this is too hard, but last time I benchmark, perf told me > > most of the time spent in the translation. > > > > Using a vhost is a workaround since its memory model is much more > > simpler so it can skip lots of memory sections like I/O and ROM etc. > > I see, that sounds like a question of optimization. Most DMA transfers > will be to/from guest RAM and it seems like QEMU's memory API could be > optimized for that case. PIO/MMIO dispatch could use a different API > from DMA transfers, if necessary. Probably. > > I don't think there is a fundamental reason why QEMU's own device > emulation code cannot translate memory as fast as vhost devices can. Yes, it can do what vhost can do. Starting from a vhost may help us to know where we could go for the optimization of the memory core. Thanks > > Stefan >
diff --git a/MAINTAINERS b/MAINTAINERS index 7f323cd2eb..ca85422676 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2925,6 +2925,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/ S: Maintained F: net/netmap.c +AF_XDP network backend +R: Ilya Maximets <i.maximets@ovn.org> +F: net/af-xdp.c + Host Memory Backends M: David Hildenbrand <david@redhat.com> M: Igor Mammedov <imammedo@redhat.com> diff --git a/hmp-commands.hx b/hmp-commands.hx index 2cbd0f77a0..af9ffe4681 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1295,7 +1295,7 @@ ERST { .name = "netdev_add", .args_type = "netdev:O", - .params = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user" + .params = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user" #ifdef CONFIG_VMNET "|vmnet-host|vmnet-shared|vmnet-bridged" #endif diff --git a/meson.build b/meson.build index 6ef78ea278..d0abb658c5 100644 --- a/meson.build +++ b/meson.build @@ -1883,6 +1883,18 @@ if libbpf.found() and not cc.links(''' endif endif +# libxdp +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config') +if libxdp.found() and \ + not (libbpf.found() and libbpf.version().version_compare('>=0.7')) + libxdp = not_found + if get_option('af_xdp').enabled() + error('af-xdp support requires libbpf version >= 0.7') + else + warning('af-xdp support requires libbpf version >= 0.7, disabling') + endif +endif + # libdw libdw = not_found if not get_option('libdw').auto() or \ @@ -2106,6 +2118,7 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars config_host_data.set('CONFIG_LIBATTR', have_old_libattr) config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found()) config_host_data.set('CONFIG_EBPF', libbpf.found()) +config_host_data.set('CONFIG_AF_XDP', libxdp.found()) config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found()) config_host_data.set('CONFIG_LIBISCSI', libiscsi.found()) config_host_data.set('CONFIG_LIBNFS', libnfs.found()) @@ -4279,6 +4292,7 @@ summary_info += {'PVRDMA support': have_pvrdma} summary_info += {'fdt support': fdt_opt == 'disabled' ? false : fdt_opt} summary_info += {'libcap-ng support': libcap_ng} summary_info += {'bpf support': libbpf} +summary_info += {'AF_XDP support': libxdp} summary_info += {'rbd support': rbd} summary_info += {'smartcard support': cacard} summary_info += {'U2F support': u2f} diff --git a/meson_options.txt b/meson_options.txt index 90237389e2..31596d59f1 100644 --- a/meson_options.txt +++ b/meson_options.txt @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto', option('keyring', type: 'feature', value: 'auto', description: 'Linux keyring support') +option('af_xdp', type : 'feature', value : 'auto', + description: 'AF_XDP network backend support') option('attr', type : 'feature', value : 'auto', description: 'attr/xattr support') option('auth_pam', type : 'feature', value : 'auto', diff --git a/net/af-xdp.c b/net/af-xdp.c new file mode 100644 index 0000000000..f78e7c9f96 --- /dev/null +++ b/net/af-xdp.c @@ -0,0 +1,501 @@ +/* + * AF_XDP network backend. + * + * Copyright (c) 2023 Red Hat, Inc. + * + * Authors: + * Ilya Maximets <i.maximets@ovn.org> + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + + +#include "qemu/osdep.h" +#include <bpf/bpf.h> +#include <linux/if_link.h> +#include <linux/if_xdp.h> +#include <net/if.h> +#include <xdp/xsk.h> + +#include "clients.h" +#include "monitor/monitor.h" +#include "net/net.h" +#include "qapi/error.h" +#include "qemu/cutils.h" +#include "qemu/error-report.h" +#include "qemu/iov.h" +#include "qemu/main-loop.h" +#include "qemu/memalign.h" + + +typedef struct AFXDPState { + NetClientState nc; + + struct xsk_socket *xsk; + struct xsk_ring_cons rx; + struct xsk_ring_prod tx; + struct xsk_ring_cons cq; + struct xsk_ring_prod fq; + + char ifname[IFNAMSIZ]; + int ifindex; + bool read_poll; + bool write_poll; + uint32_t outstanding_tx; + + uint64_t *pool; + uint32_t n_pool; + char *buffer; + struct xsk_umem *umem; + + uint32_t n_queues; + uint32_t xdp_flags; + bool inhibit; +} AFXDPState; + +#define AF_XDP_BATCH_SIZE 64 + +static void af_xdp_send(void *opaque); +static void af_xdp_writable(void *opaque); + +/* Set the event-loop handlers for the af-xdp backend. */ +static void af_xdp_update_fd_handler(AFXDPState *s) +{ + qemu_set_fd_handler(xsk_socket__fd(s->xsk), + s->read_poll ? af_xdp_send : NULL, + s->write_poll ? af_xdp_writable : NULL, + s); +} + +/* Update the read handler. */ +static void af_xdp_read_poll(AFXDPState *s, bool enable) +{ + if (s->read_poll != enable) { + s->read_poll = enable; + af_xdp_update_fd_handler(s); + } +} + +/* Update the write handler. */ +static void af_xdp_write_poll(AFXDPState *s, bool enable) +{ + if (s->write_poll != enable) { + s->write_poll = enable; + af_xdp_update_fd_handler(s); + } +} + +static void af_xdp_poll(NetClientState *nc, bool enable) +{ + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); + + if (s->read_poll != enable || s->write_poll != enable) { + s->write_poll = enable; + s->read_poll = enable; + af_xdp_update_fd_handler(s); + } +} + +static void af_xdp_complete_tx(AFXDPState *s) +{ + uint32_t idx = 0; + uint32_t done, i; + uint64_t *addr; + + done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx); + + for (i = 0; i < done; i++) { + addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++); + s->pool[s->n_pool++] = *addr; + s->outstanding_tx--; + } + + if (done) { + xsk_ring_cons__release(&s->cq, done); + } +} + +/* + * The fd_write() callback, invoked if the fd is marked as writable + * after a poll. + */ +static void af_xdp_writable(void *opaque) +{ + AFXDPState *s = opaque; + + /* Try to recover buffers that are already sent. */ + af_xdp_complete_tx(s); + + /* + * Unregister the handler, unless we still have packets to transmit + * and kernel needs a wake up. + */ + if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) { + af_xdp_write_poll(s, false); + } + + /* Flush any buffered packets. */ + qemu_flush_queued_packets(&s->nc); +} + +static ssize_t af_xdp_receive(NetClientState *nc, + const uint8_t *buf, size_t size) +{ + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); + struct xdp_desc *desc; + uint32_t idx; + void *data; + + /* Try to recover buffers that are already sent. */ + af_xdp_complete_tx(s); + + if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) { + /* We can't transmit packet this size... */ + return size; + } + + if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) { + /* + * Out of buffers or space in tx ring. Poll until we can write. + * This will also kick the Tx, if it was waiting on CQ. + */ + af_xdp_write_poll(s, true); + return 0; + } + + desc = xsk_ring_prod__tx_desc(&s->tx, idx); + desc->addr = s->pool[--s->n_pool]; + desc->len = size; + + data = xsk_umem__get_data(s->buffer, desc->addr); + memcpy(data, buf, size); + + xsk_ring_prod__submit(&s->tx, 1); + s->outstanding_tx++; + + if (xsk_ring_prod__needs_wakeup(&s->tx)) { + af_xdp_write_poll(s, true); + } + + return size; +} + +/* + * Complete a previous send (backend --> guest) and enable the + * fd_read callback. + */ +static void af_xdp_send_completed(NetClientState *nc, ssize_t len) +{ + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); + + af_xdp_read_poll(s, true); +} + +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n) +{ + uint32_t i, idx = 0; + + /* Leave one packet for Tx, just in case. */ + if (s->n_pool < n + 1) { + n = s->n_pool; + } + + if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) { + return; + } + + for (i = 0; i < n; i++) { + *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool]; + } + xsk_ring_prod__submit(&s->fq, n); + + if (xsk_ring_prod__needs_wakeup(&s->fq)) { + /* Receive was blocked by not having enough buffers. Wake it up. */ + af_xdp_read_poll(s, true); + } +} + +static void af_xdp_send(void *opaque) +{ + uint32_t i, n_rx, idx = 0; + AFXDPState *s = opaque; + + n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx); + if (!n_rx) { + return; + } + + for (i = 0; i < n_rx; i++) { + const struct xdp_desc *desc; + struct iovec iov; + + desc = xsk_ring_cons__rx_desc(&s->rx, idx++); + + iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr); + iov.iov_len = desc->len; + + s->pool[s->n_pool++] = desc->addr; + + if (!qemu_sendv_packet_async(&s->nc, &iov, 1, + af_xdp_send_completed)) { + /* + * The peer does not receive anymore. Packet is queued, stop + * reading from the backend until af_xdp_send_completed(). + */ + af_xdp_read_poll(s, false); + + /* Re-peek the descriptors to not break the ring cache. */ + xsk_ring_cons__cancel(&s->rx, n_rx); + n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx); + g_assert(n_rx == i + 1); + break; + } + } + + /* Release actually sent descriptors and try to re-fill. */ + xsk_ring_cons__release(&s->rx, n_rx); + af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE); +} + +/* Flush and close. */ +static void af_xdp_cleanup(NetClientState *nc) +{ + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); + + qemu_purge_queued_packets(nc); + + af_xdp_poll(nc, false); + + xsk_socket__delete(s->xsk); + s->xsk = NULL; + g_free(s->pool); + s->pool = NULL; + xsk_umem__delete(s->umem); + s->umem = NULL; + qemu_vfree(s->buffer); + s->buffer = NULL; + + /* Remove the program if it's the last open queue. */ + if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags + && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) { + fprintf(stderr, + "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n", + s->ifname, s->ifindex); + } +} + +static int af_xdp_umem_create(AFXDPState *s, Error **errp) +{ + struct xsk_umem_config config = { + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, + .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE, + .frame_headroom = 0, + }; + uint64_t n_descs; + uint64_t size; + int64_t i; + + /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */ + n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS + + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2; + size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE; + + s->buffer = qemu_memalign(qemu_real_host_page_size(), size); + memset(s->buffer, 0, size); + + if (xsk_umem__create(&s->umem, s->buffer, size, &s->fq, &s->cq, &config)) { + qemu_vfree(s->buffer); + error_setg_errno(errp, errno, + "failed to create umem for %s queue_index: %d", + s->ifname, s->nc.queue_index); + return -1; + } + + s->pool = g_new(uint64_t, n_descs); + /* Fill the pool in the opposite order, because it's a LIFO queue. */ + for (i = n_descs; i >= 0; i--) { + s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE; + } + s->n_pool = n_descs; + + af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS); + + return 0; +} + +static int af_xdp_socket_create(AFXDPState *s, + const NetdevAFXDPOptions *opts, + int xsks_map_fd, Error **errp) +{ + struct xsk_socket_config cfg = { + .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, + .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + .libxdp_flags = 0, + .bind_flags = XDP_USE_NEED_WAKEUP, + .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST, + }; + int queue_id, error = 0; + + s->inhibit = opts->has_inhibit && opts->inhibit; + if (s->inhibit) { + cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD; + } + + if (opts->has_force_copy && opts->force_copy) { + cfg.bind_flags |= XDP_COPY; + } + + queue_id = s->nc.queue_index; + if (opts->has_start_queue && opts->start_queue > 0) { + queue_id += opts->start_queue; + } + + if (opts->has_mode) { + /* Specific mode requested. */ + cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE) + ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE; + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, + s->umem, &s->rx, &s->tx, &cfg)) { + error = errno; + } + } else { + /* No mode requested, try native first. */ + cfg.xdp_flags |= XDP_FLAGS_DRV_MODE; + + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, + s->umem, &s->rx, &s->tx, &cfg)) { + /* Can't use native mode, try skb. */ + cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE; + cfg.xdp_flags |= XDP_FLAGS_SKB_MODE; + + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, + s->umem, &s->rx, &s->tx, &cfg)) { + error = errno; + } + } + } + + if (error) { + error_setg_errno(errp, error, + "failed to create AF_XDP socket for %s queue_id: %d", + s->ifname, queue_id); + return -1; + } + + if (s->inhibit) { + int xsk_fd = xsk_socket__fd(s->xsk); + + /* Need to update the map manually, libxdp skipped that step. */ + error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0); + if (error) { + error_setg_errno(errp, error, + "failed to update xsks map for %s queue_id: %d", + s->ifname, queue_id); + return -1; + } + } + + s->xdp_flags = cfg.xdp_flags; + + return 0; +} + +/* NetClientInfo methods. */ +static NetClientInfo net_af_xdp_info = { + .type = NET_CLIENT_DRIVER_AF_XDP, + .size = sizeof(AFXDPState), + .receive = af_xdp_receive, + .poll = af_xdp_poll, + .cleanup = af_xdp_cleanup, +}; + +/* + * The exported init function. + * + * ... -net af-xdp,ifname="..." + */ +int net_init_af_xdp(const Netdev *netdev, + const char *name, NetClientState *peer, Error **errp) +{ + const NetdevAFXDPOptions *opts = &netdev->u.af_xdp; + NetClientState *nc, *nc0 = NULL; + unsigned int ifindex; + uint32_t prog_id = 0; + int xsks_map_fd = -1; + int64_t i, queues; + Error *err = NULL; + AFXDPState *s; + + ifindex = if_nametoindex(opts->ifname); + if (!ifindex) { + error_setg_errno(errp, errno, "failed to get ifindex for '%s'", + opts->ifname); + return -1; + } + + queues = opts->has_queues ? opts->queues : 1; + if (queues < 1) { + error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'", + queues, opts->ifname); + return -1; + } + + if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) { + error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together"); + return -1; + } + + if (opts->xsks_map_fd) { + xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp); + if (xsks_map_fd < 0) { + return -1; + } + } + + for (i = 0; i < queues; i++) { + nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name); + qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname); + nc->queue_index = i; + + if (!nc0) { + nc0 = nc; + } + + s = DO_UPCAST(AFXDPState, nc, nc); + + pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname); + s->ifindex = ifindex; + s->n_queues = queues; + + if (af_xdp_umem_create(s, errp) + || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) { + /* Make sure the XDP program will be removed. */ + s->n_queues = i; + error_propagate(errp, err); + goto err; + } + } + + if (nc0) { + s = DO_UPCAST(AFXDPState, nc, nc0); + if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) { + error_setg_errno(errp, errno, + "no XDP program loaded on '%s', ifindex: %d", + s->ifname, s->ifindex); + goto err; + } + } + + af_xdp_read_poll(s, true); /* Initially only poll for reads. */ + + return 0; + +err: + if (nc0) { + qemu_del_net_client(nc0); + } + + return -1; +} diff --git a/net/clients.h b/net/clients.h index ed8bdfff1e..be53794582 100644 --- a/net/clients.h +++ b/net/clients.h @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name, NetClientState *peer, Error **errp); #endif +#ifdef CONFIG_AF_XDP +int net_init_af_xdp(const Netdev *netdev, const char *name, + NetClientState *peer, Error **errp); +#endif + int net_init_vhost_user(const Netdev *netdev, const char *name, NetClientState *peer, Error **errp); diff --git a/net/meson.build b/net/meson.build index bdf564a57b..61628d4684 100644 --- a/net/meson.build +++ b/net/meson.build @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c')) if have_netmap system_ss.add(files('netmap.c')) endif + +system_ss.add(when: libxdp, if_true: files('af-xdp.c')) + if have_vhost_net_user system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c')) system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c')) diff --git a/net/net.c b/net/net.c index 6492ad530e..127f70932b 100644 --- a/net/net.c +++ b/net/net.c @@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])( #ifdef CONFIG_NETMAP [NET_CLIENT_DRIVER_NETMAP] = net_init_netmap, #endif +#ifdef CONFIG_AF_XDP + [NET_CLIENT_DRIVER_AF_XDP] = net_init_af_xdp, +#endif #ifdef CONFIG_NET_BRIDGE [NET_CLIENT_DRIVER_BRIDGE] = net_init_bridge, #endif @@ -1186,6 +1189,9 @@ void show_netdevs(void) #ifdef CONFIG_NETMAP "netmap", #endif +#ifdef CONFIG_AF_XDP + "af-xdp", +#endif #ifdef CONFIG_POSIX "vhost-user", #endif diff --git a/qapi/net.json b/qapi/net.json index db67501308..bb30a0d3c6 100644 --- a/qapi/net.json +++ b/qapi/net.json @@ -408,6 +408,56 @@ 'ifname': 'str', '*devname': 'str' } } +## +# @AFXDPMode: +# +# Attach mode for a default XDP program +# +# @skb: generic mode, no driver support necessary +# +# @native: DRV mode, program is attached to a driver, packets are passed to +# the socket without allocation of skb. +# +# Since: 8.1 +## +{ 'enum': 'AFXDPMode', + 'data': [ 'native', 'skb' ] } + +## +# @NetdevAFXDPOptions: +# +# AF_XDP network backend +# +# @ifname: The name of an existing network interface. +# +# @mode: Attach mode for a default XDP program. If not specified, then +# 'native' will be tried first, then 'skb'. +# +# @inhibit: Don't load a default XDP program, use one already loaded to +# the interface (default: false). Requires @xsks-map-fd. +# +# @xsks-map-fd: A file descriptor for an already open XDP socket map in +# the already loaded XDP program. Requires @inhibit. +# +# @force-copy: Force XDP copy mode even if device supports zero-copy. +# (default: false) +# +# @queues: number of queues to be used for multiqueue interfaces (default: 1). +# +# @start-queue: Use @queues starting from this queue number (default: 0). +# +# Since: 8.1 +## +{ 'struct': 'NetdevAFXDPOptions', + 'data': { + 'ifname': 'str', + '*mode': 'AFXDPMode', + '*inhibit': 'bool', + '*xsks-map-fd': 'str', + '*force-copy': 'bool', + '*queues': 'int', + '*start-queue': 'int' } } + ## # @NetdevVhostUserOptions: # @@ -642,13 +692,14 @@ # @vmnet-bridged: since 7.1 # @stream: since 7.2 # @dgram: since 7.2 +# @af-xdp: since 8.1 # # Since: 2.7 ## { 'enum': 'NetClientDriver', 'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream', 'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user', - 'vhost-vdpa', + 'vhost-vdpa', 'af-xdp', { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' }, { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' }, { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] } @@ -680,6 +731,7 @@ 'bridge': 'NetdevBridgeOptions', 'hubport': 'NetdevHubPortOptions', 'netmap': 'NetdevNetmapOptions', + 'af-xdp': 'NetdevAFXDPOptions', 'vhost-user': 'NetdevVhostUserOptions', 'vhost-vdpa': 'NetdevVhostVDPAOptions', 'vmnet-host': { 'type': 'NetdevVmnetHostOptions', diff --git a/qemu-options.hx b/qemu-options.hx index b57489d7ca..7d0844b2be 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2856,6 +2856,17 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, " VALE port (created on the fly) called 'name' ('nmname' is name of the \n" " netmap device, defaults to '/dev/netmap')\n" #endif +#ifdef CONFIG_AF_XDP + "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n" + " [,inhibit=on|off][,xsks-map-fd=k][,queues=n][,start-queue=m]\n" + " attach to the existing network interface 'name' with AF_XDP socket\n" + " use 'mode=MODE' to specify an XDP program attach mode\n" + " use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n" + " use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n" + " use 'xsks-map-fd=k' to provide a file descriptor for xsks map with inhibit=on\n" + " use 'queues=n' to specify how many queues of a multiqueue interface should be used\n" + " use 'start-queue=m' to specify the first queue that should be used\n" +#endif #ifdef CONFIG_POSIX "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n" " configure a vhost-user network, backed by a chardev 'dev'\n" @@ -2901,6 +2912,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic, #ifdef CONFIG_NETMAP "netmap|" #endif +#ifdef CONFIG_AF_XDP + "af-xdp|" +#endif #ifdef CONFIG_POSIX "vhost-user|" #endif @@ -2929,6 +2943,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, #ifdef CONFIG_NETMAP "netmap|" #endif +#ifdef CONFIG_AF_XDP + "af-xdp|" +#endif #ifdef CONFIG_VMNET "vmnet-host|vmnet-shared|vmnet-bridged|" #endif @@ -2936,7 +2953,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, " old way to initialize a host network interface\n" " (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL) SRST -``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` +``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` This option is a shortcut for configuring both the on-board (default) guest NIC hardware and the host network backend in one go. The host backend options are the same as with the corresponding @@ -3350,6 +3367,48 @@ SRST # launch QEMU instance |qemu_system| linux.img -nic vde,sock=/tmp/myswitch +``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,inhibit=on|off][,xsks-map-fd=k][,queues=n][,start-queue=m]`` + Configure AF_XDP backend to connect to a network interface 'name' + using AF_XDP socket. A specific program attach mode for a default + XDP program can be forced with 'mode', defaults to best-effort, + where the likely most performant mode will be in use. Or the load + can be inhibited. In this case XDP program should be pre-loaded + externally and 'xsks-map-fd' provided with a file descriptor for an + open XDP socket map of that program. Number of queues 'n' should + generally match the number or queues in the interface, defaults to 1. + Traffic arriving on non-configured device queues will not be delivered + to the network backend. + + .. parsed-literal:: + + # set number of queues to 1 + ethtool -L eth0 combined 4 + # launch QEMU instance + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ + -netdev af-xdp,id=n1,ifname=eth0,queues=4 + + 'start-queue' option can be specified if a particular range of queues + [m, m + n] should be in use. For example, this is necessary in order + to use MLX NICs in native mode. The driver will create a separate set + of queues on top of regular ones, and only these queues can be used + for AF_XDP sockets. MLX NICs will also require an additional traffic + redirection with ethtool to these queues. E.g.: + + .. parsed-literal:: + + # set number of queues to 1 + ethtool -L eth0 combined 1 + # redirect all the traffic to the second queue (id: 1) + # note: mlx5 driver requires non-empty key/mask pair. + ethtool -N eth0 flow-type ether \\ + dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1 + ethtool -N eth0 flow-type ether \\ + dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1 + # launch QEMU instance + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ + -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1 + + ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]`` Establish a vhost-user netdev, backed by a chardev id. The chardev should be a unix domain socket backed one. The vhost-user uses a diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure index d02b09a4b9..7585c4c4ed 100755 --- a/scripts/ci/org.centos/stream/8/x86_64/configure +++ b/scripts/ci/org.centos/stream/8/x86_64/configure @@ -35,6 +35,7 @@ --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \ --with-coroutine=ucontext \ --tls-priority=@QEMU,SYSTEM \ +--disable-af-xdp \ --disable-attr \ --disable-auth-pam \ --disable-avx2 \ diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh index 5714fd93d9..e1490fd4fe 100644 --- a/scripts/meson-buildoptions.sh +++ b/scripts/meson-buildoptions.sh @@ -75,6 +75,7 @@ meson_options_help() { printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available' printf "%s\n" '(unless built with --without-default-features):' printf "%s\n" '' + printf "%s\n" ' af-xdp AF_XDP network backend support' printf "%s\n" ' alsa ALSA sound support' printf "%s\n" ' attr attr/xattr support' printf "%s\n" ' auth-pam PAM access control' @@ -208,6 +209,8 @@ meson_options_help() { } _meson_option_parse() { case $1 in + --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;; + --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;; --enable-alsa) printf "%s" -Dalsa=enabled ;; --disable-alsa) printf "%s" -Dalsa=disabled ;; --enable-attr) printf "%s" -Dattr=enabled ;; diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker index e39871c7bb..207f7adfb9 100644 --- a/tests/docker/dockerfiles/debian-amd64.docker +++ b/tests/docker/dockerfiles/debian-amd64.docker @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \ libvirglrenderer-dev \ libvte-2.91-dev \ libxen-dev \ + libxdp-dev \ libzstd-dev \ llvm \ locales \
AF_XDP is a network socket family that allows communication directly with the network device driver in the kernel, bypassing most or all of the kernel networking stack. In the essence, the technology is pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native and works with any network interfaces without driver modifications. Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't require access to character devices or unix sockets. Only access to the network interface itself is necessary. This patch implements a network backend that communicates with the kernel by creating an AF_XDP socket. A chunk of userspace memory is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, Fill and Completion) are placed in that memory along with a pool of memory buffers for the packet data. Data transmission is done by allocating one of the buffers, copying packet data into it and placing the pointer into Tx ring. After transmission, device will return the buffer via Completion ring. On Rx, device will take a buffer form a pre-populated Fill ring, write the packet data into it and place the buffer into Rx ring. AF_XDP network backend takes on the communication with the host kernel and the network interface and forwards packets to/from the peer device in QEMU. Usage example: -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 XDP program bridges the socket with a network interface. It can be attached to the interface in 2 different modes: 1. skb - this mode should work for any interface and doesn't require driver support. With a caveat of lower performance. 2. native - this does require support from the driver and allows to bypass skb allocation in the kernel and potentially use zero-copy while getting packets in/out userspace. By default, QEMU will try to use native mode and fall back to skb. Mode can be forced via 'mode' option. To force 'copy' even in native mode, use 'force-copy=on' option. This might be useful if there is some issue with the driver. Option 'queues=N' allows to specify how many device queues should be open. Note that all the queues that are not open are still functional and can receive traffic, but it will not be delivered to QEMU. So, the number of device queues should generally match the QEMU configuration, unless the device is shared with something else and the traffic re-direction to appropriate queues is correctly configured on a device level (e.g. with ethtool -N). 'start-queue=M' option can be used to specify from which queue id QEMU should start configuring 'N' queues. It might also be necessary to use this option with certain NICs, e.g. MLX5 NICs. See the docs for examples. In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN capabilities in order to load default XSK/XDP programs to the network interface and configure BTF maps. It is possible, however, to run only with CAP_NET_RAW. For that to work, an external process with admin capabilities will need to pre-load default XSK program and pass an open file descriptor for this program's 'xsks_map' to QEMU process on startup. Network backend will need to be configured with 'inhibit=on' to avoid loading of the programs. The file descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. There are few performance challenges with the current network backends. First is that they do not support IO threads. This means that data path is handled by the main thread in QEMU and may slow down other work or may be slowed down by some other work. This also means that taking advantage of multi-queue is generally not possible today. Another thing is that data path is going through the device emulation code, which is not really optimized for performance. The fastest "frontend" device is virtio-net. But it's not optimized for heavy traffic either, because it expects such use-cases to be handled via some implementation of vhost (user, kernel, vdpa). In practice, we have virtio notifications and rcu lock/unlock on a per-packet basis and not very efficient accesses to the guest memory. Communication channels between backend and frontend devices do not allow passing more than one packet at a time as well. Some of these challenges can be avoided in the future by adding better batching into device emulation or by implementing vhost-af-xdp variant. There are also a few kernel limitations. AF_XDP sockets do not support any kinds of checksum or segmentation offloading. Buffers are limited to a page size (4K), i.e. MTU is limited. Multi-buffer support is not implemented for AF_XDP today. Also, transmission in all non-zero-copy modes is synchronous, i.e. done in a syscall. That doesn't allow high packet rates on virtual interfaces. However, keeping in mind all of these challenges, current implementation of the AF_XDP backend shows a decent performance while running on top of a physical NIC with zero-copy support. Test setup: 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. Network backend is configured to open the NIC directly in native mode. The driver supports zero-copy. NIC is configured to use 1 queue. Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd for PPS testing. iperf3 result: TCP stream : 19.1 Gbps dpdk-testpmd (single queue, single CPU core, 64 B packets) results: Tx only : 3.4 Mpps Rx only : 2.0 Mpps L2 FWD Loopback : 1.5 Mpps In skb mode the same setup shows much lower performance, similar to the setup where pair of physical NICs is replaced with veth a pair: iperf3 result: TCP stream : 9 Gbps dpdk-testpmd (single queue, single CPU core, 64 B packets) results: Tx only : 1.2 Mpps Rx only : 1.0 Mpps L2 FWD Loopback : 0.7 Mpps Results in skb mode or over the veth are close to results of a tap backend with vhost=on and disabled segmentation offloading bridged with a NIC. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- MAINTAINERS | 4 + hmp-commands.hx | 2 +- meson.build | 14 + meson_options.txt | 2 + net/af-xdp.c | 501 ++++++++++++++++++ net/clients.h | 5 + net/meson.build | 3 + net/net.c | 6 + qapi/net.json | 54 +- qemu-options.hx | 61 ++- .../ci/org.centos/stream/8/x86_64/configure | 1 + scripts/meson-buildoptions.sh | 3 + tests/docker/dockerfiles/debian-amd64.docker | 1 + 13 files changed, 654 insertions(+), 3 deletions(-) create mode 100644 net/af-xdp.c