netmap —
a
framework for fast packet I/O
device netmap
netmap is a framework for extremely fast and
efficient packet I/O for userspace and kernel clients, and for Virtual
Machines. It runs on
FreeBSD Linux and some versions
of Windows, and supports a variety of
netmap
ports, including
- physical
NIC ports
- to access individual queues of network interfaces;
- host
ports
- to inject packets into the host stack;
- VALE
ports
- implementing a very fast and modular in-kernel software
switch/dataplane;
- netmap
pipes
- a shared memory packet transport channel;
- netmap
monitors
- a mechanism similar to bpf(4)
to capture traffic
All these
netmap ports are accessed interchangeably
with the same API, and are at least one order of magnitude faster than
standard OS mechanisms (sockets, bpf, tun/tap interfaces, native switches,
pipes). With suitably fast hardware (NICs, PCIe buses, CPUs), packet I/O using
netmap on supported NICs reaches 14.88 million
packets per second (Mpps) with much less than one core on 10 Gbit/s NICs;
35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); about 20 Mpps per core
for VALE ports; and over 100 Mpps for
netmap
pipes. NICs without native
netmap support can
still use the API in emulated mode, which uses unmodified device drivers and
is 3-5 times faster than
bpf(4) or raw sockets.
Userspace clients can dynamically switch NICs into
netmap mode and send and receive raw packets
through memory mapped buffers. Similarly,
VALE
switch instances and ports,
netmap pipes and
netmap monitors can be created dynamically,
providing high speed packet I/O between processes, virtual machines, NICs and
the host stack.
netmap supports both non-blocking I/O through
ioctl(2), synchronization and blocking I/O
through a file descriptor and standard OS mechanisms such as
select(2),
poll(2),
kqueue(2) and
epoll(7). All types of
netmap ports and the
VALE switch are implemented by a single kernel
module, which also emulates the
netmap API over
standard drivers. For best performance,
netmap
requires native support in device drivers. A list of such devices is at the
end of this document.
In the rest of this (long) manual page we document various aspects of the
netmap and
VALE
architecture, features and usage.
netmap supports raw packet I/O through a
port, which can be connected to a physical
interface (
NIC), to the host stack, or to a
VALE switch. Ports use preallocated circular
queues of buffers (
rings) residing in an mmapped
region. There is one ring for each transmit/receive queue of a NIC or virtual
port. An additional ring pair connects to the host stack.
After binding a file descriptor to a port, a
netmap
client can send or receive packets in batches through the rings, and possibly
implement zero-copy forwarding between ports.
All NICs operating in
netmap mode use the same
memory region, accessible to all processes who own
/dev/netmap file descriptors bound to NICs.
Independent
VALE and
netmap pipe ports by default use separate memory
regions, but can be independently configured to share memory.
The following section describes the system calls to create and control
netmap ports (including
VALE and
netmap pipe
ports). Simpler, higher level functions are described in the
LIBRARIES section.
Ports and rings are created and controlled through a file descriptor, created by
opening a special device
fd =
open("/dev/netmap");
and then bound to a specific port with an
ioctl(fd, NIOCREGIF, (struct nmreq
*)arg);
netmap has multiple modes of operation controlled
by the
struct nmreq argument.
arg.nr_name specifies the netmap port name,
as follows:
-
OS
network interface name (e.g., 'em0', 'eth1', ...
)
- the data path of the NIC is disconnected from the host
stack, and the file descriptor is bound to the NIC (one or all queues), or
to the host stack;
valeSSS:PPP
- the file descriptor is bound to port PPP of VALE switch
SSS. Switch instances and ports are dynamically created if necessary.
Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string cannot exceed
IFNAMSIZ characters, and PPP cannot be the name of any existing OS network
interface.
On return,
arg indicates the size of the shared
memory region, and the number, size and location of all the
netmap data structures, which can be accessed by
mmapping the memory
char *mem = mmap(0, arg.nr_memsize,
fd);
Non-blocking I/O is done with special
ioctl(2)
select(2) and
poll(2) on the file descriptor permit blocking
I/O.
While a NIC is in
netmap mode, the OS will still
believe the interface is up and running. OS-generated packets for that NIC end
up into a
netmap ring, and another ring is used
to send packets into the OS network stack. A
close(2) on the file descriptor removes the
binding, and returns the NIC to normal mode (reconnecting the data path to the
host stack), or destroys the virtual port.
The data structures in the mmapped memory region are detailed in
<sys/net/netmap.h>,
which is the ultimate reference for the
netmap
API. The main structures and fields are indicated below:
-
struct
netmap_if (one per interface
)
-
Indicates the number of available rings (struct
netmap_rings) and their position in the mmapped region. The number of
tx and rx rings (ni_tx_rings,
ni_rx_rings) normally depends on the
hardware. NICs also have an extra tx/rx ring pair connected to the host
stack. NIOCREGIF can also request additional
unbound buffers in the same memory space, to be used as temporary storage
for packets. The number of extra buffers is specified in the
arg.nr_arg3 field. On success, the kernel
writes back to arg.nr_arg3 the number of
extra buffers actually allocated (they may be less than the amount
requested if the memory space ran out of buffers).
ni_bufs_head contains the index of the first
of these extra buffers, which are connected in a list (the first uint32_t
of each buffer being the index of the next buffer in the list). A
0
indicates the end of the list. The
application is free to modify this list and use the buffers (i.e., binding
them to the slots of a netmap ring). When closing the netmap file
descriptor, the kernel frees the buffers contained in the list pointed by
ni_bufs_head , irrespectively of the buffers
originally provided by the kernel on
NIOCREGIF.
-
struct
netmap_ring (one per ring
)
-
Implements transmit and receive rings, with read/write pointers, metadata
and an array of slots describing the
buffers.
-
struct
netmap_slot (one per buffer
)
-
Describes a packet buffer, which normally is identified by an index and
resides in the mmapped region.
packet
buffers
- Fixed size (normally 2 KB) packet buffers allocated by the
kernel.
The offset of the
struct netmap_if in the mmapped
region is indicated by the
nr_offset field in the
structure returned by
NIOCREGIF
. From
there, all other objects are reachable through relative references (offsets or
indexes). Macros and functions in
<net/netmap_user.h>
help converting them into actual pointers:
struct netmap_if *nifp = NETMAP_IF(mem,
arg.nr_offset);
struct netmap_ring *txr = NETMAP_TXRING(nifp,
ring_index);
struct netmap_ring *rxr = NETMAP_RXRING(nifp,
ring_index);
char *buf = NETMAP_BUF(ring,
buffer_index);
Rings are circular queues of packets with three
indexes/pointers (
head,
cur,
tail);
one slot is always kept empty. The ring size
(
num_slots) should not be assumed to be a
power of two.
head is the first slot available to userspace;
cur is the wakeup point: select/poll will
unblock when
tail passes
cur;
tail is the first slot reserved to the kernel.
Slot indexes
must only move forward; for
convenience, the function
nm_ring_next(ring, index)
returns the next index modulo the ring size.
head and
cur
are only modified by the user program;
tail
is only modified by the kernel. The kernel only reads/writes the
struct netmap_ring slots and buffers during
the execution of a netmap-related system call. The only exception are slots
(and buffers) in the range
tail ...
head-1, that are explicitly assigned to the
kernel.
On transmit rings, after a
netmap system call,
slots in the range
head ...
tail-1 are available for transmission. User
code should fill the slots sequentially and advance
head and
cur
past slots ready to transmit.
cur may be
moved further ahead if the user code needs more slots before further
transmissions (see
SCATTER
GATHER I/O).
At the next NIOCTXSYNC/select()/poll(), slots up to
head-1 are pushed to the port, and
tail may advance if further slots have become
available. Below is an example of the evolution of a TX ring:
after the syscall, slots between cur and tail are (a)vailable
head=cur tail
| |
v v
TX [.....aaaaaaaaaaa.............]
user creates new packets to (T)ransmit
head=cur tail
| |
v v
TX [.....TTTTTaaaaaa.............]
NIOCTXSYNC/poll()/select() sends packets and reports new slots
head=cur tail
| |
v v
TX [..........aaaaaaaaaaa........]
select() and
poll()
will block if there is no space in the ring, i.e.,
ring->cur == ring->tail
and return when new slots have become available.
High speed applications may want to amortize the cost of system calls by
preparing as many packets as possible before issuing them.
A transmit ring with pending transmissions has
ring->head != ring->tail + 1 (modulo the
ring size).
The function
int nm_tx_pending(ring) implements
this test.
On receive rings, after a
netmap system call, the
slots in the range
head...
tail-1 contain received packets. User code
should process them and advance
head and
cur past slots it wants to return to the
kernel.
cur may be moved further ahead if the
user code wants to wait for more packets without returning all the previous
slots to the kernel.
At the next NIOCRXSYNC/select()/poll(), slots up to
head-1 are returned to the kernel for further
receives, and
tail may advance to report new
incoming packets.
Below is an example of the evolution of an RX ring:
after the syscall, there are some (h)eld and some (R)eceived slots
head cur tail
| | |
v v v
RX [..hhhhhhRRRRRRRR..........]
user advances head and cur, releasing some slots and holding others
head cur tail
| | |
v v v
RX [..*****hhhRRRRRR...........]
NICRXSYNC/poll()/select() recovers slots and reports new packets
head cur tail
| | |
v v v
RX [.......hhhRRRRRRRRRRRR....]
Normally, packets should be stored in the netmap-allocated buffers assigned to
slots when ports are bound to a file descriptor. One packet is fully contained
in a single buffer.
The following flags affect slot and buffer processing:
- NS_BUF_CHANGED
-
must be used when the
buf_idx in the slot is changed. This can
be used to implement zero-copy forwarding, see
ZERO-COPY
FORWARDING.
- NS_REPORT
- reports when this buffer has been transmitted. Normally,
netmap notifies transmit completions in
batches, hence signals can be delayed indefinitely. This flag helps detect
when packets have been sent and a file descriptor can be closed.
- NS_FORWARD
- When a ring is in 'transparent' mode, packets marked with
this flag by the user application are forwarded to the other endpoint at
the next system call, thus restoring (in a selective way) the connection
between a NIC and the host stack.
- NS_NO_LEARN
- tells the forwarding code that the source MAC address for
this packet must not be used in the learning bridge code.
- NS_INDIRECT
- indicates that the packet's payload is in a user-supplied
buffer whose user virtual address is in the 'ptr' field of the slot. The
size can reach 65535 bytes.
This is only supported on the transmit ring of
VALE ports, and it helps reducing data copies
in the interconnection of virtual machines.
- NS_MOREFRAG
- indicates that the packet continues with subsequent
buffers; the last buffer in a packet must have the flag clear.
Packets can span multiple slots if the
NS_MOREFRAG flag is set in all but the last
slot. The maximum length of a chain is 64 buffers. This is normally used with
VALE ports when connecting virtual machines, as
they generate large TSO segments that are not split unless they reach a
physical device.
NOTE: The length field always refers to the individual fragment; there is no
place with the total length of a packet.
On receive rings the macro
NS_RFRAGS(slot)
indicates the remaining number of slots for this packet, including the current
one. Slots with a value greater than 1 also have NS_MOREFRAG set.
netmap uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) for
non-blocking I/O. They take no argument. Two more ioctls (NIOCGINFO,
NIOCREGIF) are used to query and configure ports, with the following argument:
struct nmreq {
char nr_name[IFNAMSIZ]; /* (i) port name */
uint32_t nr_version; /* (i) API version */
uint32_t nr_offset; /* (o) nifp offset in mmap region */
uint32_t nr_memsize; /* (o) size of the mmap region */
uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
uint16_t nr_tx_rings; /* (i/o) number of tx rings */
uint16_t nr_rx_rings; /* (i/o) number of rx rings */
uint16_t nr_ringid; /* (i/o) ring(s) we care about */
uint16_t nr_cmd; /* (i) special command */
uint16_t nr_arg1; /* (i/o) extra arguments */
uint16_t nr_arg2; /* (i/o) extra arguments */
uint32_t nr_arg3; /* (i/o) extra arguments */
uint32_t nr_flags /* (i/o) open mode */
...
};
A file descriptor obtained through
/dev/netmap also
supports the ioctl supported by network devices, see
netintro(4).
NIOCGINFO
- returns EINVAL if the named port does not support netmap.
Otherwise, it returns 0 and (advisory) information about the port. Note
that all the information below can change before the interface is actually
put in netmap mode.
- nr_memsize
- indicates the size of the
netmap memory region. NICs in
netmap mode all share the same memory
region, whereas VALE ports have
independent regions for each port.
-
nr_tx_slots,
nr_rx_slots
- indicate the size of transmit and receive rings.
-
nr_tx_rings,
nr_rx_rings
- indicate the number of transmit and receive rings. Both
ring number and sizes may be configured at runtime using
interface-specific functions (e.g.,
ethtool(8) ).
NIOCREGIF
- binds the port named in
nr_name to the file descriptor. For a
physical device this also switches it into
netmap mode, disconnecting it from the host
stack. Multiple file descriptors can be bound to the same port, with
proper synchronization left to the user.
The recommended way to bind a file descriptor to a port is to use function
nm_open(..) (see
LIBRARIES) which parses
names to access specific port types and enable features. In the following
we document the main features.
NIOCREGIF can also bind a file descriptor to one
endpoint of a
netmap pipe, consisting
of two netmap ports with a crossover connection. A netmap pipe share the
same memory space of the parent port, and is meant to enable configuration
where a master process acts as a dispatcher towards slave processes.
To enable this function, the nr_arg1 field of
the structure can be used as a hint to the kernel to indicate how many
pipes we expect to use, and reserve extra space in the memory region.
On return, it gives the same info as NIOCGINFO, with
nr_ringid and
nr_flags indicating the identity of the rings
controlled through the file descriptor.
nr_flags
nr_ringid selects which rings are
controlled through this file descriptor. Possible values of
nr_flags are indicated below, together with
the naming schemes that application libraries (such as the
nm_open indicated below) can use to indicate
the specific set of rings. In the example below, "netmap:foo" is
any valid netmap port name.
- NR_REG_ALL_NIC
netmap:foo
- (default) all hardware ring pairs
- NR_REG_SW
netmap:foo^
- the ``host rings'', connecting to the host stack.
- NR_REG_NIC_SW
netmap:foo+
- all hardware rings and the host rings
- NR_REG_ONE_NIC
netmap:foo-i
- only the i-th hardware ring pair, where the number is
in nr_ringid;
- NR_REG_PIPE_MASTER
netmap:foo{i
- the master side of the netmap pipe whose identifier (i)
is in nr_ringid;
- NR_REG_PIPE_SLAVE
netmap:foo}i
- the slave side of the netmap pipe whose identifier (i)
is in nr_ringid.
The identifier of a pipe must be thought as part of the pipe name, and
does not need to be sequential. On return the pipe will only have a
single ring pair with index 0, irrespective of the value of
i.
By default, a poll(2) or
select(2) call pushes out any pending packets
on the transmit ring, even if no write events are specified. The feature
can be disabled by or-ing
NETMAP_NO_TX_POLL to the value written to
nr_ringid. When this feature is used,
packets are transmitted only on
ioctl(NIOCTXSYNC) or
select() /
poll() are called with a write event
(POLLOUT/wfdset) or a full ring.
When registering a virtual interface that is dynamically created to a
VALE switch, we can specify the desired
number of rings (1 by default, and currently up to 16) on it using
nr_tx_rings and nr_rx_rings fields.
NIOCTXSYNC
- tells the hardware of new packets to transmit, and updates
the number of slots available for transmission.
NIOCRXSYNC
- tells the hardware of consumed packets, and asks for newly
available packets.
select(2) and
poll(2)
on a
netmap file descriptor process rings as
indicated in
TRANSMIT
RINGS and
RECEIVE
RINGS, respectively when write (POLLOUT) and read (POLLIN) events are
requested. Both block if no slots are available in the ring
(
ring->cur == ring->tail). Depending on
the platform,
epoll(7) and
kqueue(2) are supported too.
Packets in transmit rings are normally pushed out (and buffers reclaimed) even
without requesting write events. Passing the
NETMAP_NO_TX_POLL
flag to
NIOCREGIF disables this feature. By default,
receive rings are processed only if read events are requested. Passing the
NETMAP_DO_RX_POLL
flag to
NIOCREGIF updates receive rings even without read
events. Note that on
epoll(7) and
kqueue(2),
NETMAP_NO_TX_POLL
and
NETMAP_DO_RX_POLL
only have an effect when
some event is posted for the file descriptor.
The
netmap API is supposed to be used directly,
both because of its simplicity and for efficient integration with
applications.
For convenience, the
<net/netmap_user.h>
header provides a few macros and functions to ease creating a file descriptor
and doing I/O with a
netmap port. These are
loosely modeled after the
pcap(3) API, to ease
porting of libpcap-based applications to
netmap.
To use these extra functions, programs should
#define NETMAP_WITH_LIBS
before
#include <net/netmap_user.h>
The following functions are available:
-
struct
nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t
flags, const struct nm_desc *arg)
- similar to pcap_open_live(3),
binds a file descriptor to a port.
- ifname
- is a port name, in the form "netmap:PPP" for
a NIC and "valeSSS:PPP" for a
VALE port.
- req
- provides the initial values for the argument to the
NIOCREGIF ioctl. The nm_flags and nm_ringid values are overwritten by
parsing ifname and flags, and other fields can be overridden through
the other two arguments.
- arg
- points to a struct nm_desc containing arguments (e.g.,
from a previously open file descriptor) that should override the
defaults. The fields are used as described below
- flags
- can be set to a combination of the following flags:
NETMAP_NO_TX_POLL,
NETMAP_DO_RX_POLL (copied into
nr_ringid); NM_OPEN_NO_MMAP (if arg
points to the same memory region, avoids the mmap and uses the values
from it); NM_OPEN_IFNAME (ignores
ifname and uses the values in arg);
NM_OPEN_ARG1,
NM_OPEN_ARG2,
NM_OPEN_ARG3 (uses the fields from
arg); NM_OPEN_RING_CFG (uses the ring
number and sizes from arg).
-
int
nm_close(struct nm_desc *d)
- closes the file descriptor, unmaps memory, frees
resources.
-
int
nm_inject(struct nm_desc *d, const void *buf, size_t size)
- similar to pcap_inject(),
pushes a packet to a ring, returns the size of the packet is successful,
or 0 on error;
-
int
nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
- similar to
pcap_dispatch(), applies a callback to
incoming packets
-
u_char
* nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
- similar to pcap_next(),
fetches the next packet
netmap natively supports the following devices:
On
FreeBSD:
cxgbe(4),
em(4),
iflib(4)
(providing igb, em and lem),
ixgbe(4),
ixl(4),
re(4),
vtnet(4).
On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
NICs without native support can still be used in
netmap mode through emulation. Performance is
inferior to native netmap mode but still significantly higher than various raw
socket types (bpf, PF_PACKET, etc.). Note that for slow devices (such as 1
Gbit/s and slower NICs, or several 10 Gbit/s NICs whose hardware is unable to
sustain line rate), emulated and native mode will likely have similar or same
throughput.
When emulation is in use, packet sniffer programs such as tcpdump could see
received packets before they are diverted by netmap. This behaviour is not
intentional, being just an artifact of the implementation of emulation. Note
that in case the netmap application subsequently moves packets received from
the emulated adapter onto the host RX ring, the sniffer will intercept those
packets again, since the packets are injected to the host stack as they were
received by the network interface.
Emulation is also available for devices with native netmap support, which can be
used for testing or performance comparison. The sysctl variable
dev.netmap.admode globally controls how
netmap mode is implemented.
Some aspects of the operation of
netmap and
VALE are controlled through sysctl variables on
FreeBSD (
dev.netmap.*)
and module parameters on Linux
(
/sys/module/netmap/parameters/*):
- dev.netmap.admode: 0
- Controls the use of native or emulated adapter mode.
0 uses the best available option;
1 forces native mode and fails if not available;
2 forces emulated hence never fails.
- dev.netmap.generic_rings:
1
- Number of rings used for emulated netmap mode
- dev.netmap.generic_ringsize:
1024
- Ring size used for emulated netmap mode
- dev.netmap.generic_mit:
100000
- Controls interrupt moderation for emulated mode
- dev.netmap.fwd: 0
- Forces NS_FORWARD mode
- dev.netmap.txsync_retry:
2
- Number of txsync loops in the
VALE flush function
- dev.netmap.no_pendintr:
1
- Forces recovery of transmit buffers on system calls
- dev.netmap.no_timestamp:
0
- Disables the update of the timestamp in the netmap
ring
- dev.netmap.verbose: 0
- Verbose kernel messages
- dev.netmap.buf_num:
163840
-
- dev.netmap.buf_size:
2048
-
- dev.netmap.ring_num:
200
-
- dev.netmap.ring_size:
36864
-
- dev.netmap.if_num:
100
-
- dev.netmap.if_size:
1024
- Sizes and number of objects (netmap_if, netmap_ring,
buffers) for the global memory region. The only parameter worth modifying
is dev.netmap.buf_num as it impacts the
total amount of memory used by netmap.
- dev.netmap.buf_curr_num:
0
-
- dev.netmap.buf_curr_size:
0
-
- dev.netmap.ring_curr_num:
0
-
- dev.netmap.ring_curr_size:
0
-
- dev.netmap.if_curr_num:
0
-
- dev.netmap.if_curr_size:
0
- Actual values in use.
- dev.netmap.priv_buf_num:
4098
-
- dev.netmap.priv_buf_size:
2048
-
- dev.netmap.priv_ring_num:
4
-
- dev.netmap.priv_ring_size:
20480
-
- dev.netmap.priv_if_num:
2
-
- dev.netmap.priv_if_size:
1024
- Sizes and number of objects (netmap_if, netmap_ring,
buffers) for private memory regions. A separate memory region is used for
each VALE port and each pair of
netmap pipes.
- dev.netmap.bridge_batch:
1024
- Batch size used when moving packets across a
VALE switch. Values above 64 generally
guarantee good performance.
- dev.netmap.ptnet_vnet_hdr:
1
- Allow ptnet devices to use virtio-net headers
netmap uses
select(2),
poll(2),
epoll(7)
and
kqueue(2) to wake up processes when
significant events occur, and
mmap(2) to map
memory.
ioctl(2) is used to configure ports and
VALE switches.
Applications may need to create threads and bind them to specific cores to
improve performance, using standard OS primitives, see
pthread(3). In particular,
pthread_setaffinity_np(3) may be of use.
netmap comes with a few programs that can be used
for testing or simple applications. See the
examples/ directory in
netmap distributions, or
tools/tools/netmap/ directory in
FreeBSD distributions.
pkt-gen(8) is a general purpose traffic
source/sink.
As an example
pkt-gen -i ix0 -f tx -l 60
can generate an infinite stream of minimum size packets, and
pkt-gen -i ix0 -f rx
is a traffic sink. Both print traffic statistics, to help monitor how the system
performs.
pkt-gen(8) has many options can be uses to set
packet sizes, addresses, rates, and use multiple send/receive threads and
cores.
bridge(4) is another test program which
interconnects two
netmap ports. It can be used
for transparent forwarding between interfaces, as in
bridge -i netmap:ix0 -i netmap:ix1
or even connect the NIC to the host stack using netmap
bridge -i netmap:ix0
The following code implements a traffic generator
#include <net/netmap_user.h>
...
void sender(void)
{
struct netmap_if *nifp;
struct netmap_ring *ring;
struct nmreq nmr;
struct pollfd fds;
fd = open("/dev/netmap", O_RDWR);
bzero(&nmr, sizeof(nmr));
strcpy(nmr.nr_name, "ix0");
nmr.nm_version = NETMAP_API;
ioctl(fd, NIOCREGIF, &nmr);
p = mmap(0, nmr.nr_memsize, fd);
nifp = NETMAP_IF(p, nmr.nr_offset);
ring = NETMAP_TXRING(nifp, 0);
fds.fd = fd;
fds.events = POLLOUT;
for (;;) {
poll(&fds, 1, -1);
while (!nm_ring_empty(ring)) {
i = ring->cur;
buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
... prepare packet in buf ...
ring->slot[i].len = ... packet length ...
ring->head = ring->cur = nm_ring_next(ring, i);
}
}
}
A simple receiver can be implemented using the helper functions
#define NETMAP_WITH_LIBS
#include <net/netmap_user.h>
...
void receiver(void)
{
struct nm_desc *d;
struct pollfd fds;
u_char *buf;
struct nm_pkthdr h;
...
d = nm_open("netmap:ix0", NULL, 0, 0);
fds.fd = NETMAP_FD(d);
fds.events = POLLIN;
for (;;) {
poll(&fds, 1, -1);
while ( (buf = nm_nextpkt(d, &h)) )
consume_pkt(buf, h->len);
}
nm_close(d);
}
Since physical interfaces share the same memory region, it is possible to do
packet forwarding between ports swapping buffers. The buffer from the transmit
ring is used to replenish the receive ring:
uint32_t tmp;
struct netmap_slot *src, *dst;
...
src = &src_ring->slot[rxr->cur];
dst = &dst_ring->slot[txr->cur];
tmp = dst->buf_idx;
dst->buf_idx = src->buf_idx;
dst->len = src->len;
dst->flags = NS_BUF_CHANGED;
src->buf_idx = tmp;
src->flags = NS_BUF_CHANGED;
rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
txr->head = txr->cur = nm_ring_next(txr, txr->cur);
...
The host stack is for all practical purposes just a regular ring pair, which you
can access with the netmap API (e.g., with
nm_open("netmap:eth0^",
...);
All packets that the host would send to an interface in
netmap mode end up into the RX ring, whereas all
packets queued to the TX ring are send up to the host stack.
A simple way to test the performance of a
VALE
switch is to attach a sender and a receiver to it, e.g., running the following
in two different terminals:
pkt-gen -i vale1:a -f rx #
receiver
pkt-gen -i vale1:b -f tx # sender
The same example can be used to test netmap pipes, by simply changing port
names, e.g.,
pkt-gen -i vale2:x{3 -f rx # receiver on the
master side
pkt-gen -i vale2:x}3 -f tx # sender on the
slave side
The following command attaches an interface and the host stack to a switch:
valectl -h vale2:em0
Other
netmap clients attached to the same switch
can now communicate with the network card or the host.
vale(4),
valectl(8),
bridge(8),
lb(8),
nmreplay(8),
pkt-gen(8)
http://info.iet.unipi.it/~luigi/netmap/
Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, Communications
of the ACM, 55 (3), pp.45-51, March 2012
Luigi Rizzo, netmap: a novel framework for fast packet I/O, Usenix ATC'12, June
2012, Boston
Luigi Rizzo, Giuseppe Lettieri, VALE, a switched ethernet for virtual machines,
ACM CoNEXT'12, December 2012, Nice
Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, Speeding up packet I/O in
virtual machines, ACM/IEEE ANCS'13, October 2013, San Jose
The
netmap framework has been originally designed
and implemented at the Universita` di Pisa in 2011 by
Luigi Rizzo, and further extended with help
from
Matteo Landi,
Gaetano Catalli,
Giuseppe Lettieri, and
Vincenzo Maffione.
netmap and
VALE have
been funded by the European Commission within FP7 Projects CHANGE (257422) and
OPENLAB (287581).
No matter how fast the CPU and OS are, achieving line rate on 10G and faster
interfaces requires hardware with sufficient performance. Several NICs are
unable to sustain line rate with small packet sizes. Insufficient PCIe or
memory bandwidth can also cause reduced performance.
Another frequent reason for low performance is the use of flow control on the
link: a slow receiver can limit the transmit speed. Be sure to disable flow
control when running high speed experiments.
netmap is orthogonal to some NIC features such as
multiqueue, schedulers, packet filters.
Multiple transmit and receive rings are supported natively and can be configured
with ordinary OS tools, such as
ethtool(8) or
device-specific sysctl variables. The same goes for Receive Packet Steering
(RPS) and filtering of incoming traffic.
netmap does not use
features such as
checksum offloading,
TCP segmentation offloading,
encryption,
VLAN
encapsulation/decapsulation, etc. When using netmap to exchange packets
with the host stack, make sure to disable these features.