Commit | Line | Data |
---|---|---|
e6e37f63 OS |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ===================================== | |
56c07271 | 4 | Scaling in the Linux Networking Stack |
e6e37f63 | 5 | ===================================== |
56c07271 WB |
6 | |
7 | ||
8 | Introduction | |
9 | ============ | |
10 | ||
11 | This document describes a set of complementary techniques in the Linux | |
12 | networking stack to increase parallelism and improve performance for | |
13 | multi-processor systems. | |
14 | ||
15 | The following technologies are described: | |
16 | ||
e6e37f63 OS |
17 | - RSS: Receive Side Scaling |
18 | - RPS: Receive Packet Steering | |
19 | - RFS: Receive Flow Steering | |
20 | - Accelerated Receive Flow Steering | |
21 | - XPS: Transmit Packet Steering | |
56c07271 WB |
22 | |
23 | ||
24 | RSS: Receive Side Scaling | |
25 | ========================= | |
26 | ||
27 | Contemporary NICs support multiple receive and transmit descriptor queues | |
28 | (multi-queue). On reception, a NIC can send different packets to different | |
29 | queues to distribute processing among CPUs. The NIC distributes packets by | |
30 | applying a filter to each packet that assigns it to one of a small number | |
31 | of logical flows. Packets for each flow are steered to a separate receive | |
32 | queue, which in turn can be processed by separate CPUs. This mechanism is | |
33 | generally known as “Receive-side Scaling” (RSS). The goal of RSS and | |
186c6bbc | 34 | the other scaling techniques is to increase performance uniformly. |
56c07271 WB |
35 | Multi-queue distribution can also be used for traffic prioritization, but |
36 | that is not the focus of these techniques. | |
37 | ||
38 | The filter used in RSS is typically a hash function over the network | |
39 | and/or transport layer headers-- for example, a 4-tuple hash over | |
40 | IP addresses and TCP ports of a packet. The most common hardware | |
41 | implementation of RSS uses a 128-entry indirection table where each entry | |
42 | stores a queue number. The receive queue for a packet is determined | |
43 | by masking out the low order seven bits of the computed hash for the | |
44 | packet (usually a Toeplitz hash), taking this number as a key into the | |
45 | indirection table and reading the corresponding value. | |
46 | ||
47 | Some advanced NICs allow steering packets to queues based on | |
48 | programmable filters. For example, webserver bound TCP port 80 packets | |
49 | can be directed to their own receive queue. Such “n-tuple” filters can | |
50 | be configured from ethtool (--config-ntuple). | |
51 | ||
e6e37f63 OS |
52 | |
53 | RSS Configuration | |
54 | ----------------- | |
56c07271 WB |
55 | |
56 | The driver for a multi-queue capable NIC typically provides a kernel | |
57 | module parameter for specifying the number of hardware queues to | |
58 | configure. In the bnx2x driver, for instance, this parameter is called | |
59 | num_queues. A typical RSS configuration would be to have one receive queue | |
60 | for each CPU if the device supports enough queues, or otherwise at least | |
320f24e4 WB |
61 | one for each memory domain, where a memory domain is a set of CPUs that |
62 | share a particular memory level (L1, L2, NUMA node, etc.). | |
56c07271 WB |
63 | |
64 | The indirection table of an RSS device, which resolves a queue by masked | |
65 | hash, is usually programmed by the driver at initialization. The | |
66 | default mapping is to distribute the queues evenly in the table, but the | |
67 | indirection table can be retrieved and modified at runtime using ethtool | |
68 | commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the | |
69 | indirection table could be done to give different queues different | |
70 | relative weights. | |
71 | ||
e6e37f63 OS |
72 | |
73 | RSS IRQ Configuration | |
74 | ~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
75 | |
76 | Each receive queue has a separate IRQ associated with it. The NIC triggers | |
77 | this to notify a CPU when new packets arrive on the given queue. The | |
78 | signaling path for PCIe devices uses message signaled interrupts (MSI-X), | |
79 | that can route each interrupt to a particular CPU. The active mapping | |
80 | of queues to IRQs can be determined from /proc/interrupts. By default, | |
81 | an IRQ may be handled on any CPU. Because a non-negligible part of packet | |
82 | processing takes place in receive interrupt handling, it is advantageous | |
83 | to spread receive interrupts between CPUs. To manually adjust the IRQ | |
395cf969 | 84 | affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems |
56c07271 WB |
85 | will be running irqbalance, a daemon that dynamically optimizes IRQ |
86 | assignments and as a result may override any manual settings. | |
87 | ||
e6e37f63 OS |
88 | |
89 | Suggested Configuration | |
90 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
91 | |
92 | RSS should be enabled when latency is a concern or whenever receive | |
93 | interrupt processing forms a bottleneck. Spreading load between CPUs | |
94 | decreases queue length. For low latency networking, the optimal setting | |
95 | is to allocate as many queues as there are CPUs in the system (or the | |
320f24e4 | 96 | NIC maximum, if lower). The most efficient high-rate configuration |
56c07271 | 97 | is likely the one with the smallest number of receive queues where no |
320f24e4 WB |
98 | receive queue overflows due to a saturated CPU, because in default |
99 | mode with interrupt coalescing enabled, the aggregate number of | |
100 | interrupts (and thus work) grows with each additional queue. | |
101 | ||
102 | Per-cpu load can be observed using the mpstat utility, but note that on | |
103 | processors with hyperthreading (HT), each hyperthread is represented as | |
104 | a separate CPU. For interrupt handling, HT has shown no benefit in | |
105 | initial tests, so limit the number of queues to the number of CPU cores | |
106 | in the system. | |
56c07271 WB |
107 | |
108 | ||
109 | RPS: Receive Packet Steering | |
110 | ============================ | |
111 | ||
112 | Receive Packet Steering (RPS) is logically a software implementation of | |
113 | RSS. Being in software, it is necessarily called later in the datapath. | |
114 | Whereas RSS selects the queue and hence CPU that will run the hardware | |
115 | interrupt handler, RPS selects the CPU to perform protocol processing | |
116 | above the interrupt handler. This is accomplished by placing the packet | |
117 | on the desired CPU’s backlog queue and waking up the CPU for processing. | |
e6e37f63 OS |
118 | RPS has some advantages over RSS: |
119 | ||
120 | 1) it can be used with any NIC | |
121 | 2) software filters can easily be added to hash over new protocols | |
56c07271 | 122 | 3) it does not increase hardware device interrupt rate (although it does |
e6e37f63 | 123 | introduce inter-processor interrupts (IPIs)) |
56c07271 WB |
124 | |
125 | RPS is called during bottom half of the receive interrupt handler, when | |
126 | a driver sends a packet up the network stack with netif_rx() or | |
127 | netif_receive_skb(). These call the get_rps_cpu() function, which | |
128 | selects the queue that should process a packet. | |
129 | ||
130 | The first step in determining the target CPU for RPS is to calculate a | |
131 | flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash | |
132 | depending on the protocol). This serves as a consistent hash of the | |
133 | associated flow of the packet. The hash is either provided by hardware | |
134 | or will be computed in the stack. Capable hardware can pass the hash in | |
135 | the receive descriptor for the packet; this would usually be the same | |
136 | hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in | |
e4061d57 | 137 | skb->hash and can be used elsewhere in the stack as a hash of the |
56c07271 WB |
138 | packet’s flow. |
139 | ||
140 | Each receive hardware queue has an associated list of CPUs to which | |
141 | RPS may enqueue packets for processing. For each received packet, | |
142 | an index into the list is computed from the flow hash modulo the size | |
143 | of the list. The indexed CPU is the target for processing the packet, | |
144 | and the packet is queued to the tail of that CPU’s backlog queue. At | |
145 | the end of the bottom half routine, IPIs are sent to any CPUs for which | |
146 | packets have been queued to their backlog queue. The IPI wakes backlog | |
147 | processing on the remote CPU, and any queued packets are then processed | |
148 | up the networking stack. | |
149 | ||
e6e37f63 OS |
150 | |
151 | RPS Configuration | |
152 | ----------------- | |
56c07271 WB |
153 | |
154 | RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on | |
155 | by default for SMP). Even when compiled in, RPS remains disabled until | |
156 | explicitly configured. The list of CPUs to which RPS may forward traffic | |
e6e37f63 | 157 | can be configured for each receive queue using a sysfs file entry:: |
56c07271 | 158 | |
e6e37f63 | 159 | /sys/class/net/<dev>/queues/rx-<n>/rps_cpus |
56c07271 WB |
160 | |
161 | This file implements a bitmap of CPUs. RPS is disabled when it is zero | |
162 | (the default), in which case packets are processed on the interrupting | |
163 | CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to | |
164 | the bitmap. | |
165 | ||
e6e37f63 OS |
166 | |
167 | Suggested Configuration | |
168 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
169 | |
170 | For a single queue device, a typical RPS configuration would be to set | |
320f24e4 | 171 | the rps_cpus to the CPUs in the same memory domain of the interrupting |
56c07271 WB |
172 | CPU. If NUMA locality is not an issue, this could also be all CPUs in |
173 | the system. At high interrupt rate, it might be wise to exclude the | |
174 | interrupting CPU from the map since that already performs much work. | |
175 | ||
176 | For a multi-queue system, if RSS is configured so that a hardware | |
177 | receive queue is mapped to each CPU, then RPS is probably redundant | |
178 | and unnecessary. If there are fewer hardware queues than CPUs, then | |
179 | RPS might be beneficial if the rps_cpus for each queue are the ones that | |
320f24e4 | 180 | share the same memory domain as the interrupting CPU for that queue. |
56c07271 | 181 | |
e6e37f63 OS |
182 | |
183 | RPS Flow Limit | |
184 | -------------- | |
191cb1f2 WB |
185 | |
186 | RPS scales kernel receive processing across CPUs without introducing | |
187 | reordering. The trade-off to sending all packets from the same flow | |
188 | to the same CPU is CPU load imbalance if flows vary in packet rate. | |
189 | In the extreme case a single flow dominates traffic. Especially on | |
190 | common server workloads with many concurrent connections, such | |
191 | behavior indicates a problem such as a misconfiguration or spoofed | |
192 | source Denial of Service attack. | |
193 | ||
194 | Flow Limit is an optional RPS feature that prioritizes small flows | |
195 | during CPU contention by dropping packets from large flows slightly | |
196 | ahead of those from small flows. It is active only when an RPS or RFS | |
197 | destination CPU approaches saturation. Once a CPU's input packet | |
198 | queue exceeds half the maximum queue length (as set by sysctl | |
199 | net.core.netdev_max_backlog), the kernel starts a per-flow packet | |
200 | count over the last 256 packets. If a flow exceeds a set ratio (by | |
201 | default, half) of these packets when a new packet arrives, then the | |
202 | new packet is dropped. Packets from other flows are still only | |
203 | dropped once the input packet queue reaches netdev_max_backlog. | |
204 | No packets are dropped when the input packet queue length is below | |
205 | the threshold, so flow limit does not sever connections outright: | |
206 | even large flows maintain connectivity. | |
207 | ||
e6e37f63 OS |
208 | |
209 | Interface | |
210 | ~~~~~~~~~ | |
191cb1f2 WB |
211 | |
212 | Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not | |
213 | turned on. It is implemented for each CPU independently (to avoid lock | |
214 | and cache contention) and toggled per CPU by setting the relevant bit | |
215 | in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU | |
e6e37f63 | 216 | bitmap interface as rps_cpus (see above) when called from procfs:: |
191cb1f2 | 217 | |
e6e37f63 | 218 | /proc/sys/net/core/flow_limit_cpu_bitmap |
191cb1f2 WB |
219 | |
220 | Per-flow rate is calculated by hashing each packet into a hashtable | |
221 | bucket and incrementing a per-bucket counter. The hash function is | |
222 | the same that selects a CPU in RPS, but as the number of buckets can | |
223 | be much larger than the number of CPUs, flow limit has finer-grained | |
224 | identification of large flows and fewer false positives. The default | |
e6e37f63 | 225 | table has 4096 buckets. This value can be modified through sysctl:: |
191cb1f2 | 226 | |
e6e37f63 | 227 | net.core.flow_limit_table_len |
191cb1f2 WB |
228 | |
229 | The value is only consulted when a new table is allocated. Modifying | |
230 | it does not update active tables. | |
231 | ||
e6e37f63 OS |
232 | |
233 | Suggested Configuration | |
234 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
191cb1f2 WB |
235 | |
236 | Flow limit is useful on systems with many concurrent connections, | |
237 | where a single connection taking up 50% of a CPU indicates a problem. | |
238 | In such environments, enable the feature on all CPUs that handle | |
239 | network rx interrupts (as set in /proc/irq/N/smp_affinity). | |
240 | ||
241 | The feature depends on the input packet queue length to exceed | |
242 | the flow limit threshold (50%) + the flow history length (256). | |
243 | Setting net.core.netdev_max_backlog to either 1000 or 10000 | |
244 | performed well in experiments. | |
245 | ||
56c07271 WB |
246 | |
247 | RFS: Receive Flow Steering | |
248 | ========================== | |
249 | ||
250 | While RPS steers packets solely based on hash, and thus generally | |
251 | provides good load distribution, it does not take into account | |
252 | application locality. This is accomplished by Receive Flow Steering | |
253 | (RFS). The goal of RFS is to increase datacache hitrate by steering | |
254 | kernel processing of packets to the CPU where the application thread | |
255 | consuming the packet is running. RFS relies on the same RPS mechanisms | |
256 | to enqueue packets onto the backlog of another CPU and to wake up that | |
257 | CPU. | |
258 | ||
259 | In RFS, packets are not forwarded directly by the value of their hash, | |
260 | but the hash is used as index into a flow lookup table. This table maps | |
261 | flows to the CPUs where those flows are being processed. The flow hash | |
262 | (see RPS section above) is used to calculate the index into this table. | |
263 | The CPU recorded in each entry is the one which last processed the flow. | |
264 | If an entry does not hold a valid CPU, then packets mapped to that entry | |
265 | are steered using plain RPS. Multiple table entries may point to the | |
266 | same CPU. Indeed, with many flows and few CPUs, it is very likely that | |
267 | a single application thread handles flows with many different flow hashes. | |
268 | ||
186c6bbc BP |
269 | rps_sock_flow_table is a global flow table that contains the *desired* CPU |
270 | for flows: the CPU that is currently processing the flow in userspace. | |
271 | Each table value is a CPU index that is updated during calls to recvmsg | |
272 | and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() | |
56c07271 WB |
273 | and tcp_splice_read()). |
274 | ||
275 | When the scheduler moves a thread to a new CPU while it has outstanding | |
276 | receive packets on the old CPU, packets may arrive out of order. To | |
277 | avoid this, RFS uses a second flow table to track outstanding packets | |
278 | for each flow: rps_dev_flow_table is a table specific to each hardware | |
279 | receive queue of each device. Each table value stores a CPU index and a | |
280 | counter. The CPU index represents the *current* CPU onto which packets | |
281 | for this flow are enqueued for further kernel processing. Ideally, kernel | |
282 | and userspace processing occur on the same CPU, and hence the CPU index | |
283 | in both tables is identical. This is likely false if the scheduler has | |
284 | recently migrated a userspace thread while the kernel still has packets | |
285 | enqueued for kernel processing on the old CPU. | |
286 | ||
287 | The counter in rps_dev_flow_table values records the length of the current | |
288 | CPU's backlog when a packet in this flow was last enqueued. Each backlog | |
289 | queue has a head counter that is incremented on dequeue. A tail counter | |
290 | is computed as head counter + queue length. In other words, the counter | |
08f4fc9d | 291 | in rps_dev_flow[i] records the last element in flow i that has |
56c07271 WB |
292 | been enqueued onto the currently designated CPU for flow i (of course, |
293 | entry i is actually selected by hash and multiple flows may hash to the | |
294 | same entry i). | |
295 | ||
296 | And now the trick for avoiding out of order packets: when selecting the | |
297 | CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table | |
298 | and the rps_dev_flow table of the queue that the packet was received on | |
299 | are compared. If the desired CPU for the flow (found in the | |
300 | rps_sock_flow table) matches the current CPU (found in the rps_dev_flow | |
301 | table), the packet is enqueued onto that CPU’s backlog. If they differ, | |
302 | the current CPU is updated to match the desired CPU if one of the | |
303 | following is true: | |
304 | ||
e6e37f63 OS |
305 | - The current CPU's queue head counter >= the recorded tail counter |
306 | value in rps_dev_flow[i] | |
307 | - The current CPU is unset (>= nr_cpu_ids) | |
308 | - The current CPU is offline | |
56c07271 WB |
309 | |
310 | After this check, the packet is sent to the (possibly updated) current | |
311 | CPU. These rules aim to ensure that a flow only moves to a new CPU when | |
312 | there are no packets outstanding on the old CPU, as the outstanding | |
313 | packets could arrive later than those about to be processed on the new | |
314 | CPU. | |
315 | ||
e6e37f63 OS |
316 | |
317 | RFS Configuration | |
318 | ----------------- | |
56c07271 | 319 | |
08f4fc9d | 320 | RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on |
56c07271 | 321 | by default for SMP). The functionality remains disabled until explicitly |
e6e37f63 OS |
322 | configured. The number of entries in the global flow table is set through:: |
323 | ||
324 | /proc/sys/net/core/rps_sock_flow_entries | |
56c07271 | 325 | |
e6e37f63 | 326 | The number of entries in the per-queue flow table are set through:: |
56c07271 | 327 | |
e6e37f63 | 328 | /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt |
56c07271 | 329 | |
56c07271 | 330 | |
e6e37f63 OS |
331 | Suggested Configuration |
332 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
333 | |
334 | Both of these need to be set before RFS is enabled for a receive queue. | |
335 | Values for both are rounded up to the nearest power of two. The | |
336 | suggested flow count depends on the expected number of active connections | |
337 | at any given time, which may be significantly less than the number of open | |
338 | connections. We have found that a value of 32768 for rps_sock_flow_entries | |
339 | works fairly well on a moderately loaded server. | |
340 | ||
341 | For a single queue device, the rps_flow_cnt value for the single queue | |
342 | would normally be configured to the same value as rps_sock_flow_entries. | |
343 | For a multi-queue device, the rps_flow_cnt for each queue might be | |
344 | configured as rps_sock_flow_entries / N, where N is the number of | |
08f4fc9d | 345 | queues. So for instance, if rps_sock_flow_entries is set to 32768 and there |
56c07271 WB |
346 | are 16 configured receive queues, rps_flow_cnt for each queue might be |
347 | configured as 2048. | |
348 | ||
349 | ||
350 | Accelerated RFS | |
351 | =============== | |
352 | ||
353 | Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load | |
354 | balancing mechanism that uses soft state to steer flows based on where | |
355 | the application thread consuming the packets of each flow is running. | |
356 | Accelerated RFS should perform better than RFS since packets are sent | |
357 | directly to a CPU local to the thread consuming the data. The target CPU | |
358 | will either be the same CPU where the application runs, or at least a CPU | |
359 | which is local to the application thread’s CPU in the cache hierarchy. | |
360 | ||
361 | To enable accelerated RFS, the networking stack calls the | |
362 | ndo_rx_flow_steer driver function to communicate the desired hardware | |
363 | queue for packets matching a particular flow. The network stack | |
364 | automatically calls this function every time a flow entry in | |
365 | rps_dev_flow_table is updated. The driver in turn uses a device specific | |
366 | method to program the NIC to steer the packets. | |
367 | ||
368 | The hardware queue for a flow is derived from the CPU recorded in | |
369 | rps_dev_flow_table. The stack consults a CPU to hardware queue map which | |
370 | is maintained by the NIC driver. This is an auto-generated reverse map of | |
371 | the IRQ affinity table shown by /proc/interrupts. Drivers can use | |
372 | functions in the cpu_rmap (“CPU affinity reverse map”) kernel library | |
373 | to populate the map. For each CPU, the corresponding queue in the map is | |
374 | set to be one whose processing CPU is closest in cache locality. | |
375 | ||
e6e37f63 OS |
376 | |
377 | Accelerated RFS Configuration | |
378 | ----------------------------- | |
56c07271 WB |
379 | |
380 | Accelerated RFS is only available if the kernel is compiled with | |
381 | CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. | |
382 | It also requires that ntuple filtering is enabled via ethtool. The map | |
383 | of CPU to queues is automatically deduced from the IRQ affinities | |
384 | configured for each receive queue by the driver, so no additional | |
385 | configuration should be necessary. | |
386 | ||
e6e37f63 OS |
387 | |
388 | Suggested Configuration | |
389 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
390 | |
391 | This technique should be enabled whenever one wants to use RFS and the | |
392 | NIC supports hardware acceleration. | |
393 | ||
e6e37f63 | 394 | |
56c07271 WB |
395 | XPS: Transmit Packet Steering |
396 | ============================= | |
397 | ||
398 | Transmit Packet Steering is a mechanism for intelligently selecting | |
399 | which transmit queue to use when transmitting a packet on a multi-queue | |
a4fd1f4b AN |
400 | device. This can be accomplished by recording two kinds of maps, either |
401 | a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) | |
402 | to hardware transmit queue(s). | |
403 | ||
404 | 1. XPS using CPUs map | |
405 | ||
406 | The goal of this mapping is usually to assign queues | |
56c07271 WB |
407 | exclusively to a subset of CPUs, where the transmit completions for |
408 | these queues are processed on a CPU within this set. This choice | |
409 | provides two benefits. First, contention on the device queue lock is | |
410 | significantly reduced since fewer CPUs contend for the same queue | |
411 | (contention can be eliminated completely if each CPU has its own | |
412 | transmit queue). Secondly, cache miss rate on transmit completion is | |
413 | reduced, in particular for data cache lines that hold the sk_buff | |
414 | structures. | |
415 | ||
a4fd1f4b AN |
416 | 2. XPS using receive queues map |
417 | ||
418 | This mapping is used to pick transmit queue based on the receive | |
419 | queue(s) map configuration set by the administrator. A set of receive | |
420 | queues can be mapped to a set of transmit queues (many:many), although | |
421 | the common use case is a 1:1 mapping. This will enable sending packets | |
422 | on the same queue associations for transmit and receive. This is useful for | |
423 | busy polling multi-threaded workloads where there are challenges in | |
424 | associating a given CPU to a given application thread. The application | |
425 | threads are not pinned to CPUs and each thread handles packets | |
426 | received on a single queue. The receive queue number is cached in the | |
427 | socket for the connection. In this model, sending the packets on the same | |
428 | transmit queue corresponding to the associated receive queue has benefits | |
429 | in keeping the CPU overhead low. Transmit completion work is locked into | |
430 | the same queue-association that a given application is polling on. This | |
431 | avoids the overhead of triggering an interrupt on another CPU. When the | |
432 | application cleans up the packets during the busy poll, transmit completion | |
433 | may be processed along with it in the same thread context and so result in | |
434 | reduced latency. | |
435 | ||
436 | XPS is configured per transmit queue by setting a bitmap of | |
437 | CPUs/receive-queues that may use that queue to transmit. The reverse | |
438 | mapping, from CPUs to transmit queues or from receive-queues to transmit | |
439 | queues, is computed and maintained for each network device. When | |
440 | transmitting the first packet in a flow, the function get_xps_queue() is | |
441 | called to select a queue. This function uses the ID of the receive queue | |
442 | for the socket connection for a match in the receive queue-to-transmit queue | |
443 | lookup table. Alternatively, this function can also use the ID of the | |
444 | running CPU as a key into the CPU-to-queue lookup table. If the | |
56c07271 WB |
445 | ID matches a single queue, that is used for transmission. If multiple |
446 | queues match, one is selected by using the flow hash to compute an index | |
a4fd1f4b AN |
447 | into the set. When selecting the transmit queue based on receive queue(s) |
448 | map, the transmit device is not validated against the receive device as it | |
449 | requires expensive lookup operation in the datapath. | |
56c07271 WB |
450 | |
451 | The queue chosen for transmitting a particular flow is saved in the | |
452 | corresponding socket structure for the flow (e.g. a TCP connection). | |
453 | This transmit queue is used for subsequent packets sent on the flow to | |
454 | prevent out of order (ooo) packets. The choice also amortizes the cost | |
320f24e4 | 455 | of calling get_xps_queues() over all packets in the flow. To avoid |
56c07271 WB |
456 | ooo packets, the queue for a flow can subsequently only be changed if |
457 | skb->ooo_okay is set for a packet in the flow. This flag indicates that | |
458 | there are no outstanding packets in the flow, so the transmit queue can | |
459 | change without the risk of generating out of order packets. The | |
460 | transport layer is responsible for setting ooo_okay appropriately. TCP, | |
461 | for instance, sets the flag when all data for a connection has been | |
462 | acknowledged. | |
463 | ||
e6e37f63 OS |
464 | XPS Configuration |
465 | ----------------- | |
56c07271 WB |
466 | |
467 | XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by | |
468 | default for SMP). The functionality remains disabled until explicitly | |
a4fd1f4b AN |
469 | configured. To enable XPS, the bitmap of CPUs/receive-queues that may |
470 | use a transmit queue is configured using the sysfs file entry: | |
56c07271 | 471 | |
e6e37f63 OS |
472 | For selection based on CPUs map:: |
473 | ||
474 | /sys/class/net/<dev>/queues/tx-<n>/xps_cpus | |
475 | ||
476 | For selection based on receive-queues map:: | |
477 | ||
478 | /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs | |
56c07271 | 479 | |
a4fd1f4b | 480 | |
e6e37f63 OS |
481 | Suggested Configuration |
482 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
56c07271 WB |
483 | |
484 | For a network device with a single transmission queue, XPS configuration | |
485 | has no effect, since there is no choice in this case. In a multi-queue | |
486 | system, XPS is preferably configured so that each CPU maps onto one queue. | |
487 | If there are as many queues as there are CPUs in the system, then each | |
488 | queue can also map onto one CPU, resulting in exclusive pairings that | |
489 | experience no contention. If there are fewer queues than CPUs, then the | |
490 | best CPUs to share a given queue are probably those that share the cache | |
491 | with the CPU that processes transmit completions for that queue | |
492 | (transmit interrupts). | |
493 | ||
a4fd1f4b AN |
494 | For transmit queue selection based on receive queue(s), XPS has to be |
495 | explicitly configured mapping receive-queue(s) to transmit queue(s). If the | |
496 | user configuration for receive-queue map does not apply, then the transmit | |
497 | queue is selected based on the CPUs map. | |
498 | ||
e6e37f63 OS |
499 | |
500 | Per TX Queue rate limitation | |
501 | ============================ | |
822b3b2e JF |
502 | |
503 | These are rate-limitation mechanisms implemented by HW, where currently | |
e6e37f63 | 504 | a max-rate attribute is supported, by setting a Mbps value to:: |
822b3b2e | 505 | |
e6e37f63 | 506 | /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate |
822b3b2e JF |
507 | |
508 | A value of zero means disabled, and this is the default. | |
56c07271 | 509 | |
e6e37f63 | 510 | |
56c07271 WB |
511 | Further Information |
512 | =================== | |
513 | RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into | |
514 | 2.6.38. Original patches were submitted by Tom Herbert | |
515 | (therbert@google.com) | |
516 | ||
517 | Accelerated RFS was introduced in 2.6.35. Original patches were | |
c06cbcb6 | 518 | submitted by Ben Hutchings (bwh@kernel.org) |
56c07271 WB |
519 | |
520 | Authors: | |
e6e37f63 OS |
521 | |
522 | - Tom Herbert (therbert@google.com) | |
523 | - Willem de Bruijn (willemb@google.com) |