Commit | Line | Data |
---|---|---|
3eb8eea2 JK |
1 | .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) |
2 | ||
3 | .. _napi: | |
4 | ||
5 | ==== | |
6 | NAPI | |
7 | ==== | |
8 | ||
9 | NAPI is the event handling mechanism used by the Linux networking stack. | |
10 | The name NAPI no longer stands for anything in particular [#]_. | |
11 | ||
12 | In basic operation the device notifies the host about new events | |
13 | via an interrupt. | |
14 | The host then schedules a NAPI instance to process the events. | |
15 | The device may also be polled for events via NAPI without receiving | |
16 | interrupts first (:ref:`busy polling<poll>`). | |
17 | ||
18 | NAPI processing usually happens in the software interrupt context, | |
19 | but there is an option to use :ref:`separate kernel threads<threaded>` | |
20 | for NAPI processing. | |
21 | ||
22 | All in all NAPI abstracts away from the drivers the context and configuration | |
23 | of event (packet Rx and Tx) processing. | |
24 | ||
25 | Driver API | |
26 | ========== | |
27 | ||
28 | The two most important elements of NAPI are the struct napi_struct | |
29 | and the associated poll method. struct napi_struct holds the state | |
30 | of the NAPI instance while the method is the driver-specific event | |
31 | handler. The method will typically free Tx packets that have been | |
32 | transmitted and process newly received packets. | |
33 | ||
34 | .. _drv_ctrl: | |
35 | ||
36 | Control API | |
37 | ----------- | |
38 | ||
39 | netif_napi_add() and netif_napi_del() add/remove a NAPI instance | |
40 | from the system. The instances are attached to the netdevice passed | |
41 | as argument (and will be deleted automatically when netdevice is | |
42 | unregistered). Instances are added in a disabled state. | |
43 | ||
44 | napi_enable() and napi_disable() manage the disabled state. | |
45 | A disabled NAPI can't be scheduled and its poll method is guaranteed | |
46 | to not be invoked. napi_disable() waits for ownership of the NAPI | |
47 | instance to be released. | |
48 | ||
49 | The control APIs are not idempotent. Control API calls are safe against | |
50 | concurrent use of datapath APIs but an incorrect sequence of control API | |
51 | calls may result in crashes, deadlocks, or race conditions. For example, | |
52 | calling napi_disable() multiple times in a row will deadlock. | |
53 | ||
54 | Datapath API | |
55 | ------------ | |
56 | ||
57 | napi_schedule() is the basic method of scheduling a NAPI poll. | |
58 | Drivers should call this function in their interrupt handler | |
59 | (see :ref:`drv_sched` for more info). A successful call to napi_schedule() | |
60 | will take ownership of the NAPI instance. | |
61 | ||
62 | Later, after NAPI is scheduled, the driver's poll method will be | |
63 | called to process the events/packets. The method takes a ``budget`` | |
64 | argument - drivers can process completions for any number of Tx | |
65 | packets but should only process up to ``budget`` number of | |
66 | Rx packets. Rx processing is usually much more expensive. | |
67 | ||
32ad45b7 JK |
68 | In other words for Rx processing the ``budget`` argument limits how many |
69 | packets driver can process in a single poll. Rx specific APIs like page | |
70 | pool or XDP cannot be used at all when ``budget`` is 0. | |
71 | skb Tx processing should happen regardless of the ``budget``, but if | |
72 | the argument is 0 driver cannot call any XDP (or page pool) APIs. | |
3eb8eea2 JK |
73 | |
74 | .. warning:: | |
75 | ||
32ad45b7 JK |
76 | The ``budget`` argument may be 0 if core tries to only process |
77 | skb Tx completions and no Rx or XDP packets. | |
3eb8eea2 JK |
78 | |
79 | The poll method returns the amount of work done. If the driver still | |
80 | has outstanding work to do (e.g. ``budget`` was exhausted) | |
81 | the poll method should return exactly ``budget``. In that case, | |
82 | the NAPI instance will be serviced/polled again (without the | |
83 | need to be scheduled). | |
84 | ||
85 | If event processing has been completed (all outstanding packets | |
86 | processed) the poll method should call napi_complete_done() | |
87 | before returning. napi_complete_done() releases the ownership | |
88 | of the instance. | |
89 | ||
90 | .. warning:: | |
91 | ||
92 | The case of finishing all events and using exactly ``budget`` | |
93 | must be handled carefully. There is no way to report this | |
94 | (rare) condition to the stack, so the driver must either | |
95 | not call napi_complete_done() and wait to be called again, | |
96 | or return ``budget - 1``. | |
97 | ||
98 | If the ``budget`` is 0 napi_complete_done() should never be called. | |
99 | ||
100 | Call sequence | |
101 | ------------- | |
102 | ||
103 | Drivers should not make assumptions about the exact sequencing | |
104 | of calls. The poll method may be called without the driver scheduling | |
105 | the instance (unless the instance is disabled). Similarly, | |
106 | it's not guaranteed that the poll method will be called, even | |
107 | if napi_schedule() succeeded (e.g. if the instance gets disabled). | |
108 | ||
109 | As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent | |
110 | calls to the poll method only wait for the ownership of the instance | |
111 | to be released, not for the poll method to exit. This means that | |
112 | drivers should avoid accessing any data structures after calling | |
113 | napi_complete_done(). | |
114 | ||
115 | .. _drv_sched: | |
116 | ||
117 | Scheduling and IRQ masking | |
118 | -------------------------- | |
119 | ||
120 | Drivers should keep the interrupts masked after scheduling | |
121 | the NAPI instance - until NAPI polling finishes any further | |
122 | interrupts are unnecessary. | |
123 | ||
124 | Drivers which have to mask the interrupts explicitly (as opposed | |
125 | to IRQ being auto-masked by the device) should use the napi_schedule_prep() | |
126 | and __napi_schedule() calls: | |
127 | ||
128 | .. code-block:: c | |
129 | ||
130 | if (napi_schedule_prep(&v->napi)) { | |
131 | mydrv_mask_rxtx_irq(v->idx); | |
132 | /* schedule after masking to avoid races */ | |
133 | __napi_schedule(&v->napi); | |
134 | } | |
135 | ||
136 | IRQ should only be unmasked after a successful call to napi_complete_done(): | |
137 | ||
138 | .. code-block:: c | |
139 | ||
140 | if (budget && napi_complete_done(&v->napi, work_done)) { | |
141 | mydrv_unmask_rxtx_irq(v->idx); | |
142 | return min(work_done, budget - 1); | |
143 | } | |
144 | ||
145 | napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage | |
146 | of guarantees given by being invoked in IRQ context (no need to | |
147 | mask interrupts). Note that PREEMPT_RT forces all interrupts | |
148 | to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD`` | |
149 | to avoid issues on real-time kernel configurations. | |
150 | ||
151 | Instance to queue mapping | |
152 | ------------------------- | |
153 | ||
154 | Modern devices have multiple NAPI instances (struct napi_struct) per | |
155 | interface. There is no strong requirement on how the instances are | |
156 | mapped to queues and interrupts. NAPI is primarily a polling/processing | |
157 | abstraction without specific user-facing semantics. That said, most networking | |
158 | devices end up using NAPI in fairly similar ways. | |
159 | ||
160 | NAPI instances most often correspond 1:1:1 to interrupts and queue pairs | |
161 | (queue pair is a set of a single Rx and single Tx queue). | |
162 | ||
163 | In less common cases a NAPI instance may be used for multiple queues | |
164 | or Rx and Tx queues can be serviced by separate NAPI instances on a single | |
165 | core. Regardless of the queue assignment, however, there is usually still | |
166 | a 1:1 mapping between NAPI instances and interrupts. | |
167 | ||
168 | It's worth noting that the ethtool API uses a "channel" terminology where | |
169 | each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear | |
170 | what constitutes a channel; the recommended interpretation is to understand | |
171 | a channel as an IRQ/NAPI which services queues of a given type. For example, | |
172 | a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected | |
173 | to utilize 3 interrupts, 2 Rx and 2 Tx queues. | |
174 | ||
175 | User API | |
176 | ======== | |
177 | ||
178 | User interactions with NAPI depend on NAPI instance ID. The instance IDs | |
179 | are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option. | |
180 | It's not currently possible to query IDs used by a given device. | |
181 | ||
182 | Software IRQ coalescing | |
183 | ----------------------- | |
184 | ||
185 | NAPI does not perform any explicit event coalescing by default. | |
186 | In most scenarios batching happens due to IRQ coalescing which is done | |
187 | by the device. There are cases where software coalescing is helpful. | |
188 | ||
189 | NAPI can be configured to arm a repoll timer instead of unmasking | |
190 | the hardware interrupts as soon as all packets are processed. | |
191 | The ``gro_flush_timeout`` sysfs configuration of the netdevice | |
192 | is reused to control the delay of the timer, while | |
193 | ``napi_defer_hard_irqs`` controls the number of consecutive empty polls | |
194 | before NAPI gives up and goes back to using hardware IRQs. | |
195 | ||
196 | .. _poll: | |
197 | ||
198 | Busy polling | |
199 | ------------ | |
200 | ||
201 | Busy polling allows a user process to check for incoming packets before | |
202 | the device interrupt fires. As is the case with any busy polling it trades | |
203 | off CPU cycles for lower latency (production uses of NAPI busy polling | |
204 | are not well known). | |
205 | ||
206 | Busy polling is enabled by either setting ``SO_BUSY_POLL`` on | |
207 | selected sockets or using the global ``net.core.busy_poll`` and | |
208 | ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling | |
209 | also exists. | |
210 | ||
211 | IRQ mitigation | |
212 | --------------- | |
213 | ||
214 | While busy polling is supposed to be used by low latency applications, | |
215 | a similar mechanism can be used for IRQ mitigation. | |
216 | ||
217 | Very high request-per-second applications (especially routing/forwarding | |
218 | applications and especially applications using AF_XDP sockets) may not | |
219 | want to be interrupted until they finish processing a request or a batch | |
220 | of packets. | |
221 | ||
222 | Such applications can pledge to the kernel that they will perform a busy | |
223 | polling operation periodically, and the driver should keep the device IRQs | |
224 | permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` | |
225 | socket option. To avoid system misbehavior the pledge is revoked | |
226 | if ``gro_flush_timeout`` passes without any busy poll call. | |
227 | ||
228 | The NAPI budget for busy polling is lower than the default (which makes | |
229 | sense given the low latency intention of normal busy polling). This is | |
230 | not the case with IRQ mitigation, however, so the budget can be adjusted | |
231 | with the ``SO_BUSY_POLL_BUDGET`` socket option. | |
232 | ||
233 | .. _threaded: | |
234 | ||
235 | Threaded NAPI | |
236 | ------------- | |
237 | ||
238 | Threaded NAPI is an operating mode that uses dedicated kernel | |
239 | threads rather than software IRQ context for NAPI processing. | |
240 | The configuration is per netdevice and will affect all | |
241 | NAPI instances of that device. Each NAPI instance will spawn a separate | |
242 | thread (called ``napi/${ifc-name}-${napi-id}``). | |
243 | ||
244 | It is recommended to pin each kernel thread to a single CPU, the same | |
245 | CPU as the CPU which services the interrupt. Note that the mapping | |
246 | between IRQs and NAPI instances may not be trivial (and is driver | |
247 | dependent). The NAPI instance IDs will be assigned in the opposite | |
248 | order than the process IDs of the kernel threads. | |
249 | ||
250 | Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in | |
251 | netdev's sysfs directory. | |
252 | ||
253 | .. rubric:: Footnotes | |
254 | ||
255 | .. [#] NAPI was originally referred to as New API in 2.4 Linux. |