Commit | Line | Data |
---|---|---|
f42c104f JK |
1 | .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) |
2 | ||
3 | ================== | |
4 | Kernel TLS offload | |
5 | ================== | |
6 | ||
7 | Kernel TLS operation | |
8 | ==================== | |
9 | ||
10 | Linux kernel provides TLS connection offload infrastructure. Once a TCP | |
11 | connection is in ``ESTABLISHED`` state user space can enable the TLS Upper | |
12 | Layer Protocol (ULP) and install the cryptographic connection state. | |
13 | For details regarding the user-facing interface refer to the TLS | |
14 | documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. | |
15 | ||
16 | ``ktls`` can operate in three modes: | |
17 | ||
18 | * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. | |
19 | In most basic cases only crypto operations synchronous with the CPU | |
20 | can be used, but depending on calling context CPU may utilize | |
21 | asynchronous crypto accelerators. The use of accelerators introduces extra | |
22 | latency on socket reads (decryption only starts when a read syscall | |
23 | is made) and additional I/O load on the system. | |
24 | * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto | |
25 | on a packet by packet basis, provided the packets arrive in order. | |
26 | This mode integrates best with the kernel stack and is described in detail | |
27 | in the remaining part of this document | |
28 | (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). | |
29 | * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where | |
30 | NIC driver and firmware replace the kernel networking stack | |
31 | with its own TCP handling, it is not usable in production environments | |
32 | making use of the Linux networking stack for example any firewalling | |
33 | abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``). | |
34 | ||
35 | The operation mode is selected automatically based on device configuration, | |
36 | offload opt-in or opt-out on per-connection basis is not currently supported. | |
37 | ||
38 | TX | |
39 | -- | |
40 | ||
41 | At a high level user write requests are turned into a scatter list, the TLS ULP | |
42 | intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` | |
43 | mode) and then hands the modified scatter list to the TCP layer. From this | |
44 | point on the TCP stack proceeds as normal. | |
45 | ||
46 | In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. | |
47 | Instead packets reach a device driver, the driver will mark the packets | |
48 | for crypto offload based on the socket the packet is attached to, | |
49 | and send them to the device for encryption and transmission. | |
50 | ||
51 | RX | |
52 | -- | |
53 | ||
54 | On the receive side if the device handled decryption and authentication | |
55 | successfully, the driver will set the decrypted bit in the associated | |
56 | :c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and | |
57 | are handled normally. ``ktls`` is informed when data is queued to the socket | |
58 | and the ``strparser`` mechanism is used to delineate the records. Upon read | |
59 | request, records are retrieved from the socket and passed to decryption routine. | |
60 | If device decrypted all the segments of the record the decryption is skipped, | |
61 | otherwise software path handles decryption. | |
62 | ||
63 | .. kernel-figure:: tls-offload-layers.svg | |
64 | :alt: TLS offload layers | |
65 | :align: center | |
66 | :figwidth: 28em | |
67 | ||
68 | Layers of Kernel TLS stack | |
69 | ||
70 | Device configuration | |
71 | ==================== | |
72 | ||
73 | During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and | |
74 | ``NETIF_F_HW_TLS_TX`` features and installs its | |
75 | :c:type:`struct tlsdev_ops <tlsdev_ops>` | |
76 | pointer in the :c:member:`tlsdev_ops` member of the | |
77 | :c:type:`struct net_device <net_device>`. | |
78 | ||
79 | When TLS cryptographic connection state is installed on a ``ktls`` socket | |
80 | (note that it is done twice, once for RX and once for TX direction, | |
81 | and the two are completely independent), the kernel checks if the underlying | |
82 | network device is offload-capable and attempts the offload. In case offload | |
83 | fails the connection is handled entirely in software using the same mechanism | |
84 | as if the offload was never tried. | |
85 | ||
86 | Offload request is performed via the :c:member:`tls_dev_add` callback of | |
87 | :c:type:`struct tlsdev_ops <tlsdev_ops>`: | |
88 | ||
89 | .. code-block:: c | |
90 | ||
91 | int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, | |
92 | enum tls_offload_ctx_dir direction, | |
93 | struct tls_crypto_info *crypto_info, | |
94 | u32 start_offload_tcp_sn); | |
95 | ||
96 | ``direction`` indicates whether the cryptographic information is for | |
97 | the received or transmitted packets. Driver uses the ``sk`` parameter | |
98 | to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). | |
99 | Cryptographic information in ``crypto_info`` includes the key, iv, salt | |
100 | as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates | |
101 | which TCP sequence number corresponds to the beginning of the record with | |
102 | sequence number from ``crypto_info``. The driver can add its state | |
103 | at the end of kernel structures (see :c:member:`driver_state` members | |
104 | in ``include/net/tls.h``) to avoid additional allocations and pointer | |
105 | dereferences. | |
106 | ||
107 | TX | |
108 | -- | |
109 | ||
110 | After TX state is installed, the stack guarantees that the first segment | |
111 | of the stream will start exactly at the ``start_offload_tcp_sn`` sequence | |
112 | number, simplifying TCP sequence number matching. | |
113 | ||
114 | TX offload being fully initialized does not imply that all segments passing | |
115 | through the driver and which belong to the offloaded socket will be after | |
116 | the expected sequence number and will have kernel record information. | |
117 | In particular, already encrypted data may have been queued to the socket | |
118 | before installing the connection state in the kernel. | |
119 | ||
120 | RX | |
121 | -- | |
122 | ||
123 | In RX direction local networking stack has little control over the segmentation, | |
124 | so the initial records' TCP sequence number may be anywhere inside the segment. | |
125 | ||
126 | Normal operation | |
127 | ================ | |
128 | ||
129 | At the minimum the device maintains the following state for each connection, in | |
130 | each direction: | |
131 | ||
132 | * crypto secrets (key, iv, salt) | |
133 | * crypto processing state (partial blocks, partial authentication tag, etc.) | |
134 | * record metadata (sequence number, processing offset and length) | |
135 | * expected TCP sequence number | |
136 | ||
137 | There are no guarantees on record length or record segmentation. In particular | |
138 | segments may start at any point of a record and contain any number of records. | |
139 | Assuming segments are received in order, the device should be able to perform | |
140 | crypto operations and authentication regardless of segmentation. For this | |
141 | to be possible device has to keep small amount of segment-to-segment state. | |
142 | This includes at least: | |
143 | ||
144 | * partial headers (if a segment carried only a part of the TLS header) | |
145 | * partial data block | |
146 | * partial authentication tag (all data had been seen but part of the | |
147 | authentication tag has to be written or read from the subsequent segment) | |
148 | ||
149 | Record reassembly is not necessary for TLS offload. If the packets arrive | |
150 | in order the device should be able to handle them separately and make | |
151 | forward progress. | |
152 | ||
153 | TX | |
154 | -- | |
155 | ||
156 | The kernel stack performs record framing reserving space for the authentication | |
157 | tag and populating all other TLS header and tailer fields. | |
158 | ||
159 | Both the device and the driver maintain expected TCP sequence numbers | |
160 | due to the possibility of retransmissions and the lack of software fallback | |
161 | once the packet reaches the device. | |
162 | For segments passed in order, the driver marks the packets with | |
163 | a connection identifier (note that a 5-tuple lookup is insufficient to identify | |
164 | packets requiring HW offload, see the :ref:`5tuple_problems` section) | |
165 | and hands them to the device. The device identifies the packet as requiring | |
166 | TLS handling and confirms the sequence number matches its expectation. | |
167 | The device performs encryption and authentication of the record data. | |
168 | It replaces the authentication tag and TCP checksum with correct values. | |
169 | ||
170 | RX | |
171 | -- | |
172 | ||
173 | Before a packet is DMAed to the host (but after NIC's embedded switching | |
174 | and packet transformation functions) the device validates the Layer 4 | |
175 | checksum and performs a 5-tuple lookup to find any TLS connection the packet | |
176 | may belong to (technically a 4-tuple | |
177 | lookup is sufficient - IP addresses and TCP port numbers, as the protocol | |
178 | is always TCP). If connection is matched device confirms if the TCP sequence | |
179 | number is the expected one and proceeds to TLS handling (record delineation, | |
180 | decryption, authentication for each record in the packet). The device leaves | |
181 | the record framing unmodified, the stack takes care of record decapsulation. | |
182 | Device indicates successful handling of TLS offload in the per-packet context | |
183 | (descriptor) passed to the host. | |
184 | ||
185 | Upon reception of a TLS offloaded packet, the driver sets | |
186 | the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` | |
187 | corresponding to the segment. Networking stack makes sure decrypted | |
188 | and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) | |
189 | and takes care of partial decryption. | |
190 | ||
191 | Resync handling | |
192 | =============== | |
193 | ||
194 | In presence of packet drops or network packet reordering, the device may lose | |
195 | synchronization with the TLS stream, and require a resync with the kernel's | |
196 | TCP stack. | |
197 | ||
198 | Note that resync is only attempted for connections which were successfully | |
199 | added to the device table and are in TLS_HW mode. For example, | |
200 | if the table was full when cryptographic state was installed in the kernel, | |
201 | such connection will never get offloaded. Therefore the resync request | |
202 | does not carry any cryptographic connection state. | |
203 | ||
204 | TX | |
205 | -- | |
206 | ||
207 | Segments transmitted from an offloaded socket can get out of sync | |
208 | in similar ways to the receive side-retransmissions - local drops | |
209 | are possible, though network reorders are not. | |
210 | ||
211 | Whenever an out of order segment is transmitted the driver provides | |
212 | the device with enough information to perform cryptographic operations. | |
213 | This means most likely that the part of the record preceding the current | |
214 | segment has to be passed to the device as part of the packet context, | |
215 | together with its TCP sequence number and TLS record number. The device | |
216 | can then initialize its crypto state, process and discard the preceding | |
217 | data (to be able to insert the authentication tag) and move onto handling | |
218 | the actual packet. | |
219 | ||
220 | In this mode depending on the implementation the driver can either ask | |
221 | for a continuation with the crypto state and the new sequence number | |
222 | (next expected segment is the one after the out of order one), or continue | |
223 | with the previous stream state - assuming that the out of order segment | |
224 | was just a retransmission. The former is simpler, and does not require | |
225 | retransmission detection therefore it is the recommended method until | |
226 | such time it is proven inefficient. | |
227 | ||
228 | RX | |
229 | -- | |
230 | ||
231 | A small amount of RX reorder events may not require a full resynchronization. | |
232 | In particular the device should not lose synchronization | |
233 | when record boundary can be recovered: | |
234 | ||
235 | .. kernel-figure:: tls-offload-reorder-good.svg | |
236 | :alt: reorder of non-header segment | |
237 | :align: center | |
238 | ||
239 | Reorder of non-header segment | |
240 | ||
241 | Green segments are successfully decrypted, blue ones are passed | |
242 | as received on wire, red stripes mark start of new records. | |
243 | ||
244 | In above case segment 1 is received and decrypted successfully. | |
245 | Segment 2 was dropped so 3 arrives out of order. The device knows | |
246 | the next record starts inside 3, based on record length in segment 1. | |
247 | Segment 3 is passed untouched, because due to lack of data from segment 2 | |
248 | the remainder of the previous record inside segment 3 cannot be handled. | |
249 | The device can, however, collect the authentication algorithm's state | |
250 | and partial block from the new record in segment 3 and when 4 and 5 | |
251 | arrive continue decryption. Finally when 2 arrives it's completely outside | |
252 | of expected window of the device so it's passed as is without special | |
253 | handling. ``ktls`` software fallback handles the decryption of record | |
254 | spanning segments 1, 2 and 3. The device did not get out of sync, | |
255 | even though two segments did not get decrypted. | |
256 | ||
257 | Kernel synchronization may be necessary if the lost segment contained | |
258 | a record header and arrived after the next record header has already passed: | |
259 | ||
260 | .. kernel-figure:: tls-offload-reorder-bad.svg | |
261 | :alt: reorder of header segment | |
262 | :align: center | |
263 | ||
264 | Reorder of segment with a TLS header | |
265 | ||
266 | In this example segment 2 gets dropped, and it contains a record header. | |
267 | Device can only detect that segment 4 also contains a TLS header | |
268 | if it knows the length of the previous record from segment 2. In this case | |
269 | the device will lose synchronization with the stream. | |
270 | ||
271 | When the device gets out of sync and the stream reaches TCP sequence | |
272 | numbers more than a max size record past the expected TCP sequence number, | |
273 | the device starts scanning for a known header pattern. For example | |
274 | for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur | |
275 | in the SSL/TLS version field of the header. Once pattern is matched | |
276 | the device continues attempting parsing headers at expected locations | |
277 | (based on the length fields at guessed locations). | |
278 | Whenever the expected location does not contain a valid header the scan | |
279 | is restarted. | |
280 | ||
281 | When the header is matched the device sends a confirmation request | |
282 | to the kernel, asking if the guessed location is correct (if a TLS record | |
283 | really starts there), and which record sequence number the given header had. | |
284 | The kernel confirms the guessed location was correct and tells the device | |
285 | the record sequence number. Meanwhile, the device had been parsing | |
286 | and counting all records since the just-confirmed one, it adds the number | |
287 | of records it had seen to the record number provided by the kernel. | |
288 | At this point the device is in sync and can resume decryption at next | |
289 | segment boundary. | |
290 | ||
291 | In a pathological case the device may latch onto a sequence of matching | |
292 | headers and never hear back from the kernel (there is no negative | |
293 | confirmation from the kernel). The implementation may choose to periodically | |
294 | restart scan. Given how unlikely falsely-matching stream is, however, | |
295 | periodic restart is not deemed necessary. | |
296 | ||
297 | Special care has to be taken if the confirmation request is passed | |
298 | asynchronously to the packet stream and record may get processed | |
299 | by the kernel before the confirmation request. | |
300 | ||
301 | Error handling | |
302 | ============== | |
303 | ||
304 | TX | |
305 | -- | |
306 | ||
307 | Packets may be redirected or rerouted by the stack to a different | |
308 | device than the selected TLS offload device. The stack will handle | |
309 | such condition using the :c:func:`sk_validate_xmit_skb` helper | |
310 | (TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). | |
311 | Offload maintains information about all records until the data is | |
312 | fully acknowledged, so if skbs reach the wrong device they can be handled | |
313 | by software fallback. | |
314 | ||
315 | Any device TLS offload handling error on the transmission side must result | |
316 | in the packet being dropped. For example if a packet got out of order | |
317 | due to a bug in the stack or the device, reached the device and can't | |
318 | be encrypted such packet must be dropped. | |
319 | ||
320 | RX | |
321 | -- | |
322 | ||
323 | If the device encounters any problems with TLS offload on the receive | |
324 | side it should pass the packet to the host's networking stack as it was | |
325 | received on the wire. | |
326 | ||
327 | For example authentication failure for any record in the segment should | |
328 | result in passing the unmodified packet to the software fallback. This means | |
329 | packets should not be modified "in place". Splitting segments to handle partial | |
330 | decryption is not advised. In other words either all records in the packet | |
331 | had been handled successfully and authenticated or the packet has to be passed | |
332 | to the host's stack as it was on the wire (recovering original packet in the | |
333 | driver if device provides precise error is sufficient). | |
334 | ||
335 | The Linux networking stack does not provide a way of reporting per-packet | |
336 | decryption and authentication errors, packets with errors must simply not | |
337 | have the :c:member:`decrypted` mark set. | |
338 | ||
339 | A packet should also not be handled by the TLS offload if it contains | |
340 | incorrect checksums. | |
341 | ||
342 | Performance metrics | |
343 | =================== | |
344 | ||
345 | TLS offload can be characterized by the following basic metrics: | |
346 | ||
347 | * max connection count | |
348 | * connection installation rate | |
349 | * connection installation latency | |
350 | * total cryptographic performance | |
351 | ||
352 | Note that each TCP connection requires a TLS session in both directions, | |
353 | the performance may be reported treating each direction separately. | |
354 | ||
355 | Max connection count | |
356 | -------------------- | |
357 | ||
358 | The number of connections device can support can be exposed via | |
359 | ``devlink resource`` API. | |
360 | ||
361 | Total cryptographic performance | |
362 | ------------------------------- | |
363 | ||
364 | Offload performance may depend on segment and record size. | |
365 | ||
366 | Overload of the cryptographic subsystem of the device should not have | |
367 | significant performance impact on non-offloaded streams. | |
368 | ||
369 | Statistics | |
370 | ========== | |
371 | ||
372 | Following minimum set of TLS-related statistics should be reported | |
373 | by the driver: | |
374 | ||
375 | * ``rx_tls_decrypted`` - number of successfully decrypted TLS segments | |
376 | * ``tx_tls_encrypted`` - number of in-order TLS segments passed to device | |
377 | for encryption | |
378 | * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream | |
379 | but did not arrive in the expected order | |
380 | * ``tx_tls_drop_no_sync_data`` - number of TX packets dropped because | |
381 | they arrived out of order and associated record could not be found | |
f42c104f JK |
382 | |
383 | Notable corner cases, exceptions and additional requirements | |
384 | ============================================================ | |
385 | ||
386 | .. _5tuple_problems: | |
387 | ||
388 | 5-tuple matching limitations | |
389 | ---------------------------- | |
390 | ||
391 | The device can only recognize received packets based on the 5-tuple | |
392 | of the socket. Current ``ktls`` implementation will not offload sockets | |
393 | routed through software interfaces such as those used for tunneling | |
394 | or virtual networking. However, many packet transformations performed | |
395 | by the networking stack (most notably any BPF logic) do not require | |
396 | any intermediate software device, therefore a 5-tuple match may | |
397 | consistently miss at the device level. In such cases the device | |
398 | should still be able to perform TX offload (encryption) and should | |
399 | fallback cleanly to software decryption (RX). | |
400 | ||
401 | Out of order | |
402 | ------------ | |
403 | ||
404 | Introducing extra processing in NICs should not cause packets to be | |
405 | transmitted or received out of order, for example pure ACK packets | |
406 | should not be reordered with respect to data segments. | |
407 | ||
408 | Ingress reorder | |
409 | --------------- | |
410 | ||
411 | A device is permitted to perform packet reordering for consecutive | |
412 | TCP segments (i.e. placing packets in the correct order) but any form | |
413 | of additional buffering is disallowed. | |
414 | ||
415 | Coexistence with standard networking offload features | |
416 | ----------------------------------------------------- | |
417 | ||
418 | Offloaded ``ktls`` sockets should support standard TCP stack features | |
419 | transparently. Enabling device TLS offload should not cause any difference | |
420 | in packets as seen on the wire. | |
421 | ||
422 | Transport layer transparency | |
423 | ---------------------------- | |
424 | ||
425 | The device should not modify any packet headers for the purpose | |
426 | of the simplifying TLS offload. | |
427 | ||
428 | The device should not depend on any packet headers beyond what is strictly | |
429 | necessary for TLS offload. | |
430 | ||
431 | Segment drops | |
432 | ------------- | |
433 | ||
434 | Dropping packets is acceptable only in the event of catastrophic | |
435 | system errors and should never be used as an error handling mechanism | |
436 | in cases arising from normal operation. In other words, reliance | |
437 | on TCP retransmissions to handle corner cases is not acceptable. | |
438 | ||
439 | TLS device features | |
440 | ------------------- | |
441 | ||
442 | Drivers should ignore the changes to TLS the device feature flags. | |
443 | These flags will be acted upon accordingly by the core ``ktls`` code. | |
444 | TLS device feature flags only control adding of new TLS connection | |
445 | offloads, old connections will remain active after flags are cleared. | |
446 | ||
447 | Known bugs | |
448 | ========== | |
449 | ||
450 | skb_orphan() leaks clear text | |
451 | ----------------------------- | |
452 | ||
453 | Currently drivers depend on the :c:member:`sk` member of | |
454 | :c:type:`struct sk_buff <sk_buff>` to identify segments requiring | |
455 | encryption. Any operation which removes or does not preserve the socket | |
456 | association such as :c:func:`skb_orphan` or :c:func:`skb_clone` | |
457 | will cause the driver to miss the packets and lead to clear text leaks. | |
458 | ||
459 | Redirects leak clear text | |
460 | ------------------------- | |
461 | ||
462 | In the RX direction, if segment has already been decrypted by the device | |
463 | and it gets redirected or mirrored - clear text will be transmitted out. |