Commit | Line | Data |
---|---|---|
f42c104f JK |
1 | .. _kernel_tls: |
2 | ||
f3c0f3c6 JK |
3 | ========== |
4 | Kernel TLS | |
5 | ========== | |
6 | ||
99c195fb DW |
7 | Overview |
8 | ======== | |
9 | ||
10 | Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over | |
11 | TCP. TLS provides end-to-end data integrity and confidentiality. | |
12 | ||
13 | User interface | |
14 | ============== | |
15 | ||
16 | Creating a TLS connection | |
17 | ------------------------- | |
18 | ||
19 | First create a new TCP socket and set the TLS ULP. | |
20 | ||
f3c0f3c6 JK |
21 | .. code-block:: c |
22 | ||
99c195fb DW |
23 | sock = socket(AF_INET, SOCK_STREAM, 0); |
24 | setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); | |
25 | ||
26 | Setting the TLS ULP allows us to set/get TLS socket options. Currently | |
27 | only the symmetric encryption is handled in the kernel. After the TLS | |
28 | handshake is complete, we have all the parameters required to move the | |
29 | data-path to the kernel. There is a separate socket option for moving | |
30 | the transmit and the receive into the kernel. | |
31 | ||
f3c0f3c6 JK |
32 | .. code-block:: c |
33 | ||
99c195fb DW |
34 | /* From linux/tls.h */ |
35 | struct tls_crypto_info { | |
36 | unsigned short version; | |
37 | unsigned short cipher_type; | |
38 | }; | |
39 | ||
40 | struct tls12_crypto_info_aes_gcm_128 { | |
41 | struct tls_crypto_info info; | |
42 | unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; | |
43 | unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; | |
44 | unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; | |
45 | unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; | |
46 | }; | |
47 | ||
48 | ||
49 | struct tls12_crypto_info_aes_gcm_128 crypto_info; | |
50 | ||
51 | crypto_info.info.version = TLS_1_2_VERSION; | |
52 | crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; | |
53 | memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); | |
54 | memcpy(crypto_info.rec_seq, seq_number_write, | |
55 | TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); | |
56 | memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); | |
57 | memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); | |
58 | ||
59 | setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); | |
60 | ||
b6c535b1 DW |
61 | Transmit and receive are set separately, but the setup is the same, using either |
62 | TLS_TX or TLS_RX. | |
63 | ||
99c195fb DW |
64 | Sending TLS application data |
65 | ---------------------------- | |
66 | ||
67 | After setting the TLS_TX socket option all application data sent over this | |
68 | socket is encrypted using TLS and the parameters provided in the socket option. | |
69 | For example, we can send an encrypted hello world record as follows: | |
70 | ||
f3c0f3c6 JK |
71 | .. code-block:: c |
72 | ||
99c195fb DW |
73 | const char *msg = "hello world\n"; |
74 | send(sock, msg, strlen(msg)); | |
75 | ||
76 | send() data is directly encrypted from the userspace buffer provided | |
77 | to the encrypted kernel send buffer if possible. | |
78 | ||
79 | The sendfile system call will send the file's data over TLS records of maximum | |
80 | length (2^14). | |
81 | ||
f3c0f3c6 JK |
82 | .. code-block:: c |
83 | ||
99c195fb DW |
84 | file = open(filename, O_RDONLY); |
85 | fstat(file, &stat); | |
86 | sendfile(sock, file, &offset, stat.st_size); | |
87 | ||
88 | TLS records are created and sent after each send() call, unless | |
89 | MSG_MORE is passed. MSG_MORE will delay creation of a record until | |
90 | MSG_MORE is not passed, or the maximum record size is reached. | |
91 | ||
92 | The kernel will need to allocate a buffer for the encrypted data. | |
93 | This buffer is allocated at the time send() is called, such that | |
94 | either the entire send() call will return -ENOMEM (or block waiting | |
95 | for memory), or the encryption will always succeed. If send() returns | |
96 | -ENOMEM and some data was left on the socket buffer from a previous | |
97 | call using MSG_MORE, the MSG_MORE data is left on the socket buffer. | |
98 | ||
b6c535b1 DW |
99 | Receiving TLS application data |
100 | ------------------------------ | |
101 | ||
102 | After setting the TLS_RX socket option, all recv family socket calls | |
103 | are decrypted using TLS parameters provided. A full TLS record must | |
104 | be received before decryption can happen. | |
105 | ||
f3c0f3c6 JK |
106 | .. code-block:: c |
107 | ||
b6c535b1 DW |
108 | char buffer[16384]; |
109 | recv(sock, buffer, 16384); | |
110 | ||
111 | Received data is decrypted directly in to the user buffer if it is | |
112 | large enough, and no additional allocations occur. If the userspace | |
113 | buffer is too small, data is decrypted in the kernel and copied to | |
114 | userspace. | |
115 | ||
f3c0f3c6 | 116 | ``EINVAL`` is returned if the TLS version in the received message does not |
b6c535b1 DW |
117 | match the version passed in setsockopt. |
118 | ||
f3c0f3c6 | 119 | ``EMSGSIZE`` is returned if the received message is too big. |
b6c535b1 | 120 | |
f3c0f3c6 | 121 | ``EBADMSG`` is returned if decryption failed for any other reason. |
b6c535b1 | 122 | |
99c195fb DW |
123 | Send TLS control messages |
124 | ------------------------- | |
125 | ||
126 | Other than application data, TLS has control messages such as alert | |
127 | messages (record type 21) and handshake messages (record type 22), etc. | |
128 | These messages can be sent over the socket by providing the TLS record type | |
129 | via a CMSG. For example the following function sends @data of @length bytes | |
130 | using a record of type @record_type. | |
131 | ||
f3c0f3c6 JK |
132 | .. code-block:: c |
133 | ||
134 | /* send TLS control message using record_type */ | |
99c195fb | 135 | static int klts_send_ctrl_message(int sock, unsigned char record_type, |
f3c0f3c6 | 136 | void *data, size_t length) |
99c195fb DW |
137 | { |
138 | struct msghdr msg = {0}; | |
139 | int cmsg_len = sizeof(record_type); | |
140 | struct cmsghdr *cmsg; | |
141 | char buf[CMSG_SPACE(cmsg_len)]; | |
142 | struct iovec msg_iov; /* Vector of data to send/receive into. */ | |
143 | ||
144 | msg.msg_control = buf; | |
145 | msg.msg_controllen = sizeof(buf); | |
146 | cmsg = CMSG_FIRSTHDR(&msg); | |
147 | cmsg->cmsg_level = SOL_TLS; | |
148 | cmsg->cmsg_type = TLS_SET_RECORD_TYPE; | |
149 | cmsg->cmsg_len = CMSG_LEN(cmsg_len); | |
150 | *CMSG_DATA(cmsg) = record_type; | |
151 | msg.msg_controllen = cmsg->cmsg_len; | |
152 | ||
153 | msg_iov.iov_base = data; | |
154 | msg_iov.iov_len = length; | |
155 | msg.msg_iov = &msg_iov; | |
156 | msg.msg_iovlen = 1; | |
157 | ||
158 | return sendmsg(sock, &msg, 0); | |
159 | } | |
160 | ||
161 | Control message data should be provided unencrypted, and will be | |
162 | encrypted by the kernel. | |
163 | ||
b6c535b1 DW |
164 | Receiving TLS control messages |
165 | ------------------------------ | |
166 | ||
167 | TLS control messages are passed in the userspace buffer, with message | |
168 | type passed via cmsg. If no cmsg buffer is provided, an error is | |
169 | returned if a control message is received. Data messages may be | |
170 | received without a cmsg buffer set. | |
171 | ||
f3c0f3c6 JK |
172 | .. code-block:: c |
173 | ||
b6c535b1 DW |
174 | char buffer[16384]; |
175 | char cmsg[CMSG_SPACE(sizeof(unsigned char))]; | |
176 | struct msghdr msg = {0}; | |
177 | msg.msg_control = cmsg; | |
178 | msg.msg_controllen = sizeof(cmsg); | |
179 | ||
180 | struct iovec msg_iov; | |
181 | msg_iov.iov_base = buffer; | |
182 | msg_iov.iov_len = 16384; | |
183 | ||
184 | msg.msg_iov = &msg_iov; | |
185 | msg.msg_iovlen = 1; | |
186 | ||
187 | int ret = recvmsg(sock, &msg, 0 /* flags */); | |
188 | ||
189 | struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); | |
190 | if (cmsg->cmsg_level == SOL_TLS && | |
191 | cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { | |
192 | int record_type = *((unsigned char *)CMSG_DATA(cmsg)); | |
193 | // Do something with record_type, and control message data in | |
194 | // buffer. | |
195 | // | |
196 | // Note that record_type may be == to application data (23). | |
197 | } else { | |
198 | // Buffer contains application data. | |
199 | } | |
200 | ||
201 | recv will never return data from mixed types of TLS records. | |
202 | ||
99c195fb DW |
203 | Integrating in to userspace TLS library |
204 | --------------------------------------- | |
205 | ||
206 | At a high level, the kernel TLS ULP is a replacement for the record | |
207 | layer of a userspace TLS library. | |
208 | ||
f3c0f3c6 JK |
209 | A patchset to OpenSSL to use ktls as the record layer is |
210 | `here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_. | |
99c195fb | 211 | |
f3c0f3c6 JK |
212 | `An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_ |
213 | of calling send directly after a handshake using gnutls. | |
214 | Since it doesn't implement a full record layer, control | |
215 | messages are not supported. | |
d26b698d | 216 | |
7e5e8ec7 JK |
217 | Optional optimizations |
218 | ---------------------- | |
219 | ||
220 | There are certain condition-specific optimizations the TLS ULP can make, | |
221 | if requested. Those optimizations are either not universally beneficial | |
222 | or may impact correctness, hence they require an opt-in. | |
223 | All options are set per-socket using setsockopt(), and their | |
224 | state can be checked using getsockopt() and via socket diag (``ss``). | |
225 | ||
226 | TLS_TX_ZEROCOPY_RO | |
227 | ~~~~~~~~~~~~~~~~~~ | |
228 | ||
229 | For device offload only. Allow sendfile() data to be transmitted directly | |
230 | to the NIC without making an in-kernel copy. This allows true zero-copy | |
231 | behavior when device offload is enabled. | |
232 | ||
233 | The application must make sure that the data is not modified between being | |
234 | submitted and transmission completing. In other words this is mostly | |
235 | applicable if the data sent on a socket via sendfile() is read-only. | |
236 | ||
237 | Modifying the data may result in different versions of the data being used | |
238 | for the original TCP transmission and TCP retransmissions. To the receiver | |
239 | this will look like TLS records had been tampered with and will result | |
240 | in record authentication failures. | |
241 | ||
88527790 JK |
242 | TLS_RX_EXPECT_NO_PAD |
243 | ~~~~~~~~~~~~~~~~~~~~ | |
244 | ||
245 | TLS 1.3 only. Expect the sender to not pad records. This allows the data | |
246 | to be decrypted directly into user space buffers with TLS 1.3. | |
247 | ||
248 | This optimization is safe to enable only if the remote end is trusted, | |
249 | otherwise it is an attack vector to doubling the TLS processing cost. | |
250 | ||
251 | If the record decrypted turns out to had been padded or is not a data | |
252 | record it will be decrypted again into a kernel buffer without zero copy. | |
253 | Such events are counted in the ``TlsDecryptRetry`` statistic. | |
254 | ||
d26b698d JK |
255 | Statistics |
256 | ========== | |
257 | ||
258 | TLS implementation exposes the following per-namespace statistics | |
259 | (``/proc/net/tls_stat``): | |
b32fd3cc JK |
260 | |
261 | - ``TlsCurrTxSw``, ``TlsCurrRxSw`` - | |
262 | number of TX and RX sessions currently installed where host handles | |
263 | cryptography | |
264 | ||
265 | - ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` - | |
266 | number of TX and RX sessions currently installed where NIC handles | |
267 | cryptography | |
268 | ||
269 | - ``TlsTxSw``, ``TlsRxSw`` - | |
270 | number of TX and RX sessions opened with host cryptography | |
271 | ||
272 | - ``TlsTxDevice``, ``TlsRxDevice`` - | |
273 | number of TX and RX sessions opened with NIC cryptography | |
5c5ec668 JK |
274 | |
275 | - ``TlsDecryptError`` - | |
276 | record decryption failed (e.g. due to incorrect authentication tag) | |
a4d26fdb JK |
277 | |
278 | - ``TlsDeviceRxResync`` - | |
279 | number of RX resyncs sent to NICs handling cryptography | |
88527790 JK |
280 | |
281 | - ``TlsDecryptRetry`` - | |
282 | number of RX records which had to be re-decrypted due to | |
283 | ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will | |
284 | also increment for non-data records. | |
bb56cea9 JK |
285 | |
286 | - ``TlsRxNoPadViolation`` - | |
287 | number of data RX records which had to be re-decrypted due to | |
288 | ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. |