Commit | Line | Data |
---|---|---|
b9dd2bea MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============================= | |
bb38ccce | 4 | Kernel Connection Multiplexor |
b9dd2bea | 5 | ============================= |
10016594 TH |
6 | |
7 | Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based | |
8 | interface over TCP for generic application protocols. With KCM an application | |
9 | can efficiently send and receive application protocol messages over TCP using | |
10 | datagram sockets. | |
11 | ||
b9dd2bea MCC |
12 | KCM implements an NxM multiplexor in the kernel as diagrammed below:: |
13 | ||
14 | +------------+ +------------+ +------------+ +------------+ | |
15 | | KCM socket | | KCM socket | | KCM socket | | KCM socket | | |
16 | +------------+ +------------+ +------------+ +------------+ | |
17 | | | | | | |
18 | +-----------+ | | +----------+ | |
19 | | | | | | |
20 | +----------------------------------+ | |
21 | | Multiplexor | | |
22 | +----------------------------------+ | |
23 | | | | | | | |
24 | +---------+ | | | ------------+ | |
25 | | | | | | | |
26 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
27 | | Psock | | Psock | | Psock | | Psock | | Psock | | |
28 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
29 | | | | | | | |
30 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
31 | | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | | |
32 | +----------+ +----------+ +----------+ +----------+ +----------+ | |
10016594 TH |
33 | |
34 | KCM sockets | |
b9dd2bea | 35 | =========== |
10016594 | 36 | |
bb38ccce | 37 | The KCM sockets provide the user interface to the multiplexor. All the KCM sockets |
10016594 TH |
38 | bound to a multiplexor are considered to have equivalent function, and I/O |
39 | operations in different sockets may be done in parallel without the need for | |
40 | synchronization between threads in userspace. | |
41 | ||
42 | Multiplexor | |
b9dd2bea | 43 | =========== |
10016594 TH |
44 | |
45 | The multiplexor provides the message steering. In the transmit path, messages | |
46 | written on a KCM socket are sent atomically on an appropriate TCP socket. | |
47 | Similarly, in the receive path, messages are constructed on each TCP socket | |
48 | (Psock) and complete messages are steered to a KCM socket. | |
49 | ||
50 | TCP sockets & Psocks | |
b9dd2bea | 51 | ==================== |
10016594 TH |
52 | |
53 | TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated | |
54 | for each bound TCP socket, this structure holds the state for constructing | |
55 | messages on receive as well as other connection specific information for KCM. | |
56 | ||
57 | Connected mode semantics | |
b9dd2bea | 58 | ======================== |
10016594 TH |
59 | |
60 | Each multiplexor assumes that all attached TCP connections are to the same | |
61 | destination and can use the different connections for load balancing when | |
62 | transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) | |
63 | can be used to send and receive messages from the KCM socket. | |
64 | ||
65 | Socket types | |
b9dd2bea | 66 | ============ |
10016594 TH |
67 | |
68 | KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. | |
69 | ||
70 | Message delineation | |
71 | ------------------- | |
72 | ||
73 | Messages are sent over a TCP stream with some application protocol message | |
74 | format that typically includes a header which frames the messages. The length | |
75 | of a received message can be deduced from the application protocol header | |
76 | (often just a simple length field). | |
77 | ||
78 | A TCP stream must be parsed to determine message boundaries. Berkeley Packet | |
79 | Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a | |
80 | BPF program must be specified. The program is called at the start of receiving | |
81 | a new message and is given an skbuff that contains the bytes received so far. | |
82 | It parses the message header and returns the length of the message. Given this | |
83 | information, KCM will construct the message of the stated length and deliver it | |
84 | to a KCM socket. | |
85 | ||
86 | TCP socket management | |
87 | --------------------- | |
88 | ||
89 | When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and | |
90 | write space available (POLLOUT) events are handled by the multiplexor. If there | |
91 | is a state change (disconnection) or other error on a TCP socket, an error is | |
92 | posted on the TCP socket so that a POLLERR event happens and KCM discontinues | |
93 | using the socket. When the application gets the error notification for a | |
94 | TCP socket, it should unattach the socket from KCM and then handle the error | |
95 | condition (the typical response is to close the socket and create a new | |
96 | connection if necessary). | |
97 | ||
98 | KCM limits the maximum receive message size to be the size of the receive | |
99 | socket buffer on the attached TCP socket (the socket buffer size can be set by | |
100 | SO_RCVBUF). If the length of a new message reported by the BPF program is | |
101 | greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP | |
102 | socket. The BPF program may also enforce a maximum messages size and report an | |
103 | error when it is exceeded. | |
104 | ||
105 | A timeout may be set for assembling messages on a receive socket. The timeout | |
106 | value is taken from the receive timeout of the attached TCP socket (this is set | |
107 | by SO_RCVTIMEO). If the timer expires before assembly is complete an error | |
108 | (ETIMEDOUT) is posted on the socket. | |
109 | ||
110 | User interface | |
111 | ============== | |
112 | ||
113 | Creating a multiplexor | |
114 | ---------------------- | |
115 | ||
b9dd2bea | 116 | A new multiplexor and initial KCM socket is created by a socket call:: |
10016594 TH |
117 | |
118 | socket(AF_KCM, type, protocol) | |
119 | ||
b9dd2bea MCC |
120 | - type is either SOCK_DGRAM or SOCK_SEQPACKET |
121 | - protocol is KCMPROTO_CONNECTED | |
10016594 TH |
122 | |
123 | Cloning KCM sockets | |
124 | ------------------- | |
125 | ||
126 | After the first KCM socket is created using the socket call as described | |
127 | above, additional sockets for the multiplexor can be created by cloning | |
b9dd2bea | 128 | a KCM socket. This is accomplished by an ioctl on a KCM socket:: |
10016594 TH |
129 | |
130 | /* From linux/kcm.h */ | |
131 | struct kcm_clone { | |
b9dd2bea | 132 | int fd; |
10016594 TH |
133 | }; |
134 | ||
135 | struct kcm_clone info; | |
136 | ||
137 | memset(&info, 0, sizeof(info)); | |
138 | ||
139 | err = ioctl(kcmfd, SIOCKCMCLONE, &info); | |
140 | ||
141 | if (!err) | |
142 | newkcmfd = info.fd; | |
143 | ||
144 | Attach transport sockets | |
145 | ------------------------ | |
146 | ||
147 | Attaching of transport sockets to a multiplexor is performed by calling an | |
b9dd2bea | 148 | ioctl on a KCM socket for the multiplexor. e.g.:: |
10016594 TH |
149 | |
150 | /* From linux/kcm.h */ | |
151 | struct kcm_attach { | |
b9dd2bea | 152 | int fd; |
10016594 TH |
153 | int bpf_fd; |
154 | }; | |
155 | ||
156 | struct kcm_attach info; | |
157 | ||
158 | memset(&info, 0, sizeof(info)); | |
159 | ||
160 | info.fd = tcpfd; | |
161 | info.bpf_fd = bpf_prog_fd; | |
162 | ||
163 | ioctl(kcmfd, SIOCKCMATTACH, &info); | |
164 | ||
165 | The kcm_attach structure contains: | |
b9dd2bea MCC |
166 | |
167 | - fd: file descriptor for TCP socket being attached | |
168 | - bpf_prog_fd: file descriptor for compiled BPF program downloaded | |
10016594 TH |
169 | |
170 | Unattach transport sockets | |
171 | -------------------------- | |
172 | ||
173 | Unattaching a transport socket from a multiplexor is straightforward. An | |
b9dd2bea | 174 | "unattach" ioctl is done with the kcm_unattach structure as the argument:: |
10016594 TH |
175 | |
176 | /* From linux/kcm.h */ | |
177 | struct kcm_unattach { | |
b9dd2bea | 178 | int fd; |
10016594 TH |
179 | }; |
180 | ||
181 | struct kcm_unattach info; | |
182 | ||
183 | memset(&info, 0, sizeof(info)); | |
184 | ||
185 | info.fd = cfd; | |
186 | ||
187 | ioctl(fd, SIOCKCMUNATTACH, &info); | |
188 | ||
189 | Disabling receive on KCM socket | |
190 | ------------------------------- | |
191 | ||
192 | A setsockopt is used to disable or enable receiving on a KCM socket. | |
193 | When receive is disabled, any pending messages in the socket's | |
194 | receive buffer are moved to other sockets. This feature is useful | |
195 | if an application thread knows that it will be doing a lot of | |
196 | work on a request and won't be able to service new messages for a | |
b9dd2bea | 197 | while. Example use:: |
10016594 TH |
198 | |
199 | int val = 1; | |
200 | ||
201 | setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) | |
202 | ||
203 | BFP programs for message delineation | |
204 | ------------------------------------ | |
205 | ||
bb38ccce | 206 | BPF programs can be compiled using the BPF LLVM backend. For example, |
b9dd2bea | 207 | the BPF program for parsing Thrift is:: |
10016594 TH |
208 | |
209 | #include "bpf.h" /* for __sk_buff */ | |
210 | #include "bpf_helpers.h" /* for load_word intrinsic */ | |
211 | ||
212 | SEC("socket_kcm") | |
213 | int bpf_prog1(struct __sk_buff *skb) | |
214 | { | |
215 | return load_word(skb, 0) + 4; | |
216 | } | |
217 | ||
218 | char _license[] SEC("license") = "GPL"; | |
219 | ||
220 | Use in applications | |
221 | =================== | |
222 | ||
223 | KCM accelerates application layer protocols. Specifically, it allows | |
224 | applications to use a message based interface for sending and receiving | |
225 | messages. The kernel provides necessary assurances that messages are sent | |
226 | and received atomically. This relieves much of the burden applications have | |
227 | in mapping a message based protocol onto the TCP stream. KCM also make | |
228 | application layer messages a unit of work in the kernel for the purposes of | |
bb38ccce | 229 | steering and scheduling, which in turn allows a simpler networking model in |
10016594 TH |
230 | multithreaded applications. |
231 | ||
232 | Configurations | |
233 | -------------- | |
234 | ||
235 | In an Nx1 configuration, KCM logically provides multiple socket handles | |
236 | to the same TCP connection. This allows parallelism between in I/O | |
237 | operations on the TCP socket (for instance copyin and copyout of data is | |
238 | parallelized). In an application, a KCM socket can be opened for each | |
239 | processing thread and inserted into the epoll (similar to how SO_REUSEPORT | |
240 | is used to allow multiple listener sockets on the same port). | |
241 | ||
242 | In a MxN configuration, multiple connections are established to the | |
243 | same destination. These are used for simple load balancing. | |
244 | ||
245 | Message batching | |
246 | ---------------- | |
247 | ||
248 | The primary purpose of KCM is load balancing between KCM sockets and hence | |
249 | threads in a nominal use case. Perfect load balancing, that is steering | |
250 | each received message to a different KCM socket or steering each sent | |
251 | message to a different TCP socket, can negatively impact performance | |
252 | since this doesn't allow for affinities to be established. Balancing | |
253 | based on groups, or batches of messages, can be beneficial for performance. | |
254 | ||
255 | On transmit, there are three ways an application can batch (pipeline) | |
256 | messages on a KCM socket. | |
b9dd2bea | 257 | |
10016594 TH |
258 | 1) Send multiple messages in a single sendmmsg. |
259 | 2) Send a group of messages each with a sendmsg call, where all messages | |
260 | except the last have MSG_BATCH in the flags of sendmsg call. | |
261 | 3) Create "super message" composed of multiple messages and send this | |
262 | with a single sendmsg. | |
263 | ||
264 | On receive, the KCM module attempts to queue messages received on the | |
265 | same KCM socket during each TCP ready callback. The targeted KCM socket | |
266 | changes at each receive ready callback on the KCM socket. The application | |
267 | does not need to configure this. | |
268 | ||
269 | Error handling | |
270 | -------------- | |
271 | ||
272 | An application should include a thread to monitor errors raised on | |
273 | the TCP connection. Normally, this will be done by placing each | |
274 | TCP socket attached to a KCM multiplexor in epoll set for POLLERR | |
275 | event. If an error occurs on an attached TCP socket, KCM sets an EPIPE | |
276 | on the socket thus waking up the application thread. When the application | |
277 | sees the error (which may just be a disconnect) it should unattach the | |
278 | socket from KCM and then close it. It is assumed that once an error is | |
279 | posted on the TCP socket the data stream is unrecoverable (i.e. an error | |
bb38ccce | 280 | may have occurred in the middle of receiving a message). |
10016594 TH |
281 | |
282 | TCP connection monitoring | |
283 | ------------------------- | |
284 | ||
285 | In KCM there is no means to correlate a message to the TCP socket that | |
286 | was used to send or receive the message (except in the case there is | |
287 | only one attached TCP socket). However, the application does retain | |
288 | an open file descriptor to the socket so it will be able to get statistics | |
289 | from the socket which can be used in detecting issues (such as high | |
290 | retransmissions on the socket). |