Commit | Line | Data |
---|---|---|
bad5b6e2 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | == | |
4 | RDS | |
5 | === | |
0c5f9b88 AG |
6 | |
7 | Overview | |
8 | ======== | |
9 | ||
10 | This readme tries to provide some background on the hows and whys of RDS, | |
11 | and will hopefully help you find your way around the code. | |
12 | ||
13 | In addition, please see this email about RDS origins: | |
14 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | |
15 | ||
16 | RDS Architecture | |
17 | ================ | |
18 | ||
19 | RDS provides reliable, ordered datagram delivery by using a single | |
20 | reliable connection between any two nodes in the cluster. This allows | |
21 | applications to use a single socket to talk to any other process in the | |
22 | cluster - so in a cluster with N processes you need N sockets, in contrast | |
23 | to N*N if you use a connection-oriented socket transport like TCP. | |
24 | ||
25 | RDS is not Infiniband-specific; it was designed to support different | |
26 | transports. The current implementation used to support RDS over TCP as well | |
dcdede04 | 27 | as IB. |
0c5f9b88 AG |
28 | |
29 | The high-level semantics of RDS from the application's point of view are | |
30 | ||
31 | * Addressing | |
0c5f9b88 | 32 | |
bad5b6e2 MCC |
33 | RDS uses IPv4 addresses and 16bit port numbers to identify |
34 | the end point of a connection. All socket operations that involve | |
35 | passing addresses between kernel and user space generally | |
36 | use a struct sockaddr_in. | |
37 | ||
38 | The fact that IPv4 addresses are used does not mean the underlying | |
39 | transport has to be IP-based. In fact, RDS over IB uses a | |
40 | reliable IB connection; the IP address is used exclusively to | |
41 | locate the remote node's GID (by ARPing for the given IP). | |
0c5f9b88 | 42 | |
bad5b6e2 MCC |
43 | The port space is entirely independent of UDP, TCP or any other |
44 | protocol. | |
0c5f9b88 AG |
45 | |
46 | * Socket interface | |
bad5b6e2 MCC |
47 | |
48 | RDS sockets work *mostly* as you would expect from a BSD | |
49 | socket. The next section will cover the details. At any rate, | |
50 | all I/O is performed through the standard BSD socket API. | |
51 | Some additions like zerocopy support are implemented through | |
52 | control messages, while other extensions use the getsockopt/ | |
53 | setsockopt calls. | |
54 | ||
55 | Sockets must be bound before you can send or receive data. | |
56 | This is needed because binding also selects a transport and | |
57 | attaches it to the socket. Once bound, the transport assignment | |
58 | does not change. RDS will tolerate IPs moving around (eg in | |
59 | a active-active HA scenario), but only as long as the address | |
60 | doesn't move to a different transport. | |
0c5f9b88 AG |
61 | |
62 | * sysctls | |
bad5b6e2 MCC |
63 | |
64 | RDS supports a number of sysctls in /proc/sys/net/rds | |
0c5f9b88 AG |
65 | |
66 | ||
67 | Socket Interface | |
68 | ================ | |
69 | ||
70 | AF_RDS, PF_RDS, SOL_RDS | |
ebe96e64 SV |
71 | AF_RDS and PF_RDS are the domain type to be used with socket(2) |
72 | to create RDS sockets. SOL_RDS is the socket-level to be used | |
73 | with setsockopt(2) and getsockopt(2) for RDS specific socket | |
74 | options. | |
0c5f9b88 AG |
75 | |
76 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | |
bad5b6e2 | 77 | This creates a new, unbound RDS socket. |
0c5f9b88 AG |
78 | |
79 | setsockopt(SOL_SOCKET): send and receive buffer size | |
bad5b6e2 MCC |
80 | RDS honors the send and receive buffer size socket options. |
81 | You are not allowed to queue more than SO_SNDSIZE bytes to | |
82 | a socket. A message is queued when sendmsg is called, and | |
83 | it leaves the queue when the remote system acknowledges | |
84 | its arrival. | |
85 | ||
86 | The SO_RCVSIZE option controls the maximum receive queue length. | |
87 | This is a soft limit rather than a hard limit - RDS will | |
88 | continue to accept and queue incoming messages, even if that | |
89 | takes the queue length over the limit. However, it will also | |
90 | mark the port as "congested" and send a congestion update to | |
91 | the source node. The source node is supposed to throttle any | |
92 | processes sending to this congested port. | |
0c5f9b88 AG |
93 | |
94 | bind(fd, &sockaddr_in, ...) | |
bad5b6e2 MCC |
95 | This binds the socket to a local IP address and port, and a |
96 | transport, if one has not already been selected via the | |
d67214a2 | 97 | SO_RDS_TRANSPORT socket option |
0c5f9b88 AG |
98 | |
99 | sendmsg(fd, ...) | |
bad5b6e2 MCC |
100 | Sends a message to the indicated recipient. The kernel will |
101 | transparently establish the underlying reliable connection | |
102 | if it isn't up yet. | |
0c5f9b88 | 103 | |
bad5b6e2 MCC |
104 | An attempt to send a message that exceeds SO_SNDSIZE will |
105 | return with -EMSGSIZE | |
0c5f9b88 | 106 | |
bad5b6e2 MCC |
107 | An attempt to send a message that would take the total number |
108 | of queued bytes over the SO_SNDSIZE threshold will return | |
109 | EAGAIN. | |
0c5f9b88 | 110 | |
bad5b6e2 MCC |
111 | An attempt to send a message to a destination that is marked |
112 | as "congested" will return ENOBUFS. | |
0c5f9b88 AG |
113 | |
114 | recvmsg(fd, ...) | |
bad5b6e2 MCC |
115 | Receives a message that was queued to this socket. The sockets |
116 | recv queue accounting is adjusted, and if the queue length | |
117 | drops below SO_SNDSIZE, the port is marked uncongested, and | |
118 | a congestion update is sent to all peers. | |
119 | ||
120 | Applications can ask the RDS kernel module to receive | |
121 | notifications via control messages (for instance, there is a | |
122 | notification when a congestion update arrived, or when a RDMA | |
123 | operation completes). These notifications are received through | |
124 | the msg.msg_control buffer of struct msghdr. The format of the | |
125 | messages is described in manpages. | |
0c5f9b88 AG |
126 | |
127 | poll(fd) | |
bad5b6e2 MCC |
128 | RDS supports the poll interface to allow the application |
129 | to implement async I/O. | |
0c5f9b88 | 130 | |
bad5b6e2 MCC |
131 | POLLIN handling is pretty straightforward. When there's an |
132 | incoming message queued to the socket, or a pending notification, | |
133 | we signal POLLIN. | |
0c5f9b88 | 134 | |
bad5b6e2 MCC |
135 | POLLOUT is a little harder. Since you can essentially send |
136 | to any destination, RDS will always signal POLLOUT as long as | |
137 | there's room on the send queue (ie the number of bytes queued | |
138 | is less than the sendbuf size). | |
0c5f9b88 | 139 | |
bad5b6e2 MCC |
140 | However, the kernel will refuse to accept messages to |
141 | a destination marked congested - in this case you will loop | |
142 | forever if you rely on poll to tell you what to do. | |
143 | This isn't a trivial problem, but applications can deal with | |
144 | this - by using congestion notifications, and by checking for | |
145 | ENOBUFS errors returned by sendmsg. | |
0c5f9b88 AG |
146 | |
147 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | |
bad5b6e2 MCC |
148 | This allows the application to discard all messages queued to a |
149 | specific destination on this particular socket. | |
150 | ||
151 | This allows the application to cancel outstanding messages if | |
152 | it detects a timeout. For instance, if it tried to send a message, | |
153 | and the remote host is unreachable, RDS will keep trying forever. | |
154 | The application may decide it's not worth it, and cancel the | |
155 | operation. In this case, it would use RDS_CANCEL_SENT_TO to | |
156 | nuke any pending messages. | |
157 | ||
158 | ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` | |
d67214a2 SV |
159 | Set or read an integer defining the underlying |
160 | encapsulating transport to be used for RDS packets on the | |
161 | socket. When setting the option, integer argument may be | |
162 | one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the | |
163 | value, RDS_TRANS_NONE will be returned on an unbound socket. | |
164 | This socket option may only be set exactly once on the socket, | |
165 | prior to binding it via the bind(2) system call. Attempts to | |
166 | set SO_RDS_TRANSPORT on a socket for which the transport has | |
167 | been previously attached explicitly (by SO_RDS_TRANSPORT) or | |
168 | implicitly (via bind(2)) will return an error of EOPNOTSUPP. | |
661388f9 | 169 | An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will |
d67214a2 | 170 | always return EINVAL. |
0c5f9b88 AG |
171 | |
172 | RDMA for RDS | |
173 | ============ | |
174 | ||
175 | see rds-rdma(7) manpage (available in rds-tools) | |
176 | ||
177 | ||
178 | Congestion Notifications | |
179 | ======================== | |
180 | ||
181 | see rds(7) manpage | |
182 | ||
183 | ||
184 | RDS Protocol | |
185 | ============ | |
186 | ||
187 | Message header | |
188 | ||
189 | The message header is a 'struct rds_header' (see rds.h): | |
bad5b6e2 | 190 | |
0c5f9b88 | 191 | Fields: |
bad5b6e2 | 192 | |
0c5f9b88 | 193 | h_sequence: |
bad5b6e2 | 194 | per-packet sequence number |
0c5f9b88 | 195 | h_ack: |
bad5b6e2 | 196 | piggybacked acknowledgment of last packet received |
0c5f9b88 | 197 | h_len: |
bad5b6e2 | 198 | length of data, not including header |
0c5f9b88 | 199 | h_sport: |
bad5b6e2 | 200 | source port |
0c5f9b88 | 201 | h_dport: |
bad5b6e2 | 202 | destination port |
0c5f9b88 | 203 | h_flags: |
bad5b6e2 MCC |
204 | Can be: |
205 | ||
206 | ============= ================================== | |
207 | CONG_BITMAP this is a congestion update bitmap | |
208 | ACK_REQUIRED receiver must ack this packet | |
209 | RETRANSMITTED packet has previously been sent | |
210 | ============= ================================== | |
211 | ||
0c5f9b88 | 212 | h_credit: |
bad5b6e2 MCC |
213 | indicate to other end of connection that |
214 | it has more credits available (i.e. there is | |
215 | more send room) | |
0c5f9b88 | 216 | h_padding[4]: |
bad5b6e2 | 217 | unused, for future use |
0c5f9b88 | 218 | h_csum: |
bad5b6e2 | 219 | header checksum |
0c5f9b88 | 220 | h_exthdr: |
bad5b6e2 MCC |
221 | optional data can be passed here. This is currently used for |
222 | passing RDMA-related information. | |
0c5f9b88 AG |
223 | |
224 | ACK and retransmit handling | |
225 | ||
226 | One might think that with reliable IB connections you wouldn't need | |
227 | to ack messages that have been received. The problem is that IB | |
228 | hardware generates an ack message before it has DMAed the message | |
229 | into memory. This creates a potential message loss if the HCA is | |
230 | disabled for any reason between when it sends the ack and before | |
231 | the message is DMAed and processed. This is only a potential issue | |
232 | if another HCA is available for fail-over. | |
233 | ||
234 | Sending an ack immediately would allow the sender to free the sent | |
235 | message from their send queue quickly, but could cause excessive | |
236 | traffic to be used for acks. RDS piggybacks acks on sent data | |
237 | packets. Ack-only packets are reduced by only allowing one to be | |
238 | in flight at a time, and by the sender only asking for acks when | |
239 | its send buffers start to fill up. All retransmissions are also | |
240 | acked. | |
241 | ||
242 | Flow Control | |
243 | ||
244 | RDS's IB transport uses a credit-based mechanism to verify that | |
245 | there is space in the peer's receive buffers for more data. This | |
246 | eliminates the need for hardware retries on the connection. | |
247 | ||
248 | Congestion | |
249 | ||
250 | Messages waiting in the receive queue on the receiving socket | |
251 | are accounted against the sockets SO_RCVBUF option value. Only | |
252 | the payload bytes in the message are accounted for. If the | |
253 | number of bytes queued equals or exceeds rcvbuf then the socket | |
254 | is congested. All sends attempted to this socket's address | |
255 | should return block or return -EWOULDBLOCK. | |
256 | ||
257 | Applications are expected to be reasonably tuned such that this | |
258 | situation very rarely occurs. An application encountering this | |
259 | "back-pressure" is considered a bug. | |
260 | ||
261 | This is implemented by having each node maintain bitmaps which | |
262 | indicate which ports on bound addresses are congested. As the | |
263 | bitmap changes it is sent through all the connections which | |
264 | terminate in the local address of the bitmap which changed. | |
265 | ||
266 | The bitmaps are allocated as connections are brought up. This | |
267 | avoids allocation in the interrupt handling path which queues | |
268 | sages on sockets. The dense bitmaps let transports send the | |
269 | entire bitmap on any bitmap change reasonably efficiently. This | |
270 | is much easier to implement than some finer-grained | |
271 | communication of per-port congestion. The sender does a very | |
272 | inexpensive bit test to test if the port it's about to send to | |
273 | is congested or not. | |
274 | ||
275 | ||
276 | RDS Transport Layer | |
bad5b6e2 | 277 | =================== |
0c5f9b88 AG |
278 | |
279 | As mentioned above, RDS is not IB-specific. Its code is divided | |
280 | into a general RDS layer and a transport layer. | |
281 | ||
282 | The general layer handles the socket API, congestion handling, | |
283 | loopback, stats, usermem pinning, and the connection state machine. | |
284 | ||
285 | The transport layer handles the details of the transport. The IB | |
286 | transport, for example, handles all the queue pairs, work requests, | |
287 | CM event handlers, and other Infiniband details. | |
288 | ||
289 | ||
290 | RDS Kernel Structures | |
291 | ===================== | |
292 | ||
293 | struct rds_message | |
294 | aka possibly "rds_outgoing", the generic RDS layer copies data to | |
295 | be sent and sets header fields as needed, based on the socket API. | |
296 | This is then queued for the individual connection and sent by the | |
297 | connection's transport. | |
bad5b6e2 | 298 | |
0c5f9b88 AG |
299 | struct rds_incoming |
300 | a generic struct referring to incoming data that can be handed from | |
301 | the transport to the general code and queued by the general code | |
302 | while the socket is awoken. It is then passed back to the transport | |
303 | code to handle the actual copy-to-user. | |
bad5b6e2 | 304 | |
0c5f9b88 AG |
305 | struct rds_socket |
306 | per-socket information | |
bad5b6e2 | 307 | |
0c5f9b88 AG |
308 | struct rds_connection |
309 | per-connection information | |
bad5b6e2 | 310 | |
0c5f9b88 AG |
311 | struct rds_transport |
312 | pointers to transport-specific functions | |
bad5b6e2 | 313 | |
0c5f9b88 AG |
314 | struct rds_statistics |
315 | non-transport-specific statistics | |
bad5b6e2 | 316 | |
0c5f9b88 AG |
317 | struct rds_cong_map |
318 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. | |
319 | ||
320 | Connection management | |
321 | ===================== | |
322 | ||
323 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | |
324 | ERROR states. | |
325 | ||
326 | The first time an attempt is made by an RDS socket to send data to | |
327 | a node, a connection is allocated and connected. That connection is | |
328 | then maintained forever -- if there are transport errors, the | |
329 | connection will be dropped and re-established. | |
330 | ||
331 | Dropping a connection while packets are queued will cause queued or | |
332 | partially-sent datagrams to be retransmitted when the connection is | |
333 | re-established. | |
334 | ||
335 | ||
336 | The send path | |
337 | ============= | |
338 | ||
339 | rds_sendmsg() | |
bad5b6e2 MCC |
340 | - struct rds_message built from incoming data |
341 | - CMSGs parsed (e.g. RDMA ops) | |
342 | - transport connection alloced and connected if not already | |
343 | - rds_message placed on send queue | |
344 | - send worker awoken | |
345 | ||
0c5f9b88 | 346 | rds_send_worker() |
bad5b6e2 MCC |
347 | - calls rds_send_xmit() until queue is empty |
348 | ||
0c5f9b88 | 349 | rds_send_xmit() |
bad5b6e2 MCC |
350 | - transmits congestion map if one is pending |
351 | - may set ACK_REQUIRED | |
352 | - calls transport to send either non-RDMA or RDMA message | |
353 | (RDMA ops never retransmitted) | |
354 | ||
0c5f9b88 | 355 | rds_ib_xmit() |
bad5b6e2 MCC |
356 | - allocs work requests from send ring |
357 | - adds any new send credits available to peer (h_credits) | |
358 | - maps the rds_message's sg list | |
359 | - piggybacks ack | |
360 | - populates work requests | |
361 | - post send to connection's queue pair | |
0c5f9b88 AG |
362 | |
363 | The recv path | |
364 | ============= | |
365 | ||
366 | rds_ib_recv_cq_comp_handler() | |
bad5b6e2 MCC |
367 | - looks at write completions |
368 | - unmaps recv buffer from device | |
369 | - no errors, call rds_ib_process_recv() | |
370 | - refill recv ring | |
371 | ||
0c5f9b88 | 372 | rds_ib_process_recv() |
bad5b6e2 MCC |
373 | - validate header checksum |
374 | - copy header to rds_ib_incoming struct if start of a new datagram | |
375 | - add to ibinc's fraglist | |
376 | - if competed datagram: | |
377 | - update cong map if datagram was cong update | |
378 | - call rds_recv_incoming() otherwise | |
379 | - note if ack is required | |
380 | ||
0c5f9b88 | 381 | rds_recv_incoming() |
bad5b6e2 MCC |
382 | - drop duplicate packets |
383 | - respond to pings | |
384 | - find the sock associated with this datagram | |
385 | - add to sock queue | |
386 | - wake up sock | |
387 | - do some congestion calculations | |
0c5f9b88 | 388 | rds_recvmsg |
bad5b6e2 MCC |
389 | - copy data into user iovec |
390 | - handle CMSGs | |
391 | - return to application | |
0c5f9b88 | 392 | |
09204a6c SV |
393 | Multipath RDS (mprds) |
394 | ===================== | |
395 | Mprds is multipathed-RDS, primarily intended for RDS-over-TCP | |
396 | (though the concept can be extended to other transports). The classical | |
397 | implementation of RDS-over-TCP is implemented by demultiplexing multiple | |
398 | PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, | |
399 | port]) over a single TCP socket between the 2 IP addresses involved. This | |
400 | has the limitation that it ends up funneling multiple RDS flows over a | |
401 | single TCP flow, thus it is | |
402 | (a) upper-bounded to the single-flow bandwidth, | |
403 | (b) suffers from head-of-line blocking for all the RDS sockets. | |
404 | ||
405 | Better throughput (for a fixed small packet size, MTU) can be achieved | |
406 | by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed | |
407 | RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp | |
408 | connection. RDS sockets will be attached to a path based on some hash | |
409 | (e.g., of local address and RDS port number) and packets for that RDS | |
410 | socket will be sent over the attached path using TCP to segment/reassemble | |
411 | RDS datagrams on that path. | |
412 | ||
413 | Multipathed RDS is implemented by splitting the struct rds_connection into | |
414 | a common (to all paths) part, and a per-path struct rds_conn_path. All | |
415 | I/O workqs and reconnect threads are driven from the rds_conn_path. | |
416 | Transports such as TCP that are multipath capable may then set up a | |
bb2e05e0 | 417 | TCP socket per rds_conn_path, and this is managed by the transport via |
09204a6c SV |
418 | the transport privatee cp_transport_data pointer. |
419 | ||
420 | Transports announce themselves as multipath capable by setting the | |
421 | t_mp_capable bit during registration with the rds core module. When the | |
422 | transport is multipath-capable, rds_sendmsg() hashes outgoing traffic | |
423 | across multiple paths. The outgoing hash is computed based on the | |
424 | local address and port that the PF_RDS socket is bound to. | |
425 | ||
426 | Additionally, even if the transport is MP capable, we may be | |
427 | peering with some node that does not support mprds, or supports | |
428 | a different number of paths. As a result, the peering nodes need | |
429 | to agree on the number of paths to be used for the connection. | |
430 | This is done by sending out a control packet exchange before the | |
431 | first data packet. The control packet exchange must have completed | |
432 | prior to outgoing hash completion in rds_sendmsg() when the transport | |
433 | is mutlipath capable. | |
434 | ||
435 | The control packet is an RDS ping packet (i.e., packet to rds dest | |
436 | port 0) with the ping packet having a rds extension header option of | |
437 | type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the | |
438 | number of paths supported by the sender. The "probe" ping packet will | |
439 | get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) | |
440 | The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately | |
441 | be able to compute the min(sender_paths, rcvr_paths). The pong | |
442 | sent in response to a probe-ping should contain the rcvr's npaths | |
443 | when the rcvr is mprds-capable. | |
444 | ||
445 | If the rcvr is not mprds-capable, the exthdr in the ping will be | |
446 | ignored. In this case the pong will not have any exthdrs, so the sender | |
447 | of the probe-ping can default to single-path mprds. | |
0c5f9b88 | 448 |