Commit | Line | Data |
---|---|---|
33155bac MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============= | |
98069ff4 | 4 | DCCP protocol |
4886fcad | 5 | ============= |
98069ff4 | 6 | |
98069ff4 | 7 | |
33155bac MCC |
8 | .. Contents |
9 | - Introduction | |
10 | - Missing features | |
11 | - Socket options | |
12 | - Sysctl variables | |
13 | - IOCTLs | |
14 | - Other tunables | |
15 | - Notes | |
98069ff4 | 16 | |
4886fcad | 17 | |
98069ff4 IM |
18 | Introduction |
19 | ============ | |
98069ff4 | 20 | Datagram Congestion Control Protocol (DCCP) is an unreliable, connection |
e333b3ed GR |
21 | oriented protocol designed to solve issues present in UDP and TCP, particularly |
22 | for real-time and multimedia (streaming) traffic. | |
c17cb8b5 MI |
23 | It divides into a base protocol (RFC 4340) and pluggable congestion control |
24 | modules called CCIDs. Like pluggable TCP congestion control, at least one CCID | |
e333b3ed GR |
25 | needs to be enabled in order for the protocol to function properly. In the Linux |
26 | implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as | |
27 | the TCP-friendly CCID3 (RFC 4342), are optional. | |
28 | For a brief introduction to CCIDs and suggestions for choosing a CCID to match | |
29 | given applications, see section 10 of RFC 4340. | |
98069ff4 IM |
30 | |
31 | It has a base protocol and pluggable congestion control IDs (CCIDs). | |
32 | ||
ebe6f7e7 GR |
33 | DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol |
34 | is at http://www.ietf.org/html.charters/dccp-charter.html | |
98069ff4 | 35 | |
4886fcad | 36 | |
98069ff4 IM |
37 | Missing features |
38 | ================ | |
ebe6f7e7 GR |
39 | The Linux DCCP implementation does not currently support all the features that are |
40 | specified in RFCs 4340...42. | |
98069ff4 | 41 | |
ddfe10b8 | 42 | The known bugs are at: |
33155bac | 43 | |
c996d8b9 | 44 | http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP |
98069ff4 | 45 | |
ebe6f7e7 GR |
46 | For more up-to-date versions of the DCCP implementation, please consider using |
47 | the experimental DCCP test tree; instructions for checking this out are on: | |
c996d8b9 | 48 | http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree |
ebe6f7e7 GR |
49 | |
50 | ||
98069ff4 IM |
51 | Socket options |
52 | ============== | |
871a2c16 TG |
53 | DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes |
54 | a policy ID as argument and can only be set before the connection (i.e. changes | |
55 | during an established connection are not supported). Currently, two policies are | |
56 | defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, | |
57 | and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an | |
58 | u32 priority value as ancillary data to sendmsg(), where higher numbers indicate | |
59 | a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to | |
33155bac MCC |
60 | be formatted using a cmsg(3) message header filled in as follows:: |
61 | ||
871a2c16 TG |
62 | cmsg->cmsg_level = SOL_DCCP; |
63 | cmsg->cmsg_type = DCCP_SCM_PRIORITY; | |
64 | cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ | |
65 | ||
66 | DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero | |
67 | value is always interpreted as unbounded queue length. If different from zero, | |
68 | the interpretation of this parameter depends on the current dequeuing policy | |
69 | (see above): the "simple" policy will enforce a fixed queue size by returning | |
70 | EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the | |
71 | lowest-priority packet first. The default value for this parameter is | |
72 | initialised from /proc/sys/net/dccp/default/tx_qlen. | |
73 | ||
00e4d116 GR |
74 | DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of |
75 | service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, | |
76 | the socket will fall back to 0 (which means that no meaningful service code | |
126acd5b GR |
77 | is present). On active sockets this is set before connect(); specifying more |
78 | than one code has no effect (all subsequent service codes are ignored). The | |
79 | case is different for passive sockets, where multiple service codes (up to 32) | |
80 | can be set before calling bind(). | |
98069ff4 | 81 | |
7c559a9e GR |
82 | DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet |
83 | size (application payload size) in bytes, see RFC 4340, section 14. | |
84 | ||
d90ebcbf | 85 | DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs |
69a6a0b3 GR |
86 | supported by the endpoint. The option value is an array of type uint8_t whose |
87 | size is passed as option length. The minimum array size is 4 elements, the | |
88 | value returned in the optlen argument always reflects the true number of | |
89 | built-in CCIDs. | |
d90ebcbf | 90 | |
b20a9c24 GR |
91 | DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same |
92 | time, combining the operation of the next two socket options. This option is | |
c98be0c9 | 93 | preferable over the latter two, since often applications will use the same |
b20a9c24 GR |
94 | type of CCID for both directions; and mixed use of CCIDs is not currently well |
95 | understood. This socket option takes as argument at least one uint8_t value, or | |
96 | an array of uint8_t values, which must match available CCIDS (see above). CCIDs | |
97 | must be registered on the socket before calling connect() or listen(). | |
98 | ||
99 | DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets | |
100 | the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. | |
33155bac | 101 | Please note that the getsockopt argument type here is ``int``, not uint8_t. |
b20a9c24 GR |
102 | |
103 | DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. | |
104 | ||
b8599d20 GR |
105 | DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold |
106 | timewait state when closing the connection (RFC 4340, 8.3). The usual case is | |
107 | that the closing server sends a CloseReq, whereupon the client holds timewait | |
108 | state. When this boolean socket option is on, the server sends a Close instead | |
109 | and will enter TIMEWAIT. This option must be set after accept() returns. | |
110 | ||
6f4e5fff GR |
111 | DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the |
112 | partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums | |
113 | always cover the entire packet and that only fully covered application data is | |
114 | accepted by the receiver. Hence, when using this feature on the sender, it must | |
115 | be enabled at the receiver, too with suitable choice of CsCov. | |
116 | ||
117 | DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the | |
118 | range 0..15 are acceptable. The default setting is 0 (full coverage), | |
119 | values between 1..15 indicate partial coverage. | |
33155bac | 120 | |
2bfd754d | 121 | DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it |
6f4e5fff GR |
122 | sets a threshold, where again values 0..15 are acceptable. The default |
123 | of 0 means that all packets with a partial coverage will be discarded. | |
124 | Values in the range 1..15 indicate that packets with minimally such a | |
125 | coverage value are also acceptable. The higher the number, the more | |
2bfd754d GR |
126 | restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage |
127 | settings are inherited to the child socket after accept(). | |
6f4e5fff | 128 | |
f2645101 GR |
129 | The following two options apply to CCID 3 exclusively and are getsockopt()-only. |
130 | In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. | |
33155bac | 131 | |
f2645101 | 132 | DCCP_SOCKOPT_CCID_RX_INFO |
33155bac | 133 | Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and |
f2645101 | 134 | optlen must be set to at least sizeof(struct tfrc_rx_info). |
33155bac | 135 | |
f2645101 | 136 | DCCP_SOCKOPT_CCID_TX_INFO |
33155bac | 137 | Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and |
f2645101 GR |
138 | optlen must be set to at least sizeof(struct tfrc_tx_info). |
139 | ||
8e8c71f1 GR |
140 | On unidirectional connections it is useful to close the unused half-connection |
141 | via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. | |
f2645101 | 142 | |
4886fcad | 143 | |
2e2e9e92 GR |
144 | Sysctl variables |
145 | ================ | |
146 | Several DCCP default parameters can be managed by the following sysctls | |
147 | (sysctl net.dccp.default or /proc/sys/net/dccp/default): | |
148 | ||
149 | request_retries | |
150 | The number of active connection initiation retries (the number of | |
151 | Requests minus one) before timing out. In addition, it also governs | |
152 | the behaviour of the other, passive side: this variable also sets | |
153 | the number of times DCCP repeats sending a Response when the initial | |
154 | handshake does not progress from RESPOND to OPEN (i.e. when no Ack | |
155 | is received after the initial Request). This value should be greater | |
156 | than 0, suggested is less than 10. Analogue of tcp_syn_retries. | |
157 | ||
158 | retries1 | |
159 | How often a DCCP Response is retransmitted until the listening DCCP | |
160 | side considers its connecting peer dead. Analogue of tcp_retries1. | |
161 | ||
162 | retries2 | |
163 | The number of times a general DCCP packet is retransmitted. This has | |
164 | importance for retransmitted acknowledgments and feature negotiation, | |
165 | data packets are never retransmitted. Analogue of tcp_retries2. | |
166 | ||
2e2e9e92 | 167 | tx_ccid = 2 |
0049bab5 GR |
168 | Default CCID for the sender-receiver half-connection. Depending on the |
169 | choice of CCID, the Send Ack Vector feature is enabled automatically. | |
2e2e9e92 GR |
170 | |
171 | rx_ccid = 2 | |
0049bab5 | 172 | Default CCID for the receiver-sender half-connection; see tx_ccid. |
2e2e9e92 GR |
173 | |
174 | seq_window = 100 | |
792b4878 GR |
175 | The initial sequence window (sec. 7.5.2) of the sender. This influences |
176 | the local ackno validity and the remote seqno validity windows (7.5.1). | |
bfbb2346 | 177 | Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. |
2e2e9e92 | 178 | |
82e3ab9d IM |
179 | tx_qlen = 5 |
180 | The size of the transmit buffer in packets. A value of 0 corresponds | |
181 | to an unbounded transmit buffer. | |
182 | ||
a94f0f97 GR |
183 | sync_ratelimit = 125 ms |
184 | The timeout between subsequent DCCP-Sync packets sent in response to | |
185 | sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit | |
186 | of this parameter is milliseconds; a value of 0 disables rate-limiting. | |
187 | ||
4886fcad | 188 | |
c2814901 GR |
189 | IOCTLS |
190 | ====== | |
191 | FIONREAD | |
33155bac | 192 | Works as in udp(7): returns in the ``int`` argument pointer the size of |
c2814901 GR |
193 | the next pending datagram in bytes, or 0 when no datagram is pending. |
194 | ||
749c08f8 RS |
195 | SIOCOUTQ |
196 | Returns the number of unsent data bytes in the socket send queue as ``int`` | |
197 | into the buffer specified by the argument pointer. | |
4886fcad GR |
198 | |
199 | Other tunables | |
200 | ============== | |
201 | Per-route rto_min support | |
202 | CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value | |
203 | of the RTO timer. This setting can be modified via the 'rto_min' option | |
33155bac MCC |
204 | of iproute2; for example:: |
205 | ||
4886fcad GR |
206 | > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 |
207 | > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 | |
208 | > ip route show dev wlan0 | |
33155bac | 209 | |
89858ad1 GR |
210 | CCID-3 also supports the rto_min setting: it is used to define the lower |
211 | bound for the expiry of the nofeedback timer. This can be useful on LANs | |
212 | with very low RTTs (e.g., loopback, Gbit ethernet). | |
4886fcad GR |
213 | |
214 | ||
98069ff4 IM |
215 | Notes |
216 | ===== | |
ddfe10b8 | 217 | DCCP does not travel through NAT successfully at present on many boxes. This is |
126acd5b | 218 | because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT |
ddfe10b8 | 219 | support for DCCP has been added. |