Commit | Line | Data |
---|---|---|
06bfa47e MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============ | |
4 | Timestamping | |
5 | ============ | |
6 | ||
8fe2f761 WB |
7 | |
8 | 1. Control Interfaces | |
06bfa47e | 9 | ===================== |
8fe2f761 WB |
10 | |
11 | The interfaces for receiving network packages timestamps are: | |
cb9eff09 | 12 | |
06bfa47e | 13 | SO_TIMESTAMP |
8fe2f761 WB |
14 | Generates a timestamp for each incoming packet in (not necessarily |
15 | monotonic) system time. Reports the timestamp via recvmsg() in a | |
9dd49211 DD |
16 | control message in usec resolution. |
17 | SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD | |
18 | based on the architecture type and time_t representation of libc. | |
19 | Control message format is in struct __kernel_old_timeval for | |
20 | SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for | |
21 | SO_TIMESTAMP_NEW options respectively. | |
cb9eff09 | 22 | |
06bfa47e | 23 | SO_TIMESTAMPNS |
8fe2f761 | 24 | Same timestamping mechanism as SO_TIMESTAMP, but reports the |
9dd49211 DD |
25 | timestamp as struct timespec in nsec resolution. |
26 | SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD | |
27 | based on the architecture type and time_t representation of libc. | |
28 | Control message format is in struct timespec for SO_TIMESTAMPNS_OLD | |
29 | and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options | |
30 | respectively. | |
cb9eff09 | 31 | |
06bfa47e | 32 | IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] |
8fe2f761 WB |
33 | Only for multicast:approximate transmit timestamp obtained by |
34 | reading the looped packet receive timestamp. | |
cb9eff09 | 35 | |
06bfa47e | 36 | SO_TIMESTAMPING |
8fe2f761 WB |
37 | Generates timestamps on reception, transmission or both. Supports |
38 | multiple timestamp sources, including hardware. Supports generating | |
39 | timestamps for stream sockets. | |
cb9eff09 | 40 | |
cb9eff09 | 41 | |
06bfa47e MCC |
42 | 1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW) |
43 | ------------------------------------------------------------- | |
adca4767 | 44 | |
8fe2f761 WB |
45 | This socket option enables timestamping of datagrams on the reception |
46 | path. Because the destination socket, if any, is not known early in | |
47 | the network stack, the feature has to be enabled for all packets. The | |
48 | same is true for all early receive timestamp options. | |
adca4767 | 49 | |
8fe2f761 WB |
50 | For interface details, see `man 7 socket`. |
51 | ||
9dd49211 DD |
52 | Always use SO_TIMESTAMP_NEW timestamp to always get timestamp in |
53 | struct __kernel_sock_timeval format. | |
8fe2f761 | 54 | |
9dd49211 DD |
55 | SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038 |
56 | on 32 bit machines. | |
57 | ||
5daf8384 JL |
58 | 1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW) |
59 | ------------------------------------------------------------------- | |
8fe2f761 WB |
60 | |
61 | This option is identical to SO_TIMESTAMP except for the returned data type. | |
62 | Its struct timespec allows for higher resolution (ns) timestamps than the | |
63 | timeval of SO_TIMESTAMP (ms). | |
64 | ||
9dd49211 DD |
65 | Always use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in |
66 | struct __kernel_timespec format. | |
67 | ||
68 | SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038 | |
69 | on 32 bit machines. | |
8fe2f761 | 70 | |
06bfa47e MCC |
71 | 1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW) |
72 | ---------------------------------------------------------------------- | |
8fe2f761 WB |
73 | |
74 | Supports multiple types of timestamp requests. As a result, this | |
06bfa47e | 75 | socket option takes a bitmap of flags, not a boolean. In:: |
8fe2f761 | 76 | |
5e34fa23 | 77 | err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); |
8fe2f761 WB |
78 | |
79 | val is an integer with any of the following bits set. Setting other | |
80 | bit returns EINVAL and does not change the current state. | |
adca4767 | 81 | |
fd91e12f SHY |
82 | The socket option configures timestamp generation for individual |
83 | sk_buffs (1.3.1), timestamp reporting to the socket's error | |
84 | queue (1.3.2) and options (1.3.3). Timestamp generation can also | |
85 | be enabled for individual sendmsg calls using cmsg (1.3.4). | |
86 | ||
adca4767 | 87 | |
8fe2f761 | 88 | 1.3.1 Timestamp Generation |
06bfa47e | 89 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
adca4767 | 90 | |
8fe2f761 WB |
91 | Some bits are requests to the stack to try to generate timestamps. Any |
92 | combination of them is valid. Changes to these bits apply to newly | |
93 | created packets, not to packets already in the stack. As a result, it | |
94 | is possible to selectively request timestamps for a subset of packets | |
95 | (e.g., for sampling) by embedding an send() call within two setsockopt | |
96 | calls, one to enable timestamp generation and one to disable it. | |
97 | Timestamps may also be generated for reasons other than being | |
98 | requested by a particular socket, such as when receive timestamping is | |
99 | enabled system wide, as explained earlier. | |
adca4767 | 100 | |
8fe2f761 WB |
101 | SOF_TIMESTAMPING_RX_HARDWARE: |
102 | Request rx timestamps generated by the network adapter. | |
103 | ||
104 | SOF_TIMESTAMPING_RX_SOFTWARE: | |
105 | Request rx timestamps when data enters the kernel. These timestamps | |
106 | are generated just after a device driver hands a packet to the | |
107 | kernel receive stack. | |
108 | ||
109 | SOF_TIMESTAMPING_TX_HARDWARE: | |
fd91e12f SHY |
110 | Request tx timestamps generated by the network adapter. This flag |
111 | can be enabled via both socket options and control messages. | |
8fe2f761 WB |
112 | |
113 | SOF_TIMESTAMPING_TX_SOFTWARE: | |
114 | Request tx timestamps when data leaves the kernel. These timestamps | |
115 | are generated in the device driver as close as possible, but always | |
116 | prior to, passing the packet to the network interface. Hence, they | |
117 | require driver support and may not be available for all devices. | |
fd91e12f SHY |
118 | This flag can be enabled via both socket options and control messages. |
119 | ||
8fe2f761 WB |
120 | SOF_TIMESTAMPING_TX_SCHED: |
121 | Request tx timestamps prior to entering the packet scheduler. Kernel | |
122 | transmit latency is, if long, often dominated by queuing delay. The | |
123 | difference between this timestamp and one taken at | |
124 | SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent | |
125 | of protocol processing. The latency incurred in protocol | |
126 | processing, if any, can be computed by subtracting a userspace | |
127 | timestamp taken immediately before send() from this timestamp. On | |
128 | machines with virtual devices where a transmitted packet travels | |
129 | through multiple devices and, hence, multiple packet schedulers, | |
130 | a timestamp is generated at each layer. This allows for fine | |
fd91e12f SHY |
131 | grained measurement of queuing delay. This flag can be enabled |
132 | via both socket options and control messages. | |
8fe2f761 WB |
133 | |
134 | SOF_TIMESTAMPING_TX_ACK: | |
135 | Request tx timestamps when all data in the send buffer has been | |
136 | acknowledged. This only makes sense for reliable protocols. It is | |
137 | currently only implemented for TCP. For that protocol, it may | |
138 | over-report measurement, because the timestamp is generated when all | |
139 | data up to and including the buffer at send() was acknowledged: the | |
140 | cumulative acknowledgment. The mechanism ignores SACK and FACK. | |
fd91e12f | 141 | This flag can be enabled via both socket options and control messages. |
8fe2f761 WB |
142 | |
143 | ||
144 | 1.3.2 Timestamp Reporting | |
06bfa47e | 145 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
8fe2f761 WB |
146 | |
147 | The other three bits control which timestamps will be reported in a | |
148 | generated control message. Changes to the bits take immediate | |
149 | effect at the timestamp reporting locations in the stack. Timestamps | |
150 | are only reported for packets that also have the relevant timestamp | |
151 | generation request set. | |
152 | ||
153 | SOF_TIMESTAMPING_SOFTWARE: | |
154 | Report any software timestamps when available. | |
155 | ||
156 | SOF_TIMESTAMPING_SYS_HARDWARE: | |
157 | This option is deprecated and ignored. | |
158 | ||
159 | SOF_TIMESTAMPING_RAW_HARDWARE: | |
160 | Report hardware timestamps as generated by | |
161 | SOF_TIMESTAMPING_TX_HARDWARE when available. | |
162 | ||
163 | ||
164 | 1.3.3 Timestamp Options | |
06bfa47e | 165 | ^^^^^^^^^^^^^^^^^^^^^^^ |
8fe2f761 | 166 | |
829ae9d6 | 167 | The interface supports the options |
8fe2f761 WB |
168 | |
169 | SOF_TIMESTAMPING_OPT_ID: | |
8fe2f761 WB |
170 | Generate a unique identifier along with each packet. A process can |
171 | have multiple concurrent timestamping requests outstanding. Packets | |
172 | can be reordered in the transmit path, for instance in the packet | |
173 | scheduler. In that case timestamps will be queued onto the error | |
cbd3aad5 WB |
174 | queue out of order from the original send() calls. It is not always |
175 | possible to uniquely match timestamps to the original send() calls | |
176 | based on timestamp order or payload inspection alone, then. | |
177 | ||
178 | This option associates each packet at send() with a unique | |
179 | identifier and returns that along with the timestamp. The identifier | |
180 | is derived from a per-socket u32 counter (that wraps). For datagram | |
181 | sockets, the counter increments with each sent packet. For stream | |
b534dc46 WB |
182 | sockets, it increments with every byte. For stream sockets, also set |
183 | SOF_TIMESTAMPING_OPT_ID_TCP, see the section below. | |
cbd3aad5 WB |
184 | |
185 | The counter starts at zero. It is initialized the first time that | |
186 | the socket option is enabled. It is reset each time the option is | |
187 | enabled after having been disabled. Resetting the counter does not | |
188 | change the identifiers of existing packets in the system. | |
8fe2f761 WB |
189 | |
190 | This option is implemented only for transmit timestamps. There, the | |
191 | timestamp is always looped along with a struct sock_extended_err. | |
138a7f49 | 192 | The option modifies field ee_data to pass an id that is unique |
8fe2f761 | 193 | among all possibly concurrently outstanding timestamp requests for |
cbd3aad5 | 194 | that socket. |
8fe2f761 | 195 | |
b534dc46 WB |
196 | SOF_TIMESTAMPING_OPT_ID_TCP: |
197 | Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP | |
198 | timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the | |
199 | counter increments for stream sockets, but its starting point is | |
200 | not entirely trivial. This option fixes that. | |
201 | ||
202 | For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should | |
203 | always be set too. On datagram sockets the option has no effect. | |
204 | ||
205 | A reasonable expectation is that the counter is reset to zero with | |
206 | the system call, so that a subsequent write() of N bytes generates | |
207 | a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP | |
208 | implements this behavior under all conditions. | |
209 | ||
210 | SOF_TIMESTAMPING_OPT_ID without modifier often reports the same, | |
211 | especially when the socket option is set when no data is in | |
212 | transmission. If data is being transmitted, it may be off by the | |
213 | length of the output queue (SIOCOUTQ). | |
214 | ||
215 | The difference is due to being based on snd_una versus write_seq. | |
216 | snd_una is the offset in the stream acknowledged by the peer. This | |
217 | depends on factors outside of process control, such as network RTT. | |
218 | write_seq is the last byte written by the process. This offset is | |
219 | not affected by external inputs. | |
220 | ||
221 | The difference is subtle and unlikely to be noticed when configured | |
222 | at initial socket creation, when no data is queued or sent. But | |
223 | SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of | |
224 | when the socket option is set. | |
8fe2f761 | 225 | |
829ae9d6 | 226 | SOF_TIMESTAMPING_OPT_CMSG: |
829ae9d6 WB |
227 | Support recv() cmsg for all timestamped packets. Control messages |
228 | are already supported unconditionally on all packets with receive | |
229 | timestamps and on IPv6 packets with transmit timestamp. This option | |
230 | extends them to IPv4 packets with transmit timestamp. One use case | |
231 | is to correlate packets with their egress device, by enabling socket | |
232 | option IP_PKTINFO simultaneously. | |
233 | ||
234 | ||
49ca0d8b | 235 | SOF_TIMESTAMPING_OPT_TSONLY: |
49ca0d8b WB |
236 | Applies to transmit timestamps only. Makes the kernel return the |
237 | timestamp as a cmsg alongside an empty packet, as opposed to | |
238 | alongside the original packet. This reduces the amount of memory | |
239 | charged to the socket's receive budget (SO_RCVBUF) and delivers | |
240 | the timestamp even if sysctl net.core.tstamp_allow_data is 0. | |
241 | This option disables SOF_TIMESTAMPING_OPT_CMSG. | |
242 | ||
1c885808 | 243 | SOF_TIMESTAMPING_OPT_STATS: |
1c885808 FY |
244 | Optional stats that are obtained along with the transmit timestamps. |
245 | It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the | |
246 | transmit timestamp is available, the stats are available in a | |
247 | separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a | |
248 | list of TLVs (struct nlattr) of types. These stats allow the | |
249 | application to associate various transport layer stats with | |
250 | the transmit timestamps, such as how long a certain block of | |
251 | data was limited by peer's receiver window. | |
49ca0d8b | 252 | |
aad9c8c4 | 253 | SOF_TIMESTAMPING_OPT_PKTINFO: |
aad9c8c4 ML |
254 | Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming |
255 | packets with hardware timestamps. The message contains struct | |
256 | scm_ts_pktinfo, which supplies the index of the real interface which | |
257 | received the packet and its length at layer 2. A valid (non-zero) | |
258 | interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is | |
259 | enabled and the driver is using NAPI. The struct contains also two | |
260 | other fields, but they are reserved and undefined. | |
261 | ||
b50a5c70 | 262 | SOF_TIMESTAMPING_OPT_TX_SWHW: |
b50a5c70 ML |
263 | Request both hardware and software timestamps for outgoing packets |
264 | when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE | |
265 | are enabled at the same time. If both timestamps are generated, | |
266 | two separate messages will be looped to the socket's error queue, | |
267 | each containing just one timestamp. | |
268 | ||
49ca0d8b WB |
269 | New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to |
270 | disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate | |
271 | regardless of the setting of sysctl net.core.tstamp_allow_data. | |
272 | ||
273 | An exception is when a process needs additional cmsg data, for | |
274 | instance SOL_IP/IP_PKTINFO to detect the egress network interface. | |
275 | Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on | |
276 | having access to the contents of the original packet, so cannot be | |
277 | combined with SOF_TIMESTAMPING_OPT_TSONLY. | |
278 | ||
279 | ||
fd91e12f | 280 | 1.3.4. Enabling timestamps via control messages |
06bfa47e | 281 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
fd91e12f SHY |
282 | |
283 | In addition to socket options, timestamp generation can be requested | |
284 | per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). | |
285 | Using this feature, applications can sample timestamps per sendmsg() | |
286 | without paying the overhead of enabling and disabling timestamps via | |
06bfa47e | 287 | setsockopt:: |
fd91e12f SHY |
288 | |
289 | struct msghdr *msg; | |
290 | ... | |
291 | cmsg = CMSG_FIRSTHDR(msg); | |
292 | cmsg->cmsg_level = SOL_SOCKET; | |
293 | cmsg->cmsg_type = SO_TIMESTAMPING; | |
294 | cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); | |
295 | *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | | |
296 | SOF_TIMESTAMPING_TX_SOFTWARE | | |
297 | SOF_TIMESTAMPING_TX_ACK; | |
298 | err = sendmsg(fd, msg, 0); | |
299 | ||
300 | The SOF_TIMESTAMPING_TX_* flags set via cmsg will override | |
301 | the SOF_TIMESTAMPING_TX_* flags set via setsockopt. | |
302 | ||
303 | Moreover, applications must still enable timestamp reporting via | |
06bfa47e | 304 | setsockopt to receive timestamps:: |
fd91e12f SHY |
305 | |
306 | __u32 val = SOF_TIMESTAMPING_SOFTWARE | | |
307 | SOF_TIMESTAMPING_OPT_ID /* or any other flag */; | |
5e34fa23 | 308 | err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); |
fd91e12f SHY |
309 | |
310 | ||
8fe2f761 | 311 | 1.4 Bytestream Timestamps |
06bfa47e | 312 | ------------------------- |
8fe2f761 WB |
313 | |
314 | The SO_TIMESTAMPING interface supports timestamping of bytes in a | |
315 | bytestream. Each request is interpreted as a request for when the | |
316 | entire contents of the buffer has passed a timestamping point. That | |
317 | is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record | |
318 | when all bytes have reached the device driver, regardless of how | |
319 | many packets the data has been converted into. | |
320 | ||
321 | In general, bytestreams have no natural delimiters and therefore | |
322 | correlating a timestamp with data is non-trivial. A range of bytes | |
323 | may be split across segments, any segments may be merged (possibly | |
324 | coalescing sections of previously segmented buffers associated with | |
325 | independent send() calls). Segments can be reordered and the same | |
326 | byte range can coexist in multiple segments for protocols that | |
327 | implement retransmissions. | |
328 | ||
329 | It is essential that all timestamps implement the same semantics, | |
330 | regardless of these possible transformations, as otherwise they are | |
331 | incomparable. Handling "rare" corner cases differently from the | |
332 | simple case (a 1:1 mapping from buffer to skb) is insufficient | |
333 | because performance debugging often needs to focus on such outliers. | |
334 | ||
335 | In practice, timestamps can be correlated with segments of a | |
336 | bytestream consistently, if both semantics of the timestamp and the | |
337 | timing of measurement are chosen correctly. This challenge is no | |
338 | different from deciding on a strategy for IP fragmentation. There, the | |
339 | definition is that only the first fragment is timestamped. For | |
340 | bytestreams, we chose that a timestamp is generated only when all | |
341 | bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to | |
342 | implement and reason about. An implementation that has to take into | |
343 | account SACK would be more complex due to possible transmission holes | |
344 | and out of order arrival. | |
345 | ||
346 | On the host, TCP can also break the simple 1:1 mapping from buffer to | |
347 | skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The | |
348 | implementation ensures correctness in all cases by tracking the | |
349 | individual last byte passed to send(), even if it is no longer the | |
350 | last byte after an skbuff extend or merge operation. It stores the | |
351 | relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff | |
352 | has only one such field, only one timestamp can be generated. | |
353 | ||
354 | In rare cases, a timestamp request can be missed if two requests are | |
355 | collapsed onto the same skb. A process can detect this situation by | |
356 | enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at | |
357 | send time with the value returned for each timestamp. It can prevent | |
358 | the situation by always flushing the TCP stack in between requests, | |
359 | for instance by enabling TCP_NODELAY and disabling TCP_CORK and | |
360 | autocork. | |
361 | ||
362 | These precautions ensure that the timestamp is generated only when all | |
363 | bytes have passed a timestamp point, assuming that the network stack | |
364 | itself does not reorder the segments. The stack indeed tries to avoid | |
365 | reordering. The one exception is under administrator control: it is | |
366 | possible to construct a packet scheduler configuration that delays | |
367 | segments from the same stream differently. Such a setup would be | |
368 | unusual. | |
369 | ||
370 | ||
371 | 2 Data Interfaces | |
06bfa47e | 372 | ================== |
8fe2f761 WB |
373 | |
374 | Timestamps are read using the ancillary data feature of recvmsg(). | |
375 | See `man 3 cmsg` for details of this interface. The socket manual | |
376 | page (`man 7 socket`) describes how timestamps generated with | |
377 | SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. | |
378 | ||
379 | ||
380 | 2.1 SCM_TIMESTAMPING records | |
06bfa47e | 381 | ---------------------------- |
8fe2f761 WB |
382 | |
383 | These timestamps are returned in a control message with cmsg_level | |
384 | SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type | |
69298698 | 385 | |
06bfa47e | 386 | For SO_TIMESTAMPING_OLD:: |
9dd49211 | 387 | |
06bfa47e MCC |
388 | struct scm_timestamping { |
389 | struct timespec ts[3]; | |
390 | }; | |
cb9eff09 | 391 | |
06bfa47e | 392 | For SO_TIMESTAMPING_NEW:: |
9dd49211 | 393 | |
06bfa47e MCC |
394 | struct scm_timestamping64 { |
395 | struct __kernel_timespec ts[3]; | |
9dd49211 DD |
396 | |
397 | Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in | |
398 | struct scm_timestamping64 format. | |
399 | ||
400 | SO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038 | |
401 | on 32 bit machines. | |
402 | ||
8fe2f761 | 403 | The structure can return up to three timestamps. This is a legacy |
67953d47 | 404 | feature. At least one field is non-zero at any time. Most timestamps |
8fe2f761 WB |
405 | are passed in ts[0]. Hardware timestamps are passed in ts[2]. |
406 | ||
407 | ts[1] used to hold hardware timestamps converted to system time. | |
408 | Instead, expose the hardware clock device on the NIC directly as | |
409 | a HW PTP clock source, to allow time conversion in userspace and | |
410 | optionally synchronize system time with a userspace PTP stack such | |
329f0041 | 411 | as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. |
8fe2f761 | 412 | |
67953d47 ML |
413 | Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled |
414 | together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false | |
415 | software timestamp will be generated in the recvmsg() call and passed | |
416 | in ts[0] when a real software timestamp is missing. This happens also | |
417 | on hardware transmit timestamps. | |
418 | ||
8fe2f761 | 419 | 2.1.1 Transmit timestamps with MSG_ERRQUEUE |
06bfa47e | 420 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
8fe2f761 WB |
421 | |
422 | For transmit timestamps the outgoing packet is looped back to the | |
423 | socket's error queue with the send timestamp(s) attached. A process | |
424 | receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE | |
425 | set and with a msg_control buffer sufficiently large to receive the | |
426 | relevant metadata structures. The recvmsg call returns the original | |
427 | outgoing data packet with two ancillary messages attached. | |
428 | ||
429 | A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR | |
430 | embeds a struct sock_extended_err. This defines the error type. For | |
431 | timestamps, the ee_errno field is ENOMSG. The other ancillary message | |
432 | will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This | |
433 | embeds the struct scm_timestamping. | |
434 | ||
435 | ||
436 | 2.1.1.2 Timestamp types | |
06bfa47e | 437 | ~~~~~~~~~~~~~~~~~~~~~~~ |
8fe2f761 WB |
438 | |
439 | The semantics of the three struct timespec are defined by field | |
440 | ee_info in the extended error structure. It contains a value of | |
441 | type SCM_TSTAMP_* to define the actual timestamp passed in | |
442 | scm_timestamping. | |
443 | ||
444 | The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* | |
445 | control fields discussed previously, with one exception. For legacy | |
446 | reasons, SCM_TSTAMP_SND is equal to zero and can be set for both | |
447 | SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It | |
448 | is the first if ts[2] is non-zero, the second otherwise, in which | |
449 | case the timestamp is stored in ts[0]. | |
450 | ||
451 | ||
452 | 2.1.1.3 Fragmentation | |
06bfa47e | 453 | ~~~~~~~~~~~~~~~~~~~~~ |
8fe2f761 WB |
454 | |
455 | Fragmentation of outgoing datagrams is rare, but is possible, e.g., by | |
456 | explicitly disabling PMTU discovery. If an outgoing packet is fragmented, | |
457 | then only the first fragment is timestamped and returned to the sending | |
458 | socket. | |
459 | ||
460 | ||
461 | 2.1.1.4 Packet Payload | |
06bfa47e | 462 | ~~~~~~~~~~~~~~~~~~~~~~ |
8fe2f761 WB |
463 | |
464 | The calling application is often not interested in receiving the whole | |
465 | packet payload that it passed to the stack originally: the socket | |
466 | error queue mechanism is just a method to piggyback the timestamp on. | |
467 | In this case, the application can choose to read datagrams with a | |
468 | smaller buffer, possibly even of length 0. The payload is truncated | |
469 | accordingly. Until the process calls recvmsg() on the error queue, | |
470 | however, the full packet is queued, taking up budget from SO_RCVBUF. | |
471 | ||
472 | ||
473 | 2.1.1.5 Blocking Read | |
06bfa47e | 474 | ~~~~~~~~~~~~~~~~~~~~~ |
8fe2f761 WB |
475 | |
476 | Reading from the error queue is always a non-blocking operation. To | |
477 | block waiting on a timestamp, use poll or select. poll() will return | |
478 | POLLERR in pollfd.revents if any data is ready on the error queue. | |
479 | There is no need to pass this flag in pollfd.events. This flag is | |
480 | ignored on request. See also `man 2 poll`. | |
481 | ||
482 | ||
483 | 2.1.2 Receive timestamps | |
06bfa47e | 484 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
8fe2f761 WB |
485 | |
486 | On reception, there is no reason to read from the socket error queue. | |
487 | The SCM_TIMESTAMPING ancillary data is sent along with the packet data | |
488 | on a normal recvmsg(). Since this is not a socket error, it is not | |
489 | accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, | |
490 | the meaning of the three fields in struct scm_timestamping is | |
491 | implicitly defined. ts[0] holds a software timestamp if set, ts[1] | |
492 | is again deprecated and ts[2] holds a hardware timestamp if set. | |
493 | ||
494 | ||
495 | 3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP | |
06bfa47e | 496 | ======================================================================= |
cb9eff09 PO |
497 | |
498 | Hardware time stamping must also be initialized for each device driver | |
69298698 | 499 | that is expected to do hardware time stamping. The parameter is defined in |
06bfa47e | 500 | include/uapi/linux/net_tstamp.h as:: |
cb9eff09 | 501 | |
06bfa47e MCC |
502 | struct hwtstamp_config { |
503 | int flags; /* no flags defined right now, must be zero */ | |
504 | int tx_type; /* HWTSTAMP_TX_* */ | |
505 | int rx_filter; /* HWTSTAMP_FILTER_* */ | |
506 | }; | |
cb9eff09 PO |
507 | |
508 | Desired behavior is passed into the kernel and to a specific device by | |
509 | calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose | |
510 | ifr_data points to a struct hwtstamp_config. The tx_type and | |
511 | rx_filter are hints to the driver what it is expected to do. If | |
512 | the requested fine-grained filtering for incoming packets is not | |
513 | supported, the driver may time stamp more than just the requested types | |
514 | of packets. | |
515 | ||
eff3cddc JK |
516 | Drivers are free to use a more permissive configuration than the requested |
517 | configuration. It is expected that drivers should only implement directly the | |
518 | most generic mode that can be supported. For example if the hardware can | |
cbb91dcb JK |
519 | support HWTSTAMP_FILTER_PTP_V2_EVENT, then it should generally always upscale |
520 | HWTSTAMP_FILTER_PTP_V2_L2_SYNC, and so forth, as HWTSTAMP_FILTER_PTP_V2_EVENT | |
eff3cddc JK |
521 | is more generic (and more useful to applications). |
522 | ||
cb9eff09 PO |
523 | A driver which supports hardware time stamping shall update the struct |
524 | with the actual, possibly more permissive configuration. If the | |
525 | requested packets cannot be time stamped, then nothing should be | |
526 | changed and ERANGE shall be returned (in contrast to EINVAL, which | |
527 | indicates that SIOCSHWTSTAMP is not supported at all). | |
528 | ||
529 | Only a processes with admin rights may change the configuration. User | |
530 | space is responsible to ensure that multiple processes don't interfere | |
531 | with each other and that the settings are reset. | |
532 | ||
fd468c74 BH |
533 | Any process can read the actual configuration by passing this |
534 | structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has | |
535 | not been implemented in all drivers. | |
536 | ||
06bfa47e MCC |
537 | :: |
538 | ||
539 | /* possible values for hwtstamp_config->tx_type */ | |
540 | enum { | |
541 | /* | |
542 | * no outgoing packet will need hardware time stamping; | |
543 | * should a packet arrive which asks for it, no hardware | |
544 | * time stamping will be done | |
545 | */ | |
546 | HWTSTAMP_TX_OFF, | |
547 | ||
548 | /* | |
549 | * enables hardware time stamping for outgoing packets; | |
550 | * the sender of the packet decides which are to be | |
551 | * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE | |
552 | * before sending the packet | |
553 | */ | |
554 | HWTSTAMP_TX_ON, | |
555 | }; | |
556 | ||
557 | /* possible values for hwtstamp_config->rx_filter */ | |
558 | enum { | |
559 | /* time stamp no incoming packet at all */ | |
560 | HWTSTAMP_FILTER_NONE, | |
561 | ||
562 | /* time stamp any incoming packet */ | |
563 | HWTSTAMP_FILTER_ALL, | |
564 | ||
565 | /* return value: time stamp all packets requested plus some others */ | |
566 | HWTSTAMP_FILTER_SOME, | |
567 | ||
568 | /* PTP v1, UDP, any kind of event packet */ | |
569 | HWTSTAMP_FILTER_PTP_V1_L4_EVENT, | |
570 | ||
571 | /* for the complete list of values, please check | |
572 | * the include file include/uapi/linux/net_tstamp.h | |
573 | */ | |
574 | }; | |
cb9eff09 | 575 | |
8fe2f761 | 576 | 3.1 Hardware Timestamping Implementation: Device Drivers |
06bfa47e | 577 | -------------------------------------------------------- |
cb9eff09 PO |
578 | |
579 | A driver which supports hardware time stamping must support the | |
69298698 | 580 | SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with |
fd468c74 BH |
581 | the actual values as described in the section on SIOCSHWTSTAMP. It |
582 | should also support SIOCGHWTSTAMP. | |
69298698 PL |
583 | |
584 | Time stamps for received packets must be stored in the skb. To get a pointer | |
585 | to the shared time stamp structure of the skb call skb_hwtstamps(). Then | |
06bfa47e | 586 | set the time stamps in the structure:: |
69298698 | 587 | |
06bfa47e MCC |
588 | struct skb_shared_hwtstamps { |
589 | /* hardware time stamp transformed into duration | |
590 | * since arbitrary point in time | |
591 | */ | |
592 | ktime_t hwtstamp; | |
593 | }; | |
cb9eff09 PO |
594 | |
595 | Time stamps for outgoing packets are to be generated as follows: | |
06bfa47e | 596 | |
2244d07b OH |
597 | - In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) |
598 | is set no-zero. If yes, then the driver is expected to do hardware time | |
599 | stamping. | |
cb9eff09 | 600 | - If this is possible for the skb and requested, then declare |
2244d07b | 601 | that the driver is doing the time stamping by setting the flag |
06bfa47e | 602 | SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with:: |
2244d07b OH |
603 | |
604 | skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; | |
605 | ||
606 | You might want to keep a pointer to the associated skb for the next step | |
607 | and not free the skb. A driver not supporting hardware time stamping doesn't | |
608 | do that. A driver must never touch sk_buff::tstamp! It is used to store | |
609 | software generated time stamps by the network subsystem. | |
59cb89e6 JK |
610 | - Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware |
611 | as possible. skb_tx_timestamp() provides a software time stamp if requested | |
612 | and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). | |
cb9eff09 PO |
613 | - As soon as the driver has sent the packet and/or obtained a |
614 | hardware time stamp for it, it passes the time stamp back by | |
a9725e1d WB |
615 | calling skb_tstamp_tx() with the original skb, the raw |
616 | hardware time stamp. skb_tstamp_tx() clones the original skb and | |
69298698 PL |
617 | adds the timestamps, therefore the original skb has to be freed now. |
618 | If obtaining the hardware time stamp somehow fails, then the driver | |
619 | should not fall back to software time stamping. The rationale is that | |
620 | this would occur at a later time in the processing pipeline than other | |
621 | software time stamping and therefore could lead to unexpected deltas | |
622 | between time stamps. | |
94d9f78f VO |
623 | |
624 | 3.2 Special considerations for stacked PTP Hardware Clocks | |
625 | ---------------------------------------------------------- | |
626 | ||
627 | There are situations when there may be more than one PHC (PTP Hardware Clock) | |
628 | in the data path of a packet. The kernel has no explicit mechanism to allow the | |
629 | user to select which PHC to use for timestamping Ethernet frames. Instead, the | |
630 | assumption is that the outermost PHC is always the most preferable, and that | |
631 | kernel drivers collaborate towards achieving that goal. Currently there are 3 | |
632 | cases of stacked PHCs, detailed below: | |
633 | ||
634 | 3.2.1 DSA (Distributed Switch Architecture) switches | |
635 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
636 | ||
637 | These are Ethernet switches which have one of their ports connected to an | |
638 | (otherwise completely unaware) host Ethernet interface, and perform the role of | |
639 | a port multiplier with optional forwarding acceleration features. Each DSA | |
640 | switch port is visible to the user as a standalone (virtual) network interface, | |
641 | and its network I/O is performed, under the hood, indirectly through the host | |
642 | interface (redirecting to the host port on TX, and intercepting frames on RX). | |
643 | ||
644 | When a DSA switch is attached to a host port, PTP synchronization has to | |
645 | suffer, since the switch's variable queuing delay introduces a path delay | |
646 | jitter between the host port and its PTP partner. For this reason, some DSA | |
647 | switches include a timestamping clock of their own, and have the ability to | |
648 | perform network timestamping on their own MAC, such that path delays only | |
649 | measure wire and PHY propagation latencies. Timestamping DSA switches are | |
650 | supported in Linux and expose the same ABI as any other network interface (save | |
651 | for the fact that the DSA interfaces are in fact virtual in terms of network | |
652 | I/O, they do have their own PHC). It is typical, but not mandatory, for all | |
653 | interfaces of a DSA switch to share the same PHC. | |
654 | ||
655 | By design, PTP timestamping with a DSA switch does not need any special | |
656 | handling in the driver for the host port it is attached to. However, when the | |
657 | host port also supports PTP timestamping, DSA will take care of intercepting | |
a7605370 | 658 | the ``.ndo_eth_ioctl`` calls towards the host port, and block attempts to enable |
94d9f78f VO |
659 | hardware timestamping on it. This is because the SO_TIMESTAMPING API does not |
660 | allow the delivery of multiple hardware timestamps for the same packet, so | |
661 | anybody else except for the DSA switch port must be prevented from doing so. | |
662 | ||
d150946e YL |
663 | In the generic layer, DSA provides the following infrastructure for PTP |
664 | timestamping: | |
665 | ||
666 | - ``.port_txtstamp()``: a hook called prior to the transmission of | |
667 | packets with a hardware TX timestamping request from user space. | |
668 | This is required for two-step timestamping, since the hardware | |
669 | timestamp becomes available after the actual MAC transmission, so the | |
670 | driver must be prepared to correlate the timestamp with the original | |
671 | packet so that it can re-enqueue the packet back into the socket's | |
672 | error queue. To save the packet for when the timestamp becomes | |
673 | available, the driver can call ``skb_clone_sk`` , save the clone pointer | |
674 | in skb->cb and enqueue a tx skb queue. Typically, a switch will have a | |
675 | PTP TX timestamp register (or sometimes a FIFO) where the timestamp | |
676 | becomes available. In case of a FIFO, the hardware might store | |
677 | key-value pairs of PTP sequence ID/message type/domain number and the | |
678 | actual timestamp. To perform the correlation correctly between the | |
679 | packets in a queue waiting for timestamping and the actual timestamps, | |
680 | drivers can use a BPF classifier (``ptp_classify_raw``) to identify | |
681 | the PTP transport type, and ``ptp_parse_header`` to interpret the PTP | |
682 | header fields. There may be an IRQ that is raised upon this | |
683 | timestamp's availability, or the driver might have to poll after | |
684 | invoking ``dev_queue_xmit()`` towards the host interface. | |
685 | One-step TX timestamping do not require packet cloning, since there is | |
686 | no follow-up message required by the PTP protocol (because the | |
687 | TX timestamp is embedded into the packet by the MAC), and therefore | |
688 | user space does not expect the packet annotated with the TX timestamp | |
689 | to be re-enqueued into its socket's error queue. | |
690 | ||
691 | - ``.port_rxtstamp()``: On RX, the BPF classifier is run by DSA to | |
692 | identify PTP event messages (any other packets, including PTP general | |
693 | messages, are not timestamped). The original (and only) timestampable | |
694 | skb is provided to the driver, for it to annotate it with a timestamp, | |
695 | if that is immediately available, or defer to later. On reception, | |
696 | timestamps might either be available in-band (through metadata in the | |
697 | DSA header, or attached in other ways to the packet), or out-of-band | |
698 | (through another RX timestamping FIFO). Deferral on RX is typically | |
699 | necessary when retrieving the timestamp needs a sleepable context. In | |
700 | that case, it is the responsibility of the DSA driver to call | |
21f95a88 | 701 | ``netif_rx()`` on the freshly timestamped skb. |
94d9f78f VO |
702 | |
703 | 3.2.2 Ethernet PHYs | |
704 | ^^^^^^^^^^^^^^^^^^^ | |
705 | ||
706 | These are devices that typically fulfill a Layer 1 role in the network stack, | |
707 | hence they do not have a representation in terms of a network interface as DSA | |
708 | switches do. However, PHYs may be able to detect and timestamp PTP packets, for | |
709 | performance reasons: timestamps taken as close as possible to the wire have the | |
710 | potential to yield a more stable and precise synchronization. | |
711 | ||
712 | A PHY driver that supports PTP timestamping must create a ``struct | |
713 | mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence | |
714 | of this pointer will be checked by the networking stack. | |
715 | ||
716 | Since PHYs do not have network interface representations, the timestamping and | |
717 | ethtool ioctl operations for them need to be mediated by their respective MAC | |
718 | driver. Therefore, as opposed to DSA switches, modifications need to be done | |
719 | to each individual MAC driver for PHY timestamping support. This entails: | |
720 | ||
a7605370 | 721 | - Checking, in ``.ndo_eth_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)`` |
94d9f78f VO |
722 | is true or not. If it is, then the MAC driver should not process this request |
723 | but instead pass it on to the PHY using ``phy_mii_ioctl()``. | |
724 | ||
725 | - On RX, special intervention may or may not be needed, depending on the | |
726 | function used to deliver skb's up the network stack. In the case of plain | |
727 | ``netif_rx()`` and similar, MAC drivers must check whether | |
728 | ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't | |
729 | call ``netif_rx()`` at all. If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is | |
730 | enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook | |
731 | will be called now, to determine, using logic very similar to DSA, whether | |
732 | deferral for RX timestamping is necessary. Again like DSA, it becomes the | |
733 | responsibility of the PHY driver to send the packet up the stack when the | |
734 | timestamp is available. | |
735 | ||
736 | For other skb receive functions, such as ``napi_gro_receive`` and | |
737 | ``netif_receive_skb``, the stack automatically checks whether | |
738 | ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside | |
739 | the driver. | |
740 | ||
741 | - On TX, again, special intervention might or might not be needed. The | |
742 | function that calls the ``mii_ts->txtstamp()`` hook is named | |
743 | ``skb_clone_tx_timestamp()``. This function can either be called directly | |
744 | (case in which explicit MAC driver support is indeed needed), but the | |
745 | function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC | |
746 | drivers already perform for software timestamping purposes. Therefore, if a | |
747 | MAC supports software timestamping, it does not need to do anything further | |
748 | at this stage. | |
749 | ||
750 | 3.2.3 MII bus snooping devices | |
751 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
752 | ||
753 | These perform the same role as timestamping Ethernet PHYs, save for the fact | |
754 | that they are discrete devices and can therefore be used in conjunction with | |
755 | any PHY even if it doesn't support timestamping. In Linux, they are | |
756 | discoverable and attachable to a ``struct phy_device`` through Device Tree, and | |
757 | for the rest, they use the same mii_ts infrastructure as those. See | |
758 | Documentation/devicetree/bindings/ptp/timestamper.txt for more details. | |
759 | ||
760 | 3.2.4 Other caveats for MAC drivers | |
761 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
762 | ||
763 | Stacked PHCs, especially DSA (but not only) - since that doesn't require any | |
764 | modification to MAC drivers, so it is more difficult to ensure correctness of | |
765 | all possible code paths - is that they uncover bugs which were impossible to | |
766 | trigger before the existence of stacked PTP clocks. One example has to do with | |
767 | this line of code, already presented earlier:: | |
768 | ||
769 | skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; | |
770 | ||
771 | Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY | |
772 | driver or a MII bus snooping device driver, should set this flag. | |
773 | But a MAC driver that is unaware of PHC stacking might get tripped up by | |
774 | somebody other than itself setting this flag, and deliver a duplicate | |
775 | timestamp. | |
776 | For example, a typical driver design for TX timestamping might be to split the | |
777 | transmission part into 2 portions: | |
778 | ||
779 | 1. "TX": checks whether PTP timestamping has been previously enabled through | |
a7605370 | 780 | the ``.ndo_eth_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the |
94d9f78f VO |
781 | current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags & |
782 | SKBTX_HW_TSTAMP``"). If this is true, it sets the | |
783 | "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as | |
784 | described above, in the case of a stacked PHC system, this condition should | |
785 | never trigger, as this MAC is certainly not the outermost PHC. But this is | |
786 | not where the typical issue is. Transmission proceeds with this packet. | |
787 | ||
788 | 2. "TX confirmation": Transmission has finished. The driver checks whether it | |
789 | is necessary to collect any TX timestamp for it. Here is where the typical | |
790 | issues are: the MAC driver takes a shortcut and only checks whether | |
791 | "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked | |
792 | PHC system, this is incorrect because this MAC driver is not the only entity | |
793 | in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first | |
794 | place. | |
795 | ||
796 | The correct solution for this problem is for MAC drivers to have a compound | |
797 | check in their "TX confirmation" portion, not only for | |
798 | "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for | |
799 | "``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures | |
800 | that PTP timestamping is not enabled for anything other than the outermost PHC, | |
801 | this enhanced check will avoid delivering a duplicated TX timestamp to user | |
802 | space. |