Commit | Line | Data |
---|---|---|
97abb2b3 AN |
1 | =============== |
2 | BPF ring buffer | |
3 | =============== | |
4 | ||
5 | This document describes BPF ring buffer design, API, and implementation details. | |
6 | ||
7 | .. contents:: | |
8 | :local: | |
9 | :depth: 2 | |
10 | ||
11 | Motivation | |
12 | ---------- | |
13 | ||
14 | There are two distinctive motivators for this work, which are not satisfied by | |
15 | existing perf buffer, which prompted creation of a new ring buffer | |
16 | implementation. | |
17 | ||
18 | - more efficient memory utilization by sharing ring buffer across CPUs; | |
19 | - preserving ordering of events that happen sequentially in time, even across | |
20 | multiple CPUs (e.g., fork/exec/exit events for a task). | |
21 | ||
22 | These two problems are independent, but perf buffer fails to satisfy both. | |
23 | Both are a result of a choice to have per-CPU perf ring buffer. Both can be | |
24 | also solved by having an MPSC implementation of ring buffer. The ordering | |
25 | problem could technically be solved for perf buffer with some in-kernel | |
26 | counting, but given the first one requires an MPSC buffer, the same solution | |
27 | would solve the second problem automatically. | |
28 | ||
29 | Semantics and APIs | |
30 | ------------------ | |
31 | ||
32 | Single ring buffer is presented to BPF programs as an instance of BPF map of | |
33 | type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but | |
34 | ultimately rejected. | |
35 | ||
36 | One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make | |
37 | ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not | |
38 | enforce "same CPU only" rule. This would be more familiar interface compatible | |
39 | with existing perf buffer use in BPF, but would fail if application needed more | |
40 | advanced logic to lookup ring buffer by arbitrary key. | |
41 | ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach. | |
42 | Additionally, given the performance of BPF ringbuf, many use cases would just | |
43 | opt into a simple single ring buffer shared among all CPUs, for which current | |
44 | approach would be an overkill. | |
45 | ||
46 | Another approach could introduce a new concept, alongside BPF map, to represent | |
47 | generic "container" object, which doesn't necessarily have key/value interface | |
48 | with lookup/update/delete operations. This approach would add a lot of extra | |
49 | infrastructure that has to be built for observability and verifier support. It | |
50 | would also add another concept that BPF developers would have to familiarize | |
51 | themselves with, new syntax in libbpf, etc. But then would really provide no | |
52 | additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF`` | |
53 | doesn't support lookup/update/delete operations, but so doesn't few other map | |
54 | types (e.g., queue and stack; array doesn't support delete, etc). | |
55 | ||
56 | The approach chosen has an advantage of re-using existing BPF map | |
57 | infrastructure (introspection APIs in kernel, libbpf support, etc), being | |
58 | familiar concept (no need to teach users a new type of object in BPF program), | |
59 | and utilizing existing tooling (bpftool). For common scenario of using a single | |
60 | ring buffer for all CPUs, it's as simple and straightforward, as would be with | |
61 | a dedicated "container" object. On the other hand, by being a map, it can be | |
62 | combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement | |
63 | a wide variety of topologies, from one ring buffer for each CPU (e.g., as | |
64 | a replacement for perf buffer use cases), to a complicated application | |
65 | hashing/sharding of ring buffers (e.g., having a small pool of ring buffers | |
66 | with hashed task's tgid being a look up key to preserve order, but reduce | |
67 | contention). | |
68 | ||
69 | Key and value sizes are enforced to be zero. ``max_entries`` is used to specify | |
70 | the size of ring buffer and has to be a power of 2 value. | |
71 | ||
72 | There are a bunch of similarities between perf buffer | |
73 | (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics: | |
74 | ||
75 | - variable-length records; | |
76 | - if there is no more space left in ring buffer, reservation fails, no | |
77 | blocking; | |
78 | - memory-mappable data area for user-space applications for ease of | |
79 | consumption and high performance; | |
80 | - epoll notifications for new incoming data; | |
81 | - but still the ability to do busy polling for new data to achieve the | |
82 | lowest latency, if necessary. | |
83 | ||
84 | BPF ringbuf provides two sets of APIs to BPF programs: | |
85 | ||
86 | - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring | |
87 | buffer, similarly to ``bpf_perf_event_output()``; | |
88 | - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` | |
89 | APIs split the whole process into two steps. First, a fixed amount of space | |
90 | is reserved. If successful, a pointer to a data inside ring buffer data | |
91 | area is returned, which BPF programs can use similarly to a data inside | |
92 | array/hash maps. Once ready, this piece of memory is either committed or | |
93 | discarded. Discard is similar to commit, but makes consumer ignore the | |
94 | record. | |
95 | ||
96 | ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy, | |
97 | because record has to be prepared in some other place first. But it allows to | |
98 | submit records of the length that's not known to verifier beforehand. It also | |
99 | closely matches ``bpf_perf_event_output()``, so will simplify migration | |
100 | significantly. | |
101 | ||
102 | ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory | |
103 | pointer directly to ring buffer memory. In a lot of cases records are larger | |
104 | than BPF stack space allows, so many programs have use extra per-CPU array as | |
105 | a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs | |
106 | completely. But in exchange, it only allows a known constant size of memory to | |
107 | be reserved, such that verifier can verify that BPF program can't access memory | |
108 | outside its reserved record space. bpf_ringbuf_output(), while slightly slower | |
109 | due to extra memory copy, covers some use cases that are not suitable for | |
110 | ``bpf_ringbuf_reserve()``. | |
111 | ||
112 | The difference between commit and discard is very small. Discard just marks | |
113 | a record as discarded, and such records are supposed to be ignored by consumer | |
114 | code. Discard is useful for some advanced use-cases, such as ensuring | |
115 | all-or-nothing multi-record submission, or emulating temporary | |
116 | ``malloc()``/``free()`` within single BPF program invocation. | |
117 | ||
118 | Each reserved record is tracked by verifier through existing | |
119 | reference-tracking logic, similar to socket ref-tracking. It is thus | |
120 | impossible to reserve a record, but forget to submit (or discard) it. | |
121 | ||
122 | ``bpf_ringbuf_query()`` helper allows to query various properties of ring | |
123 | buffer. Currently 4 are supported: | |
124 | ||
125 | - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; | |
126 | - ``BPF_RB_RING_SIZE`` returns the size of ring buffer; | |
127 | - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition | |
128 | of consumer/producer, respectively. | |
129 | ||
130 | Returned values are momentarily snapshots of ring buffer state and could be | |
131 | off by the time helper returns, so this should be used only for | |
132 | debugging/reporting reasons or for implementing various heuristics, that take | |
133 | into account highly-changeable nature of some of those characteristics. | |
134 | ||
135 | One such heuristic might involve more fine-grained control over poll/epoll | |
136 | notifications about new data availability in ring buffer. Together with | |
137 | ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard | |
138 | helpers, it allows BPF program a high degree of control and, e.g., more | |
139 | efficient batched notifications. Default self-balancing strategy, though, | |
140 | should be adequate for most applications and will work reliable and efficiently | |
141 | already. | |
142 | ||
143 | Design and Implementation | |
144 | ------------------------- | |
145 | ||
146 | This reserve/commit schema allows a natural way for multiple producers, either | |
147 | on different CPUs or even on the same CPU/in the same BPF program, to reserve | |
148 | independent records and work with them without blocking other producers. This | |
149 | means that if BPF program was interruped by another BPF program sharing the | |
150 | same ring buffer, they will both get a record reserved (provided there is | |
151 | enough space left) and can work with it and submit it independently. This | |
152 | applies to NMI context as well, except that due to using a spinlock during | |
153 | reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get | |
154 | a lock, in which case reservation will fail even if ring buffer is not full. | |
155 | ||
156 | The ring buffer itself internally is implemented as a power-of-2 sized | |
157 | circular buffer, with two logical and ever-increasing counters (which might | |
158 | wrap around on 32-bit architectures, that's not a problem): | |
159 | ||
160 | - consumer counter shows up to which logical position consumer consumed the | |
161 | data; | |
162 | - producer counter denotes amount of data reserved by all producers. | |
163 | ||
164 | Each time a record is reserved, producer that "owns" the record will | |
165 | successfully advance producer counter. At that point, data is still not yet | |
166 | ready to be consumed, though. Each record has 8 byte header, which contains the | |
167 | length of reserved record, as well as two extra bits: busy bit to denote that | |
168 | record is still being worked on, and discard bit, which might be set at commit | |
169 | time if record is discarded. In the latter case, consumer is supposed to skip | |
170 | the record and move on to the next one. Record header also encodes record's | |
171 | relative offset from the beginning of ring buffer data area (in pages). This | |
172 | allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the | |
173 | pointer to the record itself, without requiring also the pointer to ring buffer | |
174 | itself. Ring buffer memory location will be restored from record metadata | |
175 | header. This significantly simplifies verifier, as well as improving API | |
176 | usability. | |
177 | ||
178 | Producer counter increments are serialized under spinlock, so there is | |
179 | a strict ordering between reservations. Commits, on the other hand, are | |
180 | completely lockless and independent. All records become available to consumer | |
181 | in the order of reservations, but only after all previous records where | |
182 | already committed. It is thus possible for slow producers to temporarily hold | |
183 | off submitted records, that were reserved later. | |
184 | ||
97abb2b3 AN |
185 | One interesting implementation bit, that significantly simplifies (and thus |
186 | speeds up as well) implementation of both producers and consumers is how data | |
187 | area is mapped twice contiguously back-to-back in the virtual memory. This | |
188 | allows to not take any special measures for samples that have to wrap around | |
189 | at the end of the circular buffer data area, because the next page after the | |
190 | last data page would be first data page again, and thus the sample will still | |
191 | appear completely contiguous in virtual memory. See comment and a simple ASCII | |
192 | diagram showing this visually in ``bpf_ringbuf_area_alloc()``. | |
193 | ||
194 | Another feature that distinguishes BPF ringbuf from perf ring buffer is | |
195 | a self-pacing notifications of new data being availability. | |
196 | ``bpf_ringbuf_commit()`` implementation will send a notification of new record | |
197 | being available after commit only if consumer has already caught up right up to | |
198 | the record being committed. If not, consumer still has to catch up and thus | |
199 | will see new data anyways without needing an extra poll notification. | |
65dce596 | 200 | Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that |
97abb2b3 AN |
201 | this allows to achieve a very high throughput without having to resort to |
202 | tricks like "notify only every Nth sample", which are necessary with perf | |
203 | buffer. For extreme cases, when BPF program wants more manual control of | |
204 | notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and | |
205 | ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of | |
206 | data availability, but require extra caution and diligence in using this API. |