Commit | Line | Data |
---|---|---|
c31315c3 DM |
1 | ========================= |
2 | BPF Graph Data Structures | |
3 | ========================= | |
4 | ||
5 | This document describes implementation details of new-style "graph" data | |
6 | structures (linked_list, rbtree), with particular focus on the verifier's | |
7 | implementation of semantics specific to those data structures. | |
8 | ||
9 | Although no specific verifier code is referred to in this document, the document | |
10 | assumes that the reader has general knowledge of BPF verifier internals, BPF | |
11 | maps, and BPF program writing. | |
12 | ||
13 | Note that the intent of this document is to describe the current state of | |
14 | these graph data structures. **No guarantees** of stability for either | |
15 | semantics or APIs are made or implied here. | |
16 | ||
17 | .. contents:: | |
18 | :local: | |
19 | :depth: 2 | |
20 | ||
21 | Introduction | |
22 | ------------ | |
23 | ||
24 | The BPF map API has historically been the main way to expose data structures | |
25 | of various types for use within BPF programs. Some data structures fit naturally | |
26 | with the map API (HASH, ARRAY), others less so. Consequentially, programs | |
27 | interacting with the latter group of data structures can be hard to parse | |
28 | for kernel programmers without previous BPF experience. | |
29 | ||
30 | Luckily, some restrictions which necessitated the use of BPF map semantics are | |
31 | no longer relevant. With the introduction of kfuncs, kptrs, and the any-context | |
32 | BPF allocator, it is now possible to implement BPF data structures whose API | |
33 | and semantics more closely match those exposed to the rest of the kernel. | |
34 | ||
35 | Two such data structures - linked_list and rbtree - have many verification | |
36 | details in common. Because both have "root"s ("head" for linked_list) and | |
37 | "node"s, the verifier code and this document refer to common functionality | |
38 | as "graph_api", "graph_root", "graph_node", etc. | |
39 | ||
40 | Unless otherwise stated, examples and semantics below apply to both graph data | |
41 | structures. | |
42 | ||
43 | Unstable API | |
44 | ------------ | |
45 | ||
46 | Data structures implemented using the BPF map API have historically used BPF | |
47 | helper functions - either standard map API helpers like ``bpf_map_update_elem`` | |
48 | or map-specific helpers. The new-style graph data structures instead use kfuncs | |
49 | to define their manipulation helpers. Because there are no stability guarantees | |
50 | for kfuncs, the API and semantics for these data structures can be evolved in | |
51 | a way that breaks backwards compatibility if necessary. | |
52 | ||
53 | Root and node types for the new data structures are opaquely defined in the | |
54 | ``uapi/linux/bpf.h`` header. | |
55 | ||
56 | Locking | |
57 | ------- | |
58 | ||
59 | The new-style data structures are intrusive and are defined similarly to their | |
60 | vanilla kernel counterparts: | |
61 | ||
62 | .. code-block:: c | |
e2d323a1 | 63 | |
c31315c3 DM |
64 | struct node_data { |
65 | long key; | |
66 | long data; | |
67 | struct bpf_rb_node node; | |
68 | }; | |
69 | ||
70 | struct bpf_spin_lock glock; | |
71 | struct bpf_rb_root groot __contains(node_data, node); | |
72 | ||
73 | The "root" type for both linked_list and rbtree expects to be in a map_value | |
74 | which also contains a ``bpf_spin_lock`` - in the above example both global | |
75 | variables are placed in a single-value arraymap. The verifier considers this | |
76 | spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in | |
77 | the same map_value and will enforce that the correct lock is held when | |
78 | verifying BPF programs that manipulate the tree. Since this lock checking | |
79 | happens at verification time, there is no runtime penalty. | |
80 | ||
81 | Non-owning references | |
82 | --------------------- | |
83 | ||
84 | **Motivation** | |
85 | ||
86 | Consider the following BPF code: | |
87 | ||
88 | .. code-block:: c | |
89 | ||
90 | struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ | |
91 | ||
92 | bpf_spin_lock(&lock); | |
93 | ||
94 | bpf_rbtree_add(&tree, n); /* PASSED */ | |
95 | ||
96 | bpf_spin_unlock(&lock); | |
97 | ||
98 | From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` | |
99 | has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of | |
100 | ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the | |
101 | program has ownership of the pointee's (object pointed to by ``n``) lifetime. | |
102 | The BPF program must pass off ownership before exiting - either via | |
103 | ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with | |
104 | ``bpf_rbtree_add``. | |
105 | ||
106 | (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where | |
107 | "ownership is acquired" and "ownership is passed", respectively) | |
108 | ||
109 | What should the verifier do with ``n`` after ownership is passed off? If the | |
110 | object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier | |
111 | should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as | |
112 | the object is no longer valid. The underlying memory may have been reused for | |
113 | some other allocation, unmapped, etc. | |
114 | ||
115 | When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less | |
116 | obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, | |
117 | but that would result in programs with useful, common coding patterns being | |
118 | rejected, e.g.: | |
119 | ||
120 | .. code-block:: c | |
121 | ||
122 | int x; | |
123 | struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ | |
124 | ||
125 | bpf_spin_lock(&lock); | |
126 | ||
127 | bpf_rbtree_add(&tree, n); /* PASSED */ | |
128 | x = n->data; | |
129 | n->data = 42; | |
130 | ||
131 | bpf_spin_unlock(&lock); | |
132 | ||
133 | Both the read from and write to ``n->data`` would be rejected. The verifier | |
134 | can do better, though, by taking advantage of two details: | |
135 | ||
136 | * Graph data structure APIs can only be used when the ``bpf_spin_lock`` | |
137 | associated with the graph root is held | |
138 | ||
139 | * Both graph data structures have pointer stability | |
140 | ||
141 | * Because graph nodes are allocated with ``bpf_obj_new`` and | |
142 | adding / removing from the root involves fiddling with the | |
143 | ``bpf_{list,rb}_node`` field of the node struct, a graph node will | |
144 | remain at the same address after either operation. | |
145 | ||
146 | Because the associated ``bpf_spin_lock`` must be held by any program adding | |
147 | or removing, if we're in the critical section bounded by that lock, we know | |
148 | that no other program can add or remove until the end of the critical section. | |
149 | This combined with pointer stability means that, until the critical section | |
150 | ends, we can safely access the graph node through ``n`` even after it was used | |
151 | to pass ownership. | |
152 | ||
153 | The verifier considers such a reference a *non-owning reference*. The ref | |
154 | returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. | |
155 | Both terms currently only have meaning in the context of graph nodes and API. | |
156 | ||
157 | **Details** | |
158 | ||
159 | Let's enumerate the properties of both types of references. | |
160 | ||
161 | *owning reference* | |
162 | ||
163 | * This reference controls the lifetime of the pointee | |
164 | ||
165 | * Ownership of pointee must be 'released' by passing it to some graph API | |
166 | kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee | |
167 | ||
168 | * If not released before program ends, verifier considers program invalid | |
169 | ||
170 | * Access to the pointee's memory will not page fault | |
171 | ||
172 | *non-owning reference* | |
173 | ||
174 | * This reference does not own the pointee | |
175 | ||
176 | * It cannot be used to add the graph node to a graph root, nor ``free``'d via | |
177 | ``bpf_obj_drop`` | |
178 | ||
179 | * No explicit control of lifetime, but can infer valid lifetime based on | |
180 | non-owning ref existence (see explanation below) | |
181 | ||
182 | * Access to the pointee's memory will not page fault | |
183 | ||
184 | From verifier's perspective non-owning references can only exist | |
185 | between spin_lock and spin_unlock. Why? After spin_unlock another program | |
186 | can do arbitrary operations on the data structure like removing and ``free``-ing | |
187 | via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, | |
188 | ``free``'d, and reused via bpf_obj_new would point to an entirely different thing. | |
189 | Or the memory could go away. | |
190 | ||
191 | To prevent this logic violation all non-owning references are invalidated by the | |
192 | verifier after a critical section ends. This is necessary to ensure the "will | |
193 | not page fault" property of non-owning references. So if the verifier hasn't | |
194 | invalidated a non-owning ref, accessing it will not page fault. | |
195 | ||
196 | Currently ``bpf_obj_drop`` is not allowed in the critical section, so | |
197 | if there's a valid non-owning ref, we must be in a critical section, and can | |
198 | conclude that the ref's memory hasn't been dropped-and- ``free``'d or | |
199 | dropped-and-reused. | |
200 | ||
201 | Any reference to a node that is in an rbtree _must_ be non-owning, since | |
202 | the tree has control of the pointee's lifetime. Similarly, any ref to a node | |
203 | that isn't in rbtree _must_ be owning. This results in a nice property: | |
204 | graph API add / remove implementations don't need to check if a node | |
205 | has already been added (or already removed), as the ownership model | |
206 | allows the verifier to prevent such a state from being valid by simply checking | |
207 | types. | |
208 | ||
209 | However, pointer aliasing poses an issue for the above "nice property". | |
210 | Consider the following example: | |
211 | ||
212 | .. code-block:: c | |
213 | ||
214 | struct node_data *n, *m, *o, *p; | |
215 | n = bpf_obj_new(typeof(*n)); /* 1 */ | |
216 | ||
217 | bpf_spin_lock(&lock); | |
218 | ||
219 | bpf_rbtree_add(&tree, n); /* 2 */ | |
220 | m = bpf_rbtree_first(&tree); /* 3 */ | |
221 | ||
222 | o = bpf_rbtree_remove(&tree, n); /* 4 */ | |
223 | p = bpf_rbtree_remove(&tree, m); /* 5 */ | |
224 | ||
225 | bpf_spin_unlock(&lock); | |
226 | ||
227 | bpf_obj_drop(o); | |
228 | bpf_obj_drop(p); /* 6 */ | |
229 | ||
230 | Assume the tree is empty before this program runs. If we track verifier state | |
231 | changes here using numbers in above comments: | |
232 | ||
233 | 1) n is an owning reference | |
234 | ||
235 | 2) n is a non-owning reference, it's been added to the tree | |
236 | ||
237 | 3) n and m are non-owning references, they both point to the same node | |
238 | ||
239 | 4) o is an owning reference, n and m non-owning, all point to same node | |
240 | ||
241 | 5) o and p are owning, n and m non-owning, all point to the same node | |
242 | ||
243 | 6) a double-free has occurred, since o and p point to same node and o was | |
244 | ``free``'d in previous statement | |
245 | ||
246 | States 4 and 5 violate our "nice property", as there are non-owning refs to | |
247 | a node which is not in an rbtree. Statement 5 will try to remove a node which | |
248 | has already been removed as a result of this violation. State 6 is a dangerous | |
249 | double-free. | |
250 | ||
251 | At a minimum we should prevent state 6 from being possible. If we can't also | |
252 | prevent state 5 then we must abandon our "nice property" and check whether a | |
253 | node has already been removed at runtime. | |
254 | ||
255 | We prevent both by generalizing the "invalidate non-owning references" behavior | |
256 | of ``bpf_spin_unlock`` and doing similar invalidation after | |
257 | ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: | |
258 | ||
259 | * takes an arbitrary node argument | |
260 | ||
261 | * removes it from the data structure | |
262 | ||
263 | * returns an owning reference to the removed node | |
264 | ||
265 | May result in a state where some other non-owning reference points to the same | |
266 | node. So ``remove``-type kfuncs must be considered a non-owning reference | |
267 | invalidation point as well. |