Documentation/bpf/graph_ds_impl.rst

   1 =========================
   2 BPF Graph Data Structures
   3 =========================
   4
   5 This document describes implementation details of new-style "graph" data
   6 structures (linked_list, rbtree), with particular focus on the verifier's
   7 implementation of semantics specific to those data structures.
   8
   9 Although no specific verifier code is referred to in this document, the document
  10 assumes that the reader has general knowledge of BPF verifier internals, BPF
  11 maps, and BPF program writing.
  12
  13 Note that the intent of this document is to describe the current state of
  14 these graph data structures. **No guarantees** of stability for either
  15 semantics or APIs are made or implied here.
  16
  17 .. contents::
  18     :local:
  19     :depth: 2
  20
  21 Introduction
  22 ------------
  23
  24 The BPF map API has historically been the main way to expose data structures
  25 of various types for use within BPF programs. Some data structures fit naturally
  26 with the map API (HASH, ARRAY), others less so. Consequentially, programs
  27 interacting with the latter group of data structures can be hard to parse
  28 for kernel programmers without previous BPF experience.
  29
  30 Luckily, some restrictions which necessitated the use of BPF map semantics are
  31 no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
  32 BPF allocator, it is now possible to implement BPF data structures whose API
  33 and semantics more closely match those exposed to the rest of the kernel.
  34
  35 Two such data structures - linked_list and rbtree - have many verification
  36 details in common. Because both have "root"s ("head" for linked_list) and
  37 "node"s, the verifier code and this document refer to common functionality
  38 as "graph_api", "graph_root", "graph_node", etc.
  39
  40 Unless otherwise stated, examples and semantics below apply to both graph data
  41 structures.
  42
  43 Unstable API
  44 ------------
  45
  46 Data structures implemented using the BPF map API have historically used BPF
  47 helper functions - either standard map API helpers like ``bpf_map_update_elem``
  48 or map-specific helpers. The new-style graph data structures instead use kfuncs
  49 to define their manipulation helpers. Because there are no stability guarantees
  50 for kfuncs, the API and semantics for these data structures can be evolved in
  51 a way that breaks backwards compatibility if necessary.
  52
  53 Root and node types for the new data structures are opaquely defined in the
  54 ``uapi/linux/bpf.h`` header.
  55
  56 Locking
  57 -------
  58
  59 The new-style data structures are intrusive and are defined similarly to their
  60 vanilla kernel counterparts:
  61
  62 .. code-block:: c
  63
  64         struct node_data {
  65           long key;
  66           long data;
  67           struct bpf_rb_node node;
  68         };
  69
  70         struct bpf_spin_lock glock;
  71         struct bpf_rb_root groot __contains(node_data, node);
  72
  73 The "root" type for both linked_list and rbtree expects to be in a map_value
  74 which also contains a ``bpf_spin_lock`` - in the above example both global
  75 variables are placed in a single-value arraymap. The verifier considers this
  76 spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
  77 the same map_value and will enforce that the correct lock is held when
  78 verifying BPF programs that manipulate the tree. Since this lock checking
  79 happens at verification time, there is no runtime penalty.
  80
  81 Non-owning references
  82 ---------------------
  83
  84 **Motivation**
  85
  86 Consider the following BPF code:
  87
  88 .. code-block:: c
  89
  90         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
  91
  92         bpf_spin_lock(&lock);
  93
  94         bpf_rbtree_add(&tree, n); /* PASSED */
  95
  96         bpf_spin_unlock(&lock);
  97
  98 From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
  99 has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
 100 ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
 101 program has ownership of the pointee's (object pointed to by ``n``) lifetime.
 102 The BPF program must pass off ownership before exiting - either via
 103 ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
 104 ``bpf_rbtree_add``.
 105
 106 (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
 107 "ownership is acquired" and "ownership is passed", respectively)
 108
 109 What should the verifier do with ``n`` after ownership is passed off? If the
 110 object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
 111 should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
 112 the object is no longer valid. The underlying memory may have been reused for
 113 some other allocation, unmapped, etc.
 114
 115 When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
 116 obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
 117 but that would result in programs with useful, common coding patterns being
 118 rejected, e.g.:
 119
 120 .. code-block:: c
 121
 122         int x;
 123         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
 124
 125         bpf_spin_lock(&lock);
 126
 127         bpf_rbtree_add(&tree, n); /* PASSED */
 128         x = n->data;
 129         n->data = 42;
 130
 131         bpf_spin_unlock(&lock);
 132
 133 Both the read from and write to ``n->data`` would be rejected. The verifier
 134 can do better, though, by taking advantage of two details:
 135
 136   * Graph data structure APIs can only be used when the ``bpf_spin_lock``
 137     associated with the graph root is held
 138
 139   * Both graph data structures have pointer stability
 140
 141      * Because graph nodes are allocated with ``bpf_obj_new`` and
 142        adding / removing from the root involves fiddling with the
 143        ``bpf_{list,rb}_node`` field of the node struct, a graph node will
 144        remain at the same address after either operation.
 145
 146 Because the associated ``bpf_spin_lock`` must be held by any program adding
 147 or removing, if we're in the critical section bounded by that lock, we know
 148 that no other program can add or remove until the end of the critical section.
 149 This combined with pointer stability means that, until the critical section
 150 ends, we can safely access the graph node through ``n`` even after it was used
 151 to pass ownership.
 152
 153 The verifier considers such a reference a *non-owning reference*. The ref
 154 returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
 155 Both terms currently only have meaning in the context of graph nodes and API.
 156
 157 **Details**
 158
 159 Let's enumerate the properties of both types of references.
 160
 161 *owning reference*
 162
 163   * This reference controls the lifetime of the pointee
 164
 165   * Ownership of pointee must be 'released' by passing it to some graph API
 166     kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
 167
 168     * If not released before program ends, verifier considers program invalid
 169
 170   * Access to the pointee's memory will not page fault
 171
 172 *non-owning reference*
 173
 174   * This reference does not own the pointee
 175
 176      * It cannot be used to add the graph node to a graph root, nor ``free``'d via
 177        ``bpf_obj_drop``
 178
 179   * No explicit control of lifetime, but can infer valid lifetime based on
 180     non-owning ref existence (see explanation below)
 181
 182   * Access to the pointee's memory will not page fault
 183
 184 From verifier's perspective non-owning references can only exist
 185 between spin_lock and spin_unlock. Why? After spin_unlock another program
 186 can do arbitrary operations on the data structure like removing and ``free``-ing
 187 via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
 188 ``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
 189 Or the memory could go away.
 190
 191 To prevent this logic violation all non-owning references are invalidated by the
 192 verifier after a critical section ends. This is necessary to ensure the "will
 193 not page fault" property of non-owning references. So if the verifier hasn't
 194 invalidated a non-owning ref, accessing it will not page fault.
 195
 196 Currently ``bpf_obj_drop`` is not allowed in the critical section, so
 197 if there's a valid non-owning ref, we must be in a critical section, and can
 198 conclude that the ref's memory hasn't been dropped-and- ``free``'d or
 199 dropped-and-reused.
 200
 201 Any reference to a node that is in an rbtree _must_ be non-owning, since
 202 the tree has control of the pointee's lifetime. Similarly, any ref to a node
 203 that isn't in rbtree _must_ be owning. This results in a nice property:
 204 graph API add / remove implementations don't need to check if a node
 205 has already been added (or already removed), as the ownership model
 206 allows the verifier to prevent such a state from being valid by simply checking
 207 types.
 208
 209 However, pointer aliasing poses an issue for the above "nice property".
 210 Consider the following example:
 211
 212 .. code-block:: c
 213
 214         struct node_data *n, *m, *o, *p;
 215         n = bpf_obj_new(typeof(*n));     /* 1 */
 216
 217         bpf_spin_lock(&lock);
 218
 219         bpf_rbtree_add(&tree, n);        /* 2 */
 220         m = bpf_rbtree_first(&tree);     /* 3 */
 221
 222         o = bpf_rbtree_remove(&tree, n); /* 4 */
 223         p = bpf_rbtree_remove(&tree, m); /* 5 */
 224
 225         bpf_spin_unlock(&lock);
 226
 227         bpf_obj_drop(o);
 228         bpf_obj_drop(p); /* 6 */
 229
 230 Assume the tree is empty before this program runs. If we track verifier state
 231 changes here using numbers in above comments:
 232
 233   1) n is an owning reference
 234
 235   2) n is a non-owning reference, it's been added to the tree
 236
 237   3) n and m are non-owning references, they both point to the same node
 238
 239   4) o is an owning reference, n and m non-owning, all point to same node
 240
 241   5) o and p are owning, n and m non-owning, all point to the same node
 242
 243   6) a double-free has occurred, since o and p point to same node and o was
 244      ``free``'d in previous statement
 245
 246 States 4 and 5 violate our "nice property", as there are non-owning refs to
 247 a node which is not in an rbtree. Statement 5 will try to remove a node which
 248 has already been removed as a result of this violation. State 6 is a dangerous
 249 double-free.
 250
 251 At a minimum we should prevent state 6 from being possible. If we can't also
 252 prevent state 5 then we must abandon our "nice property" and check whether a
 253 node has already been removed at runtime.
 254
 255 We prevent both by generalizing the "invalidate non-owning references" behavior
 256 of ``bpf_spin_unlock`` and doing similar invalidation after
 257 ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
 258
 259   * takes an arbitrary node argument
 260
 261   * removes it from the data structure
 262
 263   * returns an owning reference to the removed node
 264
 265 May result in a state where some other non-owning reference points to the same
 266 node. So ``remove``-type kfuncs must be considered a non-owning reference
 267 invalidation point as well.