[linux-block.git] / Documentation / bpf / graph_ds_impl.rst

=========================
BPF Graph Data Structures
=========================

This document describes implementation details of new-style "graph" data
structures (linked_list, rbtree), with particular focus on the verifier's
implementation of semantics specific to those data structures.

Although no specific verifier code is referred to in this document, the document
assumes that the reader has general knowledge of BPF verifier internals, BPF
maps, and BPF program writing.

Note that the intent of this document is to describe the current state of
these graph data structures. **No guarantees** of stability for either
semantics or APIs are made or implied here.

.. contents::
    :local:
    :depth: 2

Introduction
------------

The BPF map API has historically been the main way to expose data structures
of various types for use within BPF programs. Some data structures fit naturally
with the map API (HASH, ARRAY), others less so. Consequentially, programs
interacting with the latter group of data structures can be hard to parse
for kernel programmers without previous BPF experience.

Luckily, some restrictions which necessitated the use of BPF map semantics are
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
BPF allocator, it is now possible to implement BPF data structures whose API
and semantics more closely match those exposed to the rest of the kernel.

Two such data structures - linked_list and rbtree - have many verification
details in common. Because both have "root"s ("head" for linked_list) and
"node"s, the verifier code and this document refer to common functionality
as "graph_api", "graph_root", "graph_node", etc.

Unless otherwise stated, examples and semantics below apply to both graph data
structures.

Unstable API
------------

Data structures implemented using the BPF map API have historically used BPF
helper functions - either standard map API helpers like ``bpf_map_update_elem``
or map-specific helpers. The new-style graph data structures instead use kfuncs
to define their manipulation helpers. Because there are no stability guarantees
for kfuncs, the API and semantics for these data structures can be evolved in
a way that breaks backwards compatibility if necessary.

Root and node types for the new data structures are opaquely defined in the
``uapi/linux/bpf.h`` header.

Locking
-------

The new-style data structures are intrusive and are defined similarly to their
vanilla kernel counterparts:

.. code-block:: c

        struct node_data {
          long key;
          long data;
          struct bpf_rb_node node;
        };

        struct bpf_spin_lock glock;
        struct bpf_rb_root groot __contains(node_data, node);

The "root" type for both linked_list and rbtree expects to be in a map_value
which also contains a ``bpf_spin_lock`` - in the above example both global
variables are placed in a single-value arraymap. The verifier considers this
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
the same map_value and will enforce that the correct lock is held when
verifying BPF programs that manipulate the tree. Since this lock checking
happens at verification time, there is no runtime penalty.

Non-owning references
---------------------

**Motivation**

Consider the following BPF code:

.. code-block:: c

        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */

        bpf_spin_lock(&lock);

        bpf_rbtree_add(&tree, n); /* PASSED */

        bpf_spin_unlock(&lock);

From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
The BPF program must pass off ownership before exiting - either via
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
``bpf_rbtree_add``.

(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
"ownership is acquired" and "ownership is passed", respectively)

What should the verifier do with ``n`` after ownership is passed off? If the
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
the object is no longer valid. The underlying memory may have been reused for
some other allocation, unmapped, etc.

When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
but that would result in programs with useful, common coding patterns being
rejected, e.g.:

.. code-block:: c

        int x;
        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */

        bpf_spin_lock(&lock);

        bpf_rbtree_add(&tree, n); /* PASSED */
        x = n->data;
        n->data = 42;

        bpf_spin_unlock(&lock);

Both the read from and write to ``n->data`` would be rejected. The verifier
can do better, though, by taking advantage of two details:

  * Graph data structure APIs can only be used when the ``bpf_spin_lock``
    associated with the graph root is held

  * Both graph data structures have pointer stability

     * Because graph nodes are allocated with ``bpf_obj_new`` and
       adding / removing from the root involves fiddling with the
       ``bpf_{list,rb}_node`` field of the node struct, a graph node will
       remain at the same address after either operation.

Because the associated ``bpf_spin_lock`` must be held by any program adding
or removing, if we're in the critical section bounded by that lock, we know
that no other program can add or remove until the end of the critical section.
This combined with pointer stability means that, until the critical section
ends, we can safely access the graph node through ``n`` even after it was used
to pass ownership.

The verifier considers such a reference a *non-owning reference*. The ref
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
Both terms currently only have meaning in the context of graph nodes and API.

**Details**

Let's enumerate the properties of both types of references.

*owning reference*

  * This reference controls the lifetime of the pointee

  * Ownership of pointee must be 'released' by passing it to some graph API
    kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee

    * If not released before program ends, verifier considers program invalid

  * Access to the pointee's memory will not page fault

*non-owning reference*

  * This reference does not own the pointee

     * It cannot be used to add the graph node to a graph root, nor ``free``'d via
       ``bpf_obj_drop``

  * No explicit control of lifetime, but can infer valid lifetime based on
    non-owning ref existence (see explanation below)

  * Access to the pointee's memory will not page fault

From verifier's perspective non-owning references can only exist
between spin_lock and spin_unlock. Why? After spin_unlock another program
can do arbitrary operations on the data structure like removing and ``free``-ing
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
Or the memory could go away.

To prevent this logic violation all non-owning references are invalidated by the
verifier after a critical section ends. This is necessary to ensure the "will
not page fault" property of non-owning references. So if the verifier hasn't
invalidated a non-owning ref, accessing it will not page fault.

Currently ``bpf_obj_drop`` is not allowed in the critical section, so
if there's a valid non-owning ref, we must be in a critical section, and can
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
dropped-and-reused.

Any reference to a node that is in an rbtree _must_ be non-owning, since
the tree has control of the pointee's lifetime. Similarly, any ref to a node
that isn't in rbtree _must_ be owning. This results in a nice property:
graph API add / remove implementations don't need to check if a node
has already been added (or already removed), as the ownership model
allows the verifier to prevent such a state from being valid by simply checking
types.

However, pointer aliasing poses an issue for the above "nice property".
Consider the following example:

.. code-block:: c

        struct node_data *n, *m, *o, *p;
        n = bpf_obj_new(typeof(*n));     /* 1 */

        bpf_spin_lock(&lock);

        bpf_rbtree_add(&tree, n);        /* 2 */
        m = bpf_rbtree_first(&tree);     /* 3 */

        o = bpf_rbtree_remove(&tree, n); /* 4 */
        p = bpf_rbtree_remove(&tree, m); /* 5 */

        bpf_spin_unlock(&lock);

        bpf_obj_drop(o);
        bpf_obj_drop(p); /* 6 */

Assume the tree is empty before this program runs. If we track verifier state
changes here using numbers in above comments:

  1) n is an owning reference

  2) n is a non-owning reference, it's been added to the tree

  3) n and m are non-owning references, they both point to the same node

  4) o is an owning reference, n and m non-owning, all point to same node

  5) o and p are owning, n and m non-owning, all point to the same node

  6) a double-free has occurred, since o and p point to same node and o was
     ``free``'d in previous statement

States 4 and 5 violate our "nice property", as there are non-owning refs to
a node which is not in an rbtree. Statement 5 will try to remove a node which
has already been removed as a result of this violation. State 6 is a dangerous
double-free.

At a minimum we should prevent state 6 from being possible. If we can't also
prevent state 5 then we must abandon our "nice property" and check whether a
node has already been removed at runtime.

We prevent both by generalizing the "invalidate non-owning references" behavior
of ``bpf_spin_unlock`` and doing similar invalidation after
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:

  * takes an arbitrary node argument

  * removes it from the data structure

  * returns an owning reference to the removed node

May result in a state where some other non-owning reference points to the same
node. So ``remove``-type kfuncs must be considered a non-owning reference
invalidation point as well.
Commit	Line	Data
c31315c3 DM	1	=========================
	2	BPF Graph Data Structures
	3	=========================
	4
	5	This document describes implementation details of new-style "graph" data
	6	structures (linked_list, rbtree), with particular focus on the verifier's
	7	implementation of semantics specific to those data structures.
	8
	9	Although no specific verifier code is referred to in this document, the document
	10	assumes that the reader has general knowledge of BPF verifier internals, BPF
	11	maps, and BPF program writing.
	12
	13	Note that the intent of this document is to describe the current state of
	14	these graph data structures. No guarantees of stability for either
	15	semantics or APIs are made or implied here.
	16
	17	.. contents::
	18	:local:
	19	:depth: 2
	20
	21	Introduction
	22	------------
	23
	24	The BPF map API has historically been the main way to expose data structures
	25	of various types for use within BPF programs. Some data structures fit naturally
	26	with the map API (HASH, ARRAY), others less so. Consequentially, programs
	27	interacting with the latter group of data structures can be hard to parse
	28	for kernel programmers without previous BPF experience.
	29
	30	Luckily, some restrictions which necessitated the use of BPF map semantics are
	31	no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
	32	BPF allocator, it is now possible to implement BPF data structures whose API
	33	and semantics more closely match those exposed to the rest of the kernel.
	34
	35	Two such data structures - linked_list and rbtree - have many verification
	36	details in common. Because both have "root"s ("head" for linked_list) and
	37	"node"s, the verifier code and this document refer to common functionality
	38	as "graph_api", "graph_root", "graph_node", etc.
	39
	40	Unless otherwise stated, examples and semantics below apply to both graph data
	41	structures.
	42
	43	Unstable API
	44	------------
	45
	46	Data structures implemented using the BPF map API have historically used BPF
	47	helper functions - either standard map API helpers like ``bpf_map_update_elem``
	48	or map-specific helpers. The new-style graph data structures instead use kfuncs
	49	to define their manipulation helpers. Because there are no stability guarantees
	50	for kfuncs, the API and semantics for these data structures can be evolved in
	51	a way that breaks backwards compatibility if necessary.
	52
	53	Root and node types for the new data structures are opaquely defined in the
	54	``uapi/linux/bpf.h`` header.
	55
	56	Locking
	57	-------
	58
	59	The new-style data structures are intrusive and are defined similarly to their
	60	vanilla kernel counterparts:
	61
	62	.. code-block:: c
e2d323a1	63
c31315c3 DM	64	struct node_data {
	65	long key;
	66	long data;
	67	struct bpf_rb_node node;
	68	};
	69
	70	struct bpf_spin_lock glock;
	71	struct bpf_rb_root groot __contains(node_data, node);
	72
	73	The "root" type for both linked_list and rbtree expects to be in a map_value
	74	which also contains a ``bpf_spin_lock`` - in the above example both global
	75	variables are placed in a single-value arraymap. The verifier considers this
	76	spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
	77	the same map_value and will enforce that the correct lock is held when
	78	verifying BPF programs that manipulate the tree. Since this lock checking
	79	happens at verification time, there is no runtime penalty.
	80
	81	Non-owning references
	82	---------------------
	83
	84	Motivation
	85
	86	Consider the following BPF code:
	87
	88	.. code-block:: c
	89
	90	struct node_data n = bpf_obj_new(typeof(n)); /* ACQUIRED */
	91
	92	bpf_spin_lock(&lock);
	93
	94	bpf_rbtree_add(&tree, n); /* PASSED */
	95
	96	bpf_spin_unlock(&lock);
	97
	98	From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
	99	has type ``PTR_TO_BTF_ID \| MEM_ALLOC``, with a ``btf_id`` of
	100	``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
	101	program has ownership of the pointee's (object pointed to by ``n``) lifetime.
	102	The BPF program must pass off ownership before exiting - either via
	103	``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
	104	``bpf_rbtree_add``.
	105
	106	(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
	107	"ownership is acquired" and "ownership is passed", respectively)
	108
	109	What should the verifier do with ``n`` after ownership is passed off? If the
	110	object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
	111	should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
	112	the object is no longer valid. The underlying memory may have been reused for
	113	some other allocation, unmapped, etc.
	114
	115	When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
	116	obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
	117	but that would result in programs with useful, common coding patterns being
	118	rejected, e.g.:
	119
	120	.. code-block:: c
	121
	122	int x;
	123	struct node_data n = bpf_obj_new(typeof(n)); /* ACQUIRED */
	124
	125	bpf_spin_lock(&lock);
	126
	127	bpf_rbtree_add(&tree, n); /* PASSED */
128	x = n->data;
129	n->data = 42;
130
131	bpf_spin_unlock(&lock);
132
133	Both the read from and write to ``n->data`` would be rejected. The verifier
134	can do better, though, by taking advantage of two details:
135
136	* Graph data structure APIs can only be used when the ``bpf_spin_lock``
137	associated with the graph root is held
138
139	* Both graph data structures have pointer stability
140
141	* Because graph nodes are allocated with ``bpf_obj_new`` and
142	adding / removing from the root involves fiddling with the
143	``bpf_{list,rb}_node`` field of the node struct, a graph node will
144	remain at the same address after either operation.
145
146	Because the associated ``bpf_spin_lock`` must be held by any program adding
147	or removing, if we're in the critical section bounded by that lock, we know
148	that no other program can add or remove until the end of the critical section.
149	This combined with pointer stability means that, until the critical section
150	ends, we can safely access the graph node through ``n`` even after it was used
151	to pass ownership.
152
153	The verifier considers such a reference a non-owning reference. The ref
154	returned by ``bpf_obj_new`` is accordingly considered an owning reference.
155	Both terms currently only have meaning in the context of graph nodes and API.
156
157	Details
158
159	Let's enumerate the properties of both types of references.
160
161	owning reference
162
163	* This reference controls the lifetime of the pointee
164
165	* Ownership of pointee must be 'released' by passing it to some graph API
166	kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
167
168	* If not released before program ends, verifier considers program invalid
169
170	* Access to the pointee's memory will not page fault
171
172	non-owning reference
173
174	* This reference does not own the pointee
175
176	* It cannot be used to add the graph node to a graph root, nor ``free``'d via
177	``bpf_obj_drop``
178
179	* No explicit control of lifetime, but can infer valid lifetime based on
180	non-owning ref existence (see explanation below)
181
182	* Access to the pointee's memory will not page fault
183
184	From verifier's perspective non-owning references can only exist
185	between spin_lock and spin_unlock. Why? After spin_unlock another program
186	can do arbitrary operations on the data structure like removing and ``free``-ing
187	via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
188	``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
189	Or the memory could go away.
190
191	To prevent this logic violation all non-owning references are invalidated by the
192	verifier after a critical section ends. This is necessary to ensure the "will
193	not page fault" property of non-owning references. So if the verifier hasn't
194	invalidated a non-owning ref, accessing it will not page fault.
195
196	Currently ``bpf_obj_drop`` is not allowed in the critical section, so
197	if there's a valid non-owning ref, we must be in a critical section, and can
198	conclude that the ref's memory hasn't been dropped-and- ``free``'d or
199	dropped-and-reused.
200
201	Any reference to a node that is in an rbtree _must_ be non-owning, since
202	the tree has control of the pointee's lifetime. Similarly, any ref to a node
203	that isn't in rbtree _must_ be owning. This results in a nice property:
204	graph API add / remove implementations don't need to check if a node
205	has already been added (or already removed), as the ownership model
206	allows the verifier to prevent such a state from being valid by simply checking
207	types.
208
209	However, pointer aliasing poses an issue for the above "nice property".
210	Consider the following example:
211
212	.. code-block:: c
213
214	struct node_data n, m, o, p;
215	n = bpf_obj_new(typeof(n)); / 1 */
216
217	bpf_spin_lock(&lock);
218
219	bpf_rbtree_add(&tree, n); /* 2 */
220	m = bpf_rbtree_first(&tree); /* 3 */
221
222	o = bpf_rbtree_remove(&tree, n); /* 4 */
223	p = bpf_rbtree_remove(&tree, m); /* 5 */
224
225	bpf_spin_unlock(&lock);
226
227	bpf_obj_drop(o);
228	bpf_obj_drop(p); /* 6 */
229
230	Assume the tree is empty before this program runs. If we track verifier state
231	changes here using numbers in above comments:
232
233	1) n is an owning reference
234
235	2) n is a non-owning reference, it's been added to the tree
236
237	3) n and m are non-owning references, they both point to the same node
238
239	4) o is an owning reference, n and m non-owning, all point to same node
240
241	5) o and p are owning, n and m non-owning, all point to the same node
242
243	6) a double-free has occurred, since o and p point to same node and o was
244	``free``'d in previous statement
245
246	States 4 and 5 violate our "nice property", as there are non-owning refs to
247	a node which is not in an rbtree. Statement 5 will try to remove a node which
248	has already been removed as a result of this violation. State 6 is a dangerous
249	double-free.
250
251	At a minimum we should prevent state 6 from being possible. If we can't also
252	prevent state 5 then we must abandon our "nice property" and check whether a
253	node has already been removed at runtime.
254
255	We prevent both by generalizing the "invalidate non-owning references" behavior
256	of ``bpf_spin_unlock`` and doing similar invalidation after
257	``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
258
259	* takes an arbitrary node argument
260
261	* removes it from the data structure
262
263	* returns an owning reference to the removed node
264
265	May result in a state where some other non-owning reference points to the same
266	node. So ``remove``-type kfuncs must be considered a non-owning reference
267	invalidation point as well.