[linux-block.git] / Documentation / bpf / classic_vs_extended.rst


===================
Classic BPF vs eBPF
===================

eBPF is designed to be JITed with one to one mapping, which can also open up
the possibility for GCC/LLVM compilers to generate optimized eBPF code through
an eBPF backend that performs almost as fast as natively compiled code.

Some core changes of the eBPF format from classic BPF:

- Number of registers increase from 2 to 10:

  The old format had two registers A and X, and a hidden frame pointer. The
  new layout extends this to be 10 internal registers and a read-only frame
  pointer. Since 64-bit CPUs are passing arguments to functions via registers
  the number of args from eBPF program to in-kernel function is restricted
  to 5 and one register is used to accept return value from an in-kernel
  function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
  sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
  registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.

  Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
  etc, and eBPF calling convention maps directly to ABIs used by the kernel on
  64-bit architectures.

  On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
  and may let more complex programs to be interpreted.

  R0 - R5 are scratch registers and eBPF program needs spill/fill them if
  necessary across calls. Note that there is only one eBPF program (== one
  eBPF main routine) and it cannot call other eBPF functions, it can only
  call predefined in-kernel functions, though.

- Register width increases from 32-bit to 64-bit:

  Still, the semantics of the original 32-bit ALU operations are preserved
  via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
  subregisters that zero-extend into 64-bit if they are being written to.
  That behavior maps directly to x86_64 and arm64 subregister definition, but
  makes other JITs more difficult.

  32-bit architectures run 64-bit eBPF programs via interpreter.
  Their JITs may convert BPF programs that only use 32-bit subregisters into
  native instruction set and let the rest being interpreted.

  Operation is 64-bit, because on 64-bit architectures, pointers are also
  64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
  so 32-bit eBPF registers would otherwise require to define register-pair
  ABI, thus, there won't be able to use a direct eBPF register to HW register
  mapping and JIT would need to do combine/split/move operations for every
  register in and out of the function, which is complex, bug prone and slow.
  Another reason is the use of atomic 64-bit counters.

- Conditional jt/jf targets replaced with jt/fall-through:

  While the original design has constructs such as ``if (cond) jump_true;
  else jump_false;``, they are being replaced into alternative constructs like
  ``if (cond) jump_true; /* else fall-through */``.

- Introduces bpf_call insn and register passing convention for zero overhead
  calls from/to other kernel functions:

  Before an in-kernel function call, the eBPF program needs to
  place function arguments into R1 to R5 registers to satisfy calling
  convention, then the interpreter will take them from registers and pass
  to in-kernel function. If R1 - R5 registers are mapped to CPU registers
  that are used for argument passing on given architecture, the JIT compiler
  doesn't need to emit extra moves. Function arguments will be in the correct
  registers and BPF_CALL instruction will be JITed as single 'call' HW
  instruction. This calling convention was picked to cover common call
  situations without performance penalty.

  After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
  a return value of the function. Since R6 - R9 are callee saved, their state
  is preserved across the call.

  For example, consider three C functions::

    u64 f1() { return (*_f2)(1); }
    u64 f2(u64 a) { return f3(a + 1, a); }
    u64 f3(u64 a, u64 b) { return a - b; }

  GCC can compile f1, f3 into x86_64::

    f1:
	movl $1, %edi
	movq _f2(%rip), %rax
	jmp  *%rax
    f3:
	movq %rdi, %rax
	subq %rsi, %rax
	ret

  Function f2 in eBPF may look like::

    f2:
	bpf_mov R2, R1
	bpf_add R1, 1
	bpf_call f3
	bpf_exit

  If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
  returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
  be used to call into f2.

  For practical reasons all eBPF programs have only one argument 'ctx' which is
  already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
  can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
  are currently not supported, but these restrictions can be lifted if necessary
  in the future.

  On 64-bit architectures all register map to HW registers one to one. For
  example, x86_64 JIT compiler can map them as ...

  ::

    R0 - rax
    R1 - rdi
    R2 - rsi
    R3 - rdx
    R4 - rcx
    R5 - r8
    R6 - rbx
    R7 - r13
    R8 - r14
    R9 - r15
    R10 - rbp

  ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
  and rbx, r12 - r15 are callee saved.

  Then the following eBPF pseudo-program::

    bpf_mov R6, R1 /* save ctx */
    bpf_mov R2, 2
    bpf_mov R3, 3
    bpf_mov R4, 4
    bpf_mov R5, 5
    bpf_call foo
    bpf_mov R7, R0 /* save foo() return value */
    bpf_mov R1, R6 /* restore ctx for next call */
    bpf_mov R2, 6
    bpf_mov R3, 7
    bpf_mov R4, 8
    bpf_mov R5, 9
    bpf_call bar
    bpf_add R0, R7
    bpf_exit

  After JIT to x86_64 may look like::

    push %rbp
    mov %rsp,%rbp
    sub $0x228,%rsp
    mov %rbx,-0x228(%rbp)
    mov %r13,-0x220(%rbp)
    mov %rdi,%rbx
    mov $0x2,%esi
    mov $0x3,%edx
    mov $0x4,%ecx
    mov $0x5,%r8d
    callq foo
    mov %rax,%r13
    mov %rbx,%rdi
    mov $0x6,%esi
    mov $0x7,%edx
    mov $0x8,%ecx
    mov $0x9,%r8d
    callq bar
    add %r13,%rax
    mov -0x228(%rbp),%rbx
    mov -0x220(%rbp),%r13
    leaveq
    retq

  Which is in this example equivalent in C to::

    u64 bpf_filter(u64 ctx)
    {
	return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
    }

  In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
  arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
  registers and place their return value into ``%rax`` which is R0 in eBPF.
  Prologue and epilogue are emitted by JIT and are implicit in the
  interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
  them across the calls as defined by calling convention.

  For example the following program is invalid::

    bpf_mov R1, 1
    bpf_call foo
    bpf_mov R0, R1
    bpf_exit

  After the call the registers R1-R5 contain junk values and cannot be read.
  An in-kernel verifier.rst is used to validate eBPF programs.

Also in the new design, eBPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
functions. Original BPF and eBPF are two operand instructions,
which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.

The input context pointer for invoking the interpreter function is generic,
its content is defined by a specific use case. For seccomp register R1 points
to seccomp_data, for converted BPF filters R1 points to a skb.

A program, that is translated internally consists of the following elements::

  op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32

So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
has room for new instructions. Some of them may use 16/24/32 byte encoding. New
instructions must be multiple of 8 bytes to preserve backward compatibility.

eBPF is a general purpose RISC instruction set. Not every register and
every instruction are used during translation from original BPF to eBPF.
For example, socket filters are not using ``exclusive add`` instruction, but
tracing filters may do to maintain counters of events, for example. Register R9
is not used by socket filters either, but more complex filters may be running
out of registers and would have to resort to spill/fill to stack.

eBPF can be used as a generic assembler for last step performance
optimizations, socket filters and seccomp are using it as assembler. Tracing
filters may use it as assembler to generate code from kernel. In kernel usage
may not be bounded by security considerations, since generated eBPF code
may be optimizing internal code path and not being exposed to the user space.
Safety of eBPF can come from the verifier.rst. In such use cases as
described, it may be used as safe instruction set.

Just like the original BPF, eBPF runs within a controlled environment,
is deterministic and the kernel can easily prove that. The safety of the program
can be determined in two steps: first step does depth-first-search to disallow
loops and other CFG validation; second step starts from the first insn and
descends all possible paths. It simulates execution of every insn and observes
the state change of registers and stack.

opcode encoding
===============

eBPF is reusing most of the opcode encoding from classic to simplify conversion
of classic BPF to eBPF.

For arithmetic and jump instructions the 8-bit 'code' field is divided into three
parts::

  +----------------+--------+--------------------+
  |   4 bits       |  1 bit |   3 bits           |
  | operation code | source | instruction class  |
  +----------------+--------+--------------------+
  (MSB)                                      (LSB)

Three LSB bits store instruction class which is one of:

  ===================     ===============
  Classic BPF classes     eBPF classes
  ===================     ===============
  BPF_LD    0x00          BPF_LD    0x00
  BPF_LDX   0x01          BPF_LDX   0x01
  BPF_ST    0x02          BPF_ST    0x02
  BPF_STX   0x03          BPF_STX   0x03
  BPF_ALU   0x04          BPF_ALU   0x04
  BPF_JMP   0x05          BPF_JMP   0x05
  BPF_RET   0x06          BPF_JMP32 0x06
  BPF_MISC  0x07          BPF_ALU64 0x07
  ===================     ===============

The 4th bit encodes the source operand ...

    ::

	BPF_K     0x00
	BPF_X     0x08

 * in classic BPF, this means::

	BPF_SRC(code) == BPF_X - use register X as source operand
	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand

 * in eBPF, this means::

	BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand

... and four MSB bits store operation code.

If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::

  BPF_ADD   0x00
  BPF_SUB   0x10
  BPF_MUL   0x20
  BPF_DIV   0x30
  BPF_OR    0x40
  BPF_AND   0x50
  BPF_LSH   0x60
  BPF_RSH   0x70
  BPF_NEG   0x80
  BPF_MOD   0x90
  BPF_XOR   0xa0
  BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
  BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
  BPF_END   0xd0  /* eBPF only: endianness conversion */

If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::

  BPF_JA    0x00  /* BPF_JMP only */
  BPF_JEQ   0x10
  BPF_JGT   0x20
  BPF_JGE   0x30
  BPF_JSET  0x40
  BPF_JNE   0x50  /* eBPF only: jump != */
  BPF_JSGT  0x60  /* eBPF only: signed '>' */
  BPF_JSGE  0x70  /* eBPF only: signed '>=' */
  BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
  BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
  BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
  BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
  BPF_JSLT  0xc0  /* eBPF only: signed '<' */
  BPF_JSLE  0xd0  /* eBPF only: signed '<=' */

So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
and eBPF. There are only two registers in classic BPF, so it means A += X.
In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.

Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
exactly the same operations as BPF_ALU, but with 64-bit wide operands
instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
dst_reg = dst_reg + src_reg

Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
in eBPF means function exit only. The eBPF program needs to store return
value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
operands for the comparisons instead.

For load and store instructions the 8-bit 'code' field is divided as::

  +--------+--------+-------------------+
  | 3 bits | 2 bits |   3 bits          |
  |  mode  |  size  | instruction class |
  +--------+--------+-------------------+
  (MSB)                             (LSB)

Size modifier is one of ...

::

  BPF_W   0x00    /* word */
  BPF_H   0x08    /* half word */
  BPF_B   0x10    /* byte */
  BPF_DW  0x18    /* eBPF only, double word */

... which encodes size of load/store operation::

 B  - 1 byte
 H  - 2 byte
 W  - 4 byte
 DW - 8 byte (eBPF only)

Mode modifier is one of::

  BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
  BPF_ABS     0x20
  BPF_IND     0x40
  BPF_MEM     0x60
  BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
  BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
  BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */
Commit	Line	Data
41db511a CH	1
	2	===================
	3	Classic BPF vs eBPF
	4	===================
	5
	6	eBPF is designed to be JITed with one to one mapping, which can also open up
	7	the possibility for GCC/LLVM compilers to generate optimized eBPF code through
	8	an eBPF backend that performs almost as fast as natively compiled code.
	9
	10	Some core changes of the eBPF format from classic BPF:
	11
	12	- Number of registers increase from 2 to 10:
	13
	14	The old format had two registers A and X, and a hidden frame pointer. The
	15	new layout extends this to be 10 internal registers and a read-only frame
	16	pointer. Since 64-bit CPUs are passing arguments to functions via registers
	17	the number of args from eBPF program to in-kernel function is restricted
	18	to 5 and one register is used to accept return value from an in-kernel
	19	function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
	20	sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
	21	registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
	22
	23	Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
	24	etc, and eBPF calling convention maps directly to ABIs used by the kernel on
	25	64-bit architectures.
	26
	27	On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
	28	and may let more complex programs to be interpreted.
	29
	30	R0 - R5 are scratch registers and eBPF program needs spill/fill them if
	31	necessary across calls. Note that there is only one eBPF program (== one
	32	eBPF main routine) and it cannot call other eBPF functions, it can only
	33	call predefined in-kernel functions, though.
	34
	35	- Register width increases from 32-bit to 64-bit:
	36
	37	Still, the semantics of the original 32-bit ALU operations are preserved
	38	via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
	39	subregisters that zero-extend into 64-bit if they are being written to.
	40	That behavior maps directly to x86_64 and arm64 subregister definition, but
	41	makes other JITs more difficult.
	42
	43	32-bit architectures run 64-bit eBPF programs via interpreter.
	44	Their JITs may convert BPF programs that only use 32-bit subregisters into
	45	native instruction set and let the rest being interpreted.
	46
	47	Operation is 64-bit, because on 64-bit architectures, pointers are also
	48	64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
	49	so 32-bit eBPF registers would otherwise require to define register-pair
	50	ABI, thus, there won't be able to use a direct eBPF register to HW register
	51	mapping and JIT would need to do combine/split/move operations for every
	52	register in and out of the function, which is complex, bug prone and slow.
	53	Another reason is the use of atomic 64-bit counters.
	54
	55	- Conditional jt/jf targets replaced with jt/fall-through:
	56
	57	While the original design has constructs such as ``if (cond) jump_true;
	58	else jump_false;``, they are being replaced into alternative constructs like
	59	``if (cond) jump_true; /* else fall-through */``.
	60
	61	- Introduces bpf_call insn and register passing convention for zero overhead
	62	calls from/to other kernel functions:
	63
	64	Before an in-kernel function call, the eBPF program needs to
65	place function arguments into R1 to R5 registers to satisfy calling
66	convention, then the interpreter will take them from registers and pass
67	to in-kernel function. If R1 - R5 registers are mapped to CPU registers
68	that are used for argument passing on given architecture, the JIT compiler
69	doesn't need to emit extra moves. Function arguments will be in the correct
70	registers and BPF_CALL instruction will be JITed as single 'call' HW
71	instruction. This calling convention was picked to cover common call
72	situations without performance penalty.
73
74	After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
75	a return value of the function. Since R6 - R9 are callee saved, their state
76	is preserved across the call.
77
78	For example, consider three C functions::
79
80	u64 f1() { return (*_f2)(1); }
81	u64 f2(u64 a) { return f3(a + 1, a); }
82	u64 f3(u64 a, u64 b) { return a - b; }
83
84	GCC can compile f1, f3 into x86_64::
85
86	f1:
87	movl $1, %edi
88	movq _f2(%rip), %rax
89	jmp *%rax
90	f3:
91	movq %rdi, %rax
92	subq %rsi, %rax
93	ret
94
95	Function f2 in eBPF may look like::
96
97	f2:
98	bpf_mov R2, R1
99	bpf_add R1, 1
100	bpf_call f3
101	bpf_exit
102
103	If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
104	returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
105	be used to call into f2.
106
107	For practical reasons all eBPF programs have only one argument 'ctx' which is
108	already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
109	can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
110	are currently not supported, but these restrictions can be lifted if necessary
111	in the future.
112
113	On 64-bit architectures all register map to HW registers one to one. For
114	example, x86_64 JIT compiler can map them as ...
115
116	::
117
118	R0 - rax
119	R1 - rdi
120	R2 - rsi
121	R3 - rdx
122	R4 - rcx
123	R5 - r8
124	R6 - rbx
125	R7 - r13
126	R8 - r14
127	R9 - r15
128	R10 - rbp
129
130	... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
131	and rbx, r12 - r15 are callee saved.
132
133	Then the following eBPF pseudo-program::
134
135	bpf_mov R6, R1 /* save ctx */
136	bpf_mov R2, 2
137	bpf_mov R3, 3
138	bpf_mov R4, 4
139	bpf_mov R5, 5
140	bpf_call foo
141	bpf_mov R7, R0 /* save foo() return value */
142	bpf_mov R1, R6 /* restore ctx for next call */
143	bpf_mov R2, 6
144	bpf_mov R3, 7
145	bpf_mov R4, 8
146	bpf_mov R5, 9
147	bpf_call bar
148	bpf_add R0, R7
149	bpf_exit
150
151	After JIT to x86_64 may look like::
152
153	push %rbp
154	mov %rsp,%rbp
155	sub $0x228,%rsp
156	mov %rbx,-0x228(%rbp)
157	mov %r13,-0x220(%rbp)
158	mov %rdi,%rbx
159	mov $0x2,%esi
160	mov $0x3,%edx
161	mov $0x4,%ecx
162	mov $0x5,%r8d
163	callq foo
164	mov %rax,%r13
165	mov %rbx,%rdi
166	mov $0x6,%esi
167	mov $0x7,%edx
168	mov $0x8,%ecx
169	mov $0x9,%r8d
170	callq bar
171	add %r13,%rax
172	mov -0x228(%rbp),%rbx
173	mov -0x220(%rbp),%r13
174	leaveq
175	retq
176
177	Which is in this example equivalent in C to::
178
179	u64 bpf_filter(u64 ctx)
180	{
181	return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
182	}
183
184	In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
185	arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
186	registers and place their return value into ``%rax`` which is R0 in eBPF.
187	Prologue and epilogue are emitted by JIT and are implicit in the
188	interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
189	them across the calls as defined by calling convention.
190
191	For example the following program is invalid::
192
193	bpf_mov R1, 1
194	bpf_call foo
195	bpf_mov R0, R1
196	bpf_exit
197
198	After the call the registers R1-R5 contain junk values and cannot be read.
199	An in-kernel verifier.rst is used to validate eBPF programs.
200
201	Also in the new design, eBPF is limited to 4096 insns, which means that any
202	program will terminate quickly and will only call a fixed number of kernel
203	functions. Original BPF and eBPF are two operand instructions,
204	which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
205
206	The input context pointer for invoking the interpreter function is generic,
207	its content is defined by a specific use case. For seccomp register R1 points
208	to seccomp_data, for converted BPF filters R1 points to a skb.
209
210	A program, that is translated internally consists of the following elements::
211
212	op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
213
214	So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
215	has room for new instructions. Some of them may use 16/24/32 byte encoding. New
216	instructions must be multiple of 8 bytes to preserve backward compatibility.
217
218	eBPF is a general purpose RISC instruction set. Not every register and
219	every instruction are used during translation from original BPF to eBPF.
220	For example, socket filters are not using ``exclusive add`` instruction, but
221	tracing filters may do to maintain counters of events, for example. Register R9
222	is not used by socket filters either, but more complex filters may be running
223	out of registers and would have to resort to spill/fill to stack.
224
225	eBPF can be used as a generic assembler for last step performance
226	optimizations, socket filters and seccomp are using it as assembler. Tracing
227	filters may use it as assembler to generate code from kernel. In kernel usage
228	may not be bounded by security considerations, since generated eBPF code
229	may be optimizing internal code path and not being exposed to the user space.
230	Safety of eBPF can come from the verifier.rst. In such use cases as
231	described, it may be used as safe instruction set.
232
233	Just like the original BPF, eBPF runs within a controlled environment,
234	is deterministic and the kernel can easily prove that. The safety of the program
235	can be determined in two steps: first step does depth-first-search to disallow
236	loops and other CFG validation; second step starts from the first insn and
237	descends all possible paths. It simulates execution of every insn and observes
238	the state change of registers and stack.
239
240	opcode encoding
241	===============
242
243	eBPF is reusing most of the opcode encoding from classic to simplify conversion
244	of classic BPF to eBPF.
245
246	For arithmetic and jump instructions the 8-bit 'code' field is divided into three
247	parts::
248
249	+----------------+--------+--------------------+
250	\| 4 bits \| 1 bit \| 3 bits \|
251	\| operation code \| source \| instruction class \|
252	+----------------+--------+--------------------+
253	(MSB) (LSB)
254
255	Three LSB bits store instruction class which is one of:
256
257	=================== ===============
258	Classic BPF classes eBPF classes
259	=================== ===============
260	BPF_LD 0x00 BPF_LD 0x00
261	BPF_LDX 0x01 BPF_LDX 0x01
262	BPF_ST 0x02 BPF_ST 0x02
263	BPF_STX 0x03 BPF_STX 0x03
264	BPF_ALU 0x04 BPF_ALU 0x04
265	BPF_JMP 0x05 BPF_JMP 0x05
266	BPF_RET 0x06 BPF_JMP32 0x06
267	BPF_MISC 0x07 BPF_ALU64 0x07
268	=================== ===============
269
270	The 4th bit encodes the source operand ...
271
272	::
273
274	BPF_K 0x00
275	BPF_X 0x08
276
277	* in classic BPF, this means::
278
279	BPF_SRC(code) == BPF_X - use register X as source operand
280	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
281
282	* in eBPF, this means::
283
284	BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
285	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
286
287	... and four MSB bits store operation code.
288
289	If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
290
291	BPF_ADD 0x00
292	BPF_SUB 0x10
293	BPF_MUL 0x20
294	BPF_DIV 0x30
295	BPF_OR 0x40
296	BPF_AND 0x50
297	BPF_LSH 0x60
298	BPF_RSH 0x70
299	BPF_NEG 0x80
300	BPF_MOD 0x90
301	BPF_XOR 0xa0
302	BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
303	BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
304	BPF_END 0xd0 /* eBPF only: endianness conversion */
305
306	If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
307
308	BPF_JA 0x00 /* BPF_JMP only */
309	BPF_JEQ 0x10
310	BPF_JGT 0x20
311	BPF_JGE 0x30
312	BPF_JSET 0x40
313	BPF_JNE 0x50 /* eBPF only: jump != */
314	BPF_JSGT 0x60 /* eBPF only: signed '>' */
315	BPF_JSGE 0x70 /* eBPF only: signed '>=' */
316	BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
317	BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
318	BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
319	BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
320	BPF_JSLT 0xc0 /* eBPF only: signed '<' */
321	BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
322
323	So BPF_ADD \| BPF_X \| BPF_ALU means 32-bit addition in both classic BPF
324	and eBPF. There are only two registers in classic BPF, so it means A += X.
325	In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
326	BPF_XOR \| BPF_K \| BPF_ALU means A ^= imm32 in classic BPF and analogous
327	src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
328
329	Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
330	eBPF is using BPF_MOV \| BPF_X \| BPF_ALU code instead. Since there are no
331	BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
332	exactly the same operations as BPF_ALU, but with 64-bit wide operands
333	instead. So BPF_ADD \| BPF_X \| BPF_ALU64 means 64-bit addition, i.e.:
334	dst_reg = dst_reg + src_reg
335
336	Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
337	operation. Classic BPF_RET \| BPF_K means copy imm32 into return register
338	and perform function exit. eBPF is modeled to match CPU, so BPF_JMP \| BPF_EXIT
339	in eBPF means function exit only. The eBPF program needs to store return
340	value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
341	BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
342	operands for the comparisons instead.
343
344	For load and store instructions the 8-bit 'code' field is divided as::
345
346	+--------+--------+-------------------+
347	\| 3 bits \| 2 bits \| 3 bits \|
348	\| mode \| size \| instruction class \|
349	+--------+--------+-------------------+
350	(MSB) (LSB)
351
352	Size modifier is one of ...
353
354	::
355
356	BPF_W 0x00 /* word */
357	BPF_H 0x08 /* half word */
358	BPF_B 0x10 /* byte */
359	BPF_DW 0x18 /* eBPF only, double word */
360
361	... which encodes size of load/store operation::
362
363	B - 1 byte
364	H - 2 byte
365	W - 4 byte
366	DW - 8 byte (eBPF only)
367
368	Mode modifier is one of::
369
370	BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
371	BPF_ABS 0x20
372	BPF_IND 0x40
373	BPF_MEM 0x60
374	BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
375	BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
376	BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */