git.kernel.dk Git - linux-2.6-block.git/log

selftests/bpf: test_progs: drop extra trailing tab

Small (un)related cleanup.

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_progs: test__printf -> printf

Now that test__printf is a simple wraper around printf, let's drop it
(and test__vprintf as well).

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: test_progs: switch to open_memstream

Use open_memstream to override stdout during test execution.
The copy of the original stdout is held in env.stdout and used
to print subtest info and dump failed log.

test_{v,}printf are now simple wrappers around stdout and will be
removed in the next patch.

v5:
* fix -v crash by always setting env.std{in,err} (Alexei Starovoitov)
* drop force_log check from stdio_hijack (Andrii Nakryiko)

v4:
* one field per line for stdout/stderr (Andrii Nakryiko)

v3:
* don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)

v2:
* add ifdef __GLIBC__ around open_memstream (maybe pointless since
we already depend on glibc for argp_parse)
* hijack stderr as well (Andrii Nakryiko)
* don't hijack for every test, do it once (Andrii Nakryiko)
* log_cap -> log_size (Andrii Nakryiko)
* do fseeko in a proper place (Andrii Nakryiko)
* check open_memstream returned value (Andrii Nakryiko)

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add loop test 5

Add a test with multiple exit conditions.
It's not an infinite loop only when the verifier can properly track
all math on variable 'i' through all possible ways of executing this loop.

barrier()s are needed to disable llvm optimization that combines multiple
branches into fewer branches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>

selftests/bpf: add loop test 4

Add a test that returns a 'random' number between [0, 2^20)
If state pruning is not working correctly for loop body the number of
processed insns will be 2^20 * num_of_insns_in_loop_body and the program
will be rejected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>

Merge branch 'setsockopt-extra-mem'

Stanislav Fomichev says:

====================
Current setsockopt hook is limited to the size of the buffer that
user had supplied. Since we always allocate memory and copy the value
into kernel space, allocate just a little bit more in case BPF
program needs to override input data with a larger value.

The canonical example is TCP_CONGESTION socket option where
input buffer is a string and if user calls it with a short string,
BPF program has no way of extending it.

The tests are extended with TCP_CONGESTION use case.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: extend sockopt_sk selftest with TCP_CONGESTION use case

Ignore SOL_TCP:TCP_CONGESTION in getsockopt and always override
SOL_TCP:TCP_CONGESTION with "cubic" in setsockopt hook.

Call setsockopt(SOL_TCP, TCP_CONGESTION) with short optval ("nv")
to make sure BPF program has enough buffer space to replace it
with "cubic".

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: always allocate at least 16 bytes for setsockopt hook

Since we always allocate memory, allocate just a little bit more
for the BPF program in case it need to override user input with
bigger value. The canonical example is TCP_CONGESTION where
input string might be too small to override (nv -> bbr or cubic).

16 bytes are chosen to match the size of TCP_CA_NAME_MAX and can
be extended in the future if needed.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: add support for reporting the effective cgroup progs

Takshak said in the original submission:

With different bpf attach_flags available to attach bpf programs specially
with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective
bpf-programs available to any sub-cgroups really needs to be available for
easy debugging.

Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached
bpf-programs to a cgroup but also the inherited ones from parent cgroup.

So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here
to list all the effective bpf-programs available for execution at a specified
cgroup.

Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf:
  # ./test_cgroup_attach

With old bpftool:

# bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/
  ID       AttachType      AttachFlags     Name
  271      egress          multi           pkt_cntr_1
  272      egress          multi           pkt_cntr_2

Attached new program pkt_cntr_4 in cg2 gives following:

# bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4

And with new "effective" option it shows all effective programs for cg2:

# bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4
  271      egress          override        pkt_cntr_1
  272      egress          override        pkt_cntr_2

Compared to original submission use a local flag instead of global
option.

We need to clear query_flags on every command, in case batch mode
wants to use varying settings.

v2: (Takshak)
- forbid duplicated flags;
- fix cgroup path freeing.

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Takshak Chahande <ctakshak@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: fix clearing buffered output between tests/subtests

Clear buffered output once test or subtests finishes even if test was
successful. Not doing this leads to accumulation of output from previous
tests and on first failed tests lots of irrelevant output will be
dumped, greatly confusing things.

v1->v2: fix Fixes tag, add more context to patch

Fixes: 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'gen-syn-cookie'

Petar Penkov says:

====================
This patch series introduces a BPF helper function that allows generating SYN
cookies from BPF. Currently, this helper is enabled at both the TC hook and the
XDP hook.

The first two patches in the series add/modify several TCP helper functions to
allow for SKB-less operation, as is the case at the XDP hook.

The third patch introduces the bpf_tcp_gen_syncookie helper function which
generates a SYN cookie for either XDP or TC programs. The return value of
this function contains both the MSS value, encoded in the cookie, and the
cookie itself.

The last three patches sync tools/ and add a test.

Performance evaluation:
I sent 10Mpps to a fixed port on a host with 2 10G bonded Mellanox 4 NICs from
random IPv6 source addresses. Without XDP I observed 7.2Mpps (syn-acks) being
sent out if the IPv6 packets carry 20 bytes of TCP options or 7.6Mpps if they
carry no options. If I attached a simple program that checks if a packet is
IPv6/TCP/SYN, looks up the socket, issues a cookie, and sends it back out after
swapping src/dest, recomputing the checksum, and setting the ACK flag, I
observed 10Mpps being sent back out.

Changes since v1:
1/ Added performance numbers to the cover letter
2/ Patch 2: Refactored a bit to fix compilation issues
3/ Patch 3: Changed ENOTSUPP to EOPNOTSUPP at Toke's suggestion

Changes since RFC:
1/ Cookie is returned in host order at Alexei's suggestion
2/ If cookies are not enabled via a sysctl, the helper function returns
   -ENOENT instead of -EINVAL at Lorenz's suggestion
3/ Fixed documentation to properly reflect that MSS is 16 bits at
   Lorenz's suggestion
4/ BPF helper requires TCP length to match ->doff field, rather than to simply
   be no more than 20 bytes at Eric and Alexei's suggestion
5/ Packet type is looked up from the packet version field, rather than from the
   socket. v4 packets are rejected on v6-only sockets but should work with
   dual stack listeners at Eric's suggestion
6/ Removed unnecessary `net` argument from helper function in patch 2 at
   Lorenz's suggestion
7/ Changed test to only pass MSS option so we can convince the verifier that the
   memory access is not out of bounds

Note that 7/ below illustrates the verifier might need to be extended to allow
passing a variable tcph->doff to the helper function like below:

__u32 thlen = tcph->doff * 4;
if (thlen < sizeof(*tcph))
return;
__s64 cookie = bpf_tcp_gen_syncookie(sk, ipv4h, 20, tcph, thlen);
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add test for bpf_tcp_gen_syncookie

Modify the existing bpf_tcp_check_syncookie test to also generate a
SYN cookie, pass the packet to the kernel, and verify that the two
cookies are the same (and both valid). Since cloned SKBs are skipped
during generic XDP, this test does not issue a SYN cookie when run in
XDP mode. We therefore only check that a valid SYN cookie was issued at
the TC hook.

Additionally, verify that the MSS for that SYN cookie is within
expected range.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers

Expose bpf_tcp_gen_syncookie to selftests.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: sync bpf.h to tools/

Sync updated documentation for bpf_redirect_map.

Sync the bpf_tcp_gen_syncookie helper function definition with the one
in tools/uapi.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: add bpf_tcp_gen_syncookie helper

This helper function allows BPF programs to try to generate SYN
cookies, given a reference to a listener socket. The function works
from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a
socket in both cases.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tcp: add skb-less helpers to retrieve SYN cookie

This patch allows generation of a SYN cookie before an SKB has been
allocated, as is the case at XDP.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tcp: tcp_syn_flood_action read port from socket

This allows us to call this function before an SKB has been
allocated.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'devmap_hash'

Toke Høiland-Jørgensen says:

====================
This series adds a new map type, devmap_hash, that works like the existing
devmap type, but using a hash-based indexing scheme. This is useful for the use
case where a devmap is indexed by ifindex (for instance for use with the routing
table lookup helper). For this use case, the regular devmap needs to be sized
after the maximum ifindex number, not the number of devices in it. A hash-based
indexing scheme makes it possible to size the map after the number of devices it
should contain instead.

This was previously part of my patch series that also turned the regular
bpf_redirect() helper into a map-based one; for this series I just pulled out
the patches that introduced the new map type.

Changelog:

v5:

- Dynamically set the number of hash buckets by rounding up max_entries to the
nearest power of two (mirroring the regular hashmap), as suggested by Jesper.

v4:

- Remove check_memlock parameter that was left over from an earlier patch
series.
- Reorder struct members to avoid holes.

v3:

- Rework the split into different patches
- Use spin_lock_irqsave()
- Also add documentation and bash completion definitions for bpftool

v2:

- Split commit adding the new map type so uapi and tools changes are separate.

Changes to these patches since the previous series:

- Rebase on top of the other devmap changes (makes this one simpler!)
- Don't enforce key==val, but allow arbitrary indexes.
- Rename the type to devmap_hash to reflect the fact that it's just a hashmap now.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: Add definitions for devmap_hash map type

This adds selftest and bpftool updates for the devmap_hash map type.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/libbpf_probes: Add new devmap_hash type

This adds the definition for BPF_MAP_TYPE_DEVMAP_HASH to libbpf_probes.c in
tools/lib/bpf.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/include/uapi: Add devmap_hash BPF map type

This adds the devmap_hash BPF map type to the uapi headers in tools/.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

xdp: Add devmap_hash map type for looking up devices by hashed index

A common pattern when using xdp_redirect_map() is to create a device map
where the lookup key is simply ifindex. Because device maps are arrays,
this leaves holes in the map, and the map has to be sized to fit the
largest ifindex, regardless of how many devices actually are actually
needed in the map.

This patch adds a second type of device map where the key is looked up
using a hashmap, instead of being used as an array index. This allows maps
to be densely packed, so they can be smaller.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

xdp: Refactor devmap allocation code for reuse

The subsequent patch to add a new devmap sub-type can re-use much of the
initialisation and allocation code, so refactor it into separate functions.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

include/bpf.h: Remove map_insert_ctx() stubs

When we changed the device and CPU maps to use linked lists instead of
bitmaps, we also removed the need for the map_insert_ctx() helpers to keep
track of the bitmaps inside each map. However, it seems I forgot to remove
the function definitions stubs, so remove those here.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'revamp-test_progs'

Andrii Nakryiko says:

====================
This patch set makes a number of changes to test_progs selftest, which is
a collection of many other tests (and sometimes sub-tests as well), to provide
better testing experience and allow to start convering many individual test
programs under selftests/bpf into a single and convenient test runner.

Patch #1 fixes issue with Makefile, which makes prog_tests/test.h compiled as
a C code. This fix allows to change how test.h is generated, providing ability
to have more control on what and how tests are run.

Patch #2 changes how test.h is auto-generated, which allows to have test
definitions, instead of just running test functions. This gives ability to do
more complicated test run policies.

Patch #3 adds `-t <test-name>` and `-n <test-num>` selectors to run only
subset of tests.

Patch #4 changes libbpf_set_print() to return previously set print callback,
allowing to temporarily replace current print callback and then set it back.
This is necessary for some tests that want more control over libbpf logging.

Patch #5 sets up and takes over libbpf logging from individual tests to
test_prog runner, adding -vv verbosity to capture debug output from libbpf.
This is useful when debugging failing tests.

Patch #6 furthers test output management and buffers it by default, emitting
log output only if test fails. This give succinct and clean default test
output. It's possible to bypass this behavior with -v flag, which will turn
off test output buffering.

Patch #7 adds support for sub-tests. It also enhances -t and -n selectors to
both support ability to specify sub-test selectors, as well as enhancing
number selector to accept sets of test, instead of just individual test
number.

Patch #8 converts bpf_verif_scale.c test to use sub-test APIs.

Patch #9 converts send_signal.c tests to use sub-test APIs.

v2->v3:
  - fix buffered output rare unitialized value bug (Alexei);
  - fix buffered output va_list reuse bug (Alexei);
  - fix buffered output truncation due to interleaving zero terminators;

v1->v2:
  - drop libbpf_swap_print, instead return previous function from
    libbpf_set_print (Stanislav);
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: convert send_signal.c to use subtests

Convert send_signal set of tests to be exposed as three sub-tests.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: convert bpf_verif_scale.c to sub-tests API

Expose each BPF verifier scale test as individual sub-test to allow
independent results output and test selection.

Test run results now look like this:

  $ sudo ./test_progs -t verif/
  #3/1 loop3.o:OK
  #3/2 test_verif_scale1.o:OK
  #3/3 test_verif_scale2.o:OK
  #3/4 test_verif_scale3.o:OK
  #3/5 pyperf50.o:OK
  #3/6 pyperf100.o:OK
  #3/7 pyperf180.o:OK
  #3/8 pyperf600.o:OK
  #3/9 pyperf600_nounroll.o:OK
  #3/10 loop1.o:OK
  #3/11 loop2.o:OK
  #3/12 strobemeta.o:OK
  #3/13 strobemeta_nounroll1.o:OK
  #3/14 strobemeta_nounroll2.o:OK
  #3/15 test_sysctl_loop1.o:OK
  #3/16 test_sysctl_loop2.o:OK
  #3/17 test_xdp_loop.o:OK
  #3/18 test_seg6_loop.o:OK
  #3 bpf_verif_scale:OK

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add sub-tests support for test_progs

Allow tests to have their own set of sub-tests. Also add ability to do
test/subtest selection using `-t <test-name>/<subtest-name>` and `-n
<test-nums-set>/<subtest-nums-set>`, as an extension of existing -t/-n
selector options. For the <test-num-set> format: it's a comma-separated
list of either individual test numbers (1-based), or range of test
numbers. E.g., all of the following are valid sets of test numbers:
  - 10
  - 1,2,3
  - 1-3
  - 5-10,1,3-4

'/<subtest' part is optional, but has the same format. E.g., to select
test #3 and its sub-tests #10 through #15, use: -t 3/10-15.

Similarly, to select tests by name, use `-t verif/strobe`:

  $ sudo ./test_progs -t verif/strobe
  #3/12 strobemeta.o:OK
  #3/13 strobemeta_nounroll1.o:OK
  #3/14 strobemeta_nounroll2.o:OK
  #3 bpf_verif_scale:OK
  Summary: 1/3 PASSED, 0 FAILED

Example of using subtest API is in the next patch, converting
bpf_verif_scale.c tests to use sub-tests.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: abstract away test log output

This patch changes how test output is printed out. By default, if test
had no errors, the only output will be a single line with test number,
name, and verdict at the end, e.g.:

#31 xdp:OK

If test had any errors, all log output captured during test execution
will be output after test completes.

It's possible to force output of log with `-v` (`--verbose`) option, in
which case output won't be buffered and will be output immediately.

To support this, individual tests are required to use helper methods for
logging: `test__printf()` and `test__vprintf()`.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftest/bpf: centralize libbpf logging management for test_progs

Make test_progs test runner own libbpf logging. Also introduce two
levels of verbosity: -v and -vv. First one will be used in subsequent
patches to enable test log output always. Second one increases verbosity
level of libbpf logging further to include debug output as well.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: return previous print callback from libbpf_set_print

By returning previously set print callback from libbpf_set_print, it's
possible to restore it, eventually. This is useful when running many
independent test with one default print function, but overriding log
verbosity for particular subset of tests.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: add test selectors by number and name to test_progs

Add ability to specify either test number or test name substring to
narrow down a set of test to run.

Usage:
sudo ./test_progs -n 1
sudo ./test_progs -t attach_probe

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: revamp test_progs to allow more control

Refactor test_progs to allow better control on what's being run.
Also use argp to do argument parsing, so that it's easier to keep adding
more options.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: prevent headers to be compiled as C code

Apprently listing header as a normal dependency for a binary output
makes it go through compilation as if it was C code. This currently
works without a problem, but in subsequent commits causes problems for
differently generated test.h for test_progs. Marking those headers as
order-only dependency solves the issue.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'flow_dissector-input-flags'

Stanislav Fomichev says:

====================
C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible.
BPF flow dissector always parses as deep as possible which is sub-optimal.
Pass input flags to the BPF flow dissector as well so it can make the same
decisions.

Series outline:
* remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
* export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
  flow dissector
* add documentation for the export flags
* support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
* sync uapi to tools
* support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
* support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
* support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest

Pros:
* makes BPF flow dissector faster by avoiding burning extra cycles
* existing BPF progs continue to work by ignoring the flags and always
  parsing as deep as possible

Cons:
* new UAPI which we need to support (OTOH, if we need to deprecate some
  flags, we can just stop setting them upon calling BPF programs)

Some numbers (with .repeat = 4000000 in test_flow_dissector):
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec

        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec

        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec

        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec

The improvement is around 10%, but it's in a tight cache-hot
BPF_PROG_TEST_RUN loop.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: support BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP

Exit as soon as we found that packet is encapped when
BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP is passed.
Add appropriate selftest cases.

v2:
* Subtract sizeof(struct iphdr) from .iph_inner.tot_len (Willem de Bruijn)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf/flow_dissector: support ipv6 flow_label and BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL

Add support for exporting ipv6 flow label via bpf_flow_keys.
Export flow label from bpf_flow.c and also return early when
BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: support BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG

bpf_flow.c: exit early unless BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG is
passed in flags. Also, set ip_proto earlier, this makes sure we have
correct value with fragmented packets.

Add selftest cases to test ipv4/ipv6 fragments and skip eth_get_headlen
tests that don't have BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag.

eth_get_headlen calls flow dissector with
BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag so we can't run tests that
have different set of input flags against it.

v2:
* sefltests -> selftests (Willem de Bruijn)
* Reword a comment about eth_get_headlen flags (Song Liu)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/bpf: sync bpf_flow_keys flags

Export bpf_flow_keys flags to tools/libbpf/selftests.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN

This will allow us to write tests for those flags.

v2:
* Swap kfree(data) and kfree(user_ctx) (Song Liu)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf/flow_dissector: document flags

Describe what each input flag does and who uses it.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf/flow_dissector: pass input flags to BPF flow dissector program

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible. Pass
those flags to the BPF flow dissector so it can make the same
decisions. In the next commits I'll add support for those flags to
our reference bpf_flow.c

v3:
* Export copy of flow dissector flags instead of moving (Alexei Starovoitov)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add selftests for bpf_perf_event_output

Software event output is only enabled by a few prog types.
This test is to ensure that all supported types are enabled for
bpf_perf_event_output successfully.

Signed-off-by: Allan Zhang <allanzhang@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Allow bpf_skb_event_output for a few prog types

Software event output is only enabled by a few prog types right now (TC,
LWT out, XDP, sockops). Many other skb based prog types need
bpf_skb_event_output to produce software event.

Added socket_filter, cg_skb, sk_skb prog types to generate sw event.

Test bpf code is generated from code snippet:

struct TMP {
    uint64_t tmp;
} tt;
tt.tmp = 5;
bpf_perf_event_output(skb, &connection_tracking_event_map, 0,
                      &tt, sizeof(tt));
return 1;

the bpf assembly from llvm is:
       0:       b7 02 00 00 05 00 00 00         r2 = 5
       1:       7b 2a f8 ff 00 00 00 00         *(u64 *)(r10 - 8) = r2
       2:       bf a4 00 00 00 00 00 00         r4 = r10
       3:       07 04 00 00 f8 ff ff ff         r4 += -8
       4:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00    r2 = 0ll
       6:       b7 03 00 00 00 00 00 00         r3 = 0
       7:       b7 05 00 00 08 00 00 00         r5 = 8
       8:       85 00 00 00 19 00 00 00         call 25
       9:       b7 00 00 00 01 00 00 00         r0 = 1
      10:       95 00 00 00 00 00 00 00         exit

Signed-off-by: Allan Zhang <allanzhang@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'convert-tests-to-libbpf'

Andrii Nakryiko says:

====================
There were few more tests and samples that were using custom perf buffer setup
code from trace_helpers.h. This patch set gets rid of all the usages of those
and removes helpers themselves. Libbpf provides nicer, but equally powerful
set of APIs to work with perf ring buffers, so let's have all the samples use

v1->v2:
- make logging message one long line instead of two (Song).
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: remove perf buffer helpers

libbpf's perf_buffer API supersedes trace_helper.h's helpers.
Remove those helpers after all existing users were already moved to
perf_buffer API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples/bpf: switch trace_output sample to perf_buffer API

Convert trace_output sample to libbpf's perf_buffer API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples/bpf: convert xdp_sample_pkts_user to perf_buffer API

Convert xdp_sample_pkts_user to libbpf's perf_buffer API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: switch test_tcpnotify to perf_buffer API

Switch test_tcpnotify test to use libbpf's perf_buffer API instead of
re-implementing portion of it.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: convert test_get_stack_raw_tp to perf_buffer API

Convert test_get_stack_raw_tp test to new perf_buffer API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: provide more helpful message on uninitialized global var

When BPF program defines uninitialized global variable, it's put into
a special COMMON section. Libbpf will reject such programs, but will
provide very unhelpful message with garbage-looking section index.

This patch detects special section cases and gives more explicit error
message.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tc-testing: added tdc tests for [b|p]fifo qdisc

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_sock: Use consistent types for UUIDs

The rest of Hyper-V code is using new types for UUID handling.
Convert hv_sock as well.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-Offload-MPLS-actions'

John Hurley says:

====================
nfp: Offload MPLS actions

The module act_mpls has recently been added to the kernel. This allows the
manipulation of MPLS headers on packets including push, pop and modify.
Add these new actions and parameters to the intermediate representation
API for hardware offload. Follow this by implementing the offload of these
MPLS actions in the NFP driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: offload MPLS set action

Recent additions to the kernel include a TC action module to manipulate
MPLS headers on packets. Such actions are available to offload via the
flow_offload intermediate representation API.

Modify the NFP driver to allow the offload of MPLS set actions to
firmware. Set actions update the outermost MPLS header. The offload
includes a mask to specify which fields should be set.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: offload MPLS pop action

Recent additions to the kernel include a TC action module to manipulate
MPLS headers on packets. Such actions are available to offload via the
flow_offload intermediate representation API.

Modify the NFP driver to allow the offload of MPLS pop actions to
firmware. The act_mpls TC module enforces that the next protocol is
supplied along with the pop action. Passing this to firmware allows it
to properly rebuild the underlying packet after the pop.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: offload MPLS push action

Recent additions to the kernel include a TC action module to manipulate
MPLS headers on packets. Such actions are available to offload via the
flow_offload intermediate representation API.

Modify the NFP driver to allow the offload of MPLS push actions to
firmware.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: include mpls actions in hardware intermediate representation

A recent addition to TC actions is the ability to manipulate the MPLS
headers on packets.

In preparation to offload such actions to hardware, update the IR code to
accept and prepare the new actions.

Note that no driver currently impliments the MPLS dec_ttl action so this
is not included.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: xsk: dynamically allocate mlx5e_channel_param

The structure is too large to put on the stack, resulting in a
warning on 32-bit ARM:

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c:59:5: error: stack frame size of 1344 bytes in function
'mlx5e_open_xsk' [-Werror,-Wframe-larger-than=]

Use kvzalloc() instead.

Fixes: a038e9794541 ("net/mlx5e: Add XSK zero-copy support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: jme: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Use dev_get_drvdata where possible

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fm10k: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1000e: Use dev_get_drvdata where possible

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: broadcom: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atheros: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: 3com: 3c59x: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atm: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata,
use dev_get_drvdata to make code simpler.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ftgmac100: Fix build.

drivers/net/ethernet/faraday/ftgmac100.c:777:13: error: 'skb_frag_t {aka struct bio_vec}' has no member named 'size'

Fallout from the skb_frag_t conversion to bio_vec, simply
use skb_frag_size().

Fixes: b8b576a16f79 ("net: Rename skb_frag_t size to bv_len")
Reported-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: Move drivers/net/ethernet/qlogic/qlge/ to drivers/staging/qlge/

The hardware has been declared EOL by the vendor more than 5 years ago.
What's more relevant to the Linux kernel is that the quality of this driver
is not on par with many other mainline drivers.

Cc: Manish Chopra <manishc@marvell.com>
Message-id: <20190617074858.32467-1-bpoirier@suse.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Convert-skb_frag_t-to-bio_vec'

Matthew Wilcox says:

====================
Convert skb_frag_t to bio_vec

The skb_frag_t and bio_vec are fundamentally the same (page, offset,
length) tuple.  This patch series unifies the two, leaving the
skb_frag_t typedef in place.  This has the immediate advantage that
we already have iov_iter support for bvecs and don't need to add
support for iterating skbuffs.  It enables a long-term plan to use
bvecs more broadly within the kernel and should make network-storage
drivers able to do less work converting between skbuffs and biovecs.

It will consume more memory on 32-bit kernels.  If that proves
problematic, we can look at ways of addressing it.

v3: Rebase on latest Linus with net-next merged.
  - Reorder the uncontroversial 'Use skb accessors' patches first so you
    can apply just those two if you want to hold off on the full
    conversion.
  - Convert all the users of 'struct skb_frag_struct' to skb_frag_t.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert skb_frag_t to bio_vec

There are a lot of users of frag->page_offset, so use a union
to avoid converting those users today.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Rename skb_frag_t size to bv_len

Improved compatibility with bvec

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Rename skb_frag page to bv_page

One step closer to turning the skb_frag_t into a bio_vec.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Reorder the contents of skb_frag_t

Match the layout of bio_vec.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Increase the size of skb_frag_t

To increase commonality between block and net, we are going to replace
the skb_frag_t with the bio_vec. This patch increases the size of
skb_frag_t on 32-bit machines from 8 bytes to 12 bytes. The size is
unchanged on 64-bit machines.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Use skb accessors in network core

In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Use skb accessors in network drivers

In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: Merge cpu_to_le32s + memcpy to put_unaligned_le32

Merge the combo uses of cpu_to_le32s and memcpy.
Use put_unaligned_le32 instead.
This simplifies the code.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: improve rtl_rx

This patch improves few aspects of rtl_rx, no functional change intended.

1. inline rtl8169_try_rx_copy
2. make pkt_size unsigned
3. use constant ETH_FCS_LEN instead of value 4
4. We just created the skb, so we don't need the checks in skb_put.
Also we don't need the return value of skb_put.
Set skb->tail and skb->len directly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ax88179_178a: Merge memcpy + le32_to_cpus to get_unaligned_le32

Merge the combo use of memcpy and le32_to_cpus.
Use get_unaligned_le32 instead.
This simplifies the code.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

usbnet: smsc75xx: Merge memcpy + le32_to_cpus to get_unaligned_le32

Merge the combo use of memcpy and le32_to_cpus.
Use get_unaligned_le32 instead.
This simplifies the code.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan78xx: Merge memcpy + lexx_to_cpus to get_unaligned_lexx

Merge the combo use of memcpy and lexx_to_cpus.
Use get_unaligned_lexx instead.
This simplifies the code.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-ipv6-ndisc: add support for RFC7710 RA Captive Portal Identifier

This is trivial since we already have support for the entirely
identical (from the kernel's point of view) RDNSS and DNSSL that
also contain opaque data that needs to be passed down to userspace.

As specified in RFC7710, Captive Portal option contains a URL.
8-bit identifier of the option type as assigned by the IANA is 37.
This option should also be treated as userland.

Hence, treat ND option 37 as userland (Captive Portal support)

See:
https://tools.ietf.org/html/rfc7710
https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml

Fixes: e35f30c131a56
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Remin Nguyen Van <reminv@google.com>
Cc: Alexey I. Froloff <raorn@raorn.name>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull preemption Kconfig fix from Thomas Gleixner:
"The PREEMPT_RT stub config renamed PREEMPT to PREEMPT_LL and defined
  PREEMPT outside of the menu and made it selectable by both PREEMPT_LL
  and PREEMPT_RT.

  Stupid me missed that 114 defconfigs select CONFIG_PREEMPT which
  obviously can't work anymore. oldconfig builds are affected as well,
  but it's more obvious as the user gets asked. [old]defconfig silently
  fixes it up and selects PREEMPT_NONE.

  Unbreak it by undoing the rename and adding a intermediate config
  symbol which is selected by both PREEMPT and PREEMPT_RT. That requires
  to chase down a few #ifdefs, but it's better than tweaking 114
  defconfigs and annoying users"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y

Merge tag 'for-linus-20190722' of git://git./linux/kernel/git/brauner/linux

Pull pidfd polling fix from Christian Brauner:
"A fix for pidfd polling. It ensures that the task's exit state is
visible to all waiters"

* tag 'for-linus-20190722' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
pidfd: fix a poll race when setting exit_state

Merge tag 'for-5.3-rc1-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fixes for leaks caused by recently merged patches

- one build fix

- a fix to prevent mixing of incompatible features

* tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't leak extent_map in btrfs_get_io_geometry()
  btrfs: free checksum hash on in close_ctree
  btrfs: Fix build error while LIBCRC32C is module
  btrfs: inode: Don't compress if NODATASUM or NODATACOW set

sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y

The merge of the CONFIG_PREEMPT_RT stub renamed CONFIG_PREEMPT to
CONFIG_PREEMPT_LL which causes all defconfigs which have CONFIG_PREEMPT=y
set to fall back to CONFIG_PREEMPT_NONE because CONFIG_PREEMPT depends on
the preemption mode choice wich defaults to NONE. This also affects
oldconfig builds.

So rather than changing 114 defconfig files and being an annoyance to
users, revert the rename and select a new config symbol PREEMPTION. That
keeps everything working smoothly and the revelant ifdef's are going to be
fixed up step by step.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: a50a3f4b6a31 ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Merge tag 'media/v5.3-2' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
"For two regressions in media core:

   - v4l2-subdev: fix regression in check_pad()

   - videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already
     in use"

* tag 'media/v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: videodev2.h: change V4L2_PIX_FMT_BGRA444 define: fourcc was already in use
  media: v4l2-subdev: fix regression in check_pad()

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Several netfilter fixes including a nfnetlink deadlock fix from
    Florian Westphal and fix for dropping VRF packets from Miaohe Lin.

2) Flow offload fixes from Pablo Neira Ayuso including a fix to restore
    proper block sharing.

3) Fix r8169 PHY init from Thomas Voegtle.

4) Fix memory leak in mac80211, from Lorenzo Bianconi.

5) Missing NULL check on object allocation in cxgb4, from Navid
    Emamdoost.

6) Fix scaling of RX power in sfp phy driver, from Andrew Lunn.

7) Check that there is actually an ip header to access in skb->data in
    VRF, from Peter Kosyh.

8) Remove spurious rcu unlock in hv_netvsc, from Haiyang Zhang.

9) One more tweak the the TCP fragmentation memory limit changes, to be
    less harmful to applications setting small SO_SNDBUF values. From
    Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
  tcp: be more careful in tcp_fragment()
  hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
  vrf: make sure skb->data contains ip header to make routing
  connector: remove redundant input callback from cn_dev
  qed: Prefer pcie_capability_read_word()
  igc: Prefer pcie_capability_read_word()
  cxgb4: Prefer pcie_capability_read_word()
  be2net: Synchronize be_update_queues with dev_watchdog
  bnx2x: Prevent load reordering in tx completion processing
  net: phy: sfp: hwmon: Fix scaling of RX power
  net: sched: verify that q!=NULL before setting q->flags
  chelsio: Fix a typo in a function name
  allocate_flower_entry: should check for null deref
  net: hns3: typo in the name of a constant
  kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist.
  tipc: Fix a typo
  mac80211: don't warn about CW params when not using them
  mac80211: fix possible memory leak in ieee80211_assign_beacon
  nl80211: fix NL80211_HE_MAX_CAPABILITY_LEN
  nl80211: fix VENDOR_CMD_RAW_DATA
  ...

pidfd: fix a poll race when setting exit_state

There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a6 ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>

tcp: be more careful in tcp_fragment()

Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
struct sk_buff *skb = tcp_send_head(sk);

return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()

There is an extra rcu_read_unlock left in netvsc_recv_callback(),
after a previous patch that removes RCU from this function.
This patch removes the extra RCU unlock.

Fixes: 345ac08990b8 ("hv_netvsc: pass netvsc_device to receive callback")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Linus 5.3-rc1

vrf: make sure skb->data contains ip header to make routing

vrf_process_v4_outbound() and vrf_process_v6_outbound() do routing
using ip/ipv6 addresses, but don't make sure the header is available
in skb->data[] (skb_headlen() is less then header size).

Case:

1) igb driver from intel.
2) Packet size is greater then 255.
3) MPLS forwards to VRF device.

So, patch adds pskb_may_pull() calls in vrf_process_v4/v6_outbound()
functions.

Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

connector: remove redundant input callback from cn_dev

A small cleanup: this callback is never used.
Originally fixed by Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
for OpenVZ7 bug OVZ-6877

cc: stanislav.kinsburskiy@gmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Prefer pcie_capability_read_word()

Commit 8c0d3a02c130 ("PCI: Add accessors for PCI Express Capability")
added accessors for the PCI Express Capability so that drivers didn't
need to be aware of differences between v1 and v2 of the PCI
Express Capability.

Replace pci_read_config_word() and pci_write_config_word() calls with
pcie_capability_read_word() and pcie_capability_write_word().

Signed-off-by: Frederick Lawler <fred@fredlawl.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Prefer pcie_capability_read_word()

Commit 8c0d3a02c130 ("PCI: Add accessors for PCI Express Capability")
added accessors for the PCI Express Capability so that drivers didn't
need to be aware of differences between v1 and v2 of the PCI
Express Capability.

Replace pci_read_config_word() and pci_write_config_word() calls with
pcie_capability_read_word() and pcie_capability_write_word().

Signed-off-by: Frederick Lawler <fred@fredlawl.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Prefer pcie_capability_read_word()

Commit 8c0d3a02c130 ("PCI: Add accessors for PCI Express Capability")
added accessors for the PCI Express Capability so that drivers didn't
need to be aware of differences between v1 and v2 of the PCI
Express Capability.

Replace pci_read_config_word() and pci_write_config_word() calls with
pcie_capability_read_word() and pcie_capability_write_word().

Signed-off-by: Frederick Lawler <fred@fredlawl.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Synchronize be_update_queues with dev_watchdog

As pointed out by Firo Yang, a netdev tx timeout may trigger just before an
ethtool set_channels operation is started. be_tx_timeout(), which dumps
some queue structures, is not written to run concurrently with
be_update_queues(), which frees/allocates those queues structures. Add some
synchronization between the two.

Message-id: <CH2PR18MB31898E033896F9760D36BFF288C90@CH2PR18MB3189.namprd18.prod.outlook.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>