git.kernel.dk Git - linux-2.6-block.git/log

net/mlx5: Consider encapsulation properties when comparing destinations

Currently the driver identifies identical vport destinations by
comparing the vport ID. The FW extended destination feature enables the
driver to forward the packet to the same vport with multiple
encapsulation properties.

Change the vport destination comparison logic to compare
the encapsulation properties in addition to the vport ID.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

bpf: fix up uapi helper description and sync bpf header with tools

Minor markup fixup from bpf-next into net-next merge in the BPF helper
description of bpf_sk_lookup_tcp() and bpf_sk_lookup_udp(). Also sync
up the copy of bpf.h from tooling infrastructure.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-11

The following pull-request contains BPF updates for your *net-next* tree.

It has three minor merge conflicts, resolutions:

1) tools/testing/selftests/bpf/test_verifier.c

Take first chunk with alignment_prevented_execution.

2) net/core/filter.c

  [...]
  case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
  case bpf_ctx_range(struct __sk_buff, wire_len):
        return false;
  [...]

3) include/uapi/linux/bpf.h

  Take the second chunk for the two cases each.

The main changes are:

1) Add support for BPF line info via BTF and extend libbpf as well
   as bpftool's program dump to annotate output with BPF C code to
   facilitate debugging and introspection, from Martin.

2) Add support for BPF_ALU | BPF_ARSH | BPF_{K,X} in interpreter
   and all JIT backends, from Jiong.

3) Improve BPF test coverage on archs with no efficient unaligned
   access by adding an "any alignment" flag to the BPF program load
   to forcefully disable verifier alignment checks, from David.

4) Add a new bpf_prog_test_run_xattr() API to libbpf which allows for
   proper use of BPF_PROG_TEST_RUN with data_out, from Lorenz.

5) Extend tc BPF programs to use a new __sk_buff field called wire_len
   for more accurate accounting of packets going to wire, from Petar.

6) Improve bpftool to allow dumping the trace pipe from it and add
   several improvements in bash completion and map/prog dump,
   from Quentin.

7) Optimize arm64 BPF JIT to always emit movn/movk/movk sequence for
   kernel addresses and add a dedicated BPF JIT backend allocator,
   from Ard.

8) Add a BPF helper function for IR remotes to report mouse movements,
   from Sean.

9) Various cleanups in BPF prog dump e.g. to make UAPI bpf_prog_info
   member naming consistent with existing conventions, from Yonghong
   and Song.

10) Misc cleanups and improvements in allowing to pass interface name
    via cmdline for xdp1 BPF example, from Matteo.

11) Fix a potential segfault in BPF sample loader's kprobes handling,
    from Daniel T.

12) Fix SPDX license in libbpf's README.rst, from Andrey.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

neighbor: gc_list changes should be protected by table lock

Adding and removing neighbor entries to / from the gc_list need to be
done while holding the table lock; a couple of places were missed in the
original patch.

Move the list_add_tail in neigh_alloc to ___neigh_create where the lock
is already obtained. Since neighbor entries should rarely be moved
to/from PERMANENT state, add lock/unlock around the gc_list changes in
neigh_change_state rather than extending the lock hold around all
neighbor updates.

Fixes: 58956317c8de ("neighbor: Improve garbage collection")
Reported-by: Andrei Vagin <avagin@gmail.com>
Reported-by: syzbot+6cc2fd1d3bdd2e007363@syzkaller.appspotmail.com
Reported-by: syzbot+35e87b87c00f386b041f@syzkaller.appspotmail.com
Reported-by: syzbot+b354d1fb59091ea73c37@syzkaller.appspotmail.com
Reported-by: syzbot+3ddead5619658537909b@syzkaller.appspotmail.com
Reported-by: syzbot+424d47d5c456ce8b2bbe@syzkaller.appspotmail.com
Reported-by: syzbot+e4d42eb35f6a27b0a628@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5e-updates-2018-12-10' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed:

====================
mlx5e-updates-2018-12-10 (gre)

This patch set adds GRE offloading support to Mellanox ethernet driver.

Patches 1-5 replace the existing egdev mechanism with the new TC indirect
block binds mechanism that was introduced by Netronome:
7f76fa36754b ("net: sched: register callbacks for indirect tc block binds")

Patches 6-9 add GRE offloading support along with some required
refactoring work.

Patch 10, Add netif_is_gretap()/netif_is_ip6gretap()
- Changed the is_gretap_dev and is_ip6gretap_dev logic from structure
comparison to string comparison of the rtnl_link_ops kind field.

Patch 11, add GRE offloading support to mlx5.

Patch 12 removes the egdev mechanism from TC as it is no longer used by
any of the drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Remove egdev mechanism

The egdev mechanism was replaced by the TC indirect block notifications
platform.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: John Hurley <john.hurley@netronome.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Add GRE protocol offloading

Add HW offloading support for TC flower filters configured on
gretap/ip6gretap net devices.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net: Add netif_is_gretap()/netif_is_ip6gretap()

Changed the is_gretap_dev and is_ip6gretap_dev logic from structure
comparison to string comparison of the rtnl_link_ops kind field.

This approach aligns with the current identification methods and function
names of vxlan and geneve network devices.

Convert mlxsw to use these helpers and use them in downstream mlx5 patch.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Move TC tunnel offloading code to separate source file

Move tunnel offloading related code to a separate source file for better
code maintainability.

Code refactoring with no functional change.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Branch according to classified tunnel type

Currently the tunnel offloading encap/decap methods assumes that VXLAN
is the sole tunneling protocol. Lay the infrastructure for supporting
multiple tunneling protocols by branching according to the tunnel
net device kind.

Encap filters tunnel type is determined according to the egress/mirred
net device. Decap filters classify the tunnel type according to the
filter's ingress net device kind.

Distinguish between the tunnel type as defined by the SW model and
the FW reformat type that specifies the HW operation being made.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Refactor VXLAN tunnel decap offloading code

Separates the vxlan header match handling from the matching on the
general fields of ipv4/6 tunnels, thus allowing the common IP tunnel
match code to branch in down stream patch, to multiple IP tunnels.

This patch doesn't add any functionality.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Refactor VXLAN tunnel encap offloading code

Separates the vxlan header encap logic from the general ipv4/6
encapsulation methods, thus allowing the common IP encap/decap code to
branch in downstream patch to multiple IP tunnels.

Code refactoring with no functional change.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Replace egdev with indirect block notifications

Use TC indirect block notifications to offload filters that
are configured on higher level device interfaces (e.g. tunnel
devices). This mechanism replaces the current egdev implementation.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Propagate the filter's net device to mlx5e structures

Propagate the filter's net_device parameter to the tc flower parsed
attributes structure so that it can later be used in tunnel decap
offloading sequences.

Pre-step for replacing egdev logic with the indirect block
notification mechanism.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Provide the TC filter netdev as parameter to flower callbacks

Currently the driver controls flower filters that are installed on its
devices. However, with the introduction of the indirect block
notifications platform the driver may receive control events for filters
that are installed on higher level net devices (e.g. tunnel devices).
Therefore, the driver filter control API will not be able to implicitly
assume the filter's net device.

Explicitly specify the filter's net device, no functional change

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Support TC indirect block notifications for eswitch uplink reprs

Towards using this mechanism as the means to offload tunnel decap rules
set on SW tunnel devices instead of egdev, add the supporting structures
and functions.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Store eswitch uplink representor state on a dedicated struct

Currently only a single field in the representor private structure
is relevant for uplink representors. As a pre-step to allow adding
additional uplink representor fields, introduce uplink representor
private structure.

This is prepration step towards replacing egdev logic with the
indirect block notification mechanism. This patch doesn't change
any functionality.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux

mlx5-next shared branch with rdma subtree to avoid mlx5 rdma v.s. netdev
conflicts.

Highlights:

1) RDMA ODP (On Demand Paging) improvements and moving ODP logic to
mlx5 RDMA driver
2) Improved mlx5 core driver and device events handling and provided API
for upper layers to subscribe to device events.
3) RDMA only code cleanup from mlx5 core
4) Add helper to get CQE opcode
5) Rework handling of port module events
6) shared mlx5_ifc.h updates to avoid conflicts

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Merge branch 'rename-info_cnt-to-nr_info'

Yonghong Song says:

====================
Before func_info and line_info are added to the kernel, there are several
fields in structure bpf_prog_info specifying the "count" of a user buffer, e.g.,
        __u32 nr_jited_ksyms;
        __u32 nr_jited_func_lens;
The naming convention has the prefix "nr_".

The func_info and line_info support added several fields
        __u32 func_info_cnt;
        __u32 line_info_cnt;
        __u32 jited_line_info_cnt;
to indicate the "count" of buffers func_info, line_info and jited_line_info.
The original intention is to keep the field names the same as those in
structure bpf_attr, so it will be clear that the "count" returned to user
space will be the same as the one passed to the kernel during prog load.

Unfortunately, the field names *_info_cnt are not consistent with
other existing fields in bpf_prog_info.
This patch set renamed the fields *_info_cnt to nr_*_info
to keep naming convention consistent.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/bpf: rename *_info_cnt to nr_*_info

Rename all occurances of *_info_cnt field access
to nr_*_info in tools directory.

The local variables finfo_cnt, linfo_cnt and jited_linfo_cnt
in function do_dump() of tools/bpf/bpftool/prog.c are also
changed to nr_finfo, nr_linfo and nr_jited_linfo to
keep naming convention consistent.

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/bpf: sync kernel uapi bpf.h to tools directory

Sync kernel uapi bpf.h "*_info_cnt => nr_*_info"
changes to tools directory.

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: rename *_info_cnt to nr_*_info in bpf_prog_info

In uapi bpf.h, currently we have the following fields in
the struct bpf_prog_info:
__u32 func_info_cnt;
__u32 line_info_cnt;
__u32 jited_line_info_cnt;
The above field names "func_info_cnt" and "line_info_cnt"
also appear in union bpf_attr for program loading.

The original intention is to keep the names the same
between bpf_prog_info and bpf_attr
so it will imply what we returned to user space will be
the same as what the user space passed to the kernel.

Such a naming convention in bpf_prog_info is not consistent
with other fields like:
__u32 nr_jited_ksyms;
__u32 nr_jited_func_lens;

This patch made this adjustment so in bpf_prog_info
newly introduced *_info_cnt becomes nr_*_info.

Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: clean up bpf_prog_get_info_by_fd()

info.nr_jited_ksyms and info.nr_jited_func_lens cannot be 0 in these two
statements, so we don't need to check them.

Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net/mlx5: Remove the get protocol device interface entry

This isn't used anywhere across the mlx5 driver stack,
remove it.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Support extended destination format in flow steering command

Update the flow steering command formatting according to the extended
destination API.
Note that the FW dictates that multi destination FTEs that involve at
least one encap must use the extended destination format, while single
destination ones must use the legacy format.
Using extended destination format requires FW support. Check for its
capabilities and return error if not supported.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: E-Switch, Change vhca id valid bool field to bit flag

Change the driver flow destination struct to use bit flags with the vhca
id valid being the 1st one. The flags field is more extendable and will
be used in downstream patch.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Introduce extended destination fields

Extended destinations provide the ability to configure different
encapsulation properties per destination on a single FTE. This is
needed for use-cases such as remote mirroring over tunneled networks.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Revise gre and nvgre key formats

GRE RFC defines a 32 bit key field. NVGRE RFC splits the 32 bit
key field to 24 bit VSID (gre_key_h) and 8 bit flow entropy (gre_key_l).

Define the two key parsing alternatives in a union, thus enabling both
access methods.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add monitor commands layout and event data

Will be used in downstream patch to monitor counter changes
by the HCA and report it to the driver by an event.
The driver will update its counters cached data accordingly.

Signed-off-by: Eyal Davidovich <eyald@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add support for plugged-disabled cable status in PME

Support a new hardware module status in port module events:
- module_status=0x4 (Cable plugged, but disabled)

Signed-off-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add support for PCIe power slot exceeded error in PME

Support a new hardware error type in port module events:
- error_type=0xc (PCIe system power slot exceeded)

Signed-off-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Rework handling of port module events

Add explicit HW defined error values. For simplicity, keep counters for all
statuses starting from 0, although currently status=0 is not used.

Additionally, when HW signals an unexpected cable status, it is reported
now rather than ignored. And status counter is now updated on errors.

Signed-off-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

bpf: bpftool: Fix newline and p_err issue

This patch fixes a few newline issues and also
replaces p_err with p_info in prog.c

Fixes: b053b439b72a ("bpf: libbpf: bpftool: Print bpf_line_info during prog dump")
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tcp: handle EOR and FIN conditions the same in tcp_tso_should_defer()

In commit f9bfe4e6a9d0 ("tcp: lack of available data can also cause
TSO defer") we moved the test in tcp_tso_should_defer() for packets
with a FIN flag, and we mentioned that the same would be done
later for EOR flag.

Both flags should be handled at the same time, after all other
heuristics have been considered. They both mean that no more bytes
can be added to this skb by an application.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-ksz-Add-reset-GPIO-handling'

Marek Vasut says:

====================
net: dsa: ksz: Add reset GPIO handling

Add code to handle optional reset GPIO in the KSZ switch driver and a
matching DT property adjustments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: ksz: Add reset GPIO handling

Add code to handle optional reset GPIO in the KSZ switch driver. The switch
has a reset GPIO line which can be controlled by the CPU, so make sure it is
configured correctly in such setups.

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Cc: Woojung Huh <woojung.huh@microchip.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Tristram Ha <Tristram.Ha@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: ksz: Add optional reset GPIO to Microchip KSZ switch binding

Add optional reset GPIO, as such a signal is available on the KSZ switches.

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Woojung Huh <Woojung.Huh@microchip.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fjes: convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: xenbus: convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: at86rf230: convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipvlan: Remove a useless comparison

Fix following gcc warning:

drivers/net/ipvlan/ipvlan_main.c:543:12: warning:
comparison is always false due to limited range of data type [-Wtype-limits]

'mode' is a u16 variable, IPVLAN_MODE_L2 is zero,
the comparison is always false

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix spelling mistake "offser" -> "offset"

There is a spelling mistake in a msg string, fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: relax verifier restriction on BPF_MOV | BPF_ALU

Currently, the destination register is marked as unknown for 32-bit
sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
SCALAR_VALUE.

This is too conservative that some valid cases will be rejected.
Especially, this may turn a constant scalar value into unknown value that
could break some assumptions of verifier.

For example, test_l4lb_noinline.c has the following C code:

    struct real_definition *dst

1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
2:    return TC_ACT_SHOT;
3:
4:  if (dst->flags & F_IPV6) {

get_packet_dst is responsible for initializing "dst" into valid pointer and
return true (1), otherwise return false (0). The compiled instruction
sequence using alu32 will be:

  412: (54) (u32) r7 &= (u32) 1
  413: (bc) (u32) r0 = (u32) r7
  414: (95) exit

insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
r7 contains SCALAR_VALUE 1.

This causes trouble when verifier is walking the code path that hasn't
initialized "dst" inside get_packet_dst, for which case 0 is returned and
we would then expect verifier concluding line 1 in the above C code pass
the "if" check, therefore would skip fall through path starting at line 4.
Now, because r0 returned from callee has became unknown value, so verifier
won't skip analyzing path starting at line 4 and "dst->flags" requires
dereferencing the pointer "dst" which actually hasn't be initialized for
this path.

This patch relaxed the code marking sub-register move destination. For a
SCALAR_VALUE, it is safe to just copy the value from source then truncate
it into 32-bit.

A unit test also included to demonstrate this issue. This test will fail
before this patch.

This relaxation could let verifier skipping more paths for conditional
comparison against immediate. It also let verifier recording a more
accurate/strict value for one register at one state, if this state end up
with going through exit without rejection and it is used for state
comparison later, then it is possible an inaccurate/permissive value is
better. So the real impact on verifier processed insn number is complex.
But in all, without this fix, valid program could be rejected.

>From real benchmarking on kernel selftests and Cilium bpf tests, there is
no impact on processed instruction number when tests ares compiled with
default compilation options. There is slightly improvements when they are
compiled with -mattr=+alu32 after this patch.

Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
rejected before this fix.

Insn processed before/after this patch:

                        default     -mattr=+alu32

Kernel selftest

===
test_xdp.o              371/371      369/369
test_l4lb.o             6345/6345    5623/5623
test_xdp_noinline.o     2971/2971    rejected/2727
test_tcp_estates.o      429/429      430/430

Cilium bpf
===
bpf_lb-DLB_L3.o:        2085/2085     1685/1687
bpf_lb-DLB_L4.o:        2287/2287     1986/1982
bpf_lb-DUNKNOWN.o:      690/690       622/622
bpf_lxc.o:              95033/95033   N/A
bpf_netdev.o:           7245/7245     N/A
bpf_overlay.o:          2898/2898     3085/2947

NOTE:
  - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
    verifier due to another issue inside verifier on supporting alu32
    binary.
  - Each cilium bpf program could generate several processed insn number,
    above number is sum of them.

v1->v2:
- Restrict the change on SCALAR_VALUE.
- Update benchmark numbers on Cilium bpf tests.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge git://git./linux/kernel/git/davem/net

Several conflicts, seemingly all over the place.

I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.

The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.

The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.

cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.

__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy. Or at least I think it was :-)

Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.

The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Move flow counters data structures from flow steering header

After the following flow counters API refactoring:
("net/mlx5: Use flow counter IDs and not the wrapping cache object")
flow counters private data structures mlx5_fc_cache and mlx5_fc are
redundantly exposed in fs_core.h, they have nothing to do with flow
steering core and they are private to fs_counter.c, this patch moves them
to where they belong and reduces their exposure in the driver.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

IB/mlx5: Use helper to get CQE opcode

Use the new helper that extracts the opcode
from a CQE (completion queue entry) structure.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>

net/mlx5: Use helper to get CQE opcode

Introduce and use a helper that extracts the opcode
from a CQE (completion queue entry) structure.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: When fetching CQEs return CQE instead of void pointer

The function is only used to retrieve CQEs, use the proper type as the
return value.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Linux 4.20-rc6

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"A decent batch of fixes here. I'd say about half are for problems that
  have existed for a while, and half are for new regressions added in
  the 4.20 merge window.

   1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.

   2) Revert bogus emac driver change, from Benjamin Herrenschmidt.

   3) Handle BPF exported data structure with pointers when building
      32-bit userland, from Daniel Borkmann.

   4) Memory leak fix in act_police, from Davide Caratti.

   5) Check RX checksum offload in RX descriptors properly in aquantia
      driver, from Dmitry Bogdanov.

   6) SKB unlink fix in various spots, from Edward Cree.

   7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
      Eric Dumazet.

   8) Fix FID leak in mlxsw driver, from Ido Schimmel.

   9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.

  10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
      limits otherwise namespace exit can hang. From Jiri Wiesner.

  11) Address block parsing length fixes in x25 from Martin Schiller.

  12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.

  13) For tun interfaces, only iface delete works with rtnl ops, enforce
      this by disallowing add. From Nicolas Dichtel.

  14) Use after free in liquidio, from Pan Bian.

  15) Fix SKB use after passing to netif_receive_skb(), from Prashant
      Bhole.

  16) Static key accounting and other fixes in XPS from Sabrina Dubroca.

  17) Partially initialized flow key passed to ip6_route_output(), from
      Shmulik Ladkani.

  18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
      Falcon.

  19) Several small TCP fixes (off-by-one on window probe abort, NULL
      deref in tail loss probe, SNMP mis-estimations) from Yuchung
      Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
  net/sched: cls_flower: Reject duplicated rules also under skip_sw
  bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
  bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
  bnxt_en: Keep track of reserved IRQs.
  bnxt_en: Fix CNP CoS queue regression.
  net/mlx4_core: Correctly set PFC param if global pause is turned off.
  Revert "net/ibm/emac: wrong bit is used for STA control"
  neighbour: Avoid writing before skb->head in neigh_hh_output()
  ipv6: Check available headroom in ip6_xmit() even without options
  tcp: lack of available data can also cause TSO defer
  ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
  mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
  mlxsw: spectrum_router: Relax GRE decap matching check
  mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
  mlxsw: spectrum_nve: Remove easily triggerable warnings
  ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
  sctp: frag_point sanity check
  tcp: fix NULL ref in tail loss probe
  tcp: Do not underestimate rwnd_limited
  net: use skb_list_del_init() to remove from RX sublists
  ...

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"Three fixes: a boot parameter re-(re-)fix, a retpoline build artifact
  fix and an LLVM workaround"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Drop implicit common-page-size linker flag
  x86/build: Fix compiler support check for CONFIG_RETPOLINE
  x86/boot: Clear RSDP address in boot_params for broken loaders

media: bpf: add bpf function to report mouse movement

Some IR remotes have a directional pad or other pointer-like thing that
can be used as a mouse. Make it possible to decode these types of IR
protocols in BPF.

Cc: netdev@vger.kernel.org
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull kprobes fixes from Ingo Molnar:
"Two kprobes fixes: a blacklist fix and an instruction patching related
  corruption fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes/x86: Blacklist non-attachable interrupt functions
  kprobes/x86: Fix instruction patching corruption when copying more than one RIP-relative instruction

Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull EFI fixes from Ingo Molnar:
"Two fixes: a large-system fix and an earlyprintk fix with certain
  resolutions"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/earlyprintk/efi: Fix infinite loop on some screen widths
  x86/efi: Allocate e820 buffer before calling efi_exit_boot_service

Merge branch 'bpf_line_info'

Martin Lau says:

====================
This patch series introduces the bpf_line_info.  Please see individual patch
for details.

It will be useful for introspection purpose, like:

[root@arch-fb-vm1 bpf]# ~/devshare/fb-kernel/linux/tools/bpf/bpftool/bpftool prog dump jited pinned /sys/fs/bpf/test_btf_haskv
[...]
int test_long_fname_2(struct dummy_tracepoint_args * arg):
bpf_prog_44a040bf25481309_test_long_fname_2:
; static int test_long_fname_2(struct dummy_tracepoint_args *arg)
   0:   push   %rbp
   1:   mov    %rsp,%rbp
   4:   sub    $0x30,%rsp
   b:   sub    $0x28,%rbp
   f:   mov    %rbx,0x0(%rbp)
  13:   mov    %r13,0x8(%rbp)
  17:   mov    %r14,0x10(%rbp)
  1b:   mov    %r15,0x18(%rbp)
  1f:   xor    %eax,%eax
  21:   mov    %rax,0x20(%rbp)
  25:   xor    %esi,%esi
; int key = 0;
  27:   mov    %esi,-0x4(%rbp)
; if (!arg->sock)
  2a:   mov    0x8(%rdi),%rdi
; if (!arg->sock)
  2e:   cmp    $0x0,%rdi
  32:   je     0x0000000000000070
  34:   mov    %rbp,%rsi
; counts = bpf_map_lookup_elem(&btf_map, &key);
  37:   add    $0xfffffffffffffffc,%rsi
  3b:   movabs $0xffff8881139d7480,%rdi
  45:   add    $0x110,%rdi
  4c:   mov    0x0(%rsi),%eax
  4f:   cmp    $0x4,%rax
  53:   jae    0x000000000000005e
  55:   shl    $0x3,%rax
  59:   add    %rdi,%rax
  5c:   jmp    0x0000000000000060
  5e:   xor    %eax,%eax
; if (!counts)
  60:   cmp    $0x0,%rax
  64:   je     0x0000000000000070
; counts->v6++;
  66:   mov    0x4(%rax),%edi
  69:   add    $0x1,%rdi
  6d:   mov    %edi,0x4(%rax)
  70:   mov    0x0(%rbp),%rbx
  74:   mov    0x8(%rbp),%r13
  78:   mov    0x10(%rbp),%r14
  7c:   mov    0x18(%rbp),%r15
  80:   add    $0x28,%rbp
  84:   leaveq
  85:   retq
[...]
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: libbpf: bpftool: Print bpf_line_info during prog dump

This patch adds print bpf_line_info function in 'prog dump jitted'
and 'prog dump xlated':

[root@arch-fb-vm1 bpf]# ~/devshare/fb-kernel/linux/tools/bpf/bpftool/bpftool prog dump jited pinned /sys/fs/bpf/test_btf_haskv
[...]
int test_long_fname_2(struct dummy_tracepoint_args * arg):
bpf_prog_44a040bf25481309_test_long_fname_2:
; static int test_long_fname_2(struct dummy_tracepoint_args *arg)
   0: push   %rbp
   1: mov    %rsp,%rbp
   4: sub    $0x30,%rsp
   b: sub    $0x28,%rbp
   f: mov    %rbx,0x0(%rbp)
  13: mov    %r13,0x8(%rbp)
  17: mov    %r14,0x10(%rbp)
  1b: mov    %r15,0x18(%rbp)
  1f: xor    %eax,%eax
  21: mov    %rax,0x20(%rbp)
  25: xor    %esi,%esi
; int key = 0;
  27: mov    %esi,-0x4(%rbp)
; if (!arg->sock)
  2a: mov    0x8(%rdi),%rdi
; if (!arg->sock)
  2e: cmp    $0x0,%rdi
  32: je     0x0000000000000070
  34: mov    %rbp,%rsi
; counts = bpf_map_lookup_elem(&btf_map, &key);
  37: add    $0xfffffffffffffffc,%rsi
  3b: movabs $0xffff8881139d7480,%rdi
  45: add    $0x110,%rdi
  4c: mov    0x0(%rsi),%eax
  4f: cmp    $0x4,%rax
  53: jae    0x000000000000005e
  55: shl    $0x3,%rax
  59: add    %rdi,%rax
  5c: jmp    0x0000000000000060
  5e: xor    %eax,%eax
; if (!counts)
  60: cmp    $0x0,%rax
  64: je     0x0000000000000070
; counts->v6++;
  66: mov    0x4(%rax),%edi
  69: add    $0x1,%rdi
  6d: mov    %edi,0x4(%rax)
  70: mov    0x0(%rbp),%rbx
  74: mov    0x8(%rbp),%r13
  78: mov    0x10(%rbp),%r14
  7c: mov    0x18(%rbp),%r15
  80: add    $0x28,%rbp
  84: leaveq
  85: retq
[...]

With linum:
[root@arch-fb-vm1 bpf]# ~/devshare/fb-kernel/linux/tools/bpf/bpftool/bpftool prog dump jited pinned /sys/fs/bpf/test_btf_haskv linum
int _dummy_tracepoint(struct dummy_tracepoint_args * arg):
bpf_prog_b07ccb89267cf242__dummy_tracepoint:
; return test_long_fname_1(arg); [file:/data/users/kafai/fb-kernel/linux/tools/testing/selftests/bpf/test_btf_haskv.c line_num:54 line_col:9]
   0: push   %rbp
   1: mov    %rsp,%rbp
   4: sub    $0x28,%rsp
   b: sub    $0x28,%rbp
   f: mov    %rbx,0x0(%rbp)
  13: mov    %r13,0x8(%rbp)
  17: mov    %r14,0x10(%rbp)
  1b: mov    %r15,0x18(%rbp)
  1f: xor    %eax,%eax
  21: mov    %rax,0x20(%rbp)
  25: callq  0x000000000000851e
; return test_long_fname_1(arg); [file:/data/users/kafai/fb-kernel/linux/tools/testing/selftests/bpf/test_btf_haskv.c line_num:54 line_col:2]
  2a: xor    %eax,%eax
  2c: mov    0x0(%rbp),%rbx
  30: mov    0x8(%rbp),%r13
  34: mov    0x10(%rbp),%r14
  38: mov    0x18(%rbp),%r15
  3c: add    $0x28,%rbp
  40: leaveq
  41: retq
[...]

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: libbpf: Add btf_line_info support to libbpf

This patch adds bpf_line_info support to libbpf:
1) Parsing the line_info sec from ".BTF.ext"
2) Relocating the line_info.  If the main prog *_info relocation
   fails, it will ignore the remaining subprog line_info and continue.
   If the subprog *_info relocation fails, it will bail out.
3) BPF_PROG_LOAD a prog with line_info

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: libbpf: Refactor and bug fix on the bpf_func_info loading logic

This patch refactor and fix a bug in the libbpf's bpf_func_info loading
logic.  The bug fix and refactoring are targeting the same
commit 2993e0515bb4 ("tools/bpf: add support to read .BTF.ext sections")
which is in the bpf-next branch.

1) In bpf_load_program_xattr(), it should retry when errno == E2BIG
   regardless of log_buf and log_buf_sz.  This patch fixes it.

2) btf_ext__reloc_init() and btf_ext__reloc() are essentially
   the same except btf_ext__reloc_init() always has insns_cnt == 0.
   Hence, btf_ext__reloc_init() is removed.

   btf_ext__reloc() is also renamed to btf_ext__reloc_func_info()
   to get ready for the line_info support in the next patch.

3) Consolidate func_info section logic from "btf_ext_parse_hdr()",
   "btf_ext_validate_func_info()" and "btf_ext__new()" to
   a new function "btf_ext_copy_func_info()" such that similar
   logic can be reused by the later libbpf's line_info patch.

4) The next line_info patch will store line_info_cnt instead of
   line_info_len in the bpf_program because the kernel is taking
   line_info_cnt also.  It will save a few "len" to "cnt" conversions
   and will also save some function args.

   Hence, this patch also makes bpf_program to store func_info_cnt
   instead of func_info_len.

5) btf_ext depends on btf.  e.g. the func_info's type_id
   in ".BTF.ext" is not useful when ".BTF" is absent.
   This patch only init the obj->btf_ext pointer after
   it has successfully init the obj->btf pointer.

   This can avoid always checking "obj->btf && obj->btf_ext"
   together for accessing ".BTF.ext".  Checking "obj->btf_ext"
   alone will do.

6) Move "struct btf_sec_func_info" from btf.h to btf.c.
   There is no external usage outside btf.c.

Fixes: 2993e0515bb4 ("tools/bpf: add support to read .BTF.ext sections")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add unit tests for bpf_line_info

Add unit tests for bpf_line_info for both BPF_PROG_LOAD and
BPF_OBJ_GET_INFO_BY_FD.

jit enabled:
[root@arch-fb-vm1 bpf]# ./test_btf -k 0
BTF prog info raw test[5] (line_info (No subprog)): OK
BTF prog info raw test[6] (line_info (No subprog. insn_off >= prog->len)): OK
BTF prog info raw test[7] (line_info (No subprog. zero tailing line_info): OK
BTF prog info raw test[8] (line_info (No subprog. nonzero tailing line_info)): OK
BTF prog info raw test[9] (line_info (subprog)): OK
BTF prog info raw test[10] (line_info (subprog + func_info)): OK
BTF prog info raw test[11] (line_info (subprog. missing 1st func line info)): OK
BTF prog info raw test[12] (line_info (subprog. missing 2nd func line info)): OK
BTF prog info raw test[13] (line_info (subprog. unordered insn offset)): OK

jit disabled:
BTF prog info raw test[5] (line_info (No subprog)): not jited. skipping jited_line_info check. OK
BTF prog info raw test[6] (line_info (No subprog. insn_off >= prog->len)): OK
BTF prog info raw test[7] (line_info (No subprog. zero tailing line_info): not jited. skipping jited_line_info check. OK
BTF prog info raw test[8] (line_info (No subprog. nonzero tailing line_info)): OK
BTF prog info raw test[9] (line_info (subprog)): not jited. skipping jited_line_info check. OK
BTF prog info raw test[10] (line_info (subprog + func_info)): not jited. skipping jited_line_info check. OK
BTF prog info raw test[11] (line_info (subprog. missing 1st func line info)): OK
BTF prog info raw test[12] (line_info (subprog. missing 2nd func line info)): OK
BTF prog info raw test[13] (line_info (subprog. unordered insn offset)): OK

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Refactor and bug fix in test_func_type in test_btf.c

1) bpf_load_program_xattr() is absorbing the EBIG error
   which makes testing this case impossible.  It is replaced
   with a direct syscall(__NR_bpf, BPF_PROG_LOAD,...).
2) The test_func_type() is renamed to test_info_raw() to
   prepare for the new line_info test in the next patch.
3) The bpf_obj_get_info_by_fd() testing for func_info
   is refactored to test_get_finfo().  A new
   test_get_linfo() will be added in the next patch
   for testing line_info purpose.
4) The test->func_info_cnt is checked instead of
   a static value "2".
5) Remove unnecessary "\n" in error message.
6) Adding back info_raw_test_num to the cmd arg such
   that a specific test case can be tested, like
   all other existing tests.

7) Fix a bug in handling expected_prog_load_failure.
   A test could pass even if prog_fd != -1 while
   expected_prog_load_failure is true.
8) The min rec_size check should be < 8 instead of < 4.

Fixes: 4798c4ba3ba9 ("tools/bpf: extends test_btf to test load/retrieve func_type info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: tools: Sync uapi bpf.h

Sync uapi bpf.h to tools/include/uapi/linux for
the new bpf_line_info.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add bpf_line_info support

This patch adds bpf_line_info support.

It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
The "line_info", "line_info_cnt" and "line_info_rec_size" are added
to the "union bpf_attr".  The "line_info_rec_size" makes
bpf_line_info extensible in the future.

The new "check_btf_line()" ensures the userspace line_info is valid
for the kernel to use.

When the verifier is translating/patching the bpf_prog (through
"bpf_patch_insn_single()"), the line_infos' insn_off is also
adjusted by the newly added "bpf_adj_linfo()".

If the bpf_prog is jited, this patch also provides the jited addrs (in
aux->jited_linfo) for the corresponding line_info.insn_off.
"bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
It is currently called by the x86 jit.  Other jits can also use
"bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
In the future, if it deemed necessary, a particular jit could also provide
its own "bpf_prog_fill_jited_linfo()" implementation.

A few "*line_info*" fields are added to the bpf_prog_info such
that the user can get the xlated line_info back (i.e. the line_info
with its insn_off reflecting the translated prog).  The jited_line_info
is available if the prog is jited.  It is an array of __u64.
If the prog is not jited, jited_line_info_cnt is 0.

The verifier's verbose log with line_info will be done in
a follow up patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net/sched: cls_flower: Reject duplicated rules also under skip_sw

Currently, duplicated rules are rejected only for skip_hw or "none",
hence allowing users to push duplicates into HW for no reason.

Use the flower tables to protect for that.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reported-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-Bug-fixes'

Michael Chan says:

====================
bnxt_en: Bug fixes.

The first patch fixes a regression on CoS queue setup, introduced
recently by the 57500 new chip support patches. The rest are
fixes related to ring and resource accounting on the new 57500 chips.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.

The CP rings are accounted differently on the new 57500 chips. There
must be enough CP rings for the sum of RX and TX rings on the new
chips. The current logic may be over-estimating the RX and TX rings.

The output parameter max_cp should be the maximum NQs capped by
MSIX vectors available for networking in the context of 57500 chips.
The existing code which uses CMPL rings capped by the MSIX vectors
works most of the time but is not always correct.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.

The new 57500 chips have introduced the NQ structure in addition to
the existing CP rings in all chips.  We need to introduce a new
bnxt_nq_rings_in_use().  On legacy chips, the 2 functions are the
same and one will just call the other.  On the new chips, they
refer to the 2 separate ring structures.  The new function is now
called to determine the resource (NQ or CP rings) associated with
MSIX that are in use.

On 57500 chips, the RDMA driver does not use the CP rings so
we don't need to do the subtraction adjustment.

Fixes: 41e8d7983752 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Keep track of reserved IRQs.

The new 57500 chips use 1 NQ per MSIX vector, whereas legacy chips use
1 CP ring per MSIX vector.  To better unify this, add a resv_irqs
field to struct bnxt_hw_resc.  On legacy chips, we initialize resv_irqs
with resv_cp_rings.  On new chips, we initialize it with the allocated
MSIX resources.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Fix CNP CoS queue regression.

Recent changes to support the 57500 devices have created this
regression. The bnxt_hwrm_queue_qportcfg() call was moved to be
called earlier before the RDMA support was determined, causing
the CoS queues configuration to be set before knowing whether RDMA
was supported or not. Fix it by moving it to the right place right
after RDMA support is determined.

Fixes: 98f04cf0f1fc ("bnxt_en: Check context memory requirements from firmware.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'char-misc-4.20-rc6' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are some small driver fixes for 4.20-rc6.

  There is a hyperv fix that for some reaon took forever to get into a
  shape that could be applied to the tree properly, but resolves a much
  reported issue. The others are some gnss patches, one a bugfix and the
  two others updates to the MAINTAINERS file to properly match the gnss
  files in the tree.

  All have been in linux-next for a while with no reported issues"

* tag 'char-misc-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  MAINTAINERS: exclude gnss from SIRFPRIMA2 regex matching
  MAINTAINERS: add gnss scm tree
  gnss: sirf: fix activation retry handling
  Drivers: hv: vmbus: Offload the handling of channels to two workqueues

Merge tag 'staging-4.20-rc6' of git://git./linux/kernel/git/gregkh/staging

Pull staging fixes from Greg KH:
"Here are two staging driver bugfixes for 4.20-rc6.

  One is a revert of a previously incorrect patch that was merged a
  while ago, and the other resolves a possible buffer overrun that was
  found by code inspection.

  Both of these have been in the linux-next tree with no reported
  issues"

* tag 'staging-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  Revert commit ef9209b642f "staging: rtl8723bs: Fix indenting errors and an off-by-one mistake in core/rtw_mlme_ext.c"
  staging: rtl8712: Fix possible buffer overrun

Merge tag 'tty-4.20-rc6' of git://git./linux/kernel/git/gregkh/tty

Pull tty driver fixes from Greg KH:
"Here are three small tty driver fixes for 4.20-rc6

  Nothing major, just some bug fixes for reported issues. Full details
  are in the shortlog.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  kgdboc: fix KASAN global-out-of-bounds bug in param_set_kgdboc_var()
  tty: serial: 8250_mtk: always resume the device in probe.
  tty: do not set TTY_IO_ERROR flag if console port

Merge tag 'usb-4.20-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB fixes for 4.20-rc6

  The "largest" here are some xhci fixes for reported issues. Also here
  is a USB core fix, some quirk additions, and a usb-serial fix which
  required the export of one of the tty layer's functions to prevent
  code duplication. The tty maintainer agreed with this change.

  All of these have been in linux-next with no reported issues"

* tag 'usb-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: Prevent U1/U2 link pm states if exit latency is too long
  xhci: workaround CSS timeout on AMD SNPS 3.0 xHC
  USB: check usb_get_extra_descriptor for proper size
  USB: serial: console: fix reported terminal settings
  usb: quirk: add no-LPM quirk on SanDisk Ultra Flair device
  USB: Fix invalid-free bug in port_over_current_notify()
  usb: appledisplay: Add 27" Apple Cinema Display

Merge tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Three small fixes: a fix for smb3 direct i/o, a fix for CIFS DFS for
  stable and a minor cifs Kconfig fix"

* tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Avoid returning EBUSY to upper layer VFS
  cifs: Fix separator when building path from dentry
  cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)

Merge tag 'dax-fixes-4.20-rc6' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull dax fixes from Dan Williams:
"The last of the known regression fixes and fallout from the Xarray
  conversion of the filesystem-dax implementation.

  On the path to debugging why the dax memory-failure injection test
  started failing after the Xarray conversion a couple more fixes for
  the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced.
  Those plus the bug that started the hunt are now addressed. These
  patches have appeared in a -next release with no issues reported.

  Note the touches to mm/memory-failure.c are just the conversion to the
  new function signature for dax_lock_page().

  Summary:

   - Fix the Xarray conversion of fsdax to properly handle
     dax_lock_mapping_entry() in the presense of pmd entries

   - Fix inode destruction racing a new lock request"

* tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix unlock mismatch with updated API
  dax: Don't access a freed inode
  dax: Check page->mapping isn't NULL

Merge tag 'libnvdimm-fixes-4.20-rc6' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
"A regression fix for the Address Range Scrub implementation, yes
  another one, and support for platforms that misalign persistent memory
  relative to the Linux memory hotplug section constraint. Longer term,
  support for sub-section memory hotplug would alleviate alignment
  waste, but until then this hack allows a 'struct page' memmap to be
  established for these misaligned memory regions.

  These have all appeared in a -next release, and thanks to Patrick for
  reporting and testing the alignment padding fix.

  Summary:

   - Unless and until the core mm handles memory hotplug units smaller
     than a section (128M), persistent memory namespaces must be padded
     to section alignment.

     The libnvdimm core already handled section collision with "System
     RAM", but some configurations overlap independent "Persistent
     Memory" ranges within a section, so additional padding injection is
     added for that case.

   - The recent reworks of the ARS (address range scrub) state machine
     to reduce the number of state flags inadvertantly missed a
     conversion of acpi_nfit_ars_rescan() call sites. Fix the regression
     whereby user-requested ARS results in a "short" scrub rather than a
     "long" scrub.

   - Fixup the unit tests to handle / test the 128M section alignment of
     mocked test resources.

* tag 'libnvdimm-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  acpi/nfit: Fix user-initiated ARS to be "ARS-long" rather than "ARS-short"
  libnvdimm, pfn: Pad pfn namespaces relative to other regions
  tools/testing/nvdimm: Align test resources to 128M

net: dsa: Make dsa_master_set_mtu() static

Add the missing static keyword.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Restore MTU on master device on unload

A previous change tries to set the MTU on the master device to take
into account the DSA overheads. This patch tries to reset the master
device back to the default MTU.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'platform-data-controls-for-mdio-gpio'

Andrew Lunn says:

====================
platform data controls for mdio-gpio

Soon to be mainlined is an x86 platform with a Marvell switch, and a
bit-banging MDIO bus. In order to make this work, the phy_mask of the
MDIO bus needs to be set to prevent scanning for PHYs, and the
phy_ignore_ta_mask needs to be set because the switch has broken
turnaround.

Add a platform_data structure with these parameters.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

The Marvell 6390 Ethernet switch family does not perform MDIO
turnaround correctly. Many hardware MDIO bus masters don't care about
this, but the bitbangging implementation in Linux does by default. Add
phy_ignore_ta_mask to the platform data so that the bitbangging code
can be told which devices are known to get TA wrong.

v2
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-gpio: Add platform_data support for phy_mask

It is sometimes necessary to instantiate a bit-banging MDIO bus as a
platform device, without the aid of device tree.

When device tree is being used, the bus is not scanned for devices,
only those devices which are in device tree are probed. Without device
tree, by default, all addresses on the bus are scanned. This may then
find a device which is not a PHY, e.g. a switch. And the switch may
have registers containing values which look like a PHY. So during the
scan, a PHY device is wrongly created.

After the bus has been registered, a search is made for
mdio_board_info structures which indicates devices on the bus, and the
driver which should be used for them. This is typically used to
instantiate Ethernet switches from platform drivers. However, if the
scanning of the bus has created a PHY device at the same location as
indicated into the board info for a switch, the switch device is not
created, since the address is already busy.

This can be avoided by setting the phy_mask of the mdio bus. This mask
prevents addresses on the bus being scanned.

v2
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Correctly set PFC param if global pause is turned off.

rx_ppp and tx_ppp can be set between 0 and 255, so don't clamp to 1.

Fixes: 6e8814ceb7e8 ("net/mlx4_en: Fix mixed PFC and Global pause user control requests")
Signed-off-by: Tarick Bedeir <tarick@google.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fixes' of git://git./linux/kernel/git/evalenti/linux-soc-thermal

Pull thermal SoC fixes from Eduardo Valentin:
"Fixes for armada and broadcom thermal drivers"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
  thermal: broadcom: constify thermal_zone_of_device_ops structure
  thermal: armada: constify thermal_zone_of_device_ops structure
  thermal: bcm2835: Switch to SPDX identifier
  thermal: armada: fix legacy resource fixup
  thermal: armada: fix legacy validity test sense

ip: silence udp zerocopy smatch false positive

extra_uref is used in __ip(6)_append_data only if uarg is set.

Smatch sees that the variable is passed to sock_zerocopy_put_abort.
This function accesses it only when uarg is set, but smatch cannot
infer this.

Make this dependency explicit.

Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'asm-generic-4.20' of git://git./linux/kernel/git/arnd/asm-generic

Pull asm-generic fix from Arnd Bergmann:
"Multiple people reported a bug I introduced in asm-generic/unistd.h in
  4.20, this is the obvious bugfix to get glibc and others to correctly
  build again on new architectures that no longer provide the old
  fstatat64() family of system calls"

* tag 'asm-generic-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  asm-generic: unistd.h: fixup broken macro include.

Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"A few clk driver fixes this time:

   - Introduce protected-clock DT binding to fix breakage on qcom
     sdm845-mtp boards where the qspi clks introduced this merge window
     cause the firmware on those boards to take down the system if we
     try to read the clk registers

   - Fix a couple off-by-one errors found by Dan Carpenter

   - Handle failure in zynq fixed factor clk driver to avoid using
     uninitialized data"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: zynqmp: Off by one in zynqmp_is_valid_clock()
  clk: mmp: Off by one in mmp_clk_add()
  clk: mvebu: Off by one bugs in cp110_of_clk_get()
  arm64: dts: qcom: sdm845-mtp: Mark protected gcc clocks
  clk: qcom: Support 'protected-clocks' property
  dt-bindings: clk: Introduce 'protected-clocks' property
  clk: zynqmp: handle fixed factor param query error

Merge tag 'xfs-4.20-fixes-3' of git://git./fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
"Here are hopefully the last set of fixes for 4.20.

  There's a fix for a longstanding statfs reporting problem with project
  quotas, a correction for page cache invalidation behaviors when
  fallocating near EOF, and a fix for a broken metadata verifier return
  code.

  Finally, the most important fix is to the pipe splicing code (aka the
  generic copy_file_range fallback) to avoid pointless short directio
  reads by only asking the filesystem for as much data as there are
  available pages in the pipe buffer. Our previous fix (simulated short
  directio reads because the number of pages didn't match the length of
  the read requested) caused subtle problems on overlayfs, so that part
  is reverted.

  Anyhow, this series passes fstests -g all on xfs and overlay+xfs, and
  has passed 17 billion fsx operations problem-free since I started
  testing

  Summary:

   - Fix broken project quota inode counts

   - Fix incorrect PAGE_MASK/PAGE_SIZE usage

   - Fix incorrect return value in btree verifier

   - Fix WARN_ON remap flags false positive

   - Fix splice read overflows"

* tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)
  splice: don't read more than available pipe space
  vfs: allow some remap flags to be passed to vfs_clone_file_range
  xfs: fix inverted return from xfs_btree_sblock_verify_crc
  xfs: fix PAGE_MASK usage in xfs_free_file_space
  fs/xfs: fix f_ffree value for statfs when project quota is set

Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore
node-local hugepage allocations").  The movement of the thp allocation
policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
intended to only set __GFP_THISNODE for mempolicies that are not
MPOL_BIND whereas the revert could set this regardless of mempolicy.

While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
and alloc_pages_vma() was racy, that has since been removed since the
revert.  What is left is the possibility to use __GFP_THISNODE in
policy_node() when it is unexpected because the special handling for
hugepages in alloc_pages_vma()  was removed as part of the consolidation.

Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat
different policy for hugepage allocations, which were allocated through
alloc_hugepage_vma().  For hugepage allocations, if the allocating
process's node is in the set of allowed nodes, allocate with
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in
mm/mempolicy.c which is functionally different behavior and removes the
requirement to only allocate hugepages locally.

So this commit does a full revert of 89c83fb539f9 instead of the partial
revert that was done in 2f0799a0ffc0.  The result is the same thp
allocation policy for 4.20 that was in 4.19.

Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "net/ibm/emac: wrong bit is used for STA control"

This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9.

This commit is completely bogus. The STACR register has two formats, old
and new, depending on the version of the IP block used. There's a pair of
device-tree properties that can be used to specify the format used:

has-inverted-stacr-oc
has-new-stacr-staopc

What this commit did was to change the bit definition used with the old
parts to match the new parts. This of course breaks the driver on all
the old ones.

Instead, the author should have set the appropriate properties in the
device-tree for the variant used on his board.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tc-testing-next'

Lucas Bates says:

====================
tc-testing: implement command timeouts and better results tracking

Patch 1 adds a timeout feature for any command tdc launches in a subshell.
This prevents tdc from hanging indefinitely.

Patches 2-4 introduce a new method for tracking and generating test case
results, and implements it across the core script and all applicable
plugins.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: gitignore, ignore generated test results

Ignore any .tap or .xml test result files generated by tdc.

Additionally, ignore plugin symlinks.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: Implement the TdcResults module in tdc

In tdc and the valgrind plugin, begin using the TdcResults module
to track executed test cases.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: Add new TdcResults module

This module includes new classes for tdc to use in keeping track
of test case results, instead of generating and tracking a lengthy
string.

The new module can be extended to support multiple formal
test result formats to be friendlier to automation.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: Add command timeout feature to tdc

Using an attribute set in the tdc_config.py file, limit the
amount of time tdc will wait for an executed command to
complete and prevent the script from hanging entirely.

This timeout will be applied to all executed commands.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'skb-headroom-slab-out-of-bounds'

Stefano Brivio says:

====================
Fix slab out-of-bounds on insufficient headroom for IPv6 packets

Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.

Patch 2/2 makes sure we avoid writing before the allocated buffer in
neigh_hh_output() in case the headroom is enough for the unaligned hardware
header size, but not enough for the aligned one, and that we warn if we hit
this condition.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

neighbour: Avoid writing before skb->head in neigh_hh_output()

While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size, it will proceed normally in case we
copy more than that because of alignment, and we'll silently corrupt
adjacent slabs.

In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.

Always check we're not writing before skb->head and, if the headroom is
not enough, warn and drop the packet.

v2:
- instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
   (Eric Dumazet)
- if we avoid the panic, though, we need to explicitly check the headroom
   before the memcpy(), otherwise we'll have corrupted slabs on a running
   kernel, after we warn
- use __skb_push() instead of skb_push(), as the headroom check is
   already implemented here explicitly (Eric Dumazet)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Check available headroom in ip6_xmit() even without options

Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.

On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().

KASan says:

[  264.967848] ==================================================================
[  264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[  264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[  264.967870]
[  264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[  264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[  264.967887] Call Trace:
[  264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[  264.967903]  [<00000000017e379c>] dump_stack+0x23c/0x290
[  264.967912]  [<00000000007bc594>] print_address_description+0xf4/0x290
[  264.967919]  [<00000000007bc8fc>] kasan_report+0x13c/0x240
[  264.967927]  [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[  264.967935]  [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[  264.967943]  [<000000000163fe44>] ip6_output+0x1f4/0x580
[  264.967953]  [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[  264.967963]  [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[  264.968033]  [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[  264.968037]  [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[  264.968041]  [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[  264.968069]  [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[  264.968071]  [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[  264.968075]  [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[  264.968078]  [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[  264.968081]  [<00000000013ddd1e>] ip_output+0x226/0x4c0
[  264.968083]  [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[  264.968100]  [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[  264.968116]  [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[  264.968131]  [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[  264.968146]  [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[  264.968161]  [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[  264.968177]  [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[  264.968192]  [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[  264.968208]  [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[  264.968212]  [<0000000001197942>] __sys_connect+0x21a/0x450
[  264.968215]  [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[  264.968218]  [<000000000184ea7a>] system_call+0x2a2/0x2c0

[...]

Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.

This issue is older than git history.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: lack of available data can also cause TSO defer

tcp_tso_should_defer() can return true in three different cases :

1) We are cwnd-limited
2) We are rwnd-limited
3) We are application limited.

Neal pointed out that my recent fix went too far, since
it assumed that if we were not in 1) case, we must be rwnd-limited

Fix this by properly populating the is_cwnd_limited and
is_rwnd_limited booleans.

After this change, we can finally move the silly check for FIN
flag only for the application-limited case.

The same move for EOR bit will be handled in net-next,
since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
with eor mark (MSG_EOR)") is scheduled for linux-4.21

Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
and checking none of them was rwnd_limited in the chrono_stat
output from "ss -ti" command.

Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: call sk_dst_reset when set SO_DONTROUTE

after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:

==========================================================================
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
    sock.send(b'foo')
    time.sleep(1)
==========================================================================

client side code:
==========================================================================
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
    data = s.recv(1024)
    print(data)
==========================================================================

Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

neighbor: Improve garbage collection

The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

        -----|---------x|----------|-----
            t-5         t         t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:

1. Use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>