git.kernel.dk Git - linux-block.git/log

bnxt_en: Update MODULE_DESCRIPTION

Update MODULE_DESCRIPTION to the more generic adapter family name.
The old name only includes the first generation of supported
adapters.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-8-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Utilize ulp client resources if RoCE is not registered

If the RoCE driver is not registered for a RoCE capable device, add
flexibility to use the RoCE resources (MSIX/NQs) for L2 purposes,
such as additional rings configured by the user or for XDP.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Change MSIX/NQs allocation policy

The existing scheme sets aside a number of MSIX/NQs for the RoCE
driver whether the RoCE driver is registered or not.  This scheme
is not flexible and limits the resources available for the L2 rings
if RoCE is never used.

Modify the scheme so that the RoCE MSIX/NQs can be used by the L2
driver if they are not used for RoCE.  The MSIX/NQs are now
represented by 3 fields.  bp->ulp_num_msix_want contains the
desired default value, edev->ulp_num_msix_vec contains the
available value (but not necessarily in use), and
ulp_tbl->msix_requested contains the actual value in use by RoCE.

The L2 driver can dip into edev->ulp_num_msix_vec if necessary.

We need to add rtnl_lock() back in bnxt_register_dev() and
bnxt_unregister_dev() to synchronize the MSIX usage between L2 and
RoCE.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions

In its current form, bnxt_rdma_aux_device_init() not only initializes
the necessary data structures of the newly created aux device but also
adds the aux device into the aux bus subsytem. Refactor the logic into
separate functions, first function to initialize the aux device along
with the required resources and second, to actually add the device to
the aux bus subsytem.
This separation helps to create bnxt_en_dev much earlier and save its
resources separately.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove unneeded MSIX base structure fields and code

Ever since commit:

303432211324 ("bnxt_en: Remove runtime interrupt vector allocation")

The MSIX base vector is effectively always 0. Remove all unneeded
structure fields and code referencing the MSIX base.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove a redundant NULL check in bnxt_register_dev()

The memory for "edev->ulp_tbl" is allocated inside the
bnxt_rdma_aux_device_init() function. If it fails, the driver
will not create the auxiliary device for RoCE. Hence the NULL
check inside bnxt_register_dev() is unnecessary.

Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Skip ethtool RSS context configuration in ifdown state

The current implementation requires the ifstate to be up when
configuring the RSS contexts. It will try to fetch the RX ring
IDs and will crash if it is in ifdown state. Return error if
!netif_running() to prevent the crash.

An improved implementation is in the works to allow RSS contexts
to be changed while in ifdown state.

Fixes: b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_ethtool: Add support for 100Gb/s per lane link modes

The Spectrum-4 ASIC supports 100Gb/s per lane link modes, but the only
one currently supported by the driver is 800Gb/s over eight lanes.

Add support for 100Gb/s over one lane, 200Gb/s over two lanes and
400Gb/s over four lanes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/1d77830f6abcc4f0d57a7f845e5a6d97a75a434b.1712667750.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: fib_rule_tests: Add VRF tests

After commit 40867d74c374 ("net: Add l3mdev index to flow struct and
avoid oif reset for port devices") it is possible to configure FIB rules
that match on iif / oif being a l3mdev port. It was not possible before
as these parameters were reset to the ifindex of the l3mdev device
itself prior to the FIB rules lookup.

Add tests that cover this functionality as it does not seem to be
covered by existing ones and I am aware of at least one user that needs
this functionality in addition to the one mentioned in [1].

Reuse the existing FIB rules tests by simply configuring a VRF prior to
the test and removing it afterwards. Differentiate the output of the
non-VRF tests from the VRF tests by appending "(VRF)" to the test name
if a l3mdev FIB rule is present.

Verified that these tests do fail on kernel 5.15.y which does not
include the previously mentioned commit:

# ./fib_rule_tests.sh -t fib_rule6_vrf
[...]
     TEST: rule6 check: oif redirect to table (VRF)                      [FAIL]
[...]
     TEST: rule6 check: iif redirect to table (VRF)                      [FAIL]

# ./fib_rule_tests.sh -t fib_rule4_vrf
[...]
     TEST: rule4 check: oif redirect to table (VRF)                      [FAIL]
[...]
     TEST: rule4 check: iif redirect to table (VRF)                      [FAIL]

[1] https://lore.kernel.org/netdev/20200922131122.GB1601@ICIPI.localdomain/

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240409110816.2508498-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: team: fix incorrect maxattr

The maxattr should be the latest attr value, i.e. array size - 1,
not total array size.

Reported-by: syzbot+ecd7e07b4be038658c9f@syzkaller.appspotmail.com
Fixes: 948dbafc15da ("net: team: use policy generated by YAML spec")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240409092812.3999785-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: fsl_qmc_hdlc: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Herve Codina <herve.codina@bootlin.com>
Link: https://lore.kernel.org/r/20240409091203.39062-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink/specs: Add bond support to rt_link.yaml

Add bond support to rt_link.yaml. Here is an example output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
   --do getlink --json '{"ifname": "bond0"}' --output-json | jq '.linkinfo'
{
   "kind": "bond",
   "data": {
     "mode": 4,
     "miimon": 100,
     ...
     "arp-interval": 0,
     "arp-ip-target": [
       "192.168.1.1",
       "192.168.1.2"
     ],
     "arp-validate": 0,
     "arp-all-targets": 0,
     "ns-ip6-target": [
       "2001::1",
       "2001::2"
     ],
     "primary-reselect": 0,
     ...
     "missed-max": 2,
     "ad-info": {
       "aggregator": 1,
       "num-ports": 1,
       "actor-key": 0,
       "partner-key": 1,
       "partner-mac": "00:00:00:00:00:00"
     }
   }
}

And here is the downlink info.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
   --do getlink --json '{"ifname": "dummy0"}' --output-json | jq '.linkinfo'
{
   "kind": "dummy",
   "slave-kind": "bond",
   "slave-data": {
     "state": 0,
     "mii-status": 0,
     "link-failure-count": 0,
     "perm-hwaddr": "f2:82:f7:cc:47:13",
     "queue-id": 0,
     "prio": 0
   }
}

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240409083504.3900877-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: update tsinfo statistics attribute docs with correct type

nla_put_uint can either write a u32 or u64 netlink attribute value. The
size depends on whether the value can be represented with a u32 or requires
a u64. Use a uint annotation in various documentation to represent this.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20240409232520.237613-2-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'optimise-local-cpu-skb_attempt_defer_free'

Pavel Begunkov says:

====================
optimise local CPU skb_attempt_defer_free

Optimise the case when an skb comes to skb_attempt_defer_free()
on the same CPU it was allocated on. The patch 1 enables skb caches
and gives frags a chance to hit the page pool's fast path.
CPU bound benchmarking with perfect skb_attempt_defer_free()
gives around 1% of extra throughput.
====================

Link: https://lore.kernel.org/r/cover.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: use SKB_CONSUMED in skb_attempt_defer_free()

skb_attempt_defer_free() is used to free already processed skbs, so pass
SKB_CONSUMED as the reason in kfree_skb_napi_cache().

Suggested-by: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/bcf5dbdda79688b074ab7ae2238535840a6d3fc2.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cache for same cpu skb_attempt_defer_free

Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/a887463fb219d973ec5ad275e31194812571f1f5.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: tweak tcp_sock_write_txrx size assertion

I forgot 32bit arches might have 64bit alignment for u64
fields.

tcp_sock_write_txrx group does not contain pointers,
but two u64 fields. It is possible that on 32bit kernel,
a 32bit hole is before tp->tcp_clock_cache.

I will try to remember a group can be bigger on 32bit
kernels in the future.

With help from Vladimir Oltean.

Fixes: d2c3a7eb1afa ("tcp: more struct tcp_sock adjustments")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404082207.HCEdQhUO-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240409140914.4105429-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-move-bpf-offload-test-from-bpf-to-net'

Jakub Kicinski says:

====================
selftests: move bpf-offload test from bpf to net

The test_offload.py test fits in networking and bpf equally
well. We started adding more Python tests in networking
and some of the code in test_offload.py can be reused,
so move it to networking. Looks like it bit rotted over
time and some fixes are needed.

Admittedly more code could be extracted but I only had
the time for a minor cleanup :(
====================

Link: https://lore.kernel.org/r/20240409031549.3531084-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: reuse common code in bpf_offload

net/lib/py/nsim.py already contains the most useful parts
of the netdevsim wrapper classes. Reuse them.

Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240409031549.3531084-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: declare section names for bpf_offload

Non-ancient ip (iproute2-5.15.0, libbpf 0.7.0) refuses to load
the sample with maps because we don't generate BTF:

   libbpf: BTF is required, but is missing or corrupted.
   ERROR: opening BPF object file failed

Enable BTF by adding -g to clang flags. With that done
neither of the programs load:

  libbpf: prog 'func': error relocating .BTF.ext function info: -22
  libbpf: prog 'func': failed to relocate calls: -22
  libbpf: failed to load object 'ksft-net-drv/net/sample_ret0.bpf.o'

Andrii explains that this is because we don't specify
section names for the code. Add the section names, too.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240409031549.3531084-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: bpf_offload: wait for maps

Maps are removed asynchronously. Either there's a bigger delay
now or the test has always been flaky. Retry waiting in the loop.

Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240409031549.3531084-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: move bpf-offload test from bpf to net

We're building more python tests on the netdev side, and some
of the classes from the venerable BPF offload tests can be reused.

Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240409031549.3531084-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: cls_api: fix slab-use-after-free in fl_dump_key

The filter counter is updated under the protection of cb_lock in the
cited commit. While waiting for the lock, it's possible the filter is
being deleted by other thread, and thus causes UAF when dump it.

Fix this issue by moving tcf_block_filter_cnt_update() after
tfilter_put().

==================================================================
BUG: KASAN: slab-use-after-free in fl_dump_key+0x1d3e/0x20d0 [cls_flower]
Read of size 4 at addr ffff88814f864000 by task tc/2973

CPU: 7 PID: 2973 Comm: tc Not tainted 6.9.0-rc2_for_upstream_debug_2024_04_02_12_41 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
  <TASK>
  dump_stack_lvl+0x7e/0xc0
  print_report+0xc1/0x600
  ? __virt_addr_valid+0x1cf/0x390
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  kasan_report+0xb9/0xf0
  ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  fl_dump_key+0x1d3e/0x20d0 [cls_flower]
  ? lock_acquire+0x1c2/0x530
  ? fl_dump+0x172/0x5c0 [cls_flower]
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? fl_dump_key_options.part.0+0x10f0/0x10f0 [cls_flower]
  ? do_raw_spin_lock+0x12d/0x270
  ? spin_bug+0x1d0/0x1d0
  fl_dump+0x21d/0x5c0 [cls_flower]
  ? fl_tmplt_dump+0x1f0/0x1f0 [cls_flower]
  ? nla_put+0x15f/0x1c0
  tcf_fill_node+0x51b/0x9a0
  ? tc_skb_ext_tc_enable+0x150/0x150
  ? __alloc_skb+0x17b/0x310
  ? __build_skb_around+0x340/0x340
  ? down_write+0x1b0/0x1e0
  tfilter_notify+0x1a5/0x390
  ? fl_terse_dump+0x400/0x400 [cls_flower]
  tc_new_tfilter+0x963/0x2170
  ? tc_del_tfilter+0x1490/0x1490
  ? print_usage_bug.part.0+0x670/0x670
  ? lock_downgrade+0x680/0x680
  ? security_capable+0x51/0x90
  ? tc_del_tfilter+0x1490/0x1490
  rtnetlink_rcv_msg+0x75e/0xac0
  ? if_nlmsg_stats_size+0x4c0/0x4c0
  ? lockdep_set_lock_cmp_fn+0x190/0x190
  ? __netlink_lookup+0x35e/0x6e0
  netlink_rcv_skb+0x12c/0x360
  ? if_nlmsg_stats_size+0x4c0/0x4c0
  ? netlink_ack+0x15e0/0x15e0
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? netlink_deliver_tap+0xcd/0xa60
  ? netlink_deliver_tap+0xcd/0xa60
  ? netlink_deliver_tap+0x1c9/0xa60
  netlink_unicast+0x43e/0x700
  ? netlink_attachskb+0x750/0x750
  ? lock_acquire+0x1c2/0x530
  ? __might_fault+0xbb/0x170
  netlink_sendmsg+0x749/0xc10
  ? netlink_unicast+0x700/0x700
  ? __might_fault+0xbb/0x170
  ? netlink_unicast+0x700/0x700
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ? import_iovec+0x7/0x10
  ? kernel_sendmsg+0x30/0x30
  ? __copy_msghdr+0x3c0/0x3c0
  ? entry_SYSCALL_64_after_hwframe+0x46/0x4e
  ? lock_acquire+0x1c2/0x530
  ? __virt_addr_valid+0x116/0x390
  ___sys_sendmsg+0xeb/0x170
  ? __virt_addr_valid+0x1ca/0x390
  ? copy_msghdr_from_user+0x110/0x110
  ? __delete_object+0xb8/0x100
  ? __virt_addr_valid+0x1cf/0x390
  ? do_sys_openat2+0x102/0x150
  ? lockdep_hardirqs_on_prepare+0x284/0x400
  ? do_sys_openat2+0x102/0x150
  ? __fget_light+0x53/0x1d0
  ? sockfd_lookup_light+0x1a/0x150
  __sys_sendmsg+0xb5/0x140
  ? __sys_sendmsg_sock+0x20/0x20
  ? lock_downgrade+0x680/0x680
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e
RIP: 0033:0x7f98e3713367
Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffc74a64608 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000047eae0 RCX: 00007f98e3713367
RDX: 0000000000000000 RSI: 00007ffc74a64670 RDI: 0000000000000003
RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f98e360c5e8 R11: 0000000000000246 R12: 00007ffc74a6a508
R13: 00000000660d518d R14: 0000000000484a80 R15: 00007ffc74a6a50b
  </TASK>

Allocated by task 2973:
  kasan_save_stack+0x20/0x40
  kasan_save_track+0x10/0x30
  __kasan_kmalloc+0x77/0x90
  fl_change+0x27a6/0x4540 [cls_flower]
  tc_new_tfilter+0x879/0x2170
  rtnetlink_rcv_msg+0x75e/0xac0
  netlink_rcv_skb+0x12c/0x360
  netlink_unicast+0x43e/0x700
  netlink_sendmsg+0x749/0xc10
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ___sys_sendmsg+0xeb/0x170
  __sys_sendmsg+0xb5/0x140
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e

Freed by task 283:
  kasan_save_stack+0x20/0x40
  kasan_save_track+0x10/0x30
  kasan_save_free_info+0x37/0x50
  poison_slab_object+0x105/0x190
  __kasan_slab_free+0x11/0x30
  kfree+0x111/0x340
  process_one_work+0x787/0x1490
  worker_thread+0x586/0xd30
  kthread+0x2df/0x3b0
  ret_from_fork+0x2d/0x70
  ret_from_fork_asm+0x11/0x20

Last potentially related work creation:
  kasan_save_stack+0x20/0x40
  __kasan_record_aux_stack+0x9b/0xb0
  insert_work+0x25/0x1b0
  __queue_work+0x640/0xc90
  rcu_work_rcufn+0x42/0x70
  rcu_core+0x6a9/0x1850
  __do_softirq+0x264/0x88f

Second to last potentially related work creation:
  kasan_save_stack+0x20/0x40
  __kasan_record_aux_stack+0x9b/0xb0
  __call_rcu_common.constprop.0+0x6f/0xac0
  queue_rcu_work+0x56/0x70
  fl_mask_put+0x20d/0x270 [cls_flower]
  __fl_delete+0x352/0x6b0 [cls_flower]
  fl_delete+0x97/0x160 [cls_flower]
  tc_del_tfilter+0x7d1/0x1490
  rtnetlink_rcv_msg+0x75e/0xac0
  netlink_rcv_skb+0x12c/0x360
  netlink_unicast+0x43e/0x700
  netlink_sendmsg+0x749/0xc10
  __sock_sendmsg+0xc5/0x190
  ____sys_sendmsg+0x534/0x6b0
  ___sys_sendmsg+0xeb/0x170
  __sys_sendmsg+0xb5/0x140
  do_syscall_64+0x70/0x140
  entry_SYSCALL_64_after_hwframe+0x46/0x4e

Fixes: 2081fd3445fe ("net: sched: cls_api: add filter counter")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'minor-cleanups-to-skb-frag-ref-unref'

Mina Almasry says:

====================
Minor cleanups to skb frag ref/unref (part)

This series is largely motivated by a recent discussion where there was
some confusion on how to properly ref/unref pp pages vs non pp pages:

https://lore.kernel.org/netdev/CAHS8izOoO-EovwMwAm9tLYetwikNPxC0FKyVGu1TPJWSz4bGoA@mail.gmail.com/T/#t

There is some subtely there because pp uses page->pp_ref_count for
refcounting, while non-pp uses get_page()/put_page() for ref counting.
Getting the refcounting pairs wrong can lead to kernel crash.
[...]

https://lore.kernel.org/lkml/CAHS8izN436pn3SndrzsCyhmqvJHLyxgCeDpWXA4r1ANt3RCDLQ@mail.gmail.com/T/
====================

Link: https://lore.kernel.org/r/20240408153000.2152844-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove napi_frag_unref

With the changes in the last patches, napi_frag_unref() is now
reduandant. Remove it and use skb_page_unref directly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make napi_frag_unref reuse skb_page_unref

The implementations of these 2 functions are almost identical. Remove
the implementation of napi_frag_unref, and make it a call into
skb_page_unref so we don't duplicate the implementation.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
net/e1000e, igb, igc: Remove redundant runtime resume

Bjorn Helgaas says:

e1000e, igb, and igc all have code to runtime resume the device during
ethtool operations.

Since f32a21376573 ("ethtool: runtime-resume netdev parent before ethtool
ioctl ops"), dev_ethtool() does this for us, so remove it from the
individual drivers.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  igc: Remove redundant runtime resume for ethtool ops
  igb: Remove redundant runtime resume for ethtool_ops
  e1000e: Remove redundant runtime resume for ethtool_ops
====================

Link: https://lore.kernel.org/r/20240408210849.3641172-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bonding-remove-rtnl-from-three-sysfs-files'

Eric Dumazet says:

====================
bonding: remove RTNL from three sysfs files

First patch might fix a potential deadlock.
sysfs handlers should use rtnl_trylock() instead of rtnl_lock().

Following files can be read without acquiring RTNL :

- /sys/class/net/bonding_masters
- /sys/class/net/<name>/bonding/slaves
- /sys/class/net/<name>/bonding/queue_id
====================

Link: https://lore.kernel.org/r/20240408190437.2214473-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: no longer use RTNL in bonding_show_queue_id()

Annotate lockless reads of slave->queue_id.

Annotate writes of slave->queue_id.

Switch bonding_show_queue_id() to rcu_read_lock()
and bond_for_each_slave_rcu().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20240408190437.2214473-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: no longer use RTNL in bonding_show_slaves()

Slave devices are already RCU protected, simply
switch to bond_for_each_slave_rcu(),

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20240408190437.2214473-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: no longer use RTNL in bonding_show_bonds()

netdev structures are already RCU protected.

Change bond_init() and bond_uninit() to use RCU
enabled list_add_tail_rcu() and list_del_rcu().

Then bonding_show_bonds() can use rcu_read_lock()
while iterating through bn->dev_list.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20240408190437.2214473-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: cake: Optimize the number of function calls and branches in heap construction

When constructing a heap, heapify operations are required on all
non-leaf nodes. Thus, determining the index of the first non-leaf node
is crucial. In a heap, the left child's index of node i is 2 * i + 1
and the right child's index is 2 * i + 2. Node CAKE_MAX_TINS *
CAKE_QUEUES / 2 has its left and right children at indexes
CAKE_MAX_TINS * CAKE_QUEUES + 1 and CAKE_MAX_TINS * CAKE_QUEUES + 2,
respectively, which are beyond the heap's range, indicating it as a
leaf node. Conversely, node CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1 has a
left child at index CAKE_MAX_TINS * CAKE_QUEUES - 1, confirming its
non-leaf status. The loop should start from it since it's not a leaf
node.

By starting the loop from CAKE_MAX_TINS * CAKE_QUEUES / 2 - 1, we
minimize function calls and branch condition evaluations. This
adjustment theoretically reduces two function calls (one for
cake_heapify() and another for cake_heap_get_backlog()) and five branch
evaluations (one for iterating all non-leaf nodes, one within
cake_heapify()'s while loop, and three more within the while loop
with if conditions).

Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20240408174716.751069-1-visitorckw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cxgb4: flower: use NL_SET_ERR_MSG_MOD for validation errors

Replace netdev_{warn,err} with NL_SET_ERR_MSG_{FMT_,}MOD
to better inform the user about the problem.

Only compile-tested, no access to HW.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://lore.kernel.org/r/20240408165506.94483-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp8382x: keep WOL settings across suspends

Unlike other ethernet PHYs from TI, PHY dp8382x has WOL enabled
at reset. The driver explicitly disables WOL in config_init callback
which is called during init and during resume from suspend. Hence,
WOL is unconditionally disabled during resume, even if it was enabled
before the suspend. We make sure that WOL configuration is persistent
across suspends.

Signed-off-by: Catalin Popescu <catalin.popescu@leica-geosystems.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240408082602.3654090-1-catalin.popescu@leica-geosystems.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-micrel-lan8814-enable-ptp_pf_perout'

Horatiu Vultur says:

====================
net: phy: micrel: lan8814: Enable PTP_PF_PEROUT

Add support for PTP_PF_PEROUT to lan8814. First patch just enables
the LTC at probe time, such that it is not required to enable
timestamping to have the LTC enabled. While the second patch actually
adds support for PTP_PF_PEROUT.
====================

Link: https://lore.kernel.org/r/20240408064432.3881636-1-horatiu.vultur@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: micrel: lan8814: Add support for PTP_PF_PEROUT

Lan8814 has 24 GPIOs but only 2 GPIOs (GPIO 0 and GPIO 1) can be
configured to generate period signals. And there are 2 events (EVENT_A
and EVENT_B) but these events are hardcoded to the GPIO 0 and GPIO 1.
These events are used to generate period signals. It is possible to
configure the length, the start time and the period of the signal by
configuring the event.

These events are generated by comparing the target time with the PHC
time. In case the PHC time is changed to a value bigger than the target
time + reload time, then it would generate only 1 event and then it
would stop because target time + reload time is smaller than PHC time.
Therefore it is required to change also the target time every time when
the PHC is changed. The same will apply also when the PHC time is
changed to a smaller value.

This was tested using:
testptp -i 1 -L 1,2
testptp -i 1 -p 1000000000 -w 200000000

Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Divya Koppera <divya.koppera@microchip.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: micrel: lan8814: Enable LTC at probe time

The LTC for lan8814 was enabled only if timestamping was enabled,
otherwise it would be stopped. Meaning that LTC will not increase by
itself. This might break other features that don't required timestamping
like generating 1PPS. Therefore enable the LTC at probe time.

Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: rockchip-dwmac: use rgmii-id in example

The dwmac supports specifying the RGMII clock delays, but it is
recommended to use rgmii-id and to specify the delays in the phy node
instead [1].

Change the example accordingly to no longer promote this undesired
setting.

[1] https://lore.kernel.org/all/1a0de7b4-f0f7-4080-ae48-f5ffa9e76be3@lunn.ch/

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240408-rockchip-dwmac-rgmii-id-binding-v1-1-3886d1a8bd54@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'tcp-fix-isn-selection-in-timewait-syn_recv'

Eric Dumazet says:

====================
tcp: fix ISN selection in TIMEWAIT -> SYN_RECV

TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt.

This SYN packet also bypasses normal checks against listen queue
being full or not.

Unfortunately this has been broken almost one decade ago.

This series fixes the issue, in two patches.

First patch refactors code to add tcp_tw_isn as a parameter
to ->route_req(), to make the second patch smaller.

Second patch fixes the issue, by no longer using TCP_SKB_CB(skb)
to store the tcp_tw_isn.

Following packetdrill test passes after this series:

// Set up a server listening socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish connection
   +0 < S 0:0(0) win 32792 <mss 1460,nop,nop,sackOK>
   +0 > S. 0:0(0) ack 1    <mss 1460,nop,nop,sackOK>
+.01 < . 1:1(0) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

// We close(), send a FIN, and get an ACK and FIN, in order to get into TIME_WAIT.

+.01 close(4) = 0
   +0 > F. 1:1(0) ack 1
+.01 < F. 1:1(0) ack 2 win 32792
   +0 > . 2:2(0) ack 2

// SYN hitting a TIME_WAIT -> should use an ISN based on TIMEWAIT tw_snd_nxt

+.01 < S 1000:1000(0) win 65535 <mss 1460,nop,nop,sackOK>
   +0 > S. 65539:65539(0) ack 1001 <mss 1460,nop,nop,sackOK>
====================

Link: https://lore.kernel.org/r/20240407093322.3172088-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt :

tcp_timewait_state_process()
...
    u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
    if (isn == 0)
        isn++;
    TCP_SKB_CB(skb)->tcp_tw_isn = isn;
    return TCP_TW_SYN;

This SYN packet also bypasses normal checks against listen queue
being full or not.

tcp_conn_request()
...
       __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
...
        /* TW buckets are converted to open requests without
         * limitations, they conserve resources and peer is
         * evidently real one.
         */
        if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                if (!want_cookie)
                        goto drop;
        }

This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

Unfortunately this field has been accidentally cleared
after the call to tcp_timewait_state_process() returning
TCP_TW_SYN.

Using a field in TCP_SKB_CB(skb) for a temporary state
is overkill.

Switch instead to a per-cpu variable.

As a bonus, we do not have to clear tcp_tw_isn in TCP receive
fast path.
It is temporarily set then cleared only in the TCP_TW_SYN dance.

Fixes: 4ad19de8774e ("net: tcp6: fix double call of tcp_v6_fill_cb()")
Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()

tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
out if the request socket is created by a SYN hitting a TIMEWAIT socket.

This has been buggy for a decade, lets directly pass the information
from tcp_conn_request().

This is a preparatory patch to make the following one easier to review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-support-for-flower-actions-mirred-and-redirect'

Daniel Machon says:

====================
Add support for flower actions mirred and redirect

This series adds support for the two tc flower actions mirred and
redirect. Both actions are implemented by means of a port mask and a
mask mode. The mask mode controls how the mask is applied, and together
they are used by the switch to make a forwarding decision. Both actions
are configurable via the IS0 or IS2 VCAP's (ingress stage 0 and 2,
respectively).

Patch #1: adds support for tc flower mirred action.
Patch #2: adds support for tc flower redirect action.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
====================

Link: https://lore.kernel.org/r/20240405-mirror-redirect-actions-v2-0-875d4c1927c8@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sparx5: add support for tc flower redirect action

Add support for the flower redirect action. Two VCAP actions are encoded
in the rule - one for the port mask, and one for the port mask mode.
When the rule is hit, the port mask is used as the final destination
set, replacing all other port masks.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sparx5: add support for tc flower mirred action.

Add support for tc flower mirred action. Two VCAP actions are encoded in
the rule - one for the port mask, and one for the port mask mode. When
the rule is hit, the destination mask is OR'ed with the port mask.

Also add new VCAP function for supporting 72-bit wide actions, and a tc
helper for setting the port forwarding mask.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'support-icssg-based-ethernet-on-am65x-sr1-0-devices'

Diogo Ivo says:

====================
Support ICSSG-based Ethernet on AM65x SR1.0 devices

This series extends the current ICSSG-based Ethernet driver to support
AM65x Silicon Revision 1.0 devices.

Notable differences between the Silicon Revisions are that there is
no TX core in SR1.0 with this being handled by the firmware, requiring
extra DMA channels to manage communication with the firmware (with the
firmware being different as well) and in the packet classifier.

The motivation behind it is that a significant number of Siemens
devices containing SR1.0 silicon have been deployed in the field
and need to be supported and updated to newer kernel versions
without losing functionality.

This series is based on TI's 5.10 SDK [1].

The fifth version of this patch series can be found in [2].

Compared to the last version of the patch set there are only changes in
patch 05/10, where the fields of a struct are now explicitly declared as
__le32 so that we can properly interpret them.

Both of the problems mentioned in v4 have been addressed by disabling
those functionalities, meaning that this driver currently only supports
one TX queue and does not support a 100Mbit/s half-duplex connection.
The removal of these features has been commented in the appropriate
locations in the code.

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y
[2]: https://lore.kernel.org/netdev/20240326110709.26165-1-diogo.ivo@siemens.com/
====================

Link: https://lore.kernel.org/r/20240403104821.283832-1-diogo.ivo@siemens.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Add ICSSG Ethernet driver for AM65x SR1.0 platforms

Add the PRUeth driver for the ICSSG subsystem found in AM65x SR1.0 devices.
The main differences that set SR1.0 and SR2.0 apart are the missing TXPRU
core in SR1.0, two extra DMA channels for management purposes and different
firmware that needs to be configured accordingly.

Based on the work of Roger Quadros, Vignesh Raghavendra and
Grygorii Strashko in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Modify common functions for SR1.0

Some parts of the logic differ only slightly between Silicon Revisions.
In these cases add the bits that differ to a common function that
executes those bits conditionally based on the Silicon Revision.

Based on the work of Roger Quadros, Vignesh Raghavendra and
Grygorii Strashko in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Add functions to configure SR1.0 packet classifier

Add the functions to configure the SR1.0 packet classifier.

Based on the work of Roger Quadros in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Adjust the number of TX channels for SR1.0

As SR1.0 uses the current higher priority channel to send commands to
the firmware, take this into account when setting/getting the number
of channels to/from the user.

Based on the work of Roger Quadros in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Adjust IPG configuration for SR1.0

Correctly adjust the IPG based on the Silicon Revision.

Based on the work of Roger Quadros, Vignesh Raghavendra
and Grygorii Strashko in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Add SR1.0-specific description bits

Add a field to distinguish between SR1.0 and SR2.0 in the driver
as well as the necessary structures to program SR1.0.

Based on the work of Roger Quadros in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Add SR1.0-specific configuration bits

Define the firmware configuration structure and commands needed to
communicate with SR1.0 firmware, as well as SR1.0 buffer information
where it differs from SR2.0.

Based on the work of Roger Quadros, Murali Karicheri and
Grygorii Strashko in TI's 5.10 SDK [1].

[1]: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/?h=ti-linux-5.10.y

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ti: icssg-prueth: Move common functions into a separate file

In order to allow code sharing between Silicon Revisions 1.0 and 2.0
move all functions that can be shared into a common file. This commit
introduces no functional changes.

Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth: Move IPv4/IPv6 multicast address bases to their own symbols

As these addresses can be useful outside of checking if an address
is a multicast address (for example in device drivers) make them
accessible to users of etherdevice.h to avoid code duplication.

Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: Add support for AM65x SR1.0 in ICSSG

Silicon Revision 1.0 of the AM65x came with a slightly different ICSSG
support: Only 2 PRUs per slice are available and instead 2 additional
DMA channels are used for management purposes. We have no restrictions
on specified PRUs, but the DMA channels need to be adjusted.

Co-developed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: air_en8811h: fix some error codes

These error paths accidentally return "ret" which is zero/success
instead of the correct error code.

Fixes: 71e79430117d ("net: phy: air_en8811h: Add the Airoha EN8811H PHY driver")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/7ef2e230-dfb7-4a77-8973-9e5be1a99fc2@moroto.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

archnet: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts drivers/net/archnet/* from tasklet to BH workqueue.

Based on the work done by Tejun Heo <tj@kernel.org>
Branch: https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-6.10

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://lore.kernel.org/r/20240403162306.20258-1-apais@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

igc: Remove redundant runtime resume for ethtool ops

8c5ad0dae93c ("igc: Add ethtool support") added ethtool_ops.begin() and
.complete(), which used pm_runtime_get_sync() to resume suspended devices
before any ethtool_ops callback and allow suspend after it completed.

Subsequently, f32a21376573 ("ethtool: runtime-resume netdev parent before
ethtool ioctl ops") added pm_runtime_get_sync() in the dev_ethtool() path,
so the device is resumed before any ethtool_ops callback even if the driver
didn't supply a .begin() callback.

Remove the .begin() and .complete() callbacks, which are now redundant
because dev_ethtool() already resumes the device.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

igb: Remove redundant runtime resume for ethtool_ops

749ab2cd1270 ("igb: add basic runtime PM support") added
ethtool_ops.begin() and .complete(), which used pm_runtime_get_sync() to
resume suspended devices before any ethtool_ops callback and allow suspend
after it completed.

Subsequently, f32a21376573 ("ethtool: runtime-resume netdev parent before
ethtool ioctl ops") added pm_runtime_get_sync() in the dev_ethtool() path,
so the device is resumed before any ethtool_ops callback even if the driver
didn't supply a .begin() callback.

Remove the .begin() and .complete() callbacks, which are now redundant
because dev_ethtool() already resumes the device.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

e1000e: Remove redundant runtime resume for ethtool_ops

e60b22c5b7e5 ("e1000e: fix accessing to suspended device") added
ethtool_ops.begin() and .complete(), which used pm_runtime_get_sync() to
resume suspended devices before any ethtool_ops callback and allow suspend
after it completed.

3ef672ab1862 ("e1000e: ethtool unnecessarily takes device out of RPM
suspend") removed ethtool_ops.begin() and .complete() and instead did
pm_runtime_get_sync() only in the individual ethtool_ops callbacks that
access device registers.

Subsequently, f32a21376573 ("ethtool: runtime-resume netdev parent before
ethtool ioctl ops") added pm_runtime_get_sync() in the dev_ethtool() path,
so the device is resumed before *any* ethtool_ops callback, as it was
before 3ef672ab1862.

Remove most runtime resumes from ethtool_ops, which are now redundant
because the resume has already been done by dev_ethtool().  This is
essentially a revert of 3ef672ab1862 ("e1000e: ethtool unnecessarily takes
device out of RPM suspend").

There are a couple subtleties:

  - Prior to 3ef672ab1862, the device was resumed only for the duration of
    a single ethtool callback.  3ef672ab1862 changed e1000_set_phys_id() so
    the device was resumed for ETHTOOL_ID_ACTIVE and remained resumed until
    a subsequent callback for ETHTOOL_ID_INACTIVE.  Preserve that part of
    3ef672ab1862 so the device will not be runtime suspended while in the
    ETHTOOL_ID_ACTIVE state.

  - 3ef672ab1862 added "if (!pm_runtime_suspended())" in before reading the
    STATUS register in e1000_get_settings().  This was racy and is now
    unnecessary because dev_ethtool() has resumed the device already, so
    revert that.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

r8169: add support for RTL8168M

A user reported an unknown chip version. According to the r8168 vendor
driver it's called RTL8168M, but handling is identical to RTL8168H.
So let's simply treat it as RTL8168H.

Tested-by: Евгений <octobergun@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'devlink-io-eqs'

Parav Pandit says:

====================
devlink: Add port function attribute for IO EQs

Currently, PCI SFs and VFs use IO event queues to deliver netdev per
channel events. The number of netdev channels is a function of IO
event queues. In the second scenario of an RDMA device, the
completion vectors are also a function of IO event queues. Currently, an
administrator on the hypervisor has no means to provision the number
of IO event queues for the SF device or the VF device. Device/firmware
determines some arbitrary value for these IO event queues. Due to this,
the SF netdev channels are unpredictable, and consequently, the
performance is too.

This short series introduces a new port function attribute: max_io_eqs.
The goal is to provide administrators at the hypervisor level with the
ability to provision the maximum number of IO event queues for a
function. This gives the control to the administrator to provision
right number of IO event queues and have predictable performance.

Examples of when an administrator provisions (set) maximum number of
IO event queues when using switchdev mode:

  $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
          function:
          hw_addr 00:00:00:00:00:00 roce enable max_io_eqs 10

  $ devlink port function set pci/0000:06:00.0/1 max_io_eqs 20

  $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
          function:
          hw_addr 00:00:00:00:00:00 roce enable max_io_eqs 20

This sets the corresponding maximum IO event queues of the function
before it is enumerated. Thus, when the VF/SF driver reads the
capability from the device, it sees the value provisioned by the
hypervisor. The driver is then able to configure the number of channels
for the net device, as well as the number of completion vectors
for the RDMA device. The device/firmware also honors the provisioned
value, hence any VF/SF driver attempting to create IO EQs
beyond provisioned value results in an error.

With above setting now, the administrator is able to achieve the 2x
performance on SFs with 20 channels. In second example when SF was
provisioned for a container with 2 cpus, the administrator provisioned only
2 IO event queues, thereby saving device resources.

With the above settings now in place, the administrator achieved 2x
performance with the SF device with 20 channels. In the second example,
when the SF was provisioned for a container with 2 CPUs, the administrator
provisioned only 2 IO event queues, thereby saving device resources.

changelog:
v2->v3:
- limited to 80 chars per line in devlink
- fixed comments from Jakub in mlx5 driver to fix missing mutex unlock
  on error path
v1->v2:
- limited comment to 80 chars per line in header file
- fixed set function variables for reverse christmas tree
- fixed comments from Kalesh
- fixed missing kfree in get call
- returning error code for get cmd failure
- fixed error msg copy paste error in set on cmd failure
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlx5/core: Support max_io_eqs for a function

Implement get and set for the maximum IO event queues for SF and VF.
This enables administrator on the hypervisor to control the maximum
IO event queues which are typically used to derive the maximum and
default number of net device channels or rdma device completion vectors.

Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Support setting max_io_eqs

Many devices send event notifications for the IO queues,
such as tx and rx queues, through event queues.

Enable a privileged owner, such as a hypervisor PF, to set the number
of IO event queues for the VF and SF during the provisioning stage.

example:
Get maximum IO event queues of the VF device::

  $ devlink port show pci/0000:06:00.0/2
  pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
          hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10

Set maximum IO event queues of the VF device::

  $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32

  $ devlink port show pci/0000:06:00.0/2
  pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
          hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: display more skb fields in skb_dump()

Print these additional fields in skb_dump() to ease debugging.

- mac_len
- csum_start (in v2, at Willem suggestion)
- csum_offset (in v2, at Willem suggestion)
- priority
- mark
- alloc_cpu
- vlan_all
- encapsulation
- inner_protocol
- inner_mac_header
- inner_network_header
- inner_transport_header

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phy-cleanup-EEE'

Andrew Lunn says:

====================
net: Clean up some EEE code

Previous patches have reworked the API between phylib and MAC drivers
with respect to EEE, pushing most of the work into phylib. These two
patches rework two drivers to make use of the new API, and fix their
EEE implementation, so that EEE is configured in the MAC based on what
is actually negotiated during autoneg.

Compile tested only.
====================

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: lan743x: Fixup EEE

The enabling/disabling of EEE in the MAC should happen as a result of
auto negotiation. So move the enable/disable into
lan743x_phy_link_status_change() which gets called by phylib when
there is a change in link status.

lan743x_ethtool_set_eee() now just programs the hardware with the LTI
timer value, and passed everything else to phylib, so it can correctly
setup the PHY.

lan743x_ethtool_get_eee() relies on phylib doing most of the work, the
MAC driver just adds the LTI timer value.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: lan78xx: Fixup EEE

The enabling/disabling of EEE in the MAC should happen as a result of
auto negotiation. So move the enable/disable into
lan783xx_phy_link_status_change() which gets called by phylib when
there is a change in link status.

lan78xx_set_eee() now just programs the hardware with the LPI
timer value, and passed everything else to phylib, so it can correctly
setup the PHY.

lan743x_get_eee() relies on phylib doing most of the work, the
MAC driver just adds the LPI timer value.

Call phy_support_eee() to indicate the MAC does actually support EEE.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: add reset reason options in some places

The reason codes are handled in two ways nowadays (quoting Mat Martineau):
1. Sending in the MPTCP option on RST packets when there is no subflow
context available (these use subflow_add_reset_reason() and directly call
a TCP-level send_reset function)
2. The "normal" way via subflow->reset_reason. This will propagate to both
the outgoing reset packet and to a local path manager process via netlink
in mptcp_event_sub_closed()

RFC 8684 defines the skb reset reason behaviour which is not required
even though in some places:

    A host sends a TCP RST in order to close a subflow or reject
    an attempt to open a subflow (MP_JOIN). In order to let the
    receiving host know why a subflow is being closed or rejected,
    the TCP RST packet MAY include the MP_TCPRST option (Figure 15).
    The host MAY use this information to decide, for example, whether
    it tries to re-establish the subflow immediately, later, or never.

Since the commit dc87efdb1a5cd ("mptcp: add mptcp reset option support")
introduced this feature about three years ago, we can fully use it.
There remains some places where we could insert reason into skb as
we can see in this patch.

Many thanks to Mat and Paolo for help:)

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Set scope explicitly in ip_route_output().

Add a "scope" parameter to ip_route_output() so that callers don't have
to override the tos parameter with the RTO_ONLINK flag if they want a
local scope.

This will allow converting flowi4_tos to dscp_t in the future, thus
allowing static analysers to flag invalid interactions between
"tos" (the DSCP bits) and ECN.

Only three users ask for local scope (bonding, arp and atm). The others
continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn
users about the limitations of ip_route_output().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband
Signed-off-by: David S. Miller <davem@davemloft.net>

ipvlan: handle NETDEV_DOWN event

In case of stacked devices, to help propagate the down
link state from the parent/root device (to this leaf device),
handle NETDEV_DOWN event like it is done now for NETDEV_UP.

In the below example, ens5 is the host interface which is the
parent of the ipvlan interface eth0 in the container.

Host:

[root@gkn-podman-x64 ~]# ip link set ens5 down
[root@gkn-podman-x64 ~]# ip -d link show dev ens5
3: ens5: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN
      ...
[root@gkn-podman-x64 ~]#

Container:

[root@testnode-ol8 /]# ip -d link show dev eth0
2: eth0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UNKNOWN
        ...
    ipvlan mode l2 bridge
        ...
[root@testnode-ol8 /]#

eth0's state continues to show up as UP even though ens5 is now DOWN.

For macvlan the handling of NETDEV_DOWN event was added in
commit 80fd2d6ca546 ("macvlan: Change status when lower device goes down").

Reported-by: Gia-Khanh Nguyen <gia-khanh.nguyen@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_packet: avoid a false positive warning in packet_setsockopt()

Although the code is correct, the following line

copy_from_sockptr(&req_u.req, optval, len));

triggers this warning :

memcpy: detected field-spanning write (size 28) of single field "dst" at include/linux/sockptr.h:49 (size 16)

Refactor the code to be more explicit.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: handle HAS_IOPORT dependencies

In a future patch HAS_IOPORT=n will disable inb()/outb() and friends at
compile time. We thus need to add HAS_IOPORT as dependency for
those drivers requiring them. For the DEFXX driver the use of I/O
ports is optional and we only need to fence specific code paths. It also
turns out that with HAS_IOPORT handled explicitly HAMRADIO does not need
the !S390 dependency and successfully builds the bpqether driver.

Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Maciej W. Rozycki <macro@orcam.me.uk>
Co-developed-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-selftests'

Matthieu Baerts says:

====================
selftests: mptcp: cleanups and 'ip mptcp' support

Here are some patches from Geliang, doing different cleanups, and
supporting 'ip mptcp' in more MPTCP selftests.

Patch 1 checks that TC is available in selftests requiring it.

Patch 2 adds 'ms' units in TC commands, to avoid confusions.

Patches 3-9 are some prerequisites for patch 10: some export code from
mptcp_join.sh to mptcp_lib.sh, to be re-used in pm_netlink.sh,
mptcp_sockopt.sh and simult_flows.sh ; and others add helpers to
pm_netlink.sh to easily support both 'ip mptcp' and 'pm_nl_ctl' tools to
interact with the in-kernel MPTCP path-manager.

Patch 10 adds a '-i' parameter in mptcp_sockopt.sh, pm_netlink.sh, and
simult_flows.sh to use 'ip mptcp' tool instead of 'pm_nl_ctl'.

Patch 11 fixes some ShellCheck warnings in pm_netlink.sh, in order to
drop a ShellCheck's 'disable' instruction.
====================

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: netlink: drop disable=SC2086

Now there are only a few of variables are not using double quotes.
Modifying them, then "shellcheck disable=SC2086" can be dropped.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: ip_mptcp option for more scripts

This patch adds '-i' option for mptcp_sockopt.sh, pm_netlink.sh, and
simult_flows.sh, to use 'ip mptcp' command in the tests instead of
'pm_nl_ctl'. Update usage() correspondingly.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: use pm_nl endpoint ops

Use those newly added pm_nl endpoint ops helpers to replace all 'pm_nl_ctl'
commands with 'limits', 'add', 'del', 'flush', 'show' and 'set' arguments
in scripts mptcp_sockopt.sh and simult_flows.sh.

In pm_netlink.sh, add wrappers of there helpers to make the function names
shorter. Then use the wrappers to replace all 'pm_nl_ctl' commands.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: export pm_nl endpoint ops

This patch exports six endpoint operation helpers with pm_nl_ prefix,
pm_nl_set_limits(), pm_nl_add_endpoint(), pm_nl_del_endpoint(),
pm_nl_flush_endpoint(), pm_nl_show_endpoints() and pm_nl_change_endpoint()
into mptcp_lib.sh as public functions, and renamed each of them with a
mptcp_lib_ prefix. Then these old pm_nl_ prefix helpers in mptcp_join.sh
can be wrappers of mptcp_lib_ prefix ones.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: join: update endpoint ops

This patch uses 'case' statements to simplify pm_nl_add_endpoint() and
pm_nl_check_endpoint(). And simplify pm_nl_check_endpoint() with
check_output() helper. Also update pm_nl_del_endpoint() to avoid the
'double quote' shellcheck warning.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: netlink: add change_address helper

The output formats of 'ip mptcp' commands are much different from that
of 'pm_nl_ctl' commands.

A new 'change_address' helper is added here, to change the flag of an
address. This is a bit similar to mptcp_join.sh's pm_nl_change_endpoint().

Usage:
Address ID - pm_nl_change_endpoint $ns id $id $flags
IP address - change_address $ns $addr $flags

Use this new helper in pm_netlink.sh to replace all 'pm_nl_ctl set'
commands.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add {get,format}_endpoint(s) helpers

The output formats of 'ip mptcp' commands are much different from that
of 'pm_nl_ctl' commands.

This patch adds a new helper format_endpoints() to format the outputs of
'ip mptcp' and 'pm_nl_ctl' with 'endpoints' arguments to hide these
differences.

A new helper named get_endpoint() has also been added to show a specific
endpoint identified by the given address ID, similar to mptcp_join.sh's
pm_nl_show_endpoints() helper, but showing all entries.

Use these two helpers in mptcp_join.sh and pm_netlink.sh to replace all
'pm_nl_ctl get' commands and outputs of 'pm_nl_ctl dump/get'.

Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: netlink: add 'limits' helpers

The output format of 'ip mptcp limits' command is much different from
that of 'pm_nl_ctl limits' command.

This patch adds format_limits() helper to format the outputs of these
two commands to hide the difference. get_limits() has been added to show
the limits.

Use these two helpers in pm_netlink.sh to replace all 'pm_nl_ctl limits'
commands and outputs.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: export ip_mptcp to mptcp_lib

This patch exports ip_mptcp into mptcp_lib.sh as a public variable,
named MPTCP_LIB_IP_MPTCP. Add a helper mptcp_lib_set_ip_mptcp() to set
it, and a helper mptcp_lib_is_ip_mptcp() to test whether it is set. Use
these two helpers in mptcp_join.sh.

This patch is prepared for coming commits.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add ms units for tc-netem delay

'delay 1' in tc-netem is confusing, not sure if it's a delay of 1 second or
1 millisecond. This patch explicitly adds millisecond units to make these
commands clearer.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add tc check for check_tools

tc are used in some test scripts: mptcp_connect.sh, mptcp_join.sh and
simult_flows.sh. It makes sense to check if tc is installed before running
these scripts, just like other tools. So this patch add 'tc' check for
mptcp_lib_check_tools(), and check it in these test scripts.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: more struct tcp_sock adjustments

tp->recvmsg_inq is used from tcp recvmsg() thus should
be in tcp_sock_read_rx group.

tp->tcp_clock_cache and tp->tcp_mstamp are written
both in rx and tx paths, thus are better placed
in tcp_sock_write_txrx group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: ax88179_178a: non necessary second random mac address

If the mac address can not be read from the device registers or the
devicetree, a random address is generated, but this was already done from
usbnet_probe, so it is not necessary to call eth_hw_addr_random from here
again to generate another random address.

Indeed, when reset was also executed from bind, generate another random mac
address invalidated the check from usbnet_probe to configure if the assigned
mac address for the interface was random or not, because it is comparing
with the initial generated random address. Now, with only a reset from open
operation, it is just a harmless simplification.

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

pfcp: avoid copy warning by simplifing code

From Arnd comments:
"The memcpy() in the ip_tunnel_info_opts_set() causes
a string.h fortification warning, with at least gcc-13:

    In function 'fortify_memcpy_chk',
        inlined from 'ip_tunnel_info_opts_set' at include/net/ip_tunnels.h:619:3,
        inlined from 'pfcp_encap_recv' at drivers/net/pfcp.c:84:2:
    include/linux/fortify-string.h:553:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
      553 |                         __write_overflow_field(p_size_field, size);"

It is a false-positivie caused by ambiguity of the union.

However, as Arnd noticed, copying here is unescessary. The code can be
simplified to avoid calling ip_tunnel_info_opts_set(), which is doing
copying, setting flags and options_len.

Set correct flags and options_len directly on tun_info.

Fixes: 6dd514f48110 ("pfcp: always set pfcp metadata")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/netdev/701f8f93-f5fb-408b-822a-37a1d5c424ba@app.fastmail.com/
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ynl-tests'

Jakub Kicinski says:

====================
selftests: net: groundwork for YNL-based tests

Currently the options for writing networking tests are C, bash or
some mix of the two. YAML/Netlink gives us the ability to easily
interface with Netlink in higher level laguages. In particular,
there is a Python library already available in tree, under tools/net.
Add the scaffolding which allows writing tests using this library.

The "scaffolding" is needed because the library lives under
tools/net and uses YAML files from under Documentation/.
So we need a small amount of glue code to find those things
and add them to TEST_FILES.

This series adds both a basic SW sanity test and driver
test which can be run against netdevsim or a real device.
When I develop core code I usually test with netdevsim,
then a real device, and then a backport to Meta's kernel.
Because of the lack of integration, until now I had
to throw away the (YNL-based) test script and netdevsim code.

Running tests in tree directly:

$ ./tools/testing/selftests/net/nl_netdev.py
KTAP version 1
1..2
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

in tree via make:

$ make -C tools/testing/selftests/ TARGETS=net \
TEST_PROGS=nl_netdev.py TEST_GEN_PROGS="" run_tests
[ ... ]

and installed externally, all seem to work:

$ make -C tools/testing/selftests/ TARGETS=net \
install INSTALL_PATH=/tmp/ksft-net
$ /tmp/ksft-net/run_kselftest.sh -t net:nl_netdev.py
[ ... ]

For driver tests I followed the lead of net/forwarding and
get the device name from env and/or a config file.

v3:
- fix up netdevsim C
- various small nits in other patches (see changelog in patches)
v2: https://lore.kernel.org/all/20240403023426.1762996-1-kuba@kernel.org/
- don't add to TARGETS, create a deperate variable with deps
- support and use with
- support and use passing arguments to tests
v1: https://lore.kernel.org/all/20240402010520.1209517-1-kuba@kernel.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

testing: net-drv: add a driver test for stats reporting

Add a very simple test to make sure drivers report expected
stats. Drivers which implement FEC or pause configuration
should report relevant stats. Qstats must be reported,
at least packet and byte counts, and they must match
total device stats.

Tested with netdevsim, bnxt, in-tree and installed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: drivers: add scaffolding for Netlink tests in Python

Add drivers/net as a target for mixed-use tests.
The setup is expected to work similarly to the forwarding tests.
Since we only need one interface (unlike forwarding tests)
read the target device name from NETIF. If not present we'll
try to run the test against netdevsim.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevsim: report stats by default, like a real device

Real devices should implement qstats. Devices which support
pause or FEC configuration should also report the relevant stats.

nsim was missing FEC stats completely, some of the qstats
and pause stats required toggling a debugfs knob.

Note that the tests which used pause always initialize the setting
so they shouldn't be affected by the different starting value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: nl_netdev: add a trivial Netlink netdev test

Add a trivial test using YNL.

  $ ./tools/testing/selftests/net/nl_netdev.py
  KTAP version 1
  1..2
  ok 1 nl_netdev.empty_check
  ok 2 nl_netdev.lo_check

Instantiate the family once, it takes longer than the test itself.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: add scaffolding for Netlink tests in Python

Add glue code for accessing the YNL library which lives under
tools/net and YAML spec files from under Documentation/.
Automatically figure out if tests are run in tree or not.
Since we'll want to use this library both from net and
drivers/net test targets make the library a target as well,
and automatically include it when net or drivers/net are
included. Making net/lib a target ensures that we end up
with only one copy of it, and saves us some path guessing.

Add a tiny bit of formatting support to be able to output KTAP
from the start.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batadv-next-pullrequest-20240405' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

- bump version strings, by Simon Wunderlich

- prefer kfree_rcu() over call_rcu() with free-only callbacks,
by Dmitry Antipov

- bypass empty buckets in batadv_purge_orig_ref(), by Eric Dumazet
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mdio-gpio: Use device_is_compatible()

Replace open coded variant of device_is_compatible().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dqs: use sysfs_emit() in favor of sprintf()

Commit 6025b9135f7a ("net: dqs: add NIC stall detector based on BQL")
added three sysfs files.

Use the recommended sysfs_emit() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip_tunnel: harden copying IP tunnel params to userspace

Structures which are about to be copied to userspace shouldn't have
uninitialized fields or paddings.
memset() the whole &ip_tunnel_parm in ip_tunnel_parm_to_user() before
filling it with the kernel data. The compilers will hopefully combine
writes to it.

Fixes: 117aef12a7b1 ("ip_tunnel: use a separate struct to store tunnel params in the kernel")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netdev/5f63dd25-de94-4ca3-84e6-14095953db13@moroto.mountain
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: remove RTNL protection from ip6addrlbl_dump()

No longer hold RTNL while calling ip6addrlbl_dump()
("ip addrlabel show")

ip6addrlbl_dump() was already mostly relying on RCU anyway.

Add READ_ONCE()/WRITE_ONCE() annotations around
net->ipv6.ip6addrlbl_table.seq

Note that ifal_seq value is currently ignored in iproute2,
and a bit weak.

We might user later cb->seq / nl_dump_check_consistent()
protocol if needed.

Also change return value for a completed dump,
so that NLMSG_DONE can be appended to current skb,
saving one recvmsg() system call.

v2: read net->ipv6.ip6addrlbl_table.seq once, (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link:https://lore.kernel.org/netdev/67f5cb70-14a4-4455-8372-f039da2f15c2@kernel.org/
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet: frags: delay fqdir_free_fn()

fqdir_free_fn() is using very expensive rcu_barrier()

When one netns is dismantled, we often call fqdir_exit()
multiple times, typically lauching fqdir_free_fn() twice.

Delaying by one second fqdir_free_fn() helps to reduce
the number of rcu_barrier() calls, and lock contention
on rcu_state.barrier_mutex.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>