git.kernel.dk Git - linux-2.6-block.git/log

net: airoha: Link the gdm port to the selected qdma controller

Link the running gdm port to the qdma controller used to connect with
the CPU. Moreover, load all QDMA controllers available on EN7581 SoC.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/95b515df34ba4727f7ae5b14a1d0462cceec84ff.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Start all qdma NAPIs in airoha_probe()

This is a preliminary patch to support multi-QDMA controllers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/b51cf69c94d8cbc81e0a0b35587f024d01e6d9c0.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Allow mapping IO region for multiple qdma controllers

Map MMIO regions of both qdma controllers available on EN7581 SoC.
Run airoha_hw_cleanup routine for both QDMA controllers available on
EN7581 SoC removing airoha_eth module or in airoha_probe error path.
This is a preliminary patch to support multi-QDMA controllers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/a734ae608da14b67ae749b375d880dbbc70868ea.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Use qdma pointer as private structure in airoha_irq_handler routine

This is a preliminary patch to support multi-QDMA controllers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/1e40c3cb973881c0eb3c3c247c78550da62054ab.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Add airoha_qdma pointer in airoha_tx_irq_queue/airoha_queue structures

Move airoha_eth pointer in airoha_qdma structure from
airoha_tx_irq_queue/airoha_queue ones. This is a preliminary patch to
introduce support for multi-QDMA controllers available on EN7581.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/074565b82fd0ceefe66e186f21133d825dbd48eb.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Move irq_mask in airoha_qdma structure

QDMA controllers have independent irq lines, so move irqmask in
airoha_qdma structure. This is a preliminary patch to support multiple
QDMA controllers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/1c8a06e8be605278a7b2f3cd8ac06e74bf5ebf2b.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Move airoha_queues in airoha_qdma

QDMA controllers available in EN7581 SoC have independent tx/rx hw queues
so move them in airoha_queues structure.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/795fc4797bffbf7f0a1351308aa9bf0e65b5126e.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Introduce airoha_qdma struct

Introduce airoha_qdma struct and move qdma IO register mapping in
airoha_qdma. This is a preliminary patch to enable both QDMA controllers
available on EN7581 SoC.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/7df163bdc72ee29c3d27a0cbf54522ffeeafe53c.1722522582.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: select DEVLINK and PAGE_POOL

Build bot reports undefined references to devlink functions.
And local testing revealed undefined references to page_pool functions.

Based on a patch by Jakub Kicinski <kuba@kernel.org>

Fixes: 1a9d48892ea5 ("eth: fbnic: Allocate core device specific structures and devlink interface")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408011219.hiPmwwAs-lkp@intel.com/
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240802-fbnic-select-v2-1-41f82a3e0178@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: ksft: print more of the stack for checks

Print more stack frames and the failing line when check fails.
This helps when tests use helpers to do the checks.

Before:

  # At ./ksft/drivers/net/hw/rss_ctx.py line 92:
  # Check failed 1037698 >= 396893.0 traffic on other queues:[344612, 462380, 233020, 449174, 342298]
  not ok 8 rss_ctx.test_rss_context_queue_reconfigure

After:

  # Check| At ./ksft/drivers/net/hw/rss_ctx.py, line 387, in test_rss_context_queue_reconfigure:
  # Check|     test_rss_queue_reconfigure(cfg, main_ctx=False)
  # Check| At ./ksft/drivers/net/hw/rss_ctx.py, line 230, in test_rss_queue_reconfigure:
  # Check|     _send_traffic_check(cfg, port, ctx_ref, { 'target': (0, 3),
  # Check| At ./ksft/drivers/net/hw/rss_ctx.py, line 92, in _send_traffic_check:
  # Check|     ksft_lt(sum(cnts[i] for i in params['noise']), directed / 2,
  # Check failed 1045235 >= 405823.5 traffic on other queues (context 1)':[460068, 351995, 565970, 351579, 127270]
  not ok 8 rss_ctx.test_rss_context_queue_reconfigure

Link: https://patch.msgid.link/20240801232317.545577-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: ksft: replace 95 with errno.EOPNOTSUPP

Petr suggested to use errno.EOPNOTSUPP instead of hard-coded 95
in the new test case. Adjust existing ones to match this style.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20240802000309.2368-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: ksft: support marking tests as disruptive

Add new @ksft_disruptive decorator to mark the tests that might
be disruptive to the system. Depending on how well the previous
test works in the CI we might want to disable disruptive tests
by default and only let the developers run them manually.

KSFT framework runs disruptive tests by default. DISRUPTIVE=False
environment (or config file) can be used to disable these tests.
ksft_setup should be called by the test cases that want to use
new decorator (ksft_setup is only called via NetDrvEnv/NetDrvEpEnv for now).

In the future we can add similar decorators to, for example, avoid
running slow tests all the time. And/or have some option to run
only 'fast' tests for some sort of smoke test scenario.

  $ DISRUPTIVE=False ./stats.py
  KTAP version 1
  1..5
  ok 1 stats.check_pause
  ok 2 stats.check_fec
  ok 3 stats.pkt_byte_sum
  ok 4 stats.qstat_by_ifindex
  ok 5 stats.check_down # SKIP marked as disruptive
  # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:1 error:0

v3:
- parse yes and properly treat non-zero nums as true (Petr)

v2:
- convert from cli argument to env variable (Jakub)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20240802000309.2368-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net-drv: exercise queue stats when the device is down

Verify that total device stats don't decrease after it has been turned down.
Also make sure the device doesn't crash when we access per-queue stats
when it's down (in case it tries to access some pointers that are NULL).

  KTAP version 1
  1..5
  ok 1 stats.check_pause
  ok 2 stats.check_fec
  ok 3 stats.pkt_byte_sum
  ok 4 stats.qstat_by_ifindex
  ok 5 stats.check_down
  # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0

v3:
- use errno.EOPNOTSUPP (Petr)
- move qstat[0] under try (Petr)

v2:
- KTAP output formatting (Jakub)
- defer instead of try/finally (Jakub)
- disappearing stats is an error (Jakub)
- ksft_ge instead of open coding (Jakub)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20240802000309.2368-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove IFF_* re-definition

We re-define values of enum netdev_priv_flags as preprocessor
macros with the same name. I guess this was done to avoid breaking
out of tree modules which may use #ifdef X for kernel compatibility?
Commit 7aa98047df95 ("net: move net_device priv_flags out from UAPI")
which added the enum doesn't say. In any case, the flags with defines
are quite old now, and defines for new flags don't get added.
OOT drivers have to resort to code greps for compat detection, anyway.
Let's delete these defines, save LoC, help LXR link to the right place.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20240801163401.378723-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'axienet-coding-style' into main

Radhey Shyam Pandey says:

====================
net: axienet: Fix coding style issues

This patchset replace all occurences of (1<<x) by BIT(x) to get rid
of checkpatch.pl "CHECK" output "Prefer using the BIT macro".

It also removes unnecessary ftrace-like logging, add missing blank line
after declaration and remove unnecessary parentheses around 'ndev->mtu
<= XAE_JUMBO_MTU' and 'ndev->mtu > XAE_MTU'.

Changes for v2:
- Split each coding style change into separate patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: axienet: remove unnecessary parentheses

Remove unnecessary parentheses around 'ndev->mtu
<= XAE_JUMBO_MTU' and 'ndev->mtu > XAE_MTU'. Reported
by checkpatch.

CHECK: Unnecessary parentheses around 'ndev->mtu > XAE_MTU'
+       if ((ndev->mtu > XAE_MTU) &&
+           (ndev->mtu <= XAE_JUMBO_MTU)) {

CHECK: Unnecessary parentheses around 'ndev->mtu <= XAE_JUMBO_MTU'
+       if ((ndev->mtu > XAE_MTU) &&
+           (ndev->mtu <= XAE_JUMBO_MTU)) {

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: axienet: remove unnecessary ftrace-like logging

remove unnecessary ftrace-like logging. Also fixes below
checkpatch WARNING.

WARNING: Unnecessary ftrace-like logging - prefer using ftrace
+ dev_dbg(&ndev->dev, "%s\n", __func__);

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: axienet: add missing blank line after declaration

Add missing blank line after declaration. Fixes below
checkpatch warnings.

WARNING: Missing a blank line after declarations
+       struct sockaddr *addr = p;
+       axienet_set_mac_address(ndev, addr->sa_data);

WARNING: Missing a blank line after declarations
+       struct axienet_local *lp = netdev_priv(ndev);
+       disable_irq(lp->tx_irq);

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: axienet: Replace the occurrences of (1<<x) by BIT(x)

Replace all occurences of (1<<x) by BIT(x) to get rid of checkpatch.pl
"CHECK" output "Prefer using the BIT macro".

Signed-off-by: Appana Durga Kedareswara Rao <appana.durga.rao@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'vsock-virtio' into main

Luigi Leonardi says:

====================
vsock: avoid queuing on intermediate queue if possible

This series introduces an optimization for vsock/virtio to reduce latency
and increase the throughput: When the guest sends a packet to the host,
and the intermediate queue (send_pkt_queue) is empty, if there is enough
space, the packet is put directly in the virtqueue.

v3->v4
While running experiments on fio with 64B payload, I realized that there
was a mistake in my fio configuration, so I re-ran all the experiments
and now the latency numbers are indeed lower with the patch applied.
I also noticed that I was kicking the host without the lock.

- Fixed a configuration mistake on fio and re-ran all experiments.
- Fio latency measurement using 64B payload.
- virtio_transport_send_skb_fast_path sends kick with the tx_lock acquired
- Addressed all minor style changes requested by maintainer.
- Rebased on latest net-next
- Link to v3: https://lore.kernel.org/r/20240711-pinna-v3-0-697d4164fe80@outlook.com

v2->v3
- Performed more experiments using iperf3 using multiple streams
- Handling of reply packets removed from virtio_transport_send_skb,
  as is needed just by the worker.
- Removed atomic_inc/atomic_sub when queuing directly to the vq.
- Introduced virtio_transport_send_skb_fast_path that handles the
  steps for sending on the vq.
- Fixed a missing mutex_unlock in error path.
- Changed authorship of the second commit
- Rebased on latest net-next

v1->v2
In this v2 I replaced a mutex_lock with a mutex_trylock because it was
insidea RCU critical section. I also added a check on tx_run, so if the
module is being removed the packet is not queued. I'd like to thank Stefano
for reporting the tx_run issue.

Applied all Stefano's suggestions:
    - Minor code style changes
    - Minor commit text rewrite
Performed more experiments:
     - Check if all the packets go directly to the vq (Matias' suggestion)
     - Used iperf3 to see if there is any improvement in overall throughput
      from guest to host
     - Pinned the vhost process to a pCPU.
     - Run fio using 512B payload
Rebased on latest net-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

test/vsock: add ioctl unsent bytes test

Introduce two tests, one for SOCK_STREAM and one for SOCK_SEQPACKET,
which use SIOCOUTQ ioctl to check that the number of unsent bytes is
zero after delivering a packet.

vsock_connect and vsock_accept are no longer static: this is to
create more generic tests, allowing code to be reused for SEQPACKET
and STREAM.

Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock/virtio: add SIOCOUTQ support for all virtio based transports

Introduce support for virtio_transport_unsent_bytes
ioctl for virtio_transport, vhost_vsock and vsock_loopback.

For all transports the unsent bytes counter is incremented
in virtio_transport_get_credit.

In virtio_transport (G2H) and in vhost-vsock (H2G) the counter
is decremented when the skbuff is consumed. In vsock_loopback the
same skbuff is passed from the transmitter to the receiver, so
the counter is decremented before queuing the skbuff to the
receiver.

Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vsock: add support for SIOCOUTQ ioctl

Add support for ioctl(s) in AF_VSOCK.
The only ioctl available is SIOCOUTQ/TIOCOUTQ, which returns the number
of unsent bytes in the socket. This information is transport-specific
and is delegated to them using a callback.

Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Use of_property_read_bool()

Use of_property_read_bool() to read boolean properties rather than
of_find_property(). This is part of a larger effort to remove callers
of of_find_property() and similar functions. of_find_property() leaks
the DT struct property and data pointers which is a problem for
dynamically allocated nodes which may be freed.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Link: https://patch.msgid.link/20240731191601.1714639-2-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: Use of_property_count_u32_elems() to get property length

Replace of_get_property() with the type specific
of_property_count_u32_elems() to get the property length.

This is part of a larger effort to remove callers of of_get_property()
and similar functions. of_get_property() leaks the DT property data
pointer which is a problem for dynamically allocated nodes which may
be freed.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240731201514.1839974-2-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: qca807x: Drop unnecessary and broken DT validation

The check for "leds" and "gpio-controller" both being present is never
true because "leds" is a node, not a property. This could be fixed
with a check for child node, but there's really no need for the kernel
to validate a DT. Just continue ignoring the LEDs if GPIOs are present.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240731201703.1842022-2-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mctp: Consistent peer address handling in ioctl tag allocation

When executing ioctl to allocate tags, if the peer address is 0,
mctp_alloc_local_tag now replaces it with 0xff. However, during tag
dropping, this replacement is not performed, potentially causing the key
not to be dropped as expected.

Signed-off-by: John Wang <wangzhiqiang02@ieisystem.com>
Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20240730084636.184140-1-wangzhiqiang02@ieisystem.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Link: https://patch.msgid.link/20240801131917.34494-1-pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.11-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from wireless, bleutooth, BPF and netfilter.

  Current release - regressions:

   - core: drop bad gso csum_start and offset in virtio_net_hdr

   - wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove

   - eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()

   - phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and
     aqr115c

  Current release - new code bugs:

   - smc: prevent UAF in inet_create()

   - bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend

   - eth: bnxt: reject unsupported hash functions

  Previous releases - regressions:

   - sched: act_ct: take care of padding in struct zones_ht_key

   - netfilter: fix null-ptr-deref in iptable_nat_table_init().

   - tcp: adjust clamping window for applications specifying SO_RCVBUF

  Previous releases - always broken:

   - ethtool: rss: small fixes to spec and GET

   - mptcp:
      - fix signal endpoint re-add
      - pm: fix backup support in signal endpoints

   - wifi: ath12k: fix soft lockup on suspend

   - eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()

   - eth: ice: fix AF_XDP ZC timeout and concurrency issues

   - eth: mlx5:
      - fix missing lock on sync reset reload
      - fix error handling in irq_pool_request_irq"

* tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  mptcp: fix duplicate data handling
  mptcp: fix bad RCVPRUNED mib accounting
  ipv6: fix ndisc_is_useropt() handling for PIO
  igc: Fix double reset adapter triggered from a single taprio cmd
  net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
  net: wan: fsl_qmc_hdlc: Discard received CRC
  net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
  net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
  net/mlx5e: Fix CT entry update leaks of modify header context
  net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
  net/mlx5: Fix missing lock on sync reset reload
  net/mlx5: Lag, don't use the hardcoded value of the first port
  net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
  net/mlx5: Fix error handling in irq_pool_request_irq
  net/mlx5: Always drain health in shutdown callback
  net: Add skbuff.h to MAINTAINERS
  r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY
  netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
  netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
  net: drop bad gso csum_start and offset in virtio_net_hdr
  ...

ethtool: Don't check for NULL info in prepare_data callbacks

Since commit f946270d05c2 ("ethtool: netlink: always pass genl_info to
.prepare_data") the info argument of prepare_data callbacks is never
NULL. Remove checks present in callback implementations.

Link: https://lore.kernel.org/netdev/20240703121237.3f8b9125@kernel.org/
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240731-prepare_data-null-check-v1-1-627f2320678f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

RDS: IB: Remove unused declarations

Commit f4f943c958a2 ("RDS: IB: ack more receive completions to improve performance")
removed rds_ib_recv_tasklet_fn() implementation but not the declaration.
And commit ec16227e1414 ("RDS/IB: Infiniband transport") declared but never implemented
other functions.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20240731063630.3592046-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: drop clocks unused by Ethernet driver

Clocks for SerDes and PHY are going to be handled by standalone drivers
for each of those hardware components. Drop them from the Ethernet driver.

The clocks which are being removed for this patch are responsible for
the for the SerDes PCS and PHYs used for the 2nd and 3rd MAC which are
anyway not yet supported. Hence backwards compatibility is not an issue.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://patch.msgid.link/b5faaf69b5c6e3e155c64af03706c3c423c6a1c9.1722335682.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Reclaim max 50K pages at once

In non FLR context, at times CX-5 requests release of ~8 million FW pages.
This needs humongous number of cmd mailboxes, which to be released once
the pages are reclaimed. Release of humongous number of cmd mailboxes is
consuming cpu time running into many seconds. Which with non preemptible
kernels is leading to critical process starving on that cpu’s RQ.
On top of it, the FW does not use all the mailbox messages as it has a
limit of releasing 50K pages at once per MLX5_CMD_OP_MANAGE_PAGES +
MLX5_PAGES_TAKE device command. Hence, the allocation of these many
mailboxes is extra and adds unnecessary overhead.
To alleviate this, this change restricts the total number of pages
a worker will try to reclaim to maximum 50K pages in one go.

Our tests have shown significant benefit of this change in terms of
time consumed by dma_pool_free().
During a test where an event was raised by HCA
to release 1.3 Million pages, following observations were made:

- Without this change:
Number of mailbox messages allocated was around 20K, to accommodate
the DMA addresses of 1.3 million pages.
The average time spent by dma_pool_free() to free the DMA pool is between
16 usec to 32 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        287
            1024 |@@@                                      1332
            2048 |@                                        656
            4096 |@@@@@                                    2599
            8192 |@@@@@@@@@@                               4755
           16384 |@@@@@@@@@@@@@@@                          7545
           32768 |@@@@@                                    2501
           65536 |                                         0

- With this change:
Number of mailbox messages allocated was around 800; this was to
accommodate DMA addresses of only 50K pages.
The average time spent by dma_pool_free() to free the DMA pool in this case
lies between 1 usec to 2 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@                       346
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   435
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         0

Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240730073634.114407-1-anand.a.khoje@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-fix-duplicate-data-handling'

Matthieu Baerts says:

====================
mptcp: fix duplicate data handling

In some cases, the subflow-level's copied_seq counter was incorrectly
increased, leading to an unexpected subflow reset.

Patch 1/2 fixes the RCVPRUNED MIB counter that was attached to the wrong
event since its introduction in v5.14, backported to v5.11.

Patch 2/2 fixes the copied_seq counter issues, is present since v5.10.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
====================

Link: https://patch.msgid.link/20240731-upstream-net-20240731-mptcp-dup-data-v1-0-bde833fa628a@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: fix duplicate data handling

When a subflow receives and discards duplicate data, the mptcp
stack assumes that the consumed offset inside the current skb is
zero.

With multiple subflows receiving data simultaneously such assertion
does not held true. As a result the subflow-level copied_seq will
be incorrectly increased and later on the same subflow will observe
a bad mapping, leading to subflow reset.

Address the issue taking into account the skb consumed offset in
mptcp_subflow_discard_data().

Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()")
Cc: stable@vger.kernel.org
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/501
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: fix bad RCVPRUNED mib accounting

Since its introduction, the mentioned MIB accounted for the wrong
event: wake-up being skipped as not-needed on some edge condition
instead of incoming skb being dropped after landing in the (subflow)
receive queue.

Move the increment in the correct location.

Fixes: ce599c516386 ("mptcp: properly account bulk freed memory")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'nf-24-07-31' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Fix a possible null-ptr-deref sometimes triggered by iptables-restore at
boot time. Register iptables {ipv4,ipv6} nat table pernet in first place
to fix this issue. Patch #1 and #2 from Kuniyuki Iwashima.

netfilter pull request 24-07-31

* tag 'nf-24-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
====================

Link: https://patch.msgid.link/20240731213046.6194-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: fix ndisc_is_useropt() handling for PIO

The current logic only works if the PIO is between two
other ND user options. This fixes it so that the PIO
can also be either before or after other ND user options
(for example the first or last option in the RA).

side note: there's actually Android tests verifying
a portion of the old broken behaviour, so:
https://android-review.googlesource.com/c/kernel/tests/+/3196704
fixes those up.

Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Patrick Rohr <prohr@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Fixes: 048c796beb6e ("ipv6: adjust ndisc_is_useropt() to also return true for PIO")
Link: https://patch.msgid.link/20240730001748.147636-1-maze@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: skbuff: Skip early return in skb_unref when debugging

This patch modifies the skb_unref function to skip the early return
optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
the reference count decrement always occurs in debug builds, allowing
for more thorough checking of SKB reference counting.

Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
was not set, the function would return early after a memory barrier
(smp_rmb()) without decrementing the reference count. This optimization
assumes it's safe to proceed with freeing the SKB without the overhead
of an atomic decrement from 1 to 0.

With this change:
- In non-debug builds (CONFIG_DEBUG_NET not set), behavior remains
  unchanged, preserving the performance optimization.
- In debug builds (CONFIG_DEBUG_NET set), the reference count is always
  decremented, even when it's 1, allowing for consistent behavior and
  potentially catching subtle SKB management bugs.

This modification enhances debugging capabilities for networking code
without impacting performance in production kernels. It helps kernel
developers identify and diagnose issues related to SKB management and
reference counting in the network stack.

Cc: Chris Mason <clm@fb.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240729104741.370327-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'ethernet-convert-from-tasklet-to-bh-workqueue'

Allen Pais says:

====================
ethernet: Convert from tasklet to BH workqueue [part]

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts a few drivers in drivers/ethernet/* from tasklet
to BH workqueue. The next set will be sent out after the next -rc is
out.

v2: https://lore.kernel.org/20240621183947.4105278-1-allen.lkml@gmail.com
v1: https://lore.kernel.org/20240507190111.16710-2-apais@linux.microsoft.com
====================

Link: https://patch.msgid.link/20240730183403.4176544-1-allen.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: Convert tasklet API to new bottom half workqueue mechanism

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the macb driver. This transition ensures compatibility
with the latest design and enhances performance.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-5-allen.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cnic: Convert tasklet API to new bottom half workqueue mechanism

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the cnic driver. This transition ensures compatibility
with the latest design and enhances performance.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-4-allen.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xgbe: Convert tasklet API to new bottom half workqueue mechanism

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the xgbe driver. This transition ensures compatibility
with the latest design and enhances performance.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-3-allen.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: alteon: Convert tasklet API to new bottom half workqueue mechanism

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the alteon driver. This transition ensures compatibility
with the latest design and enhances performance.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-2-allen.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

igc: Fix double reset adapter triggered from a single taprio cmd

Following the implementation of "igc: Add TransmissionOverrun counter"
patch, when a taprio command is triggered by user, igc processes two
commands: TAPRIO_CMD_REPLACE followed by TAPRIO_CMD_STATS. However, both
commands unconditionally pass through igc_tsn_offload_apply() which
evaluates and triggers reset adapter. The double reset causes issues in
the calculation of adapter->qbv_count in igc.

TAPRIO_CMD_REPLACE command is expected to reset the adapter since it
activates qbv. It's unexpected for TAPRIO_CMD_STATS to do the same
because it doesn't configure any driver-specific TSN settings. So, the
evaluation in igc_tsn_offload_apply() isn't needed for TAPRIO_CMD_STATS.

To address this, commands parsing are relocated to
igc_tsn_enable_qbv_scheduling(). Commands that don't require an adapter
reset will exit after processing, thus avoiding igc_tsn_offload_apply().

Fixes: d3750076d464 ("igc: Add TransmissionOverrun counter")
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20240730173304.865479-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-core_thermal-small-cleanups'

Petr Machata says:

====================
mlxsw: core_thermal: Small cleanups

Ido Schimmel says:

Clean up various issues which I noticed while addressing feedback on a
different patchset.
====================

Link: https://patch.msgid.link/cover.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Fix -Wformat-truncation warning

The name of a thermal zone device cannot be longer than 19 characters
('THERMAL_NAME_LENGTH - 1'). The format string 'mlxsw-lc%d-module%d' can
exceed this limitation if the maximum number of line cards cannot be
represented using a single digit and the maximum number of transceiver
modules cannot be represented using two digits.

This is not the case with current systems nor future ones. Therefore,
increase the size of the result buffer beyond 'THERMAL_NAME_LENGTH' and
suppress the following build warning [1].

If this limitation is ever exceeded, we will know about it since the
thermal core validates the thermal device's name during registration.

[1]
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c: In function ‘mlxsw_thermal_modules_init.part.0’:
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:70: error: ‘%d’ directive output may be truncated writing between 1 and 3 bytes into a region of size between 2 and 4 [-Werror=format-truncation=]
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                                                                      ^~
In function ‘mlxsw_thermal_module_tz_init’,
    inlined from ‘mlxsw_thermal_module_init’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:465:9,
    inlined from ‘mlxsw_thermal_modules_init.part.0’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:500:9:
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:52: note: directive argument in the range [1, 256]
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                                                    ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:17: note: ‘snprintf’ output between 18 and 22 bytes into a destination of size 20
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  419 |                          module_tz->slot_index, module_tz->module + 1);
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/583a70c6dbe75e6bf0c2c58abbb3470a860d2dc3.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Remove unnecessary assignments

Setting both pointers to NULL is unnecessary since the code never checks
whether these pointers are NULL or not. Remove the assignments.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/ea646f5d7642fffd5393fa23650660ab8f77a511.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Remove unnecessary checks

mlxsw_thermal_module_fini() cannot be invoked with a thermal module
which is NULL or which is not associated with a thermal zone, so remove
these checks.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/8db5fe0a3a28ba09a15d4102cc03f7e8ca7675be.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Simplify rollback

During rollback, instead of calling mlxsw_thermal_module_fini() for all
the modules, only call it for modules that were successfully
initialized. This is not a bug fix since mlxsw_thermal_module_fini()
first checks that the module was initialized.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/905bebc45f6e246031f0c5c177bba8efe11e05f5.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Make mlxsw_thermal_module_{init, fini} symmetric

mlxsw_thermal_module_fini() de-initializes the module's thermal zone,
but mlxsw_thermal_module_init() does not initialize it. Make both
functions symmetric by moving the initialization of the module's thermal
zone to mlxsw_thermal_module_init().

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/a661ad468f8ad0d7d533d8334e4abf61dfe34342.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Remove unused arguments

'dev' and 'core' arguments are not used by mlxsw_thermal_module_init().
Remove them.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/563fc7383f61809a306b9954872219eaaf3c689b.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Fold two loops into one

There is no need to traverse the same array twice. Do it once by folding
both loops into one.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/81756744ed532aaa9249a83fc08757accfe8b07c.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Remove another unnecessary check

mlxsw_thermal_modules_init() allocates an array of modules and then
initializes each entry by calling mlxsw_thermal_module_init() which
among other things initializes the 'parent' pointer of the entry.

mlxsw_thermal_modules_init() then traverses over the array again, but
skips over entries that do not have their 'parent' pointer set which is
impossible given the above.

Therefore, remove the unnecessary check.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/fb3e8ded422a441436431d5785b900f11ffc9621.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Remove unnecessary check

mlxsw_thermal_modules_init() allocates an array of modules and then
calls mlxsw_thermal_module_init() to initialize each entry in the array.
It is therefore impossible for mlxsw_thermal_module_init() to encounter
an entry that is already initialized and has its 'parent' pointer set.

Remove the unnecessary check.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: core_thermal: Call thermal_zone_device_unregister() unconditionally

The function returns immediately if the thermal zone pointer is NULL so
there is no need to check it before calling the function.

Remove the check.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/0bd251aa8ce03d3c951983aa6b4300d8205b88a7.1722345311.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: MAINTAINERS: Demote Qualcomm IPA to "maintained"

To the best of my knowledge, Alex Elder is not being paid to support
Qualcomm IPA networking drivers, so drop the status from "supported" to
"maintained".

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Alex Elder <elder@kernel.org>
Link: https://patch.msgid.link/20240730104016.22103-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: fsl_qmc_hdlc: Discard received CRC

Received frame from QMC contains the CRC.
Upper layers don't need this CRC and tcpdump mentioned trailing junk
data due to this CRC presence.

As some other HDLC driver, simply discard this CRC.

Fixes: d0f2258e79fd ("net: wan: Add support for QMC HDLC")
Cc: stable@vger.kernel.org
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240730063133.179598-1-herve.codina@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex

The carrier_lock spinlock protects the carrier detection. While it is
held, framer_get_status() is called which in turn takes a mutex.
This is not correct and can lead to a deadlock.

A run with PROVE_LOCKING enabled detected the issue:
  [ BUG: Invalid wait context ]
  ...
  c204ddbc (&framer->mutex){+.+.}-{3:3}, at: framer_get_status+0x40/0x78
  other info that might help us debug this:
  context-{4:4}
  2 locks held by ifconfig/146:
  #0: c0926a38 (rtnl_mutex){+.+.}-{3:3}, at: devinet_ioctl+0x12c/0x664
  #1: c2006a40 (&qmc_hdlc->carrier_lock){....}-{2:2}, at: qmc_hdlc_framer_set_carrier+0x30/0x98

Avoid the spinlock usage and convert carrier_lock to a mutex.

Fixes: 54762918ca85 ("net: wan: fsl_qmc_hdlc: Add framer support")
Cc: stable@vger.kernel.org
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240730063104.179553-1-herve.codina@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-misc-fixes-2024-07-30'

Tariq Toukan says:

====================
mlx5 misc fixes 2024-07-30

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/20240730061638.1831002-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys

Since the documentation for mlx5_toggle_port_link states that it should
only be used after setting the port register, we add a check for the
return value from mlx5_port_set_eth_ptys to ensure the register was
successfully set before calling it.

Fixes: 667daedaecd1 ("net/mlx5e: Toggle link only after modifying port parameters")
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Fix CT entry update leaks of modify header context

The cited commit allocates a new modify header to replace the old
one when updating CT entry. But if failed to allocate a new one, eg.
exceed the max number firmware can support, modify header will be
an error pointer that will trigger a panic when deallocating it. And
the old modify header point is copied to old attr. When the old
attr is freed, the old modify header is lost.

Fix it by restoring the old attr to attr when failed to allocate a
new modify header context. So when the CT entry is freed, the right
modify header context will be freed. And the panic of accessing
error pointer is also fixed.

Fixes: 94ceffb48eac ("net/mlx5e: Implement CT entry update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability

Require mlx5 classifier action support when creating IPSec chains in
offload path. MLX5_IPSEC_CAP_PRIO should only be set if CONFIG_MLX5_CLS_ACT
is enabled. If CONFIG_MLX5_CLS_ACT=n and MLX5_IPSEC_CAP_PRIO is set,
configuring IPsec offload will fail due to the mlxx5 ipsec chain rules
failing to be created due to lack of classifier action support.

Fixes: fa5aa2f89073 ("net/mlx5e: Use chains for IPsec policy priority offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Fix missing lock on sync reset reload

On sync reset reload work, when remote host updates devlink on reload
actions performed on that host, it misses taking devlink lock before
calling devlink_remote_reload_actions_performed() which results in
triggering lock assert like the following:

WARNING: CPU: 4 PID: 1164 at net/devlink/core.c:261 devl_assert_locked+0x3e/0x50
…
CPU: 4 PID: 1164 Comm: kworker/u96:6 Tainted: G S      W          6.10.0-rc2+ #116
Hardware name: Supermicro SYS-2028TP-DECTR/X10DRT-PT, BIOS 2.0 12/18/2015
Workqueue: mlx5_fw_reset_events mlx5_sync_reset_reload_work [mlx5_core]
RIP: 0010:devl_assert_locked+0x3e/0x50
…
Call Trace:
  <TASK>
  ? __warn+0xa4/0x210
  ? devl_assert_locked+0x3e/0x50
  ? report_bug+0x160/0x280
  ? handle_bug+0x3f/0x80
  ? exc_invalid_op+0x17/0x40
  ? asm_exc_invalid_op+0x1a/0x20
  ? devl_assert_locked+0x3e/0x50
  devlink_notify+0x88/0x2b0
  ? mlx5_attach_device+0x20c/0x230 [mlx5_core]
  ? __pfx_devlink_notify+0x10/0x10
  ? process_one_work+0x4b6/0xbb0
  process_one_work+0x4b6/0xbb0
[…]

Fixes: 84a433a40d0e ("net/mlx5: Lock mlx5 devlink reload callbacks")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Lag, don't use the hardcoded value of the first port

The cited commit didn't change the body of the loop as it should.
It shouldn't be using MLX5_LAG_P1.

Fixes: 7e978e7714d6 ("net/mlx5: Lag, use actual number of lag ports")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule

This patch reduces the size of hw_ste_arr_optimized array that is
allocated on stack from 640 bytes (5 match STEs + 5 action STES)
to 448 bytes (2 match STEs + 5 action STES).
This fixes the 'stack guard page was hit' issue, while still fitting
majority of the usecases (up to 2 match STEs).

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Fix error handling in irq_pool_request_irq

In case mlx5_irq_alloc fails, the previously allocated index remains
in the XArray, which could lead to inconsistencies.

Fix it by adding error handling that erases the allocated index
from the XArray if mlx5_irq_alloc returns an error.

Fixes: c36326d38d93 ("net/mlx5: Round-Robin EQs over IRQs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Always drain health in shutdown callback

There is no point in recovery during device shutdown. if health
work started need to wait for it to avoid races and NULL pointer
access.

Hence, drain health WQ on shutdown callback.

Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver")
Fixes: d2aa060d40fa ("net/mlx5: Cancel health poll before sending panic teardown command")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/20240730061638.1831002-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Add skbuff.h to MAINTAINERS

The network maintainers need to be copied if the skbuff.h is touched.

This also helps git-send-email to figure out the proper maintainers when
touching the file.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240730161404.2028175-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY

The skb isn't consumed in case of NETDEV_TX_BUSY, therefore don't
increment the tx_dropped counter.

Fixes: 188f4af04618 ("r8169: use NETDEV_TX_{BUSY/OK}")
Cc: stable@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/bbba9c48-8bac-4932-9aa1-d2ed63bc9433@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-07-31

We've added 2 non-merge commits during the last 2 day(s) which contain
a total of 2 files changed, 2 insertions(+), 2 deletions(-).

The main changes are:

1) Fix BPF selftest build after tree sync with regards to a _GNU_SOURCE
   macro redefined compilation error, from Stanislav Fomichev.

2) Fix a wrong test in the ASSERT_OK() check in uprobe_syscall BPF selftest,
   from Jiri Olsa.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf/selftests: Fix ASSERT_OK condition check in uprobe_syscall test
  selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp
====================

Link: https://patch.msgid.link/20240731115706.19677-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx4: Add support for EEPROM high pages query for QSFP/QSFP+/QSFP28

Enable reading additional EEPROM information from high pages such as
thresholds and alarms on QSFP/QSFP+/QSFP28 modules.

"This is similar to commit a708fb7b1f8d ("net/mlx5e: ethtool, Add
support for EEPROM high pages query") but given all the required logic
already exists in mlx4_qsfp_eeprom_params_set() only s/_LEN/MAX_LEN/ is
needed.

Tested-by: Dan Merillat <git@dan.merillat.org>
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/b17c5336-6dc3-41f2-afa6-f9e79231f224@ans.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().

ip6table_nat_table_init() accesses net->gen->ptr[ip6table_nat_net_ops.id],
but the function is exposed to user space before the entry is allocated
via register_pernet_subsys().

Let's call register_pernet_subsys() before xt_register_template().

Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().

We had a report that iptables-restore sometimes triggered null-ptr-deref
at boot time. [0]

The problem is that iptable_nat_table_init() is exposed to user space
before the kernel fully initialises netns.

In the small race window, a user could call iptable_nat_table_init()
that accesses net_generic(net, iptable_nat_net_id), which is available
only after registering iptable_nat_net_ops.

Let's call register_pernet_subsys() before xt_register_template().

[0]:
bpfilter: Loaded bpfilter_umh pid 11702
Started bpfilter
BUG: kernel NULL pointer dereference, address: 0000000000000013
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
PREEMPT SMP NOPTI
CPU: 2 PID: 11879 Comm: iptables-restor Not tainted 6.1.92-99.174.amzn2023.x86_64 #1
Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:iptable_nat_table_init (net/ipv4/netfilter/iptable_nat.c:87 net/ipv4/netfilter/iptable_nat.c:121) iptable_nat
Code: 10 4c 89 f6 48 89 ef e8 0b 19 bb ff 41 89 c4 85 c0 75 38 41 83 c7 01 49 83 c6 28 41 83 ff 04 75 dc 48 8b 44 24 08 48 8b 0c 24 <48> 89 08 4c 89 ef e8 a2 3b a2 cf 48 83 c4 10 44 89 e0 5b 5d 41 5c
RSP: 0018:ffffbef902843cd0 EFLAGS: 00010246
RAX: 0000000000000013 RBX: ffff9f4b052caa20 RCX: ffff9f4b20988d80
RDX: 0000000000000000 RSI: 0000000000000064 RDI: ffffffffc04201c0
RBP: ffff9f4b29394000 R08: ffff9f4b07f77258 R09: ffff9f4b07f77240
R10: 0000000000000000 R11: ffff9f4b09635388 R12: 0000000000000000
R13: ffff9f4b1a3c6c00 R14: ffff9f4b20988e20 R15: 0000000000000004
FS: 00007f6284340000(0000) GS:ffff9f51fe280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000013 CR3: 00000001d10a6005 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259)
? xt_find_table_lock (net/netfilter/x_tables.c:1259)
? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420)
? page_fault_oops (arch/x86/mm/fault.c:727)
? exc_page_fault (./arch/x86/include/asm/irqflags.h:40 ./arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1470 arch/x86/mm/fault.c:1518)
? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570)
? iptable_nat_table_init (net/ipv4/netfilter/iptable_nat.c:87 net/ipv4/netfilter/iptable_nat.c:121) iptable_nat
xt_find_table_lock (net/netfilter/x_tables.c:1259)
xt_request_find_table_lock (net/netfilter/x_tables.c:1287)
get_info (net/ipv4/netfilter/ip_tables.c:965)
? security_capable (security/security.c:809 (discriminator 13))
? ns_capable (kernel/capability.c:376 kernel/capability.c:397)
? do_ipt_get_ctl (net/ipv4/netfilter/ip_tables.c:1656)
? bpfilter_send_req (net/bpfilter/bpfilter_kern.c:52) bpfilter
nf_getsockopt (net/netfilter/nf_sockopt.c:116)
ip_getsockopt (net/ipv4/ip_sockglue.c:1827)
__sys_getsockopt (net/socket.c:2327)
__x64_sys_getsockopt (net/socket.c:2342 net/socket.c:2339 net/socket.c:2339)
do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:81)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
RIP: 0033:0x7f62844685ee
Code: 48 8b 0d 45 28 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 09
RSP: 002b:00007ffd1f83d638 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 00007ffd1f83d680 RCX: 00007f62844685ee
RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 0000000000000004 R08: 00007ffd1f83d670 R09: 0000558798ffa2a0
R10: 00007ffd1f83d680 R11: 0000000000000246 R12: 00007ffd1f83e3b2
R13: 00007f628455baa0 R14: 00007ffd1f83d7b0 R15: 00007f628457a008
</TASK>
Modules linked in: iptable_nat(+) bpfilter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache veth xt_state xt_connmark xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 vfat fat ghash_clmulni_intel aesni_intel ena crypto_simd ptp cryptd i8042 pps_core serio button sunrpc sch_fq_codel configfs loop dm_mod fuse dax dmi_sysfs crc32_pclmul crc32c_intel efivarfs
CR2: 0000000000000013

Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default")
Reported-by: Takahiro Kawahara <takawaha@amazon.co.jp>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

minmax: fix up min3() and max3() too

David Laight pointed out that we should deal with the min3() and max3()
mess too, which still does excessive expansion.

And our current macros are actually rather broken.

In particular, the macros did this:

  #define min3(x, y, z) min((typeof(x))min(x, y), z)
  #define max3(x, y, z) max((typeof(x))max(x, y), z)

and that not only is a nested expansion of possibly very complex
arguments with all that involves, the typing with that "typeof()" cast
is completely wrong.

For example, imagine what happens in max3() if 'x' happens to be a
'unsigned char', but 'y' and 'z' are 'unsigned long'.  The types are
compatible, and there's no warning - but the result is just random
garbage.

No, I don't think we've ever hit that issue in practice, but since we
now have sane infrastructure for doing this right, let's just use it.
It fixes any excessive expansion, and also avoids these kinds of broken
type issues.

Requested-by: David Laight <David.Laight@aculab.com>
Acked-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Add support for PIO p flag

draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information
Option to signal that the network can allocate a unique IPv6 prefix per
client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).

When ra_honor_pio_pflag is enabled, the presence of a P-flag causes
SLAAC autoconfiguration to be disabled for that particular PIO.

An automated test has been added in Android (r.android.com/3195335) to
go along with this change.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Lamparter <equinox@opensourcerouting.org>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'smc-cleanups' into main

Zhengchao Shao says:

====================
net/smc: do some cleanups in smc module

Do some cleanups in smc module.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove unused input parameters in smcr_new_buf_create

The input parameter "is_rmb" of the smcr_new_buf_create function
has never been used, remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove redundant code in smc_connect_check_aclc

When the SMC client perform CLC handshake, it will check whether
the clc header type is correct in receiving SMC_CLC_ACCEPT packet.
The specific invoking path is as follows:
__smc_connect
  smc_connect_clc
    smc_clc_wait_msg
      smc_clc_msg_hdr_valid
        smc_clc_msg_acc_conf_valid
Therefore, the smc_connect_check_aclc interface invoked by
__smc_connect does not need to check type again.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove the fallback in __smc_connect

When the SMC client begins to connect to server, smcd_version is set
to SMC_V1 + SMC_V2. If fail to get VLAN ID, only SMC_V2 information
is left in smcd_version. And smcd_version will not be changed to 0.
Therefore, remove the fallback caused by the failure to get VLAN ID.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove unreferenced header in smc_loopback.h file

Because linux/err.h is unreferenced in smc_loopback.h file, so
remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: dsa: vsc73xx: add {rx,tx}-internal-delay-ps

Add a schema validator to vitesse,vsc73xx.yaml for MAC-level RGMII delays
in the CPU port. Additionally, valid values for VSC73XX were defined,
and a common definition for the RX and TX valid range was created.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: vsc73xx: make RGMII delays configurable

This patch switches hardcoded RGMII transmit/receive delay to
a configurable value. Delay values are taken from the properties of
the CPU port: 'tx-internal-delay-ps' and 'rx-internal-delay-ps'.

The default value is configured to 2.0 ns to maintain backward
compatibility with existing code.

Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tp-session-cleanup' into main

James Chapman says:

====================
l2tp: simplify tunnel and session cleanup

This series simplifies and improves l2tp tunnel and session cleanup.

* refactor l2tp management code to not use the tunnel socket's
   sk_user_data. This allows the tunnel and its socket to be closed
   and freed without sequencing the two using the socket's sk_destruct
   hook.

* export ip_flush_pending_frames and use it when closing l2tp_ip
   sockets.

* move the work of closing all sessions in the tunnel to the work
   queue so that sessions are deleted using the same codepath whether
   they are closed by user API request or their parent tunnel is
   closing.

* refactor l2tp_ppp pppox socket / session relationship to have the
   session keep the socket alive, not the other way around. Previously
   the pppox socket held a ref on the session, which complicated
   session delete by having to go through the pppox socket destructor.

* free sessions and pppox sockets by rcu.

* fix a possible tunnel refcount underflow.

* avoid using rcu_barrier in net exit handler.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: use pre_exit pernet hook to avoid rcu_barrier

Move the work of closing all tunnels from the pernet exit hook to
pre_exit since the core does rcu synchronisation between these steps
and we can therefore remove rcu_barrier from l2tp code.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: cleanup eth/ppp pseudowire setup code

l2tp eth/ppp pseudowire setup/cleanup uses kfree() in some error
paths. Drop the refcount instead such that the session object is
always freed when the refcount reaches 0.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: add idr consistency check in session_register

l2tp_session_register uses an idr_alloc then idr_replace pattern to
insert sessions into the session IDR. To catch invalid locking, add a
WARN_ON_ONCE if the IDR entry is modified by another thread between
alloc and replace steps.

Also add comments to make expectations clear.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: use rcu list add/del when updating lists

l2tp_v3_session_htable and tunnel->session_list are read by lockless
getters using RCU. Use rcu list variants when adding or removing list
items.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: prevent possible tunnel refcount underflow

When a session is created, it sets a backpointer to its tunnel. When
the session refcount drops to 0, l2tp_session_free drops the tunnel
refcount if session->tunnel is non-NULL. However, session->tunnel is
set in l2tp_session_create, before the tunnel refcount is incremented
by l2tp_session_register, which leaves a small window where
session->tunnel is non-NULL when the tunnel refcount hasn't been
bumped.

Moving the assignment to l2tp_session_register is trivial but
l2tp_session_create calls l2tp_session_set_header_len which uses
session->tunnel to get the tunnel's encap. Add an encap arg to
l2tp_session_set_header_len to avoid using session->tunnel.

If l2tpv3 sessions have colliding IDs, it is possible for
l2tp_v3_session_get to race with l2tp_session_register and fetch a
session which doesn't yet have session->tunnel set. Add a check for
this case.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: refactor ppp socket/session relationship

Each l2tp ppp session has an associated pppox socket. l2tp_ppp uses
the session's pppox socket refcount to manage session lifetimes; the
pppox socket holds a ref on the session which is dropped by the socket
destructor. This complicates session cleanup.

Given l2tp sessions are refcounted, it makes more sense to reverse
this relationship such that the session keeps the socket alive, not
the other way around. So refactor l2tp_ppp to have the session hold a
ref on its socket while it references it. When the session is closed,
it drops its socket ref when it detaches from its socket. If the
socket is closed first, it initiates the closing of its session, if
one is attached. The socket/session can then be freed asynchronously
when their refcounts drop to 0.

Use the session's session_close callback to detach the pppox socket
since this will be done on the work queue together with the rest of
the session cleanup via l2tp_session_delete.

Also, since l2tp_ppp uses the pppox socket's sk_user_data, use the rcu
sk_user_data access helpers when accessing it and set the socket's
SOCK_RCU_FREE flag to have pppox sockets freed by rcu.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: free sessions using rcu

l2tp sessions may be accessed under an rcu read lock. Have them freed
via rcu and remove the now unneeded synchronize_rcu when a session is
removed.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: delete sessions using work queue

When a tunnel is closed, l2tp_tunnel_closeall closes all sessions in
the tunnel. Move the work of deleting each session to the work queue
so that sessions are deleted using the same codepath whether they are
closed by user API request or their parent tunnel is closing. This
also avoids the locking dance in l2tp_tunnel_closeall where the
tunnel's session list lock was unlocked and relocked in the loop.

In l2tp_exit_net, use drain_workqueue instead of flush_workqueue
because the processing of tunnel_delete work may queue session_delete
work items which must also be processed.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: simplify tunnel and socket cleanup

When the l2tp tunnel socket used sk_user_data to point to its
associated l2tp tunnel, socket and tunnel cleanup had to make use of
the socket's destructor to free the tunnel only when the socket could
no longer be accessed.

Now that sk_user_data is no longer used, we can simplify socket and
tunnel cleanup:

  * If the tunnel closes first, it cleans up and drops its socket ref
    when the tunnel refcount drops to zero. If its socket was provided
    by userspace, the socket is closed and freed asynchronously, when
    userspace closes it. If its socket is a kernel socket, the tunnel
    closes the socket itself during cleanup and drops its socket ref
    when the tunnel's refcount drops to zero.

  * If the socket closes first, we initiate the closing of its
    associated tunnel. For UDP sockets, this is via the socket's
    encap_destroy hook. For L2TPIP sockets, this is via the socket's
    destroy callback. The tunnel holds a socket ref while it
    references the sock. When the tunnel is freed, it drops its socket
    ref and the socket will be cleaned up when its own refcount drops
    to zero, asynchronous to the tunnel free.

  * The tunnel socket destructor is no longer needed since the tunnel
    is no longer freed through the socket destructor.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove unused tunnel magic field

Since l2tp no longer derives tunnel pointers directly via
sk_user_data, it is no longer useful for l2tp to check tunnel pointers
using a magic feather. Drop the tunnel's magic field.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: don't set sk_user_data in tunnel socket

l2tp no longer uses the tunnel socket's sk_user_data so drop the code
which sets it.

In l2tp_validate_socket use l2tp_sk_to_tunnel to check whether a given
socket is already attached to an l2tp tunnel since we can no longer
use non-null sk_user_data to indicate this.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: don't use tunnel socket sk_user_data in ppp procfs output

l2tp's ppp procfs output can be used to show internal state of
pppol2tp. It includes a 'user-data-ok' field, which is derived from
the tunnel socket's sk_user_data being non-NULL. Use tunnel->sock
being non-NULL to indicate this instead.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: have l2tp_ip_destroy_sock use ip_flush_pending_frames

Use the recently exported ip_flush_pending_frames instead of a
free-coded version and lock the socket while we call it.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: export ip_flush_pending_frames

To avoid protocol modules implementing their own, export
ip_flush_pending_frames.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: lookup tunnel from socket without using sk_user_data

l2tp_sk_to_tunnel derives the tunnel from sk_user_data. Instead,
lookup the tunnel by walking the tunnel IDR for a tunnel using the
indicated sock. This is slow but l2tp_sk_to_tunnel is not used in
the datapath so performance isn't critical.

l2tp_tunnel_destruct needs a variant of l2tp_sk_to_tunnel which does
not bump the tunnel refcount since the tunnel refcount is already 0.

Change l2tp_sk_to_tunnel sk arg to const since it does not modify sk.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'for-6.11-rc1-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix regression in extent map rework when handling insertion of
   overlapping compressed extent

- fix unexpected file length when appending to a file using direct io
   and buffer not faulted in

- in zoned mode, fix accounting of unusable space when flipping
   read-only block group back to read-write

- fix page locking when COWing an inline range, assertion failure found
   by syzbot

- fix calculation of space info in debugging print

- tree-checker, add validation of data reference item

- fix a few -Wmaybe-uninitialized build warnings

* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
  btrfs: fix corruption after buffer fault in during direct IO append write
  btrfs: zoned: fix zone_unusable accounting on making block group read-write again
  btrfs: do not subtract delalloc from avail bytes
  btrfs: make cow_file_range_inline() honor locked_page on error
  btrfs: fix corrupt read due to bad offset of a compressed extent map
  btrfs: tree-checker: validate dref root and objectid