git.kernel.dk Git - linux-2.6-block.git/log

Florian Westphal [Thu, 11 Apr 2024 23:36:19 +0000 (01:36 +0200)]

selftests: netfilter: nft_flowtable.sh: move test to lib.sh infra

Use socat, the different nc implementations have too much variance wrt.
supported options.

Avoid sleeping until listener is up, use busywait helper for this,
this also greatly reduces test duration.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-15-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:18 +0000 (01:36 +0200)]

selftests: netfilter: nft_fib.sh: move to lib.sh infra

Also lower ping interval, wait times (helpers get called several times)
and set nodad for ipv6 addresses: 20s down to 4s.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-14-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:17 +0000 (01:36 +0200)]

selftests: netfilter: nft_conntrack_helper.sh: test to lib.sh infra

prefer socat over nc, nc has too many incompatible versions around.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-13-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:16 +0000 (01:36 +0200)]

selftests: netfilter: nf_nat_edemux.sh: move to lib.sh infra

While at it, use checktool helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-12-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:15 +0000 (01:36 +0200)]

selftests: netfilter: ipvs.sh: move to lib.sh infra

The setup_ns helper makes the netns names random, so replace nsX with $nsX
everywhere.

Replace nc with socat, otherwise script fails on my system due to
incompatible nc versions ("nc: cannot use -p and -l").

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-11-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:14 +0000 (01:36 +0200)]

selftests: netfilter: place checktool helper in lib.sh

... so it doesn't have to be repeated everywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:13 +0000 (01:36 +0200)]

selftests: netfilter: conntrack_ipip_mtu.sh" move to lib.sh infra

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-9-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:12 +0000 (01:36 +0200)]

selftests: netfilter: conntrack_vrf.sh: move to lib.sh infra

swap test for "ip" with "conntrack", former is already accounted for
via setup_ns helper. Also switch to bash.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-8-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:11 +0000 (01:36 +0200)]

selftests: netfilter: conntrack_sctp_collision.sh: move to lib.sh infra

While at it, address warnings generated by shellcheck and fix following
minor issues:

- some distros place netem in 'extra' modules package, so add a skip check for netem-attach
   failure.
- tc prints a warning for the 100mbit class:
   "Warning: sch_htb: quantum of class 10001 is big. Consider r2q change."
   Silence this by increasing the divisor.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:10 +0000 (01:36 +0200)]

selftests: netfilter: conntrack_tcp_unreplied.sh: move to lib.sh infra

Replace nc with socat. Too many different implementations of nc
are around with incompatible options ("nc: cannot use -p and -l").

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:09 +0000 (01:36 +0200)]

selftests: netfilter: conntrack_icmp_related.sh: move to lib.sh infra

Only relevant change is that netns names have random suffix names,
i.e. its safe to run this in parallel with other tests.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-5-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:08 +0000 (01:36 +0200)]

selftests: netfilter: br_netfilter.sh: move to lib.sh infra

Also, fix two issues reported by Pablo Neira:
1. Must modprobe br_netfilter in case its not loaded,
else sysctl cannot be set.
2. ping for netns4 fails if rp_filter is enabled in bridge netns,
so set all and default to 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:07 +0000 (01:36 +0200)]

selftests: netfilter: bridge_brouter.sh: move to lib.sh infra

Doing so gets us dynamically generated netns names.

Also:
* do not assume rp_filter is disabled, if its on script failed
* reduce timeout (-W) for "expected to fail" ping commands
* don't print PASS line for basic sanity ping
* shellcheck cleanups

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 11 Apr 2024 23:36:06 +0000 (01:36 +0200)]

selftests: netfilter: move to net subdir

.. so this can start re-using existing lib.sh infra in next patches.

Several of these scripts will not work, e.g. because they assume
rp_filter is disabled, or reliance on a particular version/flavor
of "netcat" tool.

Add config settings for them.

nft_trans_stress.sh script is removed, it also exists in the nftables
userspace selftests. I do not see a reason to keep two versions in
different repositories/projects.

The settings file is removed for now:

It was used to increase the timeout to avoid slow scripts from getting
zapped by the 45s timeout, but some of the slow scripts can be sped up.
Re-add it later for scripts that cannot be sped up easily.

Update MAINTAINERS to reflect that future updates to netfilter
scripts should go through netfilter-devel@.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David S. Miller [Fri, 12 Apr 2024 10:40:09 +0000 (11:40 +0100)]

Merge branch 'nfp-minor-improvements'

Louis Peens says:

====================
nfp: series of minor driver improvements

This short series bundles now only includes a small update to add a
board part number to devlink. Previously some dim patches also formed
part of this series, these were dropped in v5.

Patch1: Add new define for devlink string "board.part_number"
Patch2: Make use of this field in the nfp driver

Changes since V4:
- Dropped the dim patches, as there is a more significant rework in
  progress to make it more flexible, as mentioned in the V4 review:
  https://lore.kernel.org/all/1712547870-112976-2-git-send-email-hengqi@linux.alibaba.com/
- Updated the devlink description of 'board.part_number'

Changes since V3:
- Fixed: Documentation/networking/devlink/devlink-info.rst:150:
    WARNING: Title underline too short.

Changes since V2:
- After some discussion on the previous series it was agreed that only
  the "board.part_number" field makes sense in the common code. The
  "board.model" field which was moved to devlink common code in V1 is
  now kept in the driver. The field is specific to the nfp driver,
  exposing the codename of the board.
- In summary, add "board.part_number" to devlink, and populate it
  in the the nfp driver.

Changes since V1:
- Move nfp local defines to devlink common code as it is quite generic.
- Add new 'dim' profile instead of using driver local overrides, as this
  allows use of the 'dim' helpers.
- This expanded 2 patches to 4, as the common code changes are split
  into seperate patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fei Qin [Wed, 10 Apr 2024 11:26:36 +0000 (13:26 +0200)]

nfp: update devlink device info output

Newer NIC will introduce a new part number, now add it
into devlink device info.

This patch also updates the information of "board.id" in
nfp.rst to match the devlink-info.rst.

Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Fei Qin [Wed, 10 Apr 2024 11:26:35 +0000 (13:26 +0200)]

devlink: add a new info version tag

Add definition and documentation for the new generic
info "board.part_number".

The new one is for part number specific use, and board.id
is modified to match the documentation in devlink-info.

Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Hechao Li [Tue, 9 Apr 2024 16:43:55 +0000 (09:43 -0700)]

tcp: increase the default TCP scaling ratio

After commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"),
we noticed an application-level timeout due to reduced throughput.

Before the commit, for a client that sets SO_RCVBUF to 65k, it takes
around 22 seconds to transfer 10M data. After the commit, it takes 40
seconds. Because our application has a 30-second timeout, this
regression broke the application.

The reason that it takes longer to transfer data is that
tp->scaling_ratio is initialized to a value that results in ~0.25 of
rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which
translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k
initial receive window.

Later, even though the scaling_ratio is updated to a more accurate
skb->len/skb->truesize, which is ~0.66 in our environment, the window
stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not
change together with the tp->scaling_ratio update when autotuning is
disabled due to SO_RCVBUF. As a result, the window size is capped at the
initial window_clamp, which is also ~0.25 * rcvbuf, and never grows
bigger.

Most modern applications let the kernel do autotuning, and benefit from
the increased scaling_ratio. But there are applications such as kafka
that has a default setting of SO_RCVBUF=64k.

This patch increases the initial scaling_ratio from ~25% to 50% in order
to make it backward compatible with the original default
sysctl_tcp_adv_win_scale for applications setting SO_RCVBUF.

Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Hechao Li <hli@netflix.com>
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/20240402215405.432863-1-hli@netflix.com/
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Fri, 12 Apr 2024 09:18:19 +0000 (10:18 +0100)]

Merge branch 'rtl8226b-serdes-switching'

Eric Woudstra says:

====================
rtl8226b/8221b add C45 instances and SerDes switching

Based on the comments in [PATCH net-next]
"Realtek RTL822x PHY rework to c45 and SerDes interface switching"

Adds SerDes switching interface between 2500base-x and sgmii for
rtl8221b and rtl8226b.

Add get_rate_matching() for rtl8226b and rtl8221b, reading the serdes
mode from phy.

Driver instances are added for rtl8226b and rtl8221b for Clause 45
access only. The existing code is not touched, they use newly added
functions. They also use the same rtl822xb_config_init() and
rtl822xb_get_rate_matching() as these functions also can be used for
direct Clause 45 access. Also Adds definition of MMC 31 registers,
which cannot be used through C45-over-C22, only when phydev->is_c45
is set.

Change rtlgen_get_speed() so the register value is passed as argument.
Using Clause 45 access, this value is retrieved differently.
Rename it to rtlgen_decode_speed() and add a call to it in
rtl822x_c45_read_status().

Add rtl822x_c45_get_features() to set supported port for rtl8221b.

Then 1 quirk is added for sfp modules known to have a rtl8221b
behind RollBall, Clause 45 only, protocol.

Changed in PATCH v4:
* Changed switch to if statement in rtl822xb_get_rate_matching()
* Removed setting ETHTOOL_LINK_MODE_MII_BIT in rtl822x_c45_get_features()

Changed in PATCH v3:
* Only apply to rtl8221b and rtl8226b phy's
* Set phydev->rate_matching in .config_init()
* Removed OEM SFP fixup for now, as there are modules with the same
  vendor name/PN, but with different PHY's. We found rtl8221b, but
  also the ty8821, which is not yet supported.

Changed in PATCH v2:
* Set author to Marek for the commit of the new C45 instances
* Separate commit for setting supported ports
* Renamed rtlgen_get_speed to rtlgen_decode_speed
* Always fill in possible interfaces
* Renamed sfp_fixup_oem_2_5g to sfp_fixup_oem_2_5gbaset
* Only update phydev->interface when link is up

Alexander Couzens (1):
  net: phy: realtek: configure SerDes mode for rtl822xb PHYs

Eric Woudstra (3):
  net: phy: realtek: add get_rate_matching() for rtl822xb PHYs
  net: phy: realtek: Change rtlgen_get_speed() to rtlgen_decode_speed()
  net: phy: realtek: add rtl822x_c45_get_features() to set supported
    port
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Marek Behún [Tue, 9 Apr 2024 07:30:16 +0000 (09:30 +0200)]

net: sfp: add quirk for another multigig RollBall transceiver

Add quirk for another RollBall copper transceiver: Turris RTSFP-2.5G,
containing 2.5g capable RTL8221B PHY.

Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Woudstra [Tue, 9 Apr 2024 07:30:15 +0000 (09:30 +0200)]

net: phy: realtek: add rtl822x_c45_get_features() to set supported port

Sets ETHTOOL_LINK_MODE_TP_BIT in phydev->supported.

Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Woudstra [Tue, 9 Apr 2024 07:30:14 +0000 (09:30 +0200)]

net: phy: realtek: Change rtlgen_get_speed() to rtlgen_decode_speed()

The value of the register to determine the speed, is retrieved
differently when using Clause 45 only. To use the rtlgen_get_speed()
function in this case, pass the value of the register as argument to
rtlgen_get_speed(). The function would then always return 0, so change it
to void. A better name for this function now is rtlgen_decode_speed().

Replace a call to genphy_read_status() followed by rtlgen_get_speed()
with a call to rtlgen_read_status() in rtl822x_read_status().

Add reading speed to rtl822x_c45_read_status().

Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Marek Behún [Tue, 9 Apr 2024 07:30:13 +0000 (09:30 +0200)]

net: phy: realtek: Add driver instances for rtl8221b via Clause 45

Collected from several commits in [PATCH net-next]
"Realtek RTL822x PHY rework to c45 and SerDes interface switching"

The instances are used by Clause 45 only accessible PHY's on several sfp
modules, which are using RollBall protocol.

Signed-off-by: Marek Behún <kabel@kernel.org>
[ Added matching functions to differentiate C45 instances ]
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Woudstra [Tue, 9 Apr 2024 07:30:12 +0000 (09:30 +0200)]

net: phy: realtek: add get_rate_matching() for rtl822xb PHYs

Uses vendor register to determine if SerDes is setup in rate-matching mode.

Rate-matching only supported when SerDes is set to 2500base-x.

Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Alexander Couzens [Tue, 9 Apr 2024 07:30:11 +0000 (09:30 +0200)]

net: phy: realtek: configure SerDes mode for rtl822xb PHYs

The rtl8221b and rtl8226b series support switching SerDes mode between
2500base-x and sgmii based on the negotiated copper speed.

Configure this switching mode according to SerDes modes supported by
host.

There is an additional datasheet for RTL8226B/RTL8221B called
"SERDES MODE SETTING FLOW APPLICATION NOTE" where a sequence is
described to setup interface and rate adapter mode.

However, there is no documentation about the meaning of registers
and bits, it's literally just magic numbers and pseudo-code.

Signed-off-by: Alexander Couzens <lynxis@fe80.eu>
[ refactored, dropped HiSGMII mode and changed commit message ]
Signed-off-by: Marek Behún <kabel@kernel.org>
[ changed rtl822x_update_interface() to use vendor register ]
[ always fill in possible interfaces ]
[ only apply to rtl8221b and rtl8226b phy's ]
[ set phydev->rate_matching in .config_init() ]
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: should come before them, without any blank lines. As the
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jakub Kicinski [Fri, 12 Apr 2024 03:01:19 +0000 (20:01 -0700)]

Merge branch 'net-dsa-allow-phylink_mac_ops-in-dsa-drivers'

Russell King says:

====================
net: dsa: allow phylink_mac_ops in DSA drivers

This series showcases my idea of moving the phylink_mac_ops into DSA
drivers, using mv88e6xxx as an example. Since I'm only changing one
driver, providing the mac_ops has to be optional and the existing shims
need to be kept for unconverted drivers.

The first patch introduces a new helper that converts from the
phylink_config structure that phylink uses to communicate with MAC
drivers to the dsa_port structure. From this, DSA drivers can get
the dsa_switch structure and thus their implementation specific
data structure, and they can also retrieve the port index.

The second patch adds the support to the core DSA layer to allow
DSA drivers to provide phylink_mac_ops.

The third patch converts mv88e6xxx to use this.

I initially made this change after adding yet more phylink to DSA
driver shims for my work with phylink-based EEE support, and decided
that it was getting silly to keep implementing more and more shims.
There are cases where shims don't work well - we had already tripped
over a case a few years ago when the phylink mac_select_pcs operation
was introduced. Phylink tested for the presence of this in the ops
structure, but with DSA shims, this doesn't necessarily mean that
the sub-driver supports this method. The only way to find that out
is to call the method with dummy values and check the return code.

The same thing was partly true when adding EEE support, and I ended
up with this in phylink to determine whether the MAC supported EEE:

+static bool phylink_mac_supports_eee(struct phylink *pl)
+{
+       return pl->mac_ops->mac_disable_tx_lpi &&
+              pl->mac_ops->mac_enable_tx_lpi &&
+              pl->config->lpi_capabilities;
+}

because merely testing for the presence of the operations is
insufficient when shims are involved - and it wasn't possible to call
these functions in the way that mac_select_pcs could be called.

So, I think it's time to get away from this shimming model and instead
have drivers directly interface to the various subsystems.

This converts mv88e6xxx. I have similar patches for other DSA drivers
that will be sent once this has been reviewed.
====================

Link: https://lore.kernel.org/r/ZhbrbM+d5UfgafGp@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 10 Apr 2024 19:42:48 +0000 (20:42 +0100)]

net: dsa: mv88e6xxx: provide own phylink MAC operations

Convert mv88e6xxx to provide its own phylink MAC operations, thus
avoiding the shim layer in DSA's port.c

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqK-006K9N-HY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 10 Apr 2024 19:42:43 +0000 (20:42 +0100)]

net: dsa: allow DSA switch drivers to provide their own phylink mac ops

Rather than having a shim for each and every phylink MAC operation,
allow DSA switch drivers to provide their own ops structure. When a
DSA driver provides the phylink MAC operations, the shimmed ops must
not be provided, so fail an attempt to register a switch with both
the phylink_mac_ops in struct dsa_switch and the phylink_mac_*
operations populated in dsa_switch_ops populated.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqF-006K9H-Cc@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 10 Apr 2024 19:42:38 +0000 (20:42 +0100)]

net: dsa: introduce dsa_phylink_to_port()

We convert from a phylink_config struct to a dsa_port struct in many
places, let's provide a helper for this.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1rudqA-006K9B-85@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Colin Ian King [Wed, 10 Apr 2024 14:41:36 +0000 (15:41 +0100)]

tls: remove redundant assignment to variable decrypted

The variable decrypted is being assigned a value that is never read,
the control of flow after the assignment is via an return path and
decrypted is not referenced in this path. The assignment is redundant
and can be removed.

Cleans up clang scan warning:
net/tls/tls_sw.c:2150:4: warning: Value stored to 'decrypted' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410144136.289030-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Wed, 10 Apr 2024 13:14:29 +0000 (15:14 +0200)]

ipv4: Remove RTO_ONLINK.

RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the
scope of an IPv4 route lookup. Setting this flag was equivalent to
specifying RT_SCOPE_LINK in ->flowi4_scope.

With commit ec20b2830093 ("ipv4: Set scope explicitly in
ip_route_output()."), the last users of RTO_ONLINK have been removed.
Therefore, we can now drop the code that checked this bit and stop
modifying ->flowi4_scope in ip_route_output_key_hash().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jon Maloy [Tue, 9 Apr 2024 15:28:05 +0000 (11:28 -0400)]

tcp: add support for SO_PEEK_OFF socket option

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Breno Leitao [Tue, 9 Apr 2024 13:33:06 +0000 (06:33 -0700)]

net: usb: qmi_wwan: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240409133307.2058099-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Breno Leitao [Tue, 9 Apr 2024 13:33:05 +0000 (06:33 -0700)]

net: usb: qmi_wwan: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the qmi_wwan driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240409133307.2058099-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 10 Apr 2024 11:19:50 +0000 (11:19 +0000)]

mpls: no longer hold RTNL in mpls_netconf_dump_devconf()

- Use for_each_netdev_dump() to no longer rely
  on net->dev_index_head hash table.

- No longer care of net->dev_base_seq

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

- No longer grab RTNL, RCU protection is enough,
  afer adding one READ_ONCE(mdev->input_enabled)
  in mpls_netconf_fill_devconf()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Asbjørn Sloth Tønnesen [Wed, 10 Apr 2024 11:47:17 +0000 (11:47 +0000)]

flow_offload: fix flow_offload_has_one_action() kdoc

include/net/flow_offload.h:351: warning:
No description found for return value of 'flow_offload_has_one_action'

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Link: https://lore.kernel.org/r/20240410114718.15145-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Wed, 10 Apr 2024 21:41:54 +0000 (00:41 +0300)]

net/mlx5e: Expose the VF/SF RX drop counter on the representor

Q counters are device-level counters that track specific
events, among which are out_of_buffer events. These events
occur when packets are dropped due to a lack of receive
buffer in the RX queue.

Expose the total number of out_of_buffer events on the
VFs/SFs to their respective representor, using the
"ip stats group link" under the name of "rx_missed".

The "rx_missed" equals the sum of all
Q counters out_of_buffer values allocated on the VFs/SFs.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410214154.250583-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 12 Apr 2024 02:29:27 +0000 (19:29 -0700)]

Merge branch 'minor-cleanups-to-skb-frag-ref-unref'

Mina Almasry says:

====================
Minor cleanups to skb frag ref/unref

This series is largely motivated by a recent discussion where there was
some confusion on how to properly ref/unref pp pages vs non pp pages:

https://lore.kernel.org/netdev/CAHS8izOoO-EovwMwAm9tLYetwikNPxC0FKyVGu1TPJWSz4bGoA@mail.gmail.com/T/#t

There is some subtely there because pp uses page->pp_ref_count for
refcounting, while non-pp uses get_page()/put_page() for ref counting.
Getting the refcounting pairs wrong can lead to kernel crash.

Additionally currently it may not be obvious to skb users unaware of
page pool internals how to properly acquire a ref on a pp frag. It
requires checking of skb->pp_recycle & is_pp_page() to make the correct
calls and may require some handling at the call site aware of arguable pp
internals.

This series is a minor refactor with a couple of goals:

1. skb users should be able to ref/unref a frag using
   [__]skb_frag_[un]ref() functions without needing to understand pp
   concepts and pp_ref_count vs get/put_page() differences.

2. reference counting functions should have a mirror opposite. I.e. there
   should be a foo_unref() to every foo_ref() with a mirror opposite
   implementation (as much as possible).

This is RFC to collect feedback if this change is desirable, but also so
that I don't race with the fix for the issue Dragos is seeing for his
crash.

https://lore.kernel.org/lkml/CAHS8izN436pn3SndrzsCyhmqvJHLyxgCeDpWXA4r1ANt3RCDLQ@mail.gmail.com/T/
====================

Link: https://lore.kernel.org/r/20240410190505.1225848-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mina Almasry [Wed, 10 Apr 2024 19:05:02 +0000 (12:05 -0700)]

net: mirror skb frag ref/unref helpers

Refactor some of the skb frag ref/unref helpers for improved clarity.

Implement napi_pp_get_page() to be the mirror counterpart of
napi_pp_put_page().

Implement skb_page_ref() to be the mirror of skb_page_unref().

Improve __skb_frag_ref() to become a mirror counterpart of
__skb_frag_unref(). Previously unref could handle pp & non-pp pages,
while the ref could only handle non-pp pages. Now both the ref & unref
helpers can correctly handle both pp & non-pp pages.

Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
skb_pp_frag_ref(), and use __skb_frag_ref() instead. This lets us
remove pp specific handling from skb_try_coalesce.

Additionally, since __skb_frag_ref() can now handle both pp & non-pp
pages, a latent issue in skb_shift() should now be fixed. Previously
this function would do a non-pp ref & pp unref on potential pp frags
(fragfrom). After this patch, skb_shift() should correctly do a pp
ref/unref on pp frags.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mina Almasry [Wed, 10 Apr 2024 19:05:01 +0000 (12:05 -0700)]

net: move skb ref helpers to new header

Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.

Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 21:20:04 +0000 (14:20 -0700)]

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/unix/garbage.c
  47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
  4090fa373f0e ("af_unix: Replace garbage collection algorithm.")

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
  b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")

drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
  194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  958f56e48385 ("net/mlx5e: Un-expose functions in en.h")
  49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 18:46:31 +0000 (11:46 -0700)]

Merge tag 'net-6.9-rc4' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth.

  Current release - new code bugs:

   - netfilter: complete validation of user input

   - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev

  Previous releases - regressions:

   - core: fix u64_stats_init() for lockdep when used repeatedly in one
     file

   - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr

   - bluetooth: fix memory leak in hci_req_sync_complete()

   - batman-adv: avoid infinite loop trying to resize local TT

   - drv: geneve: fix header validation in geneve[6]_xmit_skb

   - drv: bnxt_en: fix possible memory leak in
     bnxt_rdma_aux_device_init()

   - drv: mlx5: offset comp irq index in name by one

   - drv: ena: avoid double-free clearing stale tx_info->xdpf value

   - drv: pds_core: fix pdsc_check_pci_health deadlock

  Previous releases - always broken:

   - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING

   - bluetooth: fix setsockopt not validating user input

   - af_unix: clear stale u->oob_skb.

   - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies

   - drv: virtio_net: fix guest hangup on invalid RSS update

   - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow

   - dsa: mt7530: trap link-local frames regardless of ST Port State"

* tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
  net: ena: Set tx_info->xdpf value to NULL
  net: ena: Fix incorrect descriptor free behavior
  net: ena: Wrong missing IO completions check order
  net: ena: Fix potential sign extension issue
  af_unix: Fix garbage collector racing against connect()
  net: dsa: mt7530: trap link-local frames regardless of ST Port State
  Revert "s390/ism: fix receive message buffer allocation"
  net: sparx5: fix wrong config being used when reconfiguring PCS
  net/mlx5: fix possible stack overflows
  net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
  net/mlx5e: RSS, Block XOR hash with over 128 channels
  net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
  net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
  net/mlx5e: Fix mlx5e_priv_init() cleanup flow
  net/mlx5e: RSS, Block changing channels number when RXFH is configured
  net/mlx5: Correctly compare pkt reformat ids
  net/mlx5: Properly link new fs rules into the tree
  net/mlx5: offset comp irq index in name by one
  net/mlx5: Register devlink first under devlink lock
  net/mlx5: E-switch, store eswitch pointer before registering devlink_param
  ...

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 18:42:11 +0000 (11:42 -0700)]

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"The most important fix is the sg one because the regression it fixes
  (spurious warning and use after final put) is already backported to
  stable.

  The next biggest impact is the target fix for wrong credentials used
  to load a module because it's affecting new kernels installed on
  selinux based distributions.

  The other three fixes are an obvious off by one and SATA protocol
  issues"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
  scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
  scsi: hisi_sas: Handle the NCQ error returned by D2H frame
  scsi: target: Fix SELinux error when systemd-modules loads the target module
  scsi: sg: Avoid race in error handling & drop bogus warn

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 18:30:42 +0000 (11:30 -0700)]

Merge tag 'loongarch-fixes-6.9-1' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:

- make {virt, phys, page, pfn} translation work with KFENCE for
   LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
   enabled)

- update dts files for Loongson-2K series to make devices work
   correctly

- fix a build error

* tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
  LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
  LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
  LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
  LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
  LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
  LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
  mm: Move lowmem_page_address() a little later

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 18:24:55 +0000 (11:24 -0700)]

Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs

Pull more bcachefs fixes from Kent Overstreet:
"Notable user impacting bugs

   - On multi device filesystems, recovery was looping in
     btree_trans_too_many_iters(). This checks if a transaction has
     touched too many btree paths (because of iteration over many keys),
     and isuses a restart to drop unneeded paths.

     But it's now possible for some paths to exceed the previous limit
     without iteration in the interior btree update path, since the
     transaction commit will do alloc updates for every old and new
     btree node, and during journal replay we don't use the btree write
     buffer for locking reasons and thus those updates use btree paths
     when they wouldn't normally.

   - Fix a corner case in rebalance when moving extents on a
     durability=0 device. This wouldn't be hit when a device was
     formatted with durability=0 since in that case we'll only use it as
     a write through cache (only cached extents will live on it), but
     durability can now be changed on an existing device.

   - bch2_get_acl() could rarely forget to handle a transaction restart;
     this manifested as the occasional missing acl that came back after
     dropping caches.

   - Fix a major performance regression on high iops multithreaded write
     workloads (only since 6.9-rc1); a previous fix for a deadlock in
     the interior btree update path to check the journal watermark
     introduced a dependency on the state of btree write buffer flushing
     that we didn't want.

   - Assorted other repair paths and recovery fixes"

* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
  bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
  bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
  bcachefs: Fix a race in btree_update_nodes_written()
  bcachefs: btree_node_scan: Respect member.data_allowed
  bcachefs: Don't scan for btree nodes when we can reconstruct
  bcachefs: Fix check_topology() when using node scan
  bcachefs: fix eytzinger0_find_gt()
  bcachefs: fix bch2_get_acl() transaction restart handling
  bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
  bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
  bcachefs: Rename struct field swap to prevent macro naming collision
  MAINTAINERS: Add entry for bcachefs documentation
  Documentation: filesystems: Add bcachefs toctree
  bcachefs: JOURNAL_SPACE_LOW
  bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
  bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
  bcachefs: fix rand_delete unit test
  bcachefs: fix ! vs ~ typo in __clear_bit_le64()
  bcachefs: Fix rebalance from durability=0 device
  bcachefs: Print shutdown journal sequence number
  ...

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 18:15:09 +0000 (11:15 -0700)]

Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git./linux/kernel/git/chrome-platform/linux

Pull chrome platform fix from Tzung-Bi Shih:
"Fix a NULL pointer dereference"

* tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_uart: properly fix race condition

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 15:19:55 +0000 (08:19 -0700)]

Merge branch 'mptcp-add-last-time-fields-in-mptcp_info'

Matthieu Baerts says:

====================
mptcp: add last time fields in mptcp_info

These patches from Geliang add support for the "last time" field in
MPTCP Info, and verify that the counters look valid.

Patch 1 adds these counters: last_data_sent, last_data_recv and
last_ack_recv. They are available in the MPTCP Info, so exposed via
getsockopt(MPTCP_INFO) and the Netlink Diag interface.

Patch 2 adds a test in diag.sh MPTCP selftest, to check that the
counters have moved by at least 250ms, after having waited twice that
time.

v1: https://lore.kernel.org/r/20240405-upstream-net-next-20240405-mptcp-last-time-info-v1-0-52dc49453649@kernel.org
====================

Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-0-f95bd6b33e51@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 10 Apr 2024 09:48:25 +0000 (11:48 +0200)]

selftests: mptcp: test last time mptcp_info

This patch adds a new helper chk_msk_info() to show the counters in
mptcp_info of the given info, and check that the timestamps move
forward. Use it to show newly added last_data_sent, last_data_recv
and last_ack_recv in mptcp_info in chk_last_time_info().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-2-f95bd6b33e51@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 10 Apr 2024 09:48:24 +0000 (11:48 +0200)]

mptcp: add last time fields in mptcp_info

This patch adds "last time" fields last_data_sent, last_data_recv and
last_ack_recv in struct mptcp_sock to record the last time data_sent,
data_recv and ack_recv happened. They all are initialized as
tcp_jiffies32 in __mptcp_init_sock(), and updated as tcp_jiffies32 too
when data is sent in __subflow_push_pending(), data is received in
__mptcp_move_skbs_from_subflow(), and ack is received in ack_update_msk().

Similar to tcpi_last_data_sent, tcpi_last_data_recv and tcpi_last_ack_recv
exposed with TCP, this patch exposes the last time "an action happened" for
MPTCP in mptcp_info, named mptcpi_last_data_sent, mptcpi_last_data_recv and
mptcpi_last_ack_recv, calculated in mptcp_diag_fill_info() as the time
deltas between now and the newly added last time fields in mptcp_sock.

Since msk->last_ack_recv needs to be protected by mptcp_data_lock/unlock,
and lock_sock_fast can sleep and be quite slow, move the entire
mptcp_data_lock/unlock block after the lock/unlock_sock_fast block.
Then mptcpi_last_data_sent and mptcpi_last_data_recv are set in
lock/unlock_sock_fast block, while mptcpi_last_ack_recv is set in
mptcp_data_lock/unlock block, which is protected by a spinlock and
should not block for too long.

Also add three reserved bytes in struct mptcp_info not to have holes in
this structure exposed to userspace.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/446
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-1-f95bd6b33e51@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 01:04:24 +0000 (18:04 -0700)]

Merge branch mana-ib-flex of git://git./linux/kernel/git/rdma/rdma.git

Erick Archer says:

====================
mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)

The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2 and
3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code pattern
is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
====================

Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Erick Archer [Sat, 6 Apr 2024 14:23:37 +0000 (16:23 +0200)]

net: mana: Avoid open coded arithmetic

This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
and this structure ends in a flexible array:

struct mana_cfg_rx_steer_req_v2 {
[...]
mana_handle_t indir_tab[] __counted_by(num_indir_entries);
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kzalloc() function.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB7237A21355C86EC0DCC0D83B8B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

commit | commitdiff | tree

Erick Archer [Sat, 6 Apr 2024 14:23:36 +0000 (16:23 +0200)]

RDMA/mana_ib: Prefer struct_size over open coded arithmetic

This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
and this structure ends in a flexible array:

struct mana_cfg_rx_steer_req_v2 {
[...]
mana_handle_t indir_tab[] __counted_by(num_indir_entries);
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kzalloc() function.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB72375EB06EE1A84A67BE722E8B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

commit | commitdiff | tree

Erick Archer [Sat, 6 Apr 2024 14:23:35 +0000 (16:23 +0200)]

net: mana: Add flex array to struct mana_cfg_rx_steer_req_v2

The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

This is a previous step to refactor the two consumers of this structure.

drivers/infiniband/hw/mana/qp.c
drivers/net/ethernet/microsoft/mana/mana_en.c

The ultimate goal is to avoid the open-coded arithmetic in the memory
allocator functions [2] using the "struct_size" macro.

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB7237E2900247571C9CB84C678B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Thu, 11 Apr 2024 09:21:05 +0000 (11:21 +0200)]

Merge branch 'ena-driver-bug-fixes'

David Arinzon says:

====================
ENA driver bug fixes

From: David Arinzon <darinzon@amazon.com>

This patchset contains multiple bug fixes for the
ENA driver.
====================

Link: https://lore.kernel.org/r/20240410091358.16289-1-darinzon@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Arinzon [Wed, 10 Apr 2024 09:13:58 +0000 (09:13 +0000)]

net: ena: Set tx_info->xdpf value to NULL

The patch mentioned in the `Fixes` tag removed the explicit assignment
of tx_info->xdpf to NULL with the justification that there's no need
to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case
of a mapping error. Both values won't be used once the mapping function
returns an error, and their values would be overridden by the next
transmitted packet.

While both values do indeed get overridden in the next transmission
call, the value of tx_info->xdpf is also used to check whether a TX
descriptor's transmission has been completed (i.e. a completion for it
was polled).

An example scenario:
1. Mapping failed, tx_info->xdpf wasn't set to NULL
2. A VF reset occurred leading to IO resource destruction and
   a call to ena_free_tx_bufs() function
3. Although the descriptor whose mapping failed was freed by the
   transmission function, it still passes the check
     if (!tx_info->skb)

   (skb and xdp_frame are in a union)
4. The xdp_frame associated with the descriptor is freed twice

This patch returns the assignment of NULL to tx_info->xdpf to make the
cleaning function knows that the descriptor is already freed.

Fixes: 504fd6a5390c ("net: ena: fix DMA mapping function issues in XDP")
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Arinzon [Wed, 10 Apr 2024 09:13:57 +0000 (09:13 +0000)]

net: ena: Fix incorrect descriptor free behavior

ENA has two types of TX queues:
- queues which only process TX packets arriving from the network stack
- queues which only process TX packets forwarded to it by XDP_REDIRECT
or XDP_TX instructions

The ena_free_tx_bufs() cycles through all descriptors in a TX queue
and unmaps + frees every descriptor that hasn't been acknowledged yet
by the device (uncompleted TX transactions).
The function assumes that the processed TX queue is necessarily from
the first category listed above and ends up using napi_consume_skb()
for descriptors belonging to an XDP specific queue.

This patch solves a bug in which, in case of a VF reset, the
descriptors aren't freed correctly, leading to crashes.

Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action")
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Arinzon [Wed, 10 Apr 2024 09:13:56 +0000 (09:13 +0000)]

net: ena: Wrong missing IO completions check order

Missing IO completions check is called every second (HZ jiffies).
This commit fixes several issues with this check:

1. Duplicate queues check:
   Max of 4 queues are scanned on each check due to monitor budget.
   Once reaching the budget, this check exits under the assumption that
   the next check will continue to scan the remainder of the queues,
   but in practice, next check will first scan the last already scanned
   queue which is not necessary and may cause the full queue scan to
   last a couple of seconds longer.
   The fix is to start every check with the next queue to scan.
   For example, on 8 IO queues:
   Bug: [0,1,2,3], [3,4,5,6], [6,7]
   Fix: [0,1,2,3], [4,5,6,7]

2. Unbalanced queues check:
   In case the number of active IO queues is not a multiple of budget,
   there will be checks which don't utilize the full budget
   because the full scan exits when reaching the last queue id.
   The fix is to run every TX completion check with exact queue budget
   regardless of the queue id.
   For example, on 7 IO queues:
   Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
   Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
   The budget may be lowered in case the number of IO queues is less
   than the budget (4) to make sure there are no duplicate queues on
   the same check.
   For example, on 3 IO queues:
   Bug: [0,1,2,0], [1,2,0,1]
   Fix: [0,1,2], [0,1,2]

Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Arinzon [Wed, 10 Apr 2024 09:13:55 +0000 (09:13 +0000)]

net: ena: Fix potential sign extension issue

Small unsigned types are promoted to larger signed types in
the case of multiplication, the result of which may overflow.
In case the result of such a multiplication has its MSB
turned on, it will be sign extended with '1's.
This changes the multiplication result.

Code example of the phenomenon:
-------------------------------
u16 x, y;
size_t z1, z2;

x = y = 0xffff;
printk("x=%x y=%x\n",x,y);

z1 = x*y;
z2 = (size_t)x*y;

printk("z1=%lx z2=%lx\n", z1, z2);

Output:
-------
x=ffff y=ffff
z1=fffffffffffe0001 z2=fffe0001

The expected result of ffff*ffff is fffe0001, and without the
explicit casting to avoid the unwanted sign extension we got
fffffffffffe0001.

This commit adds an explicit casting to avoid the sign extension
issue.

Fixes: 689b2bdaaa14 ("net: ena: add functions for handling Low Latency Queues in ena_com")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 11 Apr 2024 08:42:43 +0000 (10:42 +0200)]

Merge tag 'for-net-2024-04-10' of git://git./linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

  - L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
  - Fix memory leak in hci_req_sync_complete
  - hci_sync: Fix using the same interval and window for Coded PHY
  - Fix not validating setsockopt user input

* tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
  Bluetooth: hci_sock: Fix not validating setsockopt user input
  Bluetooth: ISO: Fix not validating setsockopt user input
  Bluetooth: L2CAP: Fix not validating setsockopt user input
  Bluetooth: RFCOMM: Fix not validating setsockopt user input
  Bluetooth: SCO: Fix not validating setsockopt user input
  Bluetooth: Fix memory leak in hci_req_sync_complete()
  Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
  Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
====================

Link: https://lore.kernel.org/r/20240410191610.4156653-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Michal Luczaj [Tue, 9 Apr 2024 20:09:39 +0000 (22:09 +0200)]

af_unix: Fix garbage collector racing against connect()

Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped

connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------

NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
// V count=1 inflight=0

NS = unix_peer(S)
skb2 = sock_alloc()
skb_queue_tail(NS, skb2[V])

// V became in-flight
// V count=2 inflight=1

close(V)

// V count=1 inflight=1
// GC candidate condition met

for u in gc_inflight_list:
  if (total_refs == inflight_refs)
    add u to gc_candidates

// gc_candidates={L, V}

for u in gc_candidates:
  scan_children(u, dec_inflight)

// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
  if (u.inflight)
    scan_children(u, inc_inflight_move_tail)

// V count=1 inflight=2 (!)

If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
this point, unix_inflight() can not happen because unix_gc_lock is already
taken. Inflight graph remains unaffected.

Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Arınç ÜNAL [Tue, 9 Apr 2024 15:01:14 +0000 (18:01 +0300)]

net: dsa: mt7530: trap link-local frames regardless of ST Port State

In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
(DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
are described; the medium access control (MAC) and logical link control
(LLC) sublayers. The MAC sublayer is the one facing the physical layer.

In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
Bridge component comprises a MAC Relay Entity for interconnecting the Ports
of the Bridge, at least two Ports, and higher layer entities with at least
a Spanning Tree Protocol Entity included.

Each Bridge Port also functions as an end station and shall provide the MAC
Service to an LLC Entity. Each instance of the MAC Service is provided to a
distinct LLC Entity that supports protocol identification, multiplexing,
and demultiplexing, for protocol data unit (PDU) transmission and reception
by one or more higher layer entities.

It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
Entity associated with each Bridge Port is modeled as being directly
connected to the attached Local Area Network (LAN).

On the switch with CPU port architecture, CPU port functions as Management
Port, and the Management Port functionality is provided by software which
functions as an end station. Software is connected to an IEEE 802 LAN that
is wholly contained within the system that incorporates the Bridge.
Software provides access to the LLC Entity associated with each Bridge Port
by the value of the source port field on the special tag on the frame
received by software.

We call frames that carry control information to determine the active
topology and current extent of each Virtual Local Area Network (VLAN),
i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
Registration Protocol Data Units (MVRPDUs), and frames from other link
constrained protocols, such as Extensible Authentication Protocol over LAN
(EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
are not forwarded by a Bridge. Permanently configured entries in the
filtering database (FDB) ensure that such frames are discarded by the
Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
detail:

Each of the reserved MAC addresses specified in Table 8-1
(01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
permanently configured in the FDB in C-VLAN components and ERs.

Each of the reserved MAC addresses specified in Table 8-2
(01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
configured in the FDB in S-VLAN components.

Each of the reserved MAC addresses specified in Table 8-3
(01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
in TPMR components.

The FDB entries for reserved MAC addresses shall specify filtering for all
Bridge Ports and all VIDs. Management shall not provide the capability to
modify or remove entries for reserved MAC addresses.

The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
propagation of PDUs within a Bridged Network, as follows:

  The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
  no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
  component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
  PDUs transmitted using this destination address, or any other addresses
  that appear in Table 8-1, Table 8-2, and Table 8-3
  (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
  therefore travel no further than those stations that can be reached via a
  single individual LAN from the originating station.

  The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
  address that no conformant S-VLAN component, C-VLAN component, or MAC
  Bridge can forward; however, this address is relayed by a TPMR component.
  PDUs using this destination address, or any of the other addresses that
  appear in both Table 8-1 and Table 8-2 but not in Table 8-3
  (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
  by any TPMRs but will propagate no further than the nearest S-VLAN
  component, C-VLAN component, or MAC Bridge.

  The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
  address that no conformant C-VLAN component, MAC Bridge can forward;
  however, it is relayed by TPMR components and S-VLAN components. PDUs
  using this destination address, or any of the other addresses that appear
  in Table 8-1 but not in either Table 8-2 or Table 8-3
  (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
  S-VLAN components but will propagate no further than the nearest C-VLAN
  component or MAC Bridge.

Because the LLC Entity associated with each Bridge Port is provided via CPU
port, we must not filter these frames but forward them to CPU port.

In a Bridge, the transmission Port is majorly decided by ingress and egress
rules, FDB, and spanning tree Port State functions of the Forwarding
Process. For link-local frames, only CPU port should be designated as
destination port in the FDB, and the other functions of the Forwarding
Process must not interfere with the decision of the transmission Port. We
call this process trapping frames to CPU port.

Therefore, on the switch with CPU port architecture, link-local frames must
be trapped to CPU port, and certain link-local frames received by a Port of
a Bridge comprising a TPMR component or an S-VLAN component must be
excluded from it.

A Bridge of the switch with CPU port architecture cannot comprise a
Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
(Management Port doesn't count) of this architecture will either function
as a standard MAC Bridge or a standard VLAN Bridge.

Therefore, a Bridge of this architecture can only comprise S-VLAN
components, C-VLAN components, or MAC Bridge components. Since there's no
TPMR component, we don't need to relay PDUs using the destination addresses
specified on the Nearest non-TPMR section, and the proportion of the
Nearest Customer Bridge section where they must be relayed by TPMR
components.

One option to trap link-local frames to CPU port is to add static FDB
entries with CPU port designated as destination port. However, because that
Independent VLAN Learning (IVL) is being used on every VID, each entry only
applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
Bridge component or a C-VLAN component, there would have to be 16 times
4096 entries. This switch intellectual property can only hold a maximum of
2048 entries. Using this option, there also isn't a mechanism to prevent
link-local frames from being discarded when the spanning tree Port State of
the reception Port is discarding.

The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
registers. Whilst this applies to every VID, it doesn't contain all of the
reserved MAC addresses without affecting the remaining Standard Group MAC
Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
The latter option provides better but not complete conformance.

This switch intellectual property also does not provide a mechanism to trap
link-local frames with specific destination addresses to CPU port by
Bridge, to conform to the filtering rules for the distinct Bridge
components.

Therefore, regardless of the type of the Bridge component, link-local
frames with these destination addresses will be trapped to CPU port:

01-80-C2-00-00-[00,01,02,03,0E]

In a Bridge comprising a MAC Bridge component or a C-VLAN component:

  Link-local frames with these destination addresses won't be trapped to
  CPU port which won't conform to IEEE Std 802.1Q-2022:

  01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]

In a Bridge comprising an S-VLAN component:

  Link-local frames with these destination addresses will be trapped to CPU
  port which won't conform to IEEE Std 802.1Q-2022:

  01-80-C2-00-00-00

  Link-local frames with these destination addresses won't be trapped to
  CPU port which won't conform to IEEE Std 802.1Q-2022:

  01-80-C2-00-00-[04,05,06,07,08,09,0A]

Currently on this switch intellectual property, if the spanning tree Port
State of the reception Port is discarding, link-local frames will be
discarded.

To trap link-local frames regardless of the spanning tree Port State, make
the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
intellectual property only lets the frames regarded as BPDUs bypass the
spanning tree Port State function of the Forwarding Process.

With this change, the only remaining interference is the ingress rules.
When the reception Port has no PVID assigned on software, VLAN-untagged
frames won't be allowed in. There doesn't seem to be a mechanism on the
switch intellectual property to have link-local frames bypass this function
of the Forwarding Process.

Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Gerd Bayer [Tue, 9 Apr 2024 11:37:53 +0000 (13:37 +0200)]

Revert "s390/ism: fix receive message buffer allocation"

This reverts commit 58effa3476536215530c9ec4910ffc981613b413.
Review was not finished on this patch. So it's not ready for
upstreaming.

Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Link: https://lore.kernel.org/r/20240409113753.2181368-1-gbayer@linux.ibm.com
Fixes: 58effa347653 ("s390/ism: fix receive message buffer allocation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Machon [Tue, 9 Apr 2024 10:41:59 +0000 (12:41 +0200)]

net: sparx5: fix wrong config being used when reconfiguring PCS

The wrong port config is being used if the PCS is reconfigured. Fix this
by correctly using the new config instead of the old one.

Fixes: 946e7fd5053a ("net: sparx5: add port module support")
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Arnd Bergmann [Mon, 8 Apr 2024 07:41:10 +0000 (09:41 +0200)]

net/mlx5: fix possible stack overflows

A couple of debug functions use a 512 byte temporary buffer and call another
function that has another buffer of the same size, which in turn exceeds the
usual warning limit for excessive stack usage:

drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1073:1: error: stack frame size (1448) exceeds limit (1024) in 'dr_dump_start' [-Werror,-Wframe-larger-than]
dr_dump_start(struct seq_file *file, loff_t *pos)
drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1009:1: error: stack frame size (1120) exceeds limit (1024) in 'dr_dump_domain' [-Werror,-Wframe-larger-than]
dr_dump_domain(struct seq_file *file, struct mlx5dr_domain *dmn)
drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:705:1: error: stack frame size (1104) exceeds limit (1024) in 'dr_dump_matcher_rx_tx' [-Werror,-Wframe-larger-than]
dr_dump_matcher_rx_tx(struct seq_file *file, bool is_rx,

Rework these so that each of the various code paths only ever has one of
these buffers in it, and exactly the functions that declare one have
the 'noinline_for_stack' annotation that prevents them from all being
inlined into the same caller.

Fixes: 917d1e799ddf ("net/mlx5: DR, Change SWS usage to debug fs seq_file interface")
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/all/20240219100506.648089-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240408074142.3007036-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 02:55:11 +0000 (19:55 -0700)]

Merge branch 'bnxt_en-updates-for-net-next'

Michael Chan says:

====================
bnxt_en: Updates for net-next

The first patch prevents a driver crash when RSS contexts are
configred in ifdown state.  Patches 2 to 6 are improvements for
managing MSIX for the aux device (for RoCE).  The existing
scheme statically carves out the MSIX vectors for RoCE even if
the RoCE driver is not loaded.  The new scheme adds flexibility
and allows the L2 driver to use the RoCE MSIX vectors if needed
when they are unused by the RoCE driver.  The last patch updates
the MODULE_DESCRIPTION().
====================

Link: https://lore.kernel.org/r/20240409215431.41424-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Michael Chan [Tue, 9 Apr 2024 21:54:31 +0000 (14:54 -0700)]

bnxt_en: Update MODULE_DESCRIPTION

Update MODULE_DESCRIPTION to the more generic adapter family name.
The old name only includes the first generation of supported
adapters.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-8-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vikas Gupta [Tue, 9 Apr 2024 21:54:30 +0000 (14:54 -0700)]

bnxt_en: Utilize ulp client resources if RoCE is not registered

If the RoCE driver is not registered for a RoCE capable device, add
flexibility to use the RoCE resources (MSIX/NQs) for L2 purposes,
such as additional rings configured by the user or for XDP.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vikas Gupta [Tue, 9 Apr 2024 21:54:29 +0000 (14:54 -0700)]

bnxt_en: Change MSIX/NQs allocation policy

The existing scheme sets aside a number of MSIX/NQs for the RoCE
driver whether the RoCE driver is registered or not.  This scheme
is not flexible and limits the resources available for the L2 rings
if RoCE is never used.

Modify the scheme so that the RoCE MSIX/NQs can be used by the L2
driver if they are not used for RoCE.  The MSIX/NQs are now
represented by 3 fields.  bp->ulp_num_msix_want contains the
desired default value, edev->ulp_num_msix_vec contains the
available value (but not necessarily in use), and
ulp_tbl->msix_requested contains the actual value in use by RoCE.

The L2 driver can dip into edev->ulp_num_msix_vec if necessary.

We need to add rtnl_lock() back in bnxt_register_dev() and
bnxt_unregister_dev() to synchronize the MSIX usage between L2 and
RoCE.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vikas Gupta [Tue, 9 Apr 2024 21:54:28 +0000 (14:54 -0700)]

bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions

In its current form, bnxt_rdma_aux_device_init() not only initializes
the necessary data structures of the newly created aux device but also
adds the aux device into the aux bus subsytem. Refactor the logic into
separate functions, first function to initialize the aux device along
with the required resources and second, to actually add the device to
the aux bus subsytem.
This separation helps to create bnxt_en_dev much earlier and save its
resources separately.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vikas Gupta [Tue, 9 Apr 2024 21:54:27 +0000 (14:54 -0700)]

bnxt_en: Remove unneeded MSIX base structure fields and code

Ever since commit:

303432211324 ("bnxt_en: Remove runtime interrupt vector allocation")

The MSIX base vector is effectively always 0. Remove all unneeded
structure fields and code referencing the MSIX base.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kalesh AP [Tue, 9 Apr 2024 21:54:26 +0000 (14:54 -0700)]

bnxt_en: Remove a redundant NULL check in bnxt_register_dev()

The memory for "edev->ulp_tbl" is allocated inside the
bnxt_rdma_aux_device_init() function. If it fails, the driver
will not create the auxiliary device for RoCE. Hence the NULL
check inside bnxt_register_dev() is unnecessary.

Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Pavan Chebbi [Tue, 9 Apr 2024 21:54:25 +0000 (14:54 -0700)]

bnxt_en: Skip ethtool RSS context configuration in ifdown state

The current implementation requires the ifstate to be up when
configuring the RSS contexts. It will try to fetch the RX ring
IDs and will crash if it is in ifdown state. Return error if
!netif_running() to prevent the crash.

An improved implementation is in the works to allow RSS contexts
to be changed while in ifdown state.

Fixes: b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240409215431.41424-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 02:48:16 +0000 (19:48 -0700)]

Merge branch 'mlx5-misc-fixes'

Tariq Toukan says:

====================
mlx5 misc fixes

This patchset provides bug fixes to mlx5 driver.

This is V2 of the series previously submitted as PR by Saeed:
https://lore.kernel.org/netdev/20240326144646.2078893-1-saeed@kernel.org/T/

Series generated against:
commit 237f3cf13b20 ("xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING")
====================

Link: https://lore.kernel.org/r/20240409190820.227554-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Tariq Toukan [Tue, 9 Apr 2024 19:08:19 +0000 (22:08 +0300)]

net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev

Adaptations need to be made for the auxiliary device management in the
core driver level. Block this combination for now.

Fixes: 678eb448055a ("net/mlx5: SD, Implement basic query and instantiation")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Tue, 9 Apr 2024 19:08:18 +0000 (22:08 +0300)]

net/mlx5e: RSS, Block XOR hash with over 128 channels

When supporting more than 128 channels, the RQT size is
calculated by multiplying the number of channels by 2
and rounding up to the nearest power of 2.

The index of the RQT is derived from the RSS hash
calculations. If XOR8 is used as the RSS hash function,
there are only 256 possible hash results, and therefore,
only 256 indexes can be reached in the RQT.

Block setting the RSS hash function to XOR when the number
of channels exceeds 128.

Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Rahul Rameshbabu [Tue, 9 Apr 2024 19:08:17 +0000 (22:08 +0300)]

net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit

Free Tx port timestamping metadata entries in the NAPI poll context and
consume metadata enties in the WQE xmit path. Do not free a Tx port
timestamping metadata entry in the WQE xmit path even in the error path to
avoid a race between two metadata entry producers.

Fixes: 3178308ad4ca ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Tue, 9 Apr 2024 19:08:16 +0000 (22:08 +0300)]

net/mlx5e: HTB, Fix inconsistencies with QoS SQs number

When creating a new HTB class while the interface is down,
the variable that follows the number of QoS SQs (htb_max_qos_sqs)
may not be consistent with the number of HTB classes.

Previously, we compared these two values to ensure that
the node_qid is lower than the number of QoS SQs, and we
allocated stats for that SQ when they are equal.

Change the check to compare the node_qid with the current
number of leaf nodes and fix the checking conditions to
ensure allocation of stats_list and stats for each node.

Fixes: 214baf22870c ("net/mlx5e: Support HTB offload")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Tue, 9 Apr 2024 19:08:15 +0000 (22:08 +0300)]

net/mlx5e: Fix mlx5e_priv_init() cleanup flow

When mlx5e_priv_init() fails, the cleanup flow calls mlx5e_selq_cleanup which
calls mlx5e_selq_apply() that assures that the `priv->state_lock` is held using
lockdep_is_held().

Acquire the state_lock in mlx5e_selq_cleanup().

Kernel log:
=============================
WARNING: suspicious RCU usage
6.8.0-rc3_net_next_841a9b5 #1 Not tainted
-----------------------------
drivers/net/ethernet/mellanox/mlx5/core/en/selq.c:124 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
2 locks held by systemd-modules/293:
#0: ffffffffa05067b0 (devices_rwsem){++++}-{3:3}, at: ib_register_client+0x109/0x1b0 [ib_core]
#1: ffff8881096c65c0 (&device->client_data_rwsem){++++}-{3:3}, at: add_client_context+0x104/0x1c0 [ib_core]

stack backtrace:
CPU: 4 PID: 293 Comm: systemd-modules Not tainted 6.8.0-rc3_net_next_841a9b5 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8a/0xa0
lockdep_rcu_suspicious+0x154/0x1a0
mlx5e_selq_apply+0x94/0xa0 [mlx5_core]
mlx5e_selq_cleanup+0x3a/0x60 [mlx5_core]
mlx5e_priv_init+0x2be/0x2f0 [mlx5_core]
mlx5_rdma_setup_rn+0x7c/0x1a0 [mlx5_core]
rdma_init_netdev+0x4e/0x80 [ib_core]
? mlx5_rdma_netdev_free+0x70/0x70 [mlx5_core]
ipoib_intf_init+0x64/0x550 [ib_ipoib]
ipoib_intf_alloc+0x4e/0xc0 [ib_ipoib]
ipoib_add_one+0xb0/0x360 [ib_ipoib]
add_client_context+0x112/0x1c0 [ib_core]
ib_register_client+0x166/0x1b0 [ib_core]
? 0xffffffffa0573000
ipoib_init_module+0xeb/0x1a0 [ib_ipoib]
do_one_initcall+0x61/0x250
do_init_module+0x8a/0x270
init_module_from_file+0x8b/0xd0
idempotent_init_module+0x17d/0x230
__x64_sys_finit_module+0x61/0xb0
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x46/0x4e
</TASK>

Fixes: 8bf30be75069 ("net/mlx5e: Introduce select queue parameters")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carolina Jubran [Tue, 9 Apr 2024 19:08:14 +0000 (22:08 +0300)]

net/mlx5e: RSS, Block changing channels number when RXFH is configured

Changing the channels number after configuring the receive flow hash
indirection table may affect the RSS table size. The previous
configuration may no longer be compatible with the new receive flow
hash indirection table.

Block changing the channels number when RXFH is configured and changing
the channels number requires resizing the RSS table size.

Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Cosmin Ratiu [Tue, 9 Apr 2024 19:08:13 +0000 (22:08 +0300)]

net/mlx5: Correctly compare pkt reformat ids

struct mlx5_pkt_reformat contains a naked union of a u32 id and a
dr_action pointer which is used when the action is SW-managed (when
pkt_reformat.owner is set to MLX5_FLOW_RESOURCE_OWNER_SW). Using id
directly in that case is incorrect, as it maps to the least significant
32 bits of the 64-bit pointer in mlx5_fs_dr_action and not to the pkt
reformat id allocated in firmware.

For the purpose of comparing whether two rules are identical,
interpreting the least significant 32 bits of the mlx5_fs_dr_action
pointer as an id mostly works... until it breaks horribly and produces
the outcome described in [1].

This patch fixes mlx5_flow_dests_cmp to correctly compare ids using
mlx5_fs_dr_action_get_pkt_reformat_id for the SW-managed rules.

Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u
Fixes: 6a48faeeca10 ("net/mlx5: Add direct rule fs_cmd implementation")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Cosmin Ratiu [Tue, 9 Apr 2024 19:08:12 +0000 (22:08 +0300)]

net/mlx5: Properly link new fs rules into the tree

Previously, add_rule_fg would only add newly created rules from the
handle into the tree when they had a refcount of 1. On the other hand,
create_flow_handle tries hard to find and reference already existing
identical rules instead of creating new ones.

These two behaviors can result in a situation where create_flow_handle
1) creates a new rule and references it, then
2) in a subsequent step during the same handle creation references it
again,
resulting in a rule with a refcount of 2 that is not linked into the
tree, will have a NULL parent and root and will result in a crash when
the flow group is deleted because del_sw_hw_rule, invoked on rule
deletion, assumes node->parent is != NULL.

This happened in the wild, due to another bug related to incorrect
handling of duplicate pkt_reformat ids, which lead to the code in
create_flow_handle incorrectly referencing a just-added rule in the same
flow handle, resulting in the problem described above. Full details are
at [1].

This patch changes add_rule_fg to add new rules without parents into
the tree, properly initializing them and avoiding the crash. This makes
it more consistent with how rules are added to an FTE in
create_flow_handle.

Fixes: 74491de93712 ("net/mlx5: Add multi dest support")
Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Michael Liang [Tue, 9 Apr 2024 19:08:11 +0000 (22:08 +0300)]

net/mlx5: offset comp irq index in name by one

The mlx5 comp irq name scheme is changed a little bit between
commit 3663ad34bc70 ("net/mlx5: Shift control IRQ to the last index")
and commit 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation").
The index in the comp irq name used to start from 0 but now it starts
from 1. There is nothing critical here, but it's harmless to change
back to the old behavior, a.k.a starting from 0.

Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shay Drory [Tue, 9 Apr 2024 19:08:10 +0000 (22:08 +0300)]

net/mlx5: Register devlink first under devlink lock

In case device is having a non fatal FW error during probe, the
driver will report the error to user via devlink. This will trigger
a WARN_ON, since mlx5 is calling devlink_register() last.
In order to avoid the WARN_ON[1], change mlx5 to invoke devl_register()
first under devlink lock.

[1]
WARNING: CPU: 5 PID: 227 at net/devlink/health.c:483 devlink_recover_notify.constprop.0+0xb8/0xc0
CPU: 5 PID: 227 Comm: kworker/u16:3 Not tainted 6.4.0-rc5_for_upstream_min_debug_2023_06_12_12_38 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5_health0000:08:00.0 mlx5_fw_reporter_err_work [mlx5_core]
RIP: 0010:devlink_recover_notify.constprop.0+0xb8/0xc0
Call Trace:
<TASK>
? __warn+0x79/0x120
? devlink_recover_notify.constprop.0+0xb8/0xc0
? report_bug+0x17c/0x190
? handle_bug+0x3c/0x60
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
? devlink_recover_notify.constprop.0+0xb8/0xc0
devlink_health_report+0x4a/0x1c0
mlx5_fw_reporter_err_work+0xa4/0xd0 [mlx5_core]
process_one_work+0x1bb/0x3c0
? process_one_work+0x3c0/0x3c0
worker_thread+0x4d/0x3c0
? process_one_work+0x3c0/0x3c0
kthread+0xc6/0xf0
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

Fixes: cf530217408e ("devlink: Notify users when objects are accessible")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shay Drory [Tue, 9 Apr 2024 19:08:09 +0000 (22:08 +0300)]

net/mlx5: E-switch, store eswitch pointer before registering devlink_param

Next patch will move devlink register to be first. Therefore, whenever
mlx5 will register a param, the user will be notified.
In order to notify the user, devlink is using the get() callback of
the param. Hence, resources that are being used by the get() callback
must be set before the devlink param is registered.

Therefore, store eswitch pointer inside mdev before registering the
param.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240409190820.227554-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 02:48:05 +0000 (19:48 -0700)]

Merge tag 'probes-fixes-v6.9-rc3' of git://git./linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:
"Fix possible use-after-free issue on kprobe registration.

  check_kprobe_address_safe() uses `is_module_text_address()` and
  `__module_text_address()` separately.

  As a result, if the probed address is in a module that is being
  unloaded, the first `is_module_text_address()` might return true but
  then the `__module_text_address()` call might return NULL if the
  module has been unloaded between the two.

  The result is that kprobe believes the probe is on the kernel text,
  and skips getting a module reference. In this case, when it arms a
  breakpoint on the probe address, it may cause a use-after-free.

  To fix this issue, only use `__module_text_address()` once and get a
  reference to the module then. If it fails, reject the probe"

* tag 'probes-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Fix possible use-after-free issue on kprobe registration

commit | commitdiff | tree

Ido Schimmel [Tue, 9 Apr 2024 13:22:14 +0000 (15:22 +0200)]

mlxsw: spectrum_ethtool: Add support for 100Gb/s per lane link modes

The Spectrum-4 ASIC supports 100Gb/s per lane link modes, but the only
one currently supported by the driver is 800Gb/s over eight lanes.

Add support for 100Gb/s over one lane, 200Gb/s over two lanes and
400Gb/s over four lanes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/1d77830f6abcc4f0d57a7f845e5a6d97a75a434b.1712667750.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 9 Apr 2024 12:07:41 +0000 (12:07 +0000)]

netfilter: complete validation of user input

In my recent commit, I missed that do_replace() handlers
use copy_from_sockptr() (which I fixed), followed
by unsafe copy_from_sockptr_offset() calls.

In all functions, we can perform the @optlen validation
before even calling xt_alloc_table_info() with the following
check:

if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
return -EINVAL;

Fixes: 0c83842df40f ("netfilter: validate user input for expected length")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 11 Apr 2024 02:42:45 +0000 (19:42 -0700)]

Merge tag 'bootconfig-fixes-v6.9-rc3' of git://git./linux/kernel/git/trace/linux-trace

Pull bootconfig fixes from Masami Hiramatsu:

- show the original cmdline only once, and only if it was modeified by
   bootconfig

* tag 'bootconfig-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fs/proc: Skip bootloader comment if no embedded kernel parameters
  fs/proc: remove redundant comments from /proc/bootconfig

commit | commitdiff | tree

Ido Schimmel [Tue, 9 Apr 2024 11:08:16 +0000 (14:08 +0300)]

selftests: fib_rule_tests: Add VRF tests

After commit 40867d74c374 ("net: Add l3mdev index to flow struct and
avoid oif reset for port devices") it is possible to configure FIB rules
that match on iif / oif being a l3mdev port. It was not possible before
as these parameters were reset to the ifindex of the l3mdev device
itself prior to the FIB rules lookup.

Add tests that cover this functionality as it does not seem to be
covered by existing ones and I am aware of at least one user that needs
this functionality in addition to the one mentioned in [1].

Reuse the existing FIB rules tests by simply configuring a VRF prior to
the test and removing it afterwards. Differentiate the output of the
non-VRF tests from the VRF tests by appending "(VRF)" to the test name
if a l3mdev FIB rule is present.

Verified that these tests do fail on kernel 5.15.y which does not
include the previously mentioned commit:

# ./fib_rule_tests.sh -t fib_rule6_vrf
[...]
     TEST: rule6 check: oif redirect to table (VRF)                      [FAIL]
[...]
     TEST: rule6 check: iif redirect to table (VRF)                      [FAIL]

# ./fib_rule_tests.sh -t fib_rule4_vrf
[...]
     TEST: rule4 check: oif redirect to table (VRF)                      [FAIL]
[...]
     TEST: rule4 check: iif redirect to table (VRF)                      [FAIL]

[1] https://lore.kernel.org/netdev/20200922131122.GB1601@ICIPI.localdomain/

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240409110816.2508498-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Hangbin Liu [Tue, 9 Apr 2024 09:28:12 +0000 (17:28 +0800)]

net: team: fix incorrect maxattr

The maxattr should be the latest attr value, i.e. array size - 1,
not total array size.

Reported-by: syzbot+ecd7e07b4be038658c9f@syzkaller.appspotmail.com
Fixes: 948dbafc15da ("net: team: use policy generated by YAML spec")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240409092812.3999785-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Uwe Kleine-König [Tue, 9 Apr 2024 09:12:04 +0000 (11:12 +0200)]

net: wan: fsl_qmc_hdlc: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Herve Codina <herve.codina@bootlin.com>
Link: https://lore.kernel.org/r/20240409091203.39062-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Hangbin Liu [Tue, 9 Apr 2024 08:35:04 +0000 (16:35 +0800)]

doc/netlink/specs: Add bond support to rt_link.yaml

Add bond support to rt_link.yaml. Here is an example output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
   --do getlink --json '{"ifname": "bond0"}' --output-json | jq '.linkinfo'
{
   "kind": "bond",
   "data": {
     "mode": 4,
     "miimon": 100,
     ...
     "arp-interval": 0,
     "arp-ip-target": [
       "192.168.1.1",
       "192.168.1.2"
     ],
     "arp-validate": 0,
     "arp-all-targets": 0,
     "ns-ip6-target": [
       "2001::1",
       "2001::2"
     ],
     "primary-reselect": 0,
     ...
     "missed-max": 2,
     "ad-info": {
       "aggregator": 1,
       "num-ports": 1,
       "actor-key": 0,
       "partner-key": 1,
       "partner-mac": "00:00:00:00:00:00"
     }
   }
}

And here is the downlink info.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
   --do getlink --json '{"ifname": "dummy0"}' --output-json | jq '.linkinfo'
{
   "kind": "dummy",
   "slave-kind": "bond",
   "slave-data": {
     "state": 0,
     "mii-status": 0,
     "link-failure-count": 0,
     "perm-hwaddr": "f2:82:f7:cc:47:13",
     "queue-id": 0,
     "prio": 0
   }
}

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240409083504.3900877-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Rahul Rameshbabu [Tue, 9 Apr 2024 23:25:16 +0000 (16:25 -0700)]

ethtool: update tsinfo statistics attribute docs with correct type

nla_put_uint can either write a u32 or u64 netlink attribute value. The
size depends on whether the value can be represented with a u32 or requires
a u64. Use a uint annotation in various documentation to represent this.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20240409232520.237613-2-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kent Overstreet [Wed, 10 Apr 2024 05:30:22 +0000 (01:30 -0400)]

bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()

We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 10 Apr 2024 04:10:18 +0000 (00:10 -0400)]

bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()

dropping read locks in bch2_btree_node_lock_write_nofail() dates from
before we had the cycle detector; we can now tell the cycle detector
directly when taking a lock may not fail because we can't handle
transaction restarts.

This is needed for adding should_be_locked asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Kent Overstreet [Wed, 10 Apr 2024 16:53:28 +0000 (12:53 -0400)]

bcachefs: Fix a race in btree_update_nodes_written()

One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.

This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

commit | commitdiff | tree

Jakub Kicinski [Thu, 11 Apr 2024 02:27:34 +0000 (19:27 -0700)]

Merge branch 'optimise-local-cpu-skb_attempt_defer_free'

Pavel Begunkov says:

====================
optimise local CPU skb_attempt_defer_free

Optimise the case when an skb comes to skb_attempt_defer_free()
on the same CPU it was allocated on. The patch 1 enables skb caches
and gives frags a chance to hit the page pool's fast path.
CPU bound benchmarking with perfect skb_attempt_defer_free()
gives around 1% of extra throughput.
====================

Link: https://lore.kernel.org/r/cover.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Pavel Begunkov [Wed, 10 Apr 2024 01:28:10 +0000 (02:28 +0100)]

net: use SKB_CONSUMED in skb_attempt_defer_free()

skb_attempt_defer_free() is used to free already processed skbs, so pass
SKB_CONSUMED as the reason in kfree_skb_napi_cache().

Suggested-by: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/bcf5dbdda79688b074ab7ae2238535840a6d3fc2.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Pavel Begunkov [Wed, 10 Apr 2024 01:28:09 +0000 (02:28 +0100)]

net: cache for same cpu skb_attempt_defer_free

Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/a887463fb219d973ec5ad275e31194812571f1f5.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Wed, 10 Apr 2024 13:11:28 +0000 (15:11 +0200)]

r8169: add missing conditional compiling for call to r8169_remove_leds

Add missing dependency on CONFIG_R8169_LEDS. As-is a link error occurs
if config option CONFIG_R8169_LEDS isn't enabled.

Fixes: 19fa4f2a85d7 ("r8169: fix LED-related deadlock on module removal")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-By: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/d080038c-eb6b-45ac-9237-b8c1cdd7870f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Linux 6.x block layer and io_uring tree(s)

RSS Atom