git.kernel.dk Git - linux-block.git/log

Bharat Bhushan [Wed, 4 Dec 2024 05:56:54 +0000 (11:26 +0530)]

octeontx2-af: Disable backpressure between CPT and NIX

NIX can assert backpressure to CPT on the NIX<=>CPT link.
Keep the backpressure disabled for now. NIX block anyways
handles backpressure asserted by MAC due to PFC or flow
control pkts.

Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Bharat Bhushan [Wed, 4 Dec 2024 05:56:53 +0000 (11:26 +0530)]

octeontx2-pf: Move skb fragment map/unmap to common code

Move skb fragment map/unmap function to common file
so as to reuse same for outbound IPsec crypto offload

Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Bharat Bhushan [Wed, 4 Dec 2024 05:56:52 +0000 (11:26 +0530)]

octeontx2-pf: map skb data as device writeable

Crypto hardware need write permission for in-place encrypt
or decrypt operation on skb-data to support IPsec crypto
offload. That patch uses skb_unshare to make skb data writeable
for ipsec crypto offload and map skb fragment memory as
device read-write.

Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Justin Lai [Fri, 6 Dec 2024 08:48:51 +0000 (16:48 +0800)]

rtase: Refine the if statement

Refine the if statement to improve readability.

Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20241206084851.760475-1-justinlai0215@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sun, 8 Dec 2024 01:49:47 +0000 (17:49 -0800)]

Merge branch 'net-net-add-negotiation-of-in-band-capabilities-remainder'

Russell King says:

====================
net: net: add negotiation of in-band capabilities (remainder)

Here are the last three patches which were not included in the non-RFC
posting, but were in the RFC posting. These add the .pcs_inband()
method to the Lynx, MTK Lynx and XPCS drivers.
====================

Link: https://patch.msgid.link/Z1F1b8eh8s8T627j@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 09:42:34 +0000 (09:42 +0000)]

net: pcs: xpcs: implement pcs_inband_caps() method

Report the PCS inband capabilities to phylink for XPCS.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tJ8NW-006L5V-I9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 09:42:29 +0000 (09:42 +0000)]

net: pcs: pcs-mtk-lynxi: implement pcs_inband_caps() method

Report the PCS in-band capabilities to phylink for the LynxI PCS.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tJ8NR-006L5P-E3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 09:42:24 +0000 (09:42 +0000)]

net: pcs: pcs-lynx: implement pcs_inband_caps() method

Report the PCS in-band capabilities to phylink for the Lynx PCS.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tJ8NM-006L5J-AH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stas Sergeev [Thu, 5 Dec 2024 07:36:14 +0000 (10:36 +0300)]

tun: fix group permission check

Currently tun checks the group permission even if the user have matched.
Besides going against the usual permission semantic, this has a
very interesting implication: if the tun group is not among the
supplementary groups of the tun user, then effectively no one can
access the tun device. CAP_SYS_ADMIN still can, but its the same as
not setting the tun ownership.

This patch relaxes the group checking so that either the user match
or the group match is enough. This avoids the situation when no one
can access the device even though the ownership is properly set.

Also I simplified the logic by removing the redundant inversions:
tun_not_capable() --> !tun_capable()

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241205073614.294773-1-stsp2@yandex.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Johannes Berg [Fri, 6 Dec 2024 10:30:57 +0000 (11:30 +0100)]

tools: ynl-gen-c: don't require -o argument

Without -o the tool currently crashes, but it's not marked
as required. The only thing we can't do without it is to
generate the correct #include for user source files, but
we can put a placeholder instead.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20241206113100.89d35bf124d6.I9228fb704e6d5c9d8e046ef15025a47a48439c1e@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Johannes Berg [Fri, 6 Dec 2024 10:30:56 +0000 (11:30 +0100)]

tools: ynl-gen-c: annotate valid choices for --mode

This makes argparse validate the input and helps users
understand which modes are possible.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20241206113100.e2ab5cf6937c.Ie149a0ca5df713860964b44fe9d9ae547f2e1553@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sat, 7 Dec 2024 01:56:56 +0000 (17:56 -0800)]

Merge branch 'net-convert-some-udp-tunnel-drivers-to-netdev_pcpu_stat_dstats'

Guillaume Nault says:

====================
net: Convert some UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS.

VXLAN, Geneve and Bareudp use various device counters for managing
RX and TX statistics:

  * VXLAN uses the device core_stats for RX and TX drops, tstats for
    regular RX/TX counters and DEV_STATS_INC() for various types of
    RX/TX errors.

  * Geneve uses tstats for regular RX/TX counters and DEV_STATS_INC()
    for everything else, include RX/TX drops.

  * Bareudp, was recently converted to follow VXLAN behaviour, that is,
    device core_stats for RX and TX drops, tstats for regular RX/TX
    counters and DEV_STATS_INC() for other counter types.

Let's consolidate statistics management around the dstats counters
instead. This avoids using core_stats in VXLAN and Bareudp, as
core_stats is supposed to be used by core networking code only (and not
in drivers).  This also allows Geneve to avoid using atomic increments
when updating RX and TX drop counters, as dstats is per-cpu. Finally,
this also simplifies the code as all three modules now handle stats in
the same way and with only two different sets of counters (the per-cpu
dstats and the atomic DEV_STATS_INC()).

Patch 1 creates dstats helper functions that can be used outside of VRF
(until then, dstats was VRF-specific).
Then patches 2 to 4, convert VXLAN, Geneve and Bareudp, one by one.
====================

Link: https://patch.msgid.link/cover.1733313925.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Wed, 4 Dec 2024 12:11:32 +0000 (13:11 +0100)]

bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.

Bareudp uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. It was also recently converted to use the device core
stats (dev_core_stats_*()) for RX and TX drops (see commit 788d5d655bc9
("bareudp: Use pcpu stats to update rx_dropped counter.")).

Since core stats are to be avoided in drivers, and for consistency with
VXLAN and Geneve, let's convert packet stats handling to DSTATS, which
can handle RX/TX stats and packet drops. Statistics that don't fit
DSTATS are still updated atomically with DEV_STATS_INC().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/0f4f8448db3ff449ac6e939872b28cf3f8982da7.1733313925.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Wed, 4 Dec 2024 12:11:30 +0000 (13:11 +0100)]

geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.

Geneve uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
packet counters. All other counters are handled using atomic increments
with DEV_STATS_INC().

Let's convert packet stats handling to DSTATS, which has a per-cpu
counter for packet drops too, to avoid the cost of atomic increments
in these cases. Statistics that don't fit DSTATS are still updated
atomically with DEV_STATS_INC().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/7af5c09f3c26f0f231fbe383822ca5d1ce0278fa.1733313925.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Wed, 4 Dec 2024 12:11:27 +0000 (13:11 +0100)]

vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.

VXLAN uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX and
TX packet counters. It also uses the device core stats
(dev_core_stats_*()) for RX and TX drops.

Let's consolidate that using the DSTATS infrastructure, which can
handle both packet counters and packet drops. Statistics that don't
fit DSTATS are still updated atomically with DEV_STATS_INC().

While there, convert the "len" variable of vxlan_encap_bypass() to
unsigned int, to respect the types of skb->len and
dev_dstats_[rt]x_add().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/145558b184b3cda77911ca5682b6eb83c3ffed8e.1733313925.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Wed, 4 Dec 2024 12:11:21 +0000 (13:11 +0100)]

vrf: Make pcpu_dstats update functions available to other modules.

Currently vrf is the only module that uses NETDEV_PCPU_STAT_DSTATS.
In order to make this kind of statistics available to other modules,
we need to define the update functions in netdevice.h.

Therefore, let's define dev_dstats_*() functions for RX and TX packet
updates (packets, bytes and drops). Use these new functions in vrf.c
instead of vrf_rx_stats() and the other manual counter updates.

While there, update the type of the "len" variables to "unsigned int",
so that there're aligned with both skb->len and the new dstats update
functions.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/d7a552ee382c79f4854e7fcc224cf176cd21150d.1733313925.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sat, 7 Dec 2024 01:53:29 +0000 (17:53 -0800)]

Merge branch 'lan78xx-preparations-for-phylink'

Oleksij Rempel says:

====================
lan78xx: Preparations for PHYlink

This patch set is part of the preparatory work for migrating the lan78xx
USB Ethernet driver to the PHYlink framework. During extensive testing,
I observed that resetting the USB adapter can lead to various read/write
errors. While the errors themselves are acceptable, they generate
excessive log messages, resulting in significant log spam. This set
improves error handling to reduce logging noise by addressing errors
directly and returning early when necessary.

Key highlights of this series include:
- Enhanced error handling to reduce log spam while preserving the
  original error values, avoiding unnecessary overwrites.
- Improved error reporting using the `%pe` specifier for better clarity
  in log messages.
- Removal of redundant and problematic PHY fixups for LAN8835 and
  KSZ9031, with detailed explanations in the respective patches.
- Cleanup of code structure, including unified `goto` labels for better
  readability and maintainability, even in simple editors.
====================

Link: https://patch.msgid.link/20241204084142.1152696-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:42 +0000 (09:41 +0100)]

net: usb: lan78xx: Improve error handling in dataport and multicast writes

Update `lan78xx_dataport_write` and `lan78xx_deferred_multicast_write`
to:
- Handle errors during register read/write operations.
- Exit immediately on errors and log them using `%pe` for clarity.
- Avoid silent failures by propagating error codes properly.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-11-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:41 +0000 (09:41 +0100)]

net: usb: lan78xx: Add error handling to lan78xx_irq_bus_sync_unlock

Update `lan78xx_irq_bus_sync_unlock` to handle errors in register
read/write operations. If an error occurs, log it and exit the function
appropriately. This ensures proper handling of failures during IRQ
synchronization.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-10-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:40 +0000 (09:41 +0100)]

net: usb: lan78xx: Add error handling to set_rx_max_frame_length and set_mtu

Improve error handling in `lan78xx_set_rx_max_frame_length` by:
- Checking return values from register read/write operations and
propagating errors.
- Exiting immediately on failure to ensure proper error reporting.

In `lan78xx_change_mtu`, log errors when changing MTU fails, using `%pe`
for clear error representation.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-9-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:39 +0000 (09:41 +0100)]

net: usb: lan78xx: Add error handling to lan78xx_init_ltm

Convert `lan78xx_init_ltm` to return error codes and handle errors
properly. Previously, errors during the LTM initialization process were
not propagated, potentially leading to undetected issues. This patch
ensures:

- Errors in `lan78xx_read_reg` and `lan78xx_write_reg` are checked and
handled.
- Errors are logged with detailed messages using `%pe` for clarity.
- The function exits immediately on error, returning the error code.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-8-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:38 +0000 (09:41 +0100)]

net: usb: lan78xx: Improve error handling in EEPROM and OTP operations

Refine error handling in EEPROM and OTP read/write functions by:
- Return error values immediately upon detection.
- Avoid overwriting correct error codes with `-EIO`.
- Preserve initial error codes as they were appropriate for specific
failures.
- Use `-ETIMEDOUT` for timeout conditions instead of `-EIO`.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-7-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:37 +0000 (09:41 +0100)]

net: usb: lan78xx: Fix error handling in MII read/write functions

Ensure proper error handling in `lan78xx_mdiobus_read` and
`lan78xx_mdiobus_write` by checking return values of register read/write
operations and returning errors to the caller.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-6-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:36 +0000 (09:41 +0100)]

net: usb: lan78xx: Improve error reporting with %pe specifier

Replace integer error codes with the `%pe` format specifier in register
read and write error messages. This change provides human-readable error
strings, making logs more informative and debugging easier.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-5-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:35 +0000 (09:41 +0100)]

net: usb: lan78xx: move functions to avoid forward definitions

Move following functions to avoid forward declarations in the code:
- lan78xx_start_hw()
- lan78xx_stop_hw()
- lan78xx_flush_fifo()
- lan78xx_start_tx_path()
- lan78xx_stop_tx_path()
- lan78xx_flush_tx_fifo()
- lan78xx_start_rx_path()
- lan78xx_stop_rx_path()
- lan78xx_flush_rx_fifo()

These functions will be used in an upcoming PHYlink migration patch.

No modifications to the functionality of the code are made.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:34 +0000 (09:41 +0100)]

net: usb: lan78xx: Remove KSZ9031 PHY fixup

Remove the KSZ9031RNX PHY fixup from the lan78xx driver. The fixup applied
specific RGMII pad skew configurations globally, but these settings violate the
RGMII specification and cause more harm than benefit.

Key issues with the fixup:
1. **Non-Compliant Timing**: The fixup's delay settings fall outside the RGMII
   specification requirements of 1.5 ns to 2.0 ns:
   - RX Path: Total delay of **2.16 ns** (PHY internal delay of 1.2 ns + 0.96
     ns skew).
   - TX Path: Total delay of **0.96 ns**, significantly below the RGMII minimum
     of 1.5 ns.

2. **Redundant or Incorrect Configurations**:
   - The RGMII skew registers written by the fixup do not meaningfully alter
     the PHY's default behavior and fail to account for its internal delays.
   - The TX_DATA pad skew was not configured, relying on power-on defaults
     that are insufficient for RGMII compliance.

3. **Micrel Driver Support**: By setting `PHY_INTERFACE_MODE_RGMII_ID`, the
   Micrel driver can calculate and assign appropriate skew values for the
   KSZ9031 PHY.  This ensures better timing configurations without relying on
   external fixups.

4. **System Interference**: The fixup applied globally, reconfiguring all
   KSZ9031 PHYs in the system, even those unrelated to the LAN78xx adapter.
   This could lead to unintended and harmful behavior on unrelated interfaces.

While the fixup is removed, a better mechanism is still needed to dynamically
determine the optimal combination of PHY and MAC delays to fully meet RGMII
requirements without relying on Device Tree or global fixups. This would allow
for robust operation across different hardware configurations.

The Micrel driver is capable of using the interface mode value to calculate and
apply better skew values, providing a configuration much closer to the RGMII
specification than the fixup. Removing the fixup ensures better default
behavior and prevents harm to other system interfaces.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Oleksij Rempel [Wed, 4 Dec 2024 08:41:33 +0000 (09:41 +0100)]

net: usb: lan78xx: Remove LAN8835 PHY fixup

Remove the PHY fixup for the LAN8835 PHY in the lan78xx driver due to
the following reasons:

- There is no publicly available information about the LAN8835 PHY.
  However, it appears to be the integrated PHY used in the LAN7800 and
  LAN7850 USB Ethernet controllers. These PHYs use the GMII interface,
  not RGMII as configured by the fixup.

- The correct driver for handling the LAN8835 PHY functionality is the
  Microchip PHY driver (`drivers/net/phy/microchip.c`), which properly
  supports these integrated PHYs.

- The PHY ID `0x0007C130` is actually used by the LAN8742A PHY, which
  only supports RMII. This interface is incompatible with the LAN78xx
  MAC, as the LAN7801 (the only LAN78xx version without an integrated
  PHY) supports only RGMII.

- The mask applied for this fixup is overly broad, inadvertently
  covering both Microchip LAN88xx PHYs and unrelated SMSC LAN8742A PHYs,
  leading to potential conflicts with other devices.

- Testing has shown that removing this fixup for LAN7800 and LAN7850
  does not result in any noticeable difference in functionality, as the
  Microchip PHY driver (`drivers/net/phy/microchip.c`) handles all
  necessary configurations for these integrated PHYs.

- Registering this fixup globally (not limited to USB devices) risks
  conflicts by unintentionally modifying other interfaces whenever a
  LAN7801 adapter is connected to the system.

Note that both LAN7800 and LAN7850 USB Ethernet controllers use an
integrated PHY with the ID `0x0007C132`. Additionally, the LAN7515, a
specialized part for Raspberry Pi, includes an integrated LAN7800 USB
Ethernet controller and USB hub in a multifunctional chip design, and it
also uses the same PHY ID (`0x0007C132`).

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241204084142.1152696-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sat, 7 Dec 2024 01:47:34 +0000 (17:47 -0800)]

Merge branch 'net-phylib-eee-cleanups'

Russell King says:

====================
net: phylib EEE cleanups

Clean up phylib's EEE support. Patches previously posted as RFC as part
of the phylink EEE series.

Patch 1 changes the Marvell driver to use the state we store in
struct phy_device, rather than manually calling
phydev->eee_cfg.eee_enabled.

Patch 2 avoids genphy_c45_ethtool_get_eee() setting ->eee_enabled, as
we copy that from phydev->eee_cfg.eee_enabled later, and after patch 3
mo one uses this after calling genphy_c45_ethtool_get_eee(). In fact,
the only caller of this function now is phy_ethtool_get_eee().

As all callers to genphy_c45_eee_is_active() now pass NULL as its
is_enabled flag, this is no longer useful. Remove the argument in
patch 3.

Patch 4 updates the phylib documentation to make it absolutely clear
that phy_ethtool_get_eee() now fills in all members of struct
ethtool_keee, which is why we now have so many buggy network drivers.
====================

Link: https://patch.msgid.link/Z1GDZlFyF2fsFa3S@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 10:42:15 +0000 (10:42 +0000)]

net: phy: update phy_ethtool_get_eee() documentation

Update the phy_ethtool_get_eee() documentation to make it clear that
all members of struct ethtool_keee are written by this function.

keee.supported, keee.advertised, keee.lp_advertised and keee.eee_active
are all written by genphy_c45_ethtool_get_eee().

keee.tx_lpi_timer, keee.tx_lpi_enabled and keee.eee_enabled are all
written by eeecfg_to_eee().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tJ9JH-006LIz-SO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 10:42:10 +0000 (10:42 +0000)]

net: phy: remove genphy_c45_eee_is_active()'s is_enabled arg

All callers to genphy_c45_eee_is_active() now pass NULL as the
is_enabled argument, which means we never use the value computed
in this function. Remove the argument and clean up this function.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/E1tJ9JC-006LIt-Ne@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 10:42:05 +0000 (10:42 +0000)]

net: phy: avoid genphy_c45_ethtool_get_eee() setting eee_enabled

genphy_c45_ethtool_get_eee() is only called from phy_ethtool_get_eee(),
which then calls eeecfg_to_eee(). eeecfg_to_eee() will overwrite
keee.eee_enabled, so there's no point setting keee.eee_enabled in
genphy_c45_ethtool_get_eee(). Remove this assignment.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/E1tJ9J7-006LIn-Jr@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Thu, 5 Dec 2024 10:42:00 +0000 (10:42 +0000)]

net: phy: marvell: use phydev->eee_cfg.eee_enabled

Rather than calling genphy_c45_ethtool_get_eee() to retrieve whether
EEE is enabled, use the value stored in the phy_device eee_cfg
structure.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/E1tJ9J2-006LIh-Fl@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Joe Damato [Wed, 4 Dec 2024 16:32:39 +0000 (16:32 +0000)]

selftests: net: cleanup busy_poller.c

Fix various integer type conversions by using strtoull and a temporary
variable which is bounds checked before being casted into the
appropriate cfg_* variable for use by the test program.

While here:
  - free the strdup'd cfg string for overall hygenie.
  - initialize napi_id = 0 in setup_queue to avoid warnings on some
    compilers.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204163239.294123-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 4 Dec 2024 21:02:34 +0000 (21:02 +0000)]

net: tipc: remove one synchronize_net() from tipc_nametbl_stop()

tipc_exit_net() is very slow and is abused by syzbot.

tipc_nametbl_stop() is called for each netns being dismantled.

Calling synchronize_net() right before freeing tn->nametbl
is a big hammer.

Replace this with kfree_rcu().

Note that RCU is not properly used here, otherwise
tn->nametbl should be cleared before the synchronize_net()
or kfree_rcu(), or even before the cleanup loop.

We might need to fix this at some point.

Also note tipc uses other synchronize_rcu() calls,
more work is needed to make tipc_exit_net() much faster.

List of remaining calls to synchronize_rcu()

  tipc_detach_loopback() (dev_remove_pack())
  tipc_bcast_stop()
  tipc_sk_rht_destroy()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241204210234.319484-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Rosen Penev [Tue, 3 Dec 2024 23:13:37 +0000 (15:13 -0800)]

net: simplify resource acquisition + ioremap

get resource + request_mem_region + ioremap can all be done by a single
function.

Replace them with devm_platform_get_and_ioremap_resource or\
devm_platform_ioremap_resource where res is not used.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> # sja1000_platform.c
Link: https://patch.msgid.link/20241203231337.182391-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David S. Miller [Fri, 6 Dec 2024 13:41:53 +0000 (13:41 +0000)]

Merge branch 'ucc_geth-phylink-conversion'

Maxime Chevallier says:

====================
net: freescale: ucc_geth: Phylink conversion

This is V3 of the phylink conversion for ucc_geth.

The main changes in this V3 are related to error handling in the patches
1 and 10 to report an error when the deprecated "interface" property is
found in DT. Doing so, I found and addressed some issues with the jump
labels in the error paths, impacting patches 1 and 10.

The rest of the changes are just a rebase on net-next.

Some of the V2 changes haven't been reviewed, so I stress out that I'm
still uncertain about the way WoL is handled is patches 4 and 10.

Thanks,

Maxime

Link to V1: https://lore.kernel.org/netdev/20241107170255.1058124-1-maxime.chevallier@bootlin.com/
Link to V2: https://lore.kernel.org/netdev/20241114153603.307872-1-maxime.chevallier@bootlin.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:21 +0000 (13:43 +0100)]

net: freescale: ucc_geth: phylink conversion

ucc_geth is quite capable in terms of supported interfaces, and even
includes an externally controlled PCS (well, TBI). Port that driver to
phylink.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:20 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Introduce a helper to check Reduced modes

A number of parallel MII interfaces also exist in a "Reduced" mode,
usually with higher clock rates and fewer data lines, to ease the
hardware design. This is what the 'R' stands for in RGMII, RMII, RTBI,
RXAUI, etc.

The UCC Geth controller has a special configuration bit that needs to be
set when the MII mode is one of the supported reduced modes.

Add a local helper for that.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:19 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Move the serdes configuration around

The uec_configure_serdes() function deals with serialized linkmodes
settings. It's used during the link bringup sequence. It is planned to
be used during the phylink conversion for mac configuration, but it
needs to me moved around in the process. To make the phylink port
clearer, this commit moves the function without any feature change.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:18 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Hardcode the preamble length to 7 bytes

The preamble length can be configured in ucc_geth, however it just
ends-up always being configured to 7 bytes, as nothing ever changes the
default value of 7.

Make that value the default value when the MACCFG2 register gets
initialized, and remove the code to configure that value altogether.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:17 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Simplify frame length check

The frame length check is configured when the phy interface is setup.
However, it's configured according to an internal flag that is always
false. So, just make so that we disable the relevant bit in the MACCFG2
register upon accessing it for other MAC configuration operations.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:16 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Use the correct type to store WoL opts

The WoL opts are represented through a bitmask stored in a u32. As this
mask is copied as-is in the driver, make sure we use the exact same type
to store them internally.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:15 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Fix WOL configuration

The get/set_wol ethtool ops rely on querying the PHY for its WoL
capabilities, checking for the presence of a PHY and a PHY interrupts
isn't enough. Address that by cleaning up the WoL configuration
sequence.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:14 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Use netdev->phydev to access the PHY

As this driver pre-dates phylib, it uses a private pointer to get a
reference to the attached phy_device. Drop that pointer and use the
netdev's pointer instead.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:13 +0000 (13:43 +0100)]

net: freescale: ucc_geth: split adjust_link for phylink conversion

Preparing the phylink conversion, split the adjust_link callbaclk, by
clearly separating the mac configuration, link_up and link_down phases.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Maxime Chevallier [Tue, 3 Dec 2024 12:43:12 +0000 (13:43 +0100)]

net: freescale: ucc_geth: Drop support for the "interface" DT property

In april 2007, ucc_geth was converted to phylib with :

commit 728de4c927a3 ("ucc_geth: migrate ucc_geth to phylib").

In that commit, the device-tree property "interface", that could be used to
retrieve the PHY interface mode was deprecated.

DTS files that still used that property were converted along the way, in
the following commit, also dating from april 2007 :

commit 0fd8c47cccb1 ("[POWERPC] Replace undocumented interface properties in dts files")

17 years later, there's no users of that property left and I hope it's
safe to say we can remove support from that in the ucc_geth driver,
making the probe() function a bit simpler.

Should there be any users that have a DT that was generated when 2.6.21 was
cutting-edge, print an error message with hints on how to convert the
devicetree if the 'interface' property is found.

With that property gone, we can greatly simplify the parsing of the
phy-interface-mode from the devicetree by using of_get_phy_mode(),
allowing the removal of the open-coded parsing in the driver.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jakub Kicinski [Fri, 6 Dec 2024 02:41:13 +0000 (18:41 -0800)]

Merge branch 'xdp-a-fistful-of-generic-changes-pt-i'

Alexander Lobakin says:

====================
xdp: a fistful of generic changes pt. I

XDP for idpf is currently 6 chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes (you are here);
* generic XDP and XSk code additions;
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).

Part III does the following:
* improve &xdp_buff_xsk cacheline placement;
* does some cleanups with marking read-only bpf_prog and xdp_buff
arguments const for some generic functions;
* allows attaching already registered XDP memory model to RxQ info;
* makes system percpu page_pools valid XDP memory models;
* starts using netmems in the XDP core code (1 function);
* allows mixing pages from several page_pools within one XDP frame;
* optimizes &xdp_frame layout and removes no-more-used field.

Bullets 4-6 are the most important ones. All of them are prereqs to
libeth_xdp.
====================

Link: https://patch.msgid.link/20241203173733.3181246-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:31 +0000 (18:37 +0100)]

page_pool: make page_pool_put_page_bulk() handle array of netmems

Currently, page_pool_put_page_bulk() indeed takes an array of pointers
to the data, not pages, despite the name. As one side effect, when
you're freeing frags from &skb_shared_info, xdp_return_frame_bulk()
converts page pointers to virtual addresses and then
page_pool_put_page_bulk() converts them back. Moreover, data pointers
assume every frag is placed in the host memory, making this function
non-universal.
Make page_pool_put_page_bulk() handle array of netmems. Pass frag
netmems directly and use virt_to_netmem() when freeing xdpf->data,
so that the PP core will then get the compound netmem and take care
of the rest.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241203173733.3181246-9-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:30 +0000 (18:37 +0100)]

netmem: add a couple of page helper wrappers

Add the following netmem counterparts:

* virt_to_netmem() -- simple page_to_netmem(virt_to_page()) wrapper;
* netmem_is_pfmemalloc() -- page_is_pfmemalloc() for page-backed
netmems, false otherwise;

and the following "unsafe" versions:

* __netmem_to_page()
* __netmem_get_pp()
* __netmem_address()

They do the same as their non-underscored buddies, but assume the netmem
is always page-backed. When working with header &page_pools, you don't
need to check whether netmem belongs to the host memory and you can
never get NULL instead of &page. Checks for the LSB, clearing the LSB,
branches take cycles and increase object code size, sometimes
significantly. When you're sure your PP is always host, you can avoid
this by using the underscored counterparts.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241203173733.3181246-8-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Toke Høiland-Jørgensen [Tue, 3 Dec 2024 17:37:29 +0000 (18:37 +0100)]

xdp: register system page pool as an XDP memory model

To make the system page pool usable as a source for allocating XDP
frames, we need to register it with xdp_reg_mem_model(), so that page
return works correctly. This is done in preparation for using the system
page_pool to convert XDP_PASS XSk frames to skbs; for the same reason,
make the per-cpu variable non-static so we can access it from other
source files as well (but w/o exporting).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:28 +0000 (18:37 +0100)]

xsk: allow attaching XSk pool via xdp_rxq_info_reg_mem_model()

When you register an XSk pool as XDP Rxq info memory model, you then
need to manually attach it after the registration.
Let the user combine both actions into one by just passing a pointer
to the pool directly to xdp_rxq_info_reg_mem_model(), which will take
care of calling xsk_pool_set_rxq_info(). This looks similar to how a
&page_pool gets registered and reduce repeating driver code.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:27 +0000 (18:37 +0100)]

xdp: allow attaching already registered memory model to xdp_rxq_info

One may need to register memory model separately from xdp_rxq_info. One
simple example may be XDP test run code, but in general, it might be
useful when memory model registering is managed by one layer and then
XDP RxQ info by a different one.
Allow such scenarios by adding a simple helper which "attaches"
already registered memory model to the desired xdp_rxq_info. As this
is mostly needed for Page Pool, add a special function to do that for
a &page_pool pointer.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:26 +0000 (18:37 +0100)]

xdp, xsk: constify read-only arguments of some static inline helpers

Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting
the frame length, skb_shared_info etc., don't have their arguments
marked with `const` for no reason. Add the missing annotations to leave
less place for mistakes and more for optimization.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:25 +0000 (18:37 +0100)]

bpf, xdp: constify some bpf_prog * function arguments

In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexander Lobakin [Tue, 3 Dec 2024 17:37:24 +0000 (18:37 +0100)]

xsk: align &xdp_buff_xsk harder

After the series "XSk buff on a diet" by Maciej, the greatest pow-2
which &xdp_buff_xsk can be divided got reduced from 16 to 8 on x86_64.
Also, sizeof(xdp_buff_xsk) now is 120 bytes, which, taking the previous
sentence into account, leads to that it leaves 8 bytes at the end of
cacheline, which means an array of buffs will have its elements
messed between the cachelines chaotically.
Use __aligned_largest for this struct. This alignment is usually 16
bytes, which makes it fill two full cachelines and align an array
nicely. ___cacheline_aligned may be excessive here, especially on
arches with 128-256 byte CLs, as well as 32-bit arches (76 -> 96
bytes on MIPS32R2), while not doing better than _largest.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241203173733.3181246-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 6 Dec 2024 02:02:15 +0000 (18:02 -0800)]

Merge branch 'net_sched-sch_sfq-reject-limit-of-1'

Octavian Purdila says:

====================
net_sched: sch_sfq: reject limit of 1

The implementation does not properly support limits of 1. Add an
in-kernel check, in addition to existing iproute2 check, since other
tools may be used for configuration.

This patch set also adds a selfcheck to test that a limit of 1 is
rejected.

An alternative (or in addition) we could fix the implementation by
setting q->tail to NULL in sfq_drop if this is the last slot we marked
empty, e.g.:

  --- a/net/sched/sch_sfq.c
  +++ b/net/sched/sch_sfq.c
  @@ -317,8 +317,11 @@ static unsigned int sfq_drop(struct Qdisc *sch, struct sk_buff **to_free)
                  /* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */
                  x = q->tail->next;
                  slot = &q->slots[x];
  -               q->tail->next = slot->next;
                  q->ht[slot->hash] = SFQ_EMPTY_SLOT;
  +               if (x == slot->next)
  +                       q->tail = NULL; /* no more active slots */
  +               else
  +                       q->tail->next = slot->next;
                  goto drop;
          }
====================

Link: https://patch.msgid.link/20241204030520.2084663-1-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Octavian Purdila [Wed, 4 Dec 2024 03:05:20 +0000 (19:05 -0800)]

selftests/tc-testing: sfq: test that kernel rejects limit of 1

Add test to check that the kernel rejects a configuration with the
limit set to 1.

Signed-off-by: Octavian Purdila <tavip@google.com>
Link: https://patch.msgid.link/20241204030520.2084663-3-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Octavian Purdila [Wed, 4 Dec 2024 03:05:19 +0000 (19:05 -0800)]

net_sched: sch_sfq: don't allow 1 packet limit

The current implementation does not work correctly with a limit of
1. iproute2 actually checks for this and this patch adds the check in
kernel as well.

This fixes the following syzkaller reported crash:

UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:210:6
index 65535 is out of range for type 'struct sfq_head[128]'
CPU: 0 PID: 2569 Comm: syz-executor101 Not tainted 5.10.0-smp-DEV #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x125/0x19f lib/dump_stack.c:120
  ubsan_epilogue lib/ubsan.c:148 [inline]
  __ubsan_handle_out_of_bounds+0xed/0x120 lib/ubsan.c:347
  sfq_link net/sched/sch_sfq.c:210 [inline]
  sfq_dec+0x528/0x600 net/sched/sch_sfq.c:238
  sfq_dequeue+0x39b/0x9d0 net/sched/sch_sfq.c:500
  sfq_reset+0x13/0x50 net/sched/sch_sfq.c:525
  qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
  tbf_reset+0x3d/0x100 net/sched/sch_tbf.c:319
  qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
  dev_reset_queue+0x8c/0x140 net/sched/sch_generic.c:1296
  netdev_for_each_tx_queue include/linux/netdevice.h:2350 [inline]
  dev_deactivate_many+0x6dc/0xc20 net/sched/sch_generic.c:1362
  __dev_close_many+0x214/0x350 net/core/dev.c:1468
  dev_close_many+0x207/0x510 net/core/dev.c:1506
  unregister_netdevice_many+0x40f/0x16b0 net/core/dev.c:10738
  unregister_netdevice_queue+0x2be/0x310 net/core/dev.c:10695
  unregister_netdevice include/linux/netdevice.h:2893 [inline]
  __tun_detach+0x6b6/0x1600 drivers/net/tun.c:689
  tun_detach drivers/net/tun.c:705 [inline]
  tun_chr_close+0x104/0x1b0 drivers/net/tun.c:3640
  __fput+0x203/0x840 fs/file_table.c:280
  task_work_run+0x129/0x1b0 kernel/task_work.c:185
  exit_task_work include/linux/task_work.h:33 [inline]
  do_exit+0x5ce/0x2200 kernel/exit.c:931
  do_group_exit+0x144/0x310 kernel/exit.c:1046
  __do_sys_exit_group kernel/exit.c:1057 [inline]
  __se_sys_exit_group kernel/exit.c:1055 [inline]
  __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:1055
do_syscall_64+0x6c/0xd0
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7fe5e7b52479
Code: Unable to access opcode bytes at RIP 0x7fe5e7b5244f.
RSP: 002b:00007ffd3c800398 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5e7b52479
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007fe5e7bcd2d0 R08: ffffffffffffffb8 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5e7bcd2d0
R13: 0000000000000000 R14: 00007fe5e7bcdd20 R15: 00007fe5e7b24270

The crash can be also be reproduced with the following (with a tc
recompiled to allow for sfq limits of 1):

tc qdisc add dev dummy0 handle 1: root tbf rate 1Kbit burst 100b lat 1s
../iproute2-6.9.0/tc/tc qdisc add dev dummy0 handle 2: parent 1:10 sfq limit 1
ifconfig dummy0 up
ping -I dummy0 -f -c2 -W0.1 8.8.8.8
sleep 1

Scenario that triggers the crash:

* the first packet is sent and queued in TBF and SFQ; qdisc qlen is 1

* TBF dequeues: it peeks from SFQ which moves the packet to the
  gso_skb list and keeps qdisc qlen set to 1. TBF is out of tokens so
  it schedules itself for later.

* the second packet is sent and TBF tries to queues it to SFQ. qdisc
  qlen is now 2 and because the SFQ limit is 1 the packet is dropped
  by SFQ. At this point qlen is 1, and all of the SFQ slots are empty,
  however q->tail is not NULL.

At this point, assuming no more packets are queued, when sch_dequeue
runs again it will decrement the qlen for the current empty slot
causing an underflow and the subsequent out of bounds access.

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241204030520.2084663-2-tavip@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 4 Dec 2024 17:19:50 +0000 (17:19 +0000)]

net_sched: sch_fq: add three drop_reason

Add three new drop_reason, more precise than generic QDISC_DROP:

"tc -s qd" show aggregate counters, it might be more useful
to use drop_reason infrastructure for bug hunting.

1) SKB_DROP_REASON_FQ_BAND_LIMIT
   Whenever a packet is added while its band limit is hit.
   Corresponding value in "tc -s qd" is bandX_drops XXXX

2) SKB_DROP_REASON_FQ_HORIZON_LIMIT
   Whenever a packet has a timestamp too far in the future.
   Corresponding value in "tc -s qd" is horizon_drops XXXX

3) SKB_DROP_REASON_FQ_FLOW_LIMIT
   Whenever a flow has reached its limit.
   Corresponding value in "tc -s qd" is flows_plimit XXXX

Tested:
tc qd replace dev eth1 root fq flow_limit 10 limit 100000
perf record -a -e skb:kfree_skb sleep 1; perf script

      udp_stream   12329 [004]   216.929492: skb:kfree_skb: skbaddr=0xffff888eabe17e00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12385 [006]   216.929593: skb:kfree_skb: skbaddr=0xffff888ef8827f00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12389 [005]   216.929871: skb:kfree_skb: skbaddr=0xffff888ecb9ba500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12316 [009]   216.930398: skb:kfree_skb: skbaddr=0xffff888eca286b00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT
      udp_stream   12400 [008]   216.930490: skb:kfree_skb: skbaddr=0xffff888eabf93d00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT

tc qd replace dev eth1 root fq flow_limit 100 limit 10000
perf record -a -e skb:kfree_skb sleep 1; perf script

      udp_stream   18074 [001]  1058.318040: skb:kfree_skb: skbaddr=0xffffa23c881fc000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18126 [005]  1058.320651: skb:kfree_skb: skbaddr=0xffffa23c6aad4000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18118 [006]  1058.321065: skb:kfree_skb: skbaddr=0xffffa23df0d48a00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   18074 [001]  1058.321126: skb:kfree_skb: skbaddr=0xffffa23c881ffa00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT
      udp_stream   15815 [003]  1058.321224: skb:kfree_skb: skbaddr=0xffffa23c9835db00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT

tc -s -d qd sh dev eth1
qdisc fq 8023: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023
bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 weights 589824 196608 65536 quantum 18Kb
initial_quantum 92120b low_rate_threshold 550Kbit refill_delay 40ms
timer_slack 10us horizon 10s horizon_drop
Sent 492439603330 bytes 336953991 pkt (dropped 61724094, overlimits 0 requeues 4463)
backlog 14611228b 9995p requeues 4463
  flows 2965 (inactive 1151 throttled 0) band0_pkts 0 band1_pkts 9993 band2_pkts 0
  gc 6347 highprio 0 fastpath 30 throttled 5 latency 2.32us flows_plimit 7403693
band1_drops 54320401

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241204171950.89829-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 5 Dec 2024 20:03:09 +0000 (12:03 -0800)]

Merge branch 'ethtool-generate-uapi-header-from-the-spec'

Stanislav Fomichev says:

====================
ethtool: generate uapi header from the spec

We keep expanding ethtool netlink api surface and this leads to
constantly playing catchup on the ynl spec side. There are a couple
of things that prevent us from fully converting to generating
the header from the spec (stats and cable tests), but we can
generate 95% of the header which is still better than maintaining
c header and spec separately. The series adds a couple of missing
features on the ynl-gen-c side and separates the parts
that we can generate into new ethtool_netlink_generated.h.
====================

Link: https://patch.msgid.link/20241204155549.641348-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:49 +0000 (07:55 -0800)]

ethtool: regenerate uapi header from the spec

No functional changes. Mostly the following formatting:
- extra docs
- extra enums
- XXX_MAX = __XXX_CNT - 1 -> XXX_MAX = (__XXX_CNT - 1)
- newlines

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:48 +0000 (07:55 -0800)]

ethtool: remove the comments that are not gonna be generated

Cleanup the header manually to make it easier to review the changes that ynl
generator brings in. No functional changes.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:47 +0000 (07:55 -0800)]

ethtool: separate definitions that are gonna be generated

Reshuffle definitions that are gonna be generated into
ethtool_netlink_generated.h and match ynl spec order.
This should make it easier to compare the output of the ynl-gen-c
to the existing uapi header. No functional changes.

Things that are still remaining to be manually defined:
- ETHTOOL_FLAG_ALL - probably no good way to add to spec?
- some of the cable test bits (not sure whether it's possible to move to
spec)
- some of the stats definitions (no way currently to move to spec)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:46 +0000 (07:55 -0800)]

ynl: include uapi header after all dependencies

Essentially reverse the order of headers for userspace generated files.

Before (make -C tools/net/ynl/; cat tools/net/ynl/ethtool-user.h):
  #include <linux/ethtool_netlink_generated.h>
  #include <linux/ethtool.h>
  #include <linux/ethtool.h>
  #include <linux/ethtool.h>

After:
  #include <linux/ethtool.h>
  #include <linux/ethtool_netlink_generated.h>

While at it, make sure we track which headers we've already included
and include the headers only once.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:45 +0000 (07:55 -0800)]

ynl: add missing pieces to ethtool spec to better match uapi header

- __ETHTOOL_UDP_TUNNEL_TYPE_CNT and render max
- skip rendering stringset (empty enum)
- skip rendering c33-pse-ext-state (defined in ethtool.h)
- rename header flags to ethtool-flag-
- add attr-cnt-name to each attribute to use XXX_CNT instead of XXX_MAX
- add unspec 0 entry to each attribute
- carry some doc entries from the existing header
- tcp-header-split

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:44 +0000 (07:55 -0800)]

ynl: support directional specs in ynl-gen-c.py

The intent is to generate ethtool uapi headers. For now, some of the
things are hard-coded:
- <FAMILY>_MSG_{USER,KERNEL}_MAX
- the split between USER and KERNEL messages

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:43 +0000 (07:55 -0800)]

ynl: skip rendering attributes with header property in uapi mode

To allow omitting some of the attributes in the final generated file.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Stanislav Fomichev [Wed, 4 Dec 2024 15:55:42 +0000 (07:55 -0800)]

ynl: support enum-cnt-name attribute in legacy definitions

This is similar to existing attr-cnt-name in the attributes
to allow changing the name of the 'count' enum entry.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241204155549.641348-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 5 Dec 2024 19:48:58 +0000 (11:48 -0800)]

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.13-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 5 Dec 2024 18:25:06 +0000 (10:25 -0800)]

Merge tag 'net-6.13-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from can and netfilter.

  Current release - regressions:

   - rtnetlink: fix double call of rtnl_link_get_net_ifla()

   - tcp: populate XPS related fields of timewait sockets

   - ethtool: fix access to uninitialized fields in set RXNFC command

   - selinux: use sk_to_full_sk() in selinux_ip_output()

  Current release - new code bugs:

   - net: make napi_hash_lock irq safe

   - eth:
      - bnxt_en: support header page pool in queue API
      - ice: fix NULL pointer dereference in switchdev

  Previous releases - regressions:

   - core: fix icmp host relookup triggering ip_rt_bug

   - ipv6:
      - avoid possible NULL deref in modify_prefix_route()
      - release expired exception dst cached in socket

   - smc: fix LGR and link use-after-free issue

   - hsr: avoid potential out-of-bound access in fill_frame_info()

   - can: hi311x: fix potential use-after-free

   - eth: ice: fix VLAN pruning in switchdev mode

  Previous releases - always broken:

   - netfilter:
      - ipset: hold module reference while requesting a module
      - nft_inner: incorrect percpu area handling under softirq

   - can: j1939: fix skb reference counting

   - eth:
      - mlxsw: use correct key block on Spectrum-4
      - mlx5: fix memory leak in mlx5hws_definer_calc_layout"

* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
  net: avoid potential UAF in default_operstate()
  vsock/test: verify socket options after setting them
  vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
  vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
  net/mlx5e: Remove workaround to avoid syndrome for internal port
  net/mlx5e: SD, Use correct mdev to build channel param
  net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
  net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
  net/mlx5: HWS: Properly set bwc queue locks lock classes
  net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
  bnxt_en: handle tpa_info in queue API implementation
  bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
  bnxt_en: refactor tpa_info alloc/free into helpers
  geneve: do not assume mac header is set in geneve_xmit_skb()
  mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
  ethtool: Fix wrong mod state in case of verbose and no_mask bitset
  ipmr: tune the ipmr_can_free_table() checks.
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  ...

commit | commitdiff | tree

Linus Torvalds [Thu, 5 Dec 2024 18:17:55 +0000 (10:17 -0800)]

Merge tag 'trace-v6.13-rc1' of git://git./linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix trace histogram sort function cmp_entries_dup()

   The sort function cmp_entries_dup() returns either 1 or 0, and not -1
   if parameter "a" is less than "b" by memcmp().

- Fix archs that call trace_hardirqs_off() without RCU watching

   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed to
   get rid of the SRCU protection and be able to implement features like
   faultable traceponits and add rust tracepoints.

   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only do
   it if the tracepoint is enabled.

* tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix archs that still call tracepoints without RCU watching
  tracing: Fix cmp_entries_dup() to respect sort() comparison rules

commit | commitdiff | tree

Linus Torvalds [Thu, 5 Dec 2024 18:06:47 +0000 (10:06 -0800)]

Merge tag 'hid-for-linus-2024120501' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Benjamin Tissoires:

- regression fix in suspend/resume for i2c-hid (Kenny Levinsen)

- fix wacom driver assuming a name can not be null (WangYuli)

- a couple of constify changes/fixes (Thomas Weißschuh)

- a couple of selftests/hid fixes (Maximilian Heyne & Benjamin
   Tissoires)

* tag 'hid-for-linus-2024120501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  selftests/hid: fix kfunc inclusions with newer bpftool
  HID: bpf: drop unneeded casts discarding const
  HID: bpf: constify hid_ops
  selftests: hid: fix typo and exit code
  HID: wacom: fix when get product name maybe null pointer
  HID: i2c-hid: Revert to using power commands to wake on resume

commit | commitdiff | tree

Linus Torvalds [Thu, 5 Dec 2024 18:03:43 +0000 (10:03 -0800)]

Merge tag 'linux-watchdog-6.13-rc1' of git://linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

- Add support for exynosautov920 SoC

- Add support for Airoha EN7851 watchdog

- Add support for MT6735 TOPRGU/WDT

- Delete the cpu5wdt driver

- Always print when registering watchdog fails

- Several other small fixes and improvements

* tag 'linux-watchdog-6.13-rc1' of git://www.linux-watchdog.org/linux-watchdog: (36 commits)
  watchdog: rti: of: honor timeout-sec property
  watchdog: s3c2410_wdt: add support for exynosautov920 SoC
  dt-bindings: watchdog: Document ExynosAutoV920 watchdog bindings
  watchdog: mediatek: Add support for MT6735 TOPRGU/WDT
  watchdog: mediatek: Make sure system reset gets asserted in mtk_wdt_restart()
  dt-bindings: watchdog: fsl-imx-wdt: Add missing 'big-endian' property
  dt-bindings: watchdog: Document Qualcomm QCS8300
  docs: ABI: Fix spelling mistake in pretimeout_avaialable_governors
  Revert "watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs"
  watchdog: rzg2l_wdt: Power on the watchdog domain in the restart handler
  watchdog: Switch back to struct platform_driver::remove()
  watchdog: it87_wdt: add PWRGD enable quirk for Qotom QCML04
  watchdog: da9063: Remove __maybe_unused notations
  watchdog: da9063: Do not use a global variable
  watchdog: Delete the cpu5wdt driver
  watchdog: Add support for Airoha EN7851 watchdog
  dt-bindings: watchdog: airoha: document watchdog for Airoha EN7581
  watchdog: sl28cpld_wdt: don't print out if registering watchdog fails
  watchdog: rza_wdt: don't print out if registering watchdog fails
  watchdog: rti_wdt: don't print out if registering watchdog fails
  ...

commit | commitdiff | tree

Steven Rostedt [Wed, 4 Dec 2024 15:04:14 +0000 (10:04 -0500)]

tracing: Fix archs that still call tracepoints without RCU watching

Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.

There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).

Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()")
Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

commit | commitdiff | tree

Shradha Gupta [Wed, 4 Dec 2024 05:48:20 +0000 (21:48 -0800)]

net :mana :Request a V2 response version for MANA_QUERY_GF_STAT

The current requested response version(V1) for MANA_QUERY_GF_STAT query
results in STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR value being set to
0 always.
In order to get the correct value for this counter we request the response
version to be V2.

Cc: stable@vger.kernel.org
Fixes: e1df5202e879 ("net :mana :Add remaining GDMA stats for MANA to ethtool")
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1733291300-12593-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 3 Dec 2024 17:09:33 +0000 (17:09 +0000)]

net: avoid potential UAF in default_operstate()

syzbot reported an UAF in default_operstate() [1]

Issue is a race between device and netns dismantles.

After calling __rtnl_unlock() from netdev_run_todo(),
we can not assume the netns of each device is still alive.

Make sure the device is not in NETREG_UNREGISTERED state,
and add an ASSERT_RTNL() before the call to
__dev_get_by_index().

We might move this ASSERT_RTNL() in __dev_get_by_index()
in the future.

[1]

BUG: KASAN: slab-use-after-free in __dev_get_by_index+0x5d/0x110 net/core/dev.c:852
Read of size 8 at addr ffff888043eba1b0 by task syz.0.0/5339

CPU: 0 UID: 0 PID: 5339 Comm: syz.0.0 Not tainted 6.12.0-syzkaller-10296-gaaf20f870da0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_address_description mm/kasan/report.c:378 [inline]
  print_report+0x169/0x550 mm/kasan/report.c:489
  kasan_report+0x143/0x180 mm/kasan/report.c:602
  __dev_get_by_index+0x5d/0x110 net/core/dev.c:852
  default_operstate net/core/link_watch.c:51 [inline]
  rfc2863_policy+0x224/0x300 net/core/link_watch.c:67
  linkwatch_do_dev+0x3e/0x170 net/core/link_watch.c:170
  netdev_run_todo+0x461/0x1000 net/core/dev.c:10894
  rtnl_unlock net/core/rtnetlink.c:152 [inline]
  rtnl_net_unlock include/linux/rtnetlink.h:133 [inline]
  rtnl_dellink+0x760/0x8d0 net/core/rtnetlink.c:3520
  rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6911
  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2541
  netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
  netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1347
  netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1891
  sock_sendmsg_nosec net/socket.c:711 [inline]
  __sock_sendmsg+0x221/0x270 net/socket.c:726
  ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2583
  ___sys_sendmsg net/socket.c:2637 [inline]
  __sys_sendmsg+0x269/0x350 net/socket.c:2669
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2a3cb80809
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f2a3d9cd058 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f2a3cd45fa0 RCX: 00007f2a3cb80809
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000008
RBP: 00007f2a3cbf393e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f2a3cd45fa0 R15: 00007ffd03bc65c8
</TASK>

Allocated by task 5339:
  kasan_save_stack mm/kasan/common.c:47 [inline]
  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
  poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
  kasan_kmalloc include/linux/kasan.h:260 [inline]
  __kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4314
  kmalloc_noprof include/linux/slab.h:901 [inline]
  kmalloc_array_noprof include/linux/slab.h:945 [inline]
  netdev_create_hash net/core/dev.c:11870 [inline]
  netdev_init+0x10c/0x250 net/core/dev.c:11890
  ops_init+0x31e/0x590 net/core/net_namespace.c:138
  setup_net+0x287/0x9e0 net/core/net_namespace.c:362
  copy_net_ns+0x33f/0x570 net/core/net_namespace.c:500
  create_new_namespaces+0x425/0x7b0 kernel/nsproxy.c:110
  unshare_nsproxy_namespaces+0x124/0x180 kernel/nsproxy.c:228
  ksys_unshare+0x57d/0xa70 kernel/fork.c:3314
  __do_sys_unshare kernel/fork.c:3385 [inline]
  __se_sys_unshare kernel/fork.c:3383 [inline]
  __x64_sys_unshare+0x38/0x40 kernel/fork.c:3383
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 12:
  kasan_save_stack mm/kasan/common.c:47 [inline]
  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
  poison_slab_object mm/kasan/common.c:247 [inline]
  __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
  kasan_slab_free include/linux/kasan.h:233 [inline]
  slab_free_hook mm/slub.c:2338 [inline]
  slab_free mm/slub.c:4598 [inline]
  kfree+0x196/0x420 mm/slub.c:4746
  netdev_exit+0x65/0xd0 net/core/dev.c:11992
  ops_exit_list net/core/net_namespace.c:172 [inline]
  cleanup_net+0x802/0xcc0 net/core/net_namespace.c:632
  process_one_work kernel/workqueue.c:3229 [inline]
  process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
  worker_thread+0x870/0xd30 kernel/workqueue.c:3391
  kthread+0x2f0/0x390 kernel/kthread.c:389
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff888043eba000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 432 bytes inside of
freed 2048-byte region [ffff888043eba000, ffff888043eba800)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x43eb8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x4fff00000000040(head|node=1|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 04fff00000000040 ffff88801ac42000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 04fff00000000040 ffff88801ac42000 dead000000000122 0000000000000000
head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 04fff00000000003 ffffea00010fae01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5339, tgid 5338 (syz.0.0), ts 69674195892, free_ts 69663220888
  set_page_owner include/linux/page_owner.h:32 [inline]
  post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
  prep_new_page mm/page_alloc.c:1564 [inline]
  get_page_from_freelist+0x3649/0x3790 mm/page_alloc.c:3474
  __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
  alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
  alloc_slab_page+0x6a/0x140 mm/slub.c:2408
  allocate_slab+0x5a/0x2f0 mm/slub.c:2574
  new_slab mm/slub.c:2627 [inline]
  ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
  __slab_alloc+0x58/0xa0 mm/slub.c:3905
  __slab_alloc_node mm/slub.c:3980 [inline]
  slab_alloc_node mm/slub.c:4141 [inline]
  __do_kmalloc_node mm/slub.c:4282 [inline]
  __kmalloc_noprof+0x2e6/0x4c0 mm/slub.c:4295
  kmalloc_noprof include/linux/slab.h:905 [inline]
  sk_prot_alloc+0xe0/0x210 net/core/sock.c:2165
  sk_alloc+0x38/0x370 net/core/sock.c:2218
  __netlink_create+0x65/0x260 net/netlink/af_netlink.c:629
  __netlink_kernel_create+0x174/0x6f0 net/netlink/af_netlink.c:2015
  netlink_kernel_create include/linux/netlink.h:62 [inline]
  uevent_net_init+0xed/0x2d0 lib/kobject_uevent.c:783
  ops_init+0x31e/0x590 net/core/net_namespace.c:138
  setup_net+0x287/0x9e0 net/core/net_namespace.c:362
page last free pid 1032 tgid 1032 stack trace:
  reset_page_owner include/linux/page_owner.h:25 [inline]
  free_pages_prepare mm/page_alloc.c:1127 [inline]
  free_unref_page+0xdf9/0x1140 mm/page_alloc.c:2657
  __slab_free+0x31b/0x3d0 mm/slub.c:4509
  qlink_free mm/kasan/quarantine.c:163 [inline]
  qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
  kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
  __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
  kasan_slab_alloc include/linux/kasan.h:250 [inline]
  slab_post_alloc_hook mm/slub.c:4104 [inline]
  slab_alloc_node mm/slub.c:4153 [inline]
  kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
  __alloc_skb+0x1c3/0x440 net/core/skbuff.c:668
  alloc_skb include/linux/skbuff.h:1323 [inline]
  alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
  sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2881
  sock_alloc_send_skb include/net/sock.h:1797 [inline]
  mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
  add_grhead net/ipv6/mcast.c:1850 [inline]
  add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
  mld_send_initial_cr+0x228/0x4b0 net/ipv6/mcast.c:2234
  ipv6_mc_dad_complete+0x88/0x490 net/ipv6/mcast.c:2245
  addrconf_dad_completed+0x712/0xcd0 net/ipv6/addrconf.c:4342
addrconf_dad_work+0xdc2/0x16f0
  process_one_work kernel/workqueue.c:3229 [inline]
  process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310

Memory state around the buggy address:
ffff888043eba080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888043eba100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888043eba180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
ffff888043eba200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888043eba280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 8c55facecd7a ("net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down")
Reported-by: syzbot+1939f24bdb783e9e43d9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/674f3a18.050a0220.48a03.0041.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241203170933.2449307-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 5 Dec 2024 10:49:14 +0000 (11:49 +0100)]

Merge tag 'nf-24-12-05' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix esoteric undefined behaviour due to uninitialized stack access
   in ip_vs_protocol_init(), from Jinghao Jia.

2) Fix iptables xt_LED slab-out-of-bounds due to incorrect sanitization
   of the led string identifier, reported by syzbot. Patch from
   Dmitry Antipov.

3) Remove WARN_ON_ONCE reachable from userspace to check for the maximum
   cgroup level, nft_socket cgroup matching is restricted to 255 levels,
   but cgroups allow for INT_MAX levels by default. Reported by syzbot.

4) Fix nft_inner incorrect use of percpu area to store tunnel parser
   context with softirqs, resulting in inconsistent inner header
   offsets that could lead to bogus rule mismatches, reported by syzbot.

5) Grab module reference on ipset core while requesting set type modules,
   otherwise kernel crash is possible by removing ipset core module,
   patch from Phil Sutter.

6) Fix possible double-free in nft_hash garbage collector due to unstable
   walk interator that can provide twice the same element. Use a sequence
   number to skip expired/dead elements that have been already scheduled
   for removal. Based on patch from Laurent Fasnach

netfilter pull request 24-12-05

* tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  netfilter: nft_inner: incorrect percpu area handling under softirq
  netfilter: nft_socket: remove WARN_ON_ONCE on maximum cgroup level
  netfilter: x_tables: fix LED ID check in led_tg_check()
  ipvs: fix UB due to uninitialized stack access in ip_vs_protocol_init()
====================

Link: https://patch.msgid.link/20241205002854.162490-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 5 Dec 2024 10:39:36 +0000 (11:39 +0100)]

Merge branch 'vsock-test-fix-wrong-setsockopt-parameters'

Konstantin Shkolnyy says:

====================
vsock/test: fix wrong setsockopt() parameters

Parameters were created using wrong C types, which caused them to be of
wrong size on some architectures, causing problems.

The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64
didn't show it. After the fix, all tests pass on s390.
Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have
a similar problem, which turned out to be true, hence, the second patch.

Changes for v8:
- Fix whitespace warnings from "checkpatch.pl --strict"
- Add maintainers to Cc:
Changes for v7:
- Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
- Add the "net" tags to the subjects
Changes for v6:
- rework the patch #3 to avoid creating a new file for new functions,
and exclude vsock_perf from calling the new functions.
- add "Reviewed-by:" to the patch #2.
Changes for v5:
- in the patch #2 replace the introduced uint64_t with unsigned long long
to match documentation
- add a patch #3 that verifies every setsockopt() call.
Changes for v4:
- add "Reviewed-by:" to the first patch, and add a second patch fixing
SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now
a patch series.)
Changes for v3:
- fix the same problem in vsock_perf and update commit message
Changes for v2:
- add "Fixes:" lines to the commit message
====================

Link: https://patch.msgid.link/20241203150656.287028-1-kshk@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:56 +0000 (09:06 -0600)]

vsock/test: verify socket options after setting them

Replace setsockopt() calls with calls to functions that follow
setsockopt() with getsockopt() and check that the returned value and its
size are the same as have been set. (Except in vsock_perf.)

Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:55 +0000 (09:06 -0600)]

vsock/test: fix parameter types in SO_VM_SOCKETS_* calls

Change parameters of SO_VM_SOCKETS_* to unsigned long long as documented
in the vm_sockets.h, because the corresponding kernel code requires them
to be at least 64-bit, no matter what architecture. Otherwise they are
too small on 32-bit machines.

Fixes: 5c338112e48a ("test/vsock: rework message bounds test")
Fixes: 685a21c314a8 ("test/vsock: add big message test")
Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Konstantin Shkolnyy [Tue, 3 Dec 2024 15:06:54 +0000 (09:06 -0600)]

vsock/test: fix failures due to wrong SO_RCVLOWAT parameter

This happens on 64-bit big-endian machines.
SO_RCVLOWAT requires an int parameter. However, instead of int, the test
uses unsigned long in one place and size_t in another. Both are 8 bytes
long on 64-bit machines. The kernel, having received the 8 bytes, doesn't
test for the exact size of the parameter, it only cares that it's >=
sizeof(int), and casts the 4 lower-addressed bytes to an int, which, on
a big-endian machine, contains 0. 0 doesn't trigger an error, SO_RCVLOWAT
returns with success and the socket stays with the default SO_RCVLOWAT = 1,
which results in vsock_test failures, while vsock_perf doesn't even notice
that it's failed to change it.

Fixes: b1346338fbae ("vsock_test: POLLIN + SO_RCVLOWAT test")
Fixes: 542e893fbadc ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 5 Dec 2024 10:15:59 +0000 (11:15 +0100)]

Merge branch 'mitigate-the-two-reallocations-issue-for-iptunnels'

Justin Iurman says:

====================
Mitigate the two-reallocations issue for iptunnels

RESEND v5:
- v5 was sent just when net-next closed
v5:
- address Paolo's comments
- s/int dst_dev_overhead()/unsigned int dst_dev_overhead()/
v4:
- move static inline function to include/net/dst.h
v3:
- fix compilation error in seg6_iptunnel
v2:
- add missing "static" keywords in seg6_iptunnel
- use a static-inline function to return the dev overhead (as suggested
by Olek, thanks)

The same pattern is found in ioam6, rpl6, and seg6. Basically, it first
makes sure there is enough room for inserting a new header:

(1) err = skb_cow_head(skb, len + skb->mac_len);

Then, when the insertion (encap or inline) is performed, the input and
output handlers respectively make sure there is enough room for layer 2:

(2) err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));

skb_cow_head() does nothing when there is enough room. Otherwise, it
reallocates more room, which depends on the architecture. Briefly,
skb_cow_head() calls __skb_cow() which then calls pskb_expand_head() as
follows:

pskb_expand_head(skb, ALIGN(delta, NET_SKB_PAD), 0, GFP_ATOMIC);

"delta" represents the number of bytes to be added. This value is
aligned with NET_SKB_PAD, which is defined as follows:

NET_SKB_PAD = max(32, L1_CACHE_BYTES)

... where L1_CACHE_BYTES also depends on the architecture. In our case
(x86), it is defined as follows:

L1_CACHE_BYTES = (1 << CONFIG_X86_L1_CACHE_SHIFT)

... where (again, in our case) CONFIG_X86_L1_CACHE_SHIFT equals 6
(=X86_GENERIC).

All this to say, skb_cow_head() would reallocate to the next multiple of
NET_SKB_PAD (in our case a 64-byte multiple) when there is not enough
room.

Back to the main issue with the pattern: in some cases, two
reallocations are triggered, resulting in a performance drop (i.e.,
lines (1) and (2) would both trigger an implicit reallocation). How's
that possible? Well, this is kind of bad luck as we hit an exact
NET_SKB_PAD boundary and when skb->mac_len (=14) is smaller than
LL_RESERVED_SPACE(dst->dev) (=16 in our case). For an x86 arch, it
happens in the following cases (with the default needed_headroom):

- ioam6:
- (inline mode) pre-allocated data trace of 236 or 240 bytes
- (encap mode) pre-allocated data trace of 196 or 200 bytes
- seg6:
- (encap mode) for 13, 17, 21, 25, 29, 33, ...(+4)... prefixes

Let's illustrate the problem, i.e., when we fall on the exact
NET_SKB_PAD boundary. In the case of ioam6, for the above problematic
values, the total overhead is 256 bytes for both modes. Based on line
(1), skb->mac_len (=14) is added, therefore passing 270 bytes to
skb_cow_head(). At that moment, the headroom has 206 bytes available (in
our case). Since 270 > 206, skb_cow_head() performs a reallocation and
the new headroom is now 206 + 64 (NET_SKB_PAD) = 270. Which is exactly
the room we needed. After the insertion, the headroom has 0 byte
available. But, there's line (2) where 16 bytes are still needed. Which,
again, triggers another reallocation.

The same logic is applied to seg6 (although it does not happen with the
inline mode, i.e., -40 bytes). It happens with other L1 cache shifts too
(the larger the cache shift, the less often it happens). For example,
with a +32 cache shift (instead of +64), the following number of
segments would trigger two reallocations: 11, 15, 19, ... With a +128
cache shift, the following number of segments would trigger two
reallocations: 17, 25, 33, ... And so on and so forth. Note that it is
the same for both the "encap" and "l2encap" modes. For the "encap.red"
and "l2encap.red" modes, it is the same logic but with "segs+1" (e.g.,
14, 18, 22, 26, etc for a +64 cache shift). Note also that it may happen
with rpl6 (based on some calculations), although it did not in our case.

This series provides a solution to mitigate the aforementioned issue for
ioam6, seg6, and rpl6. It provides the dst_entry (in the cache) to
skb_cow_head() **before** the insertion (line (1)). As a result, the
very first iteration would still trigger two reallocations (i.e., empty
cache), while next iterations would only trigger a single reallocation.
====================

Link: https://patch.msgid.link/20241203124945.22508-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Justin Iurman [Tue, 3 Dec 2024 12:49:45 +0000 (13:49 +0100)]

net: ipv6: rpl_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with rpl_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
there is no impact (it even shows improvement):
- before: https://ibb.co/nQJhqwc
- after: https://ibb.co/4ZvW6wV

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Justin Iurman [Tue, 3 Dec 2024 12:49:44 +0000 (13:49 +0100)]

net: ipv6: seg6_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with seg6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- before: https://ibb.co/3Cg4sNH
- after: https://ibb.co/8rQ350r

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: David Lebrun <dlebrun@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Justin Iurman [Tue, 3 Dec 2024 12:49:43 +0000 (13:49 +0100)]

net: ipv6: ioam6_iptunnel: mitigate 2-realloc issue

This patch mitigates the two-reallocations issue with ioam6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration may still trigger
two reallocations (i.e., empty cache), while next iterations would only
trigger a single reallocation.

Performance tests before/after applying this patch, which clearly shows
the improvement:
- inline mode:
  - before: https://ibb.co/LhQ8V63
  - after: https://ibb.co/x5YT2bS
- encap mode:
  - before: https://ibb.co/3Cjm5m0
  - after: https://ibb.co/TwpsxTC
- encap mode with tunsrc:
  - before: https://ibb.co/Gpy9QPg
  - after: https://ibb.co/PW1bZFT

This patch also fixes an incorrect behavior: after the insertion, the
second call to skb_cow_head() makes sure that the dev has enough
headroom in the skb for layer 2 and stuff. In that case, the "old"
dst_entry was used, which is now fixed. After discussing with Paolo, it
appears that both patches can be merged into a single one -this one-
(for the sake of readability) and target net-next.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Justin Iurman [Tue, 3 Dec 2024 12:49:42 +0000 (13:49 +0100)]

include: net: add static inline dst_dev_overhead() to dst.h

Add static inline dst_dev_overhead() function to include/net/dst.h. This
helper function is used by ioam6_iptunnel, rpl_iptunnel and
seg6_iptunnel to get the dev's overhead based on a cache entry
(dst_entry). If the cache is empty, the default and generic value
skb->mac_len is returned. Otherwise, LL_RESERVED_SPACE() over dst's dev
is returned.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jakub Kicinski [Thu, 5 Dec 2024 03:46:49 +0000 (19:46 -0800)]

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-12-03 (ice, idpf, ixgbe, ixgbevf, igb)

This series contains updates to ice, idpf, ixgbe, ixgbevf, and igb
drivers.

For ice:
Arkadiusz corrects search for determining whether PHY clock recovery is
supported on the device.

Przemyslaw corrects mask used for PHY timestamps on ETH56G devices.

Wojciech adds missing virtchnl ops which caused NULL pointer
dereference.

Marcin fixes VLAN filter settings for uplink VSI in switchdev mode.

For idpf:
Josh restores setting of completion tag for empty buffers.

For ixgbevf:
Jake removes incorrect initialization/support of IPSEC for mailbox
version 1.5.

For ixgbe:
Jake rewords and downgrades misleading message when negotiation
of VF mailbox version is not supported.

Tore Amundsen corrects value for BASE-BX10 capability.

For igb:
Yuan Can adds proper teardown on failed pci_register_driver() call.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igb: Fix potential invalid memory access in igb_init_module()
  ixgbe: Correct BASE-BX10 compliance code
  ixgbe: downgrade logging of unsupported VF API version to debug
  ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5
  idpf: set completion tag for "empty" bufs associated with a packet
  ice: Fix VLAN pruning in switchdev mode
  ice: Fix NULL pointer dereference in switchdev
  ice: fix PHY timestamp extraction for ETH56G
  ice: fix PHY Clock Recovery availability check
====================

Link: https://patch.msgid.link/20241203215521.1646668-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Tue, 3 Dec 2024 21:33:22 +0000 (22:33 +0100)]

r8169: simplify setting hwmon attribute visibility

Use new member visible to simplify setting the static visibility.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/dba77e76-be45-4a30-96c7-45e284072ad2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 5 Dec 2024 03:43:48 +0000 (19:43 -0800)]

Merge branch 'mlx5-misc-fixes-2024-12-03'

Tariq Toukan says:

====================
mlx5 misc fixes 2024-12-03

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/20241203204920.232744-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jianbo Liu [Tue, 3 Dec 2024 20:49:20 +0000 (22:49 +0200)]

net/mlx5e: Remove workaround to avoid syndrome for internal port

Previously a workaround was added to avoid syndrome 0xcdb051. It is
triggered when offload a rule with tunnel encapsulation, and
forwarding to another table, but not matching on the internal port in
firmware steering mode. The original workaround skips internal tunnel
port logic, which is not correct as not all cases are considered. As
an example, if vlan is configured on the uplink port, traffic can't
pass because vlan header is not added with this workaround. Besides,
there is no such issue for software steering. So, this patch removes
that, and returns error directly if trying to offload such rule for
firmware steering.

Fixes: 06b4eac9c4be ("net/mlx5e: Don't offload internal port if filter device is out device")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Tested-by: Frode Nordahl <frode.nordahl@canonical.com>
Reviewed-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Tariq Toukan [Tue, 3 Dec 2024 20:49:19 +0000 (22:49 +0200)]

net/mlx5e: SD, Use correct mdev to build channel param

In a multi-PF netdev, each traffic channel creates its own resources
against a specific PF.
In the cited commit, where this support was added, the channel_param
logic was mistakenly kept unchanged, so it always used the primary PF
which is found at priv->mdev.
In this patch we fix this by moving the logic to be per-channel, and
passing the correct mdev instance.

This bug happened to be usually harmless, as the resulting cparam
structures would be the same for all channels, due to identical FW logic
and decisions.
However, in some use cases, like fwreset, this gets broken.

This could lead to different symptoms. Example:
Error cqe on cqn 0x428, ci 0x0, qn 0x10a9, opcode 0xe, syndrome 0x4,
vendor syndrome 0x32

Fixes: e4f9686bdee7 ("net/mlx5e: Let channels be SD-aware")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Patrisious Haddad [Tue, 3 Dec 2024 20:49:18 +0000 (22:49 +0200)]

net/mlx5: E-Switch, Fix switching to switchdev mode in MPV

Fix the mentioned commit change for MPV mode, since in MPV mode the IB
device is shared between different core devices, so under this change
when moving both devices simultaneously to switchdev mode the IB device
removal and re-addition can race with itself causing unexpected behavior.

In such case do rescan_drivers() only once in order to add the ethernet
representor auxiliary device, and skip adding and removing IB devices.

Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Patrisious Haddad [Tue, 3 Dec 2024 20:49:17 +0000 (22:49 +0200)]

net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled

In case that IB device is already disabled when moving to switchdev mode,
which can happen when working with LAG, need to do rescan_drivers()
before leaving in order to add ethernet representor auxiliary device.

Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Cosmin Ratiu [Tue, 3 Dec 2024 20:49:16 +0000 (22:49 +0200)]

net/mlx5: HWS: Properly set bwc queue locks lock classes

The mentioned "Fixes" patch forgot to do that.

Fixes: 9addffa34359 ("net/mlx5: HWS, use lock classes for bwc locks")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Cosmin Ratiu [Tue, 3 Dec 2024 20:49:15 +0000 (22:49 +0200)]

net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout

It allocates a match template, which creates a compressed definer fc
struct, but that is not deallocated.

This commit fixes that.

Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 5 Dec 2024 03:23:36 +0000 (19:23 -0800)]

Merge branch 'bnxt_en-support-header-page-pool-in-queue-api'

David Wei says:

====================
bnxt_en: support header page pool in queue API

Commit 7ed816be35ab ("eth: bnxt: use page pool for head frags") added a
separate page pool for header frags. Now, frags are allocated from this
header page pool e.g. rxr->tpa_info.data.

The queue API did not properly handle rxr->tpa_info and so using the
queue API to i.e. reset any queues will result in pages being returned
to the incorrect page pool, causing inflight != 0 warnings.

Fix this bug by properly allocating/freeing tpa_info and copying/freeing
head_pool in the queue API implementation.

The 1st patch is a prep patch that refactors helpers out to be used by
the implementation patch later.

The 2nd patch is a drive-by refactor. Happy to take it out and re-send
to net-next if there are any objections.

The 3rd patch is the implementation patch that will properly alloc/free
rxr->tpa_info.
====================

Link: https://patch.msgid.link/20241204041022.56512-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Wei [Wed, 4 Dec 2024 04:10:22 +0000 (20:10 -0800)]

bnxt_en: handle tpa_info in queue API implementation

Commit 7ed816be35ab ("eth: bnxt: use page pool for head frags") added a
page pool for header frags, which may be distinct from the existing pool
for the aggregation ring. Prior to this change, frags used in the TPA
ring rx_tpa were allocated from system memory e.g. napi_alloc_frag()
meaning their lifetimes were not associated with a page pool. They can
be returned at any time and so the queue API did not alloc or free
rx_tpa.

But now frags come from a separate head_pool which may be different to
page_pool. Without allocating and freeing rx_tpa, frags allocated from
the old head_pool may be returned to a different new head_pool which
causes a mismatch between the pp hold/release count.

Fix this problem by properly freeing and allocating rx_tpa in the queue
API implementation.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Wei [Wed, 4 Dec 2024 04:10:21 +0000 (20:10 -0800)]

bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()

Refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap() for
allocating rx_agg_bmap.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-3-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Wei [Wed, 4 Dec 2024 04:10:20 +0000 (20:10 -0800)]

bnxt_en: refactor tpa_info alloc/free into helpers

Refactor bnxt_rx_ring_info->tpa_info operations into helpers that work
on a single tpa_info in prep for queue API using them.

There are 2 pairs of operations:

* bnxt_alloc_one_tpa_info()
* bnxt_free_one_tpa_info()

These alloc/free the tpa_info array itself.

* bnxt_alloc_one_tpa_info_data()
* bnxt_free_one_tpa_info_data()

These alloc/free the frags stored in tpa_info array.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kenjiro Nakayama [Tue, 3 Dec 2024 22:28:44 +0000 (07:28 +0900)]

selftests/net: call sendmmsg via udpgso_bench.sh

Currently, sendmmsg is implemented in udpgso_bench_tx.c,
but it is not called by any test script.

This patch adds a test for sendmmsg in udpgso_bench.sh.
This allows for basic API testing and benchmarking
comparisons with GSO.

Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241203222843.26983-1-nakayamakenjiro@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Linux 6.x block layer and io_uring tree(s)

RSS Atom