linux-block.git
8 weeks agotools: ynl-gen: move the count into a presence struct too
Jakub Kicinski [Mon, 5 May 2025 16:52:07 +0000 (09:52 -0700)]
tools: ynl-gen: move the count into a presence struct too

While we reshuffle the presence members, move the counts as well.
Previously array count members would have been place directly in
the struct, so:

  struct family_op_req {
      struct {
            u32 a:1;
            u32 b:1;
      } _present;
      struct {
            u32 bin;
      } _len;

      u32 a;
      u64 b;
      const unsigned char *bin;
      u32 n_multi;                 << count
      u32 *multi;                  << objects
  };

Since len has been moved to its own presence struct move the count
as well:

  struct family_op_req {
      struct {
            u32 a:1;
            u32 b:1;
      } _present;
      struct {
            u32 bin;
      } _len;
      struct {
            u32 multi;             << count
      } _count;

      u32 a;
      u64 b;
      const unsigned char *bin;
      u32 *multi;                  << objects
  };

This improves the consistency and allows us to remove some hacks
in the codegen. Unlike for len there is no known name collision
with the existing scheme.

Link: https://patch.msgid.link/20250505165208.248049-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agotools: ynl-gen: split presence metadata
Jakub Kicinski [Mon, 5 May 2025 16:52:06 +0000 (09:52 -0700)]
tools: ynl-gen: split presence metadata

Each YNL struct contains the data and a sub-struct indicating which
fields are valid. Something like:

  struct family_op_req {
      struct {
            u32 a:1;
            u32 b:1;
    u32 bin_len;
      } _present;

      u32 a;
      u64 b;
      const unsigned char *bin;
  };

Note that the bin object 'bin' has a length stored, and that length
has a _len suffix added to the field name. This breaks if there
is a explicit field called bin_len, which is the case for some
TC actions. Move the length fields out of the _present struct,
create a new struct called _len:

  struct family_op_req {
      struct {
            u32 a:1;
            u32 b:1;
      } _present;
      struct {
    u32 bin;
      } _len;

      u32 a;
      u64 b;
      const unsigned char *bin;
  };

This should prevent name collisions and help with the packing
of the struct.

Unfortunately this is a breaking change, but hopefully the migration
isn't too painful.

Link: https://patch.msgid.link/20250505165208.248049-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agotools: ynl-gen: rename basic presence from 'bit' to 'present'
Jakub Kicinski [Mon, 5 May 2025 16:52:05 +0000 (09:52 -0700)]
tools: ynl-gen: rename basic presence from 'bit' to 'present'

Internal change to the code gen. Rename how we indicate a type
has a single bit presence from using a 'bit' string to 'present'.
This is a noop in terms of generated code but will make next
breaking change easier.

Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250505165208.248049-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'lan78xx-phylink-prep'
David S. Miller [Wed, 7 May 2025 11:57:06 +0000 (12:57 +0100)]
Merge branch 'lan78xx-phylink-prep'

Oleksij Rempel says:

====================
lan78xx: preparation for PHYLINK conversion

This patch series contains the first part of the LAN78xx driver
refactoring in preparation for converting the driver to use the PHYLINK
framework.

The goal of this initial part is to reduce the size and complexity of
the final PHYLINK conversion by introducing incremental cleanups and
logical separation of concerns, such as:

- Improving error handling in the PHY initialization path
- Refactoring PHY detection and MAC-side configuration
- Moving LED DT configuration to a dedicated helper
- Separating USB link power and flow control setup from the main probe logic
- Extracting PHY interrupt acknowledgment logic

Each patch is self-contained and moves non-PHYLINK-specific logic out of
the way, setting the stage for the actual conversion in a follow-up
patch series.

changes v8 (as split from full v7 00/12 series):
- Split the original series to make review easier
- This part includes only preparation patches; actual PHYLINK
  integration will follow
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: Extract flow control configuration to helper
Oleksij Rempel [Mon, 5 May 2025 08:43:41 +0000 (10:43 +0200)]
net: usb: lan78xx: Extract flow control configuration to helper

Move flow control register configuration from
lan78xx_update_flowcontrol() into a new helper function
lan78xx_configure_flowcontrol(). This separates hardware-specific
programming from policy logic and simplifies the upcoming phylink
integration.

The values used in this initial version of
lan78xx_configure_flowcontrol() are taken over as-is from the original
implementation to avoid functional changes. While they may not be
optimal for all USB and link speed combinations, they are known to work
reliably. Optimization of pause time and thresholds based on runtime
conditions can be done in a separate follow-up patch.

The forward declaration of lan78xx_configure_flowcontrol() will also be
removed later during the phylink conversion.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: Refactor USB link power configuration into helper
Oleksij Rempel [Mon, 5 May 2025 08:43:40 +0000 (10:43 +0200)]
net: usb: lan78xx: Refactor USB link power configuration into helper

Move the USB link power configuration logic from lan78xx_link_reset()
to a new helper function lan78xx_configure_usb(). This simplifies the
main link reset path and isolates USB-specific logic.

The new function handles U1/U2 enablement based on Ethernet link speed,
but only for SuperSpeed-capable devices (LAN7800 and LAN7801). LAN7850,
a High-Speed-only device, is explicitly excluded. A warning is logged
if SuperSpeed is reported unexpectedly for LAN7850.

Add a forward declaration for lan78xx_configure_usb() as preparation for
the upcoming phylink conversion, where it will also be used from the
mac_link_up() callback.

Open questions remain:

- Why is the 1000 Mbps configuration split into two steps (U2 disable,
  then U1 enable), unlike the single-step config used for 10/100 Mbps?

- U1/U2 behavior appears to depend on proper EEPROM configuration.
  There are known devices in the field without EEPROM. Should the driver
  enforce safe defaults in such cases?

Due to lack of USB subsystem expertise, no changes were made to this logic
beyond structural refactoring.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: Extract PHY interrupt acknowledgment to helper
Oleksij Rempel [Mon, 5 May 2025 08:43:39 +0000 (10:43 +0200)]
net: usb: lan78xx: Extract PHY interrupt acknowledgment to helper

Move the PHY interrupt acknowledgment logic from lan78xx_link_reset()
to a new helper function lan78xx_phy_int_ack(). This simplifies the
code and prepares for reusing the acknowledgment logic independently
from the full link reset process, such as when using phylink.

No functional change intended.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: move LED DT configuration to helper
Oleksij Rempel [Mon, 5 May 2025 08:43:38 +0000 (10:43 +0200)]
net: usb: lan78xx: move LED DT configuration to helper

Extract the LED enable logic based on the "microchip,led-modes"
property into a new helper function lan78xx_configure_leds_from_dt().

This simplifies lan78xx_phy_init() and improves modularity.
No functional changes intended.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: refactor PHY init to separate detection and MAC configuration
Oleksij Rempel [Mon, 5 May 2025 08:43:37 +0000 (10:43 +0200)]
net: usb: lan78xx: refactor PHY init to separate detection and MAC configuration

Split out PHY detection into lan78xx_get_phy() and MAC-side setup into
lan78xx_mac_prepare_for_phy(), making the main lan78xx_phy_init() cleaner
and easier to follow.

This improves separation of concerns and prepares the code for a future
transition to phylink. Fixed PHY registration and interface selection
are now handled in lan78xx_get_phy(), while MAC-side delay configuration
is done in lan78xx_mac_prepare_for_phy().

The fixed PHY fallback is preserved for setups like EVB-KSZ9897-1,
where LAN7801 connects directly to a KSZ switch without a standard PHY
or device tree support.

No functional changes intended.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: remove explicit check for missing PHY driver
Oleksij Rempel [Mon, 5 May 2025 08:43:36 +0000 (10:43 +0200)]
net: usb: lan78xx: remove explicit check for missing PHY driver

RGMII timing correctness relies on the PHY providing internal delays.
This is typically ensured via PHY driver, strap pins, or PCB layout.

Explicitly checking for a PHY driver here is unnecessary and non-standard.
This logic applies to all MACs, not just LAN78xx, and should be left to
phylib, phylink, or platform configuration.

Drop the check and rely on standard subsystem behavior.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agonet: usb: lan78xx: Improve error handling in PHY initialization
Oleksij Rempel [Mon, 5 May 2025 08:43:35 +0000 (10:43 +0200)]
net: usb: lan78xx: Improve error handling in PHY initialization

Ensure that return values from `lan78xx_write_reg()`,
`lan78xx_read_reg()`, and `phy_find_first()` are properly checked and
propagated. Use `ERR_PTR(ret)` for error reporting in
`lan7801_phy_init()` and replace `-EIO` with `-ENODEV` where appropriate
to provide more accurate error codes.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 weeks agoMerge tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Wed, 7 May 2025 02:04:42 +0000 (19:04 -0700)]
Merge tag 'wireless-next-2025-05-06' of https://git./linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
wireless features, notably

 * stack
   - free SKBTX_WIFI_STATUS flag
   - fixes for VLAN multicast in multi-link
   - improve codel parameters (revert some old twiddling)
 * ath12k
   - Enable AHB support for IPQ5332.
   - Add monitor interface support to QCN9274.
   - Add MLO support to WCN7850.
   - Add 802.11d scan offload support to WCN7850.
 * ath11k
   - Restore hibernation support
 * iwlwifi
   - EMLSR on two 5 GHz links
 * mwifiex
   - cleanups/refactoring

along with many other small features/cleanups

* tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (177 commits)
  Revert "wifi: iwlwifi: clean up config macro"
  wifi: iwlwifi: move phy_filters to fw_runtime
  wifi: iwlwifi: pcie: make sure to lock rxq->read
  wifi: iwlwifi: add definitions for iwl_mac_power_cmd version 2
  wifi: iwlwifi: clean up config macro
  wifi: iwlwifi: mld: simplify iwl_mld_rx_fill_status()
  wifi: iwlwifi: mld: rx: simplify channel handling
  wifi: iwlwifi: clean up band in RX metadata
  wifi: iwlwifi: mld: skip unknown FW channel load values
  wifi: iwlwifi: define API for external FSEQ images
  wifi: iwlwifi: mld: allow EMLSR on separated 5 GHz subbands
  wifi: iwlwifi: mld: use cfg80211_chandef_get_width()
  wifi: iwlwifi: mld: fix iwl_mld_emlsr_disallowed_with_link() return
  wifi: iwlwifi: mld: clarify variable type
  wifi: iwlwifi: pcie: add support for the reset handshake in MSI
  wifi: mac80211_hwsim: Prevent tsf from setting if beacon is disabled
  wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
  wifi: nl80211: add link id of transmitted profile for MLO MBSSID
  wifi: ieee80211: Add helpers to fetch EMLSR delay and timeout values
  wifi: mac80211: update ML STA with EML capabilities
  ...

====================

Link: https://patch.msgid.link/20250506174656.119970-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'devlink-sanitize-variable-typed-attributes'
Jakub Kicinski [Wed, 7 May 2025 01:20:12 +0000 (18:20 -0700)]
Merge branch 'devlink-sanitize-variable-typed-attributes'

Jiri Pirko says:

====================
devlink: sanitize variable typed attributes

This is continuation based on first two patches of
https://lore.kernel.org/20250425214808.507732-1-saeed@kernel.org

Better to take it as a separate patchset, as the rest of the original
patchset is probably settled.

This patchset is taking care of incorrect usage of internal NLA_* values
in uapi, introduces new enum (in patch #2) that shadows NLA_* values and
makes that part of UAPI.

The last two patches removes unnecessary translations with maintaining
clear devlink param driver api.
====================

Link: https://patch.msgid.link/20250505114513.53370-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsg
Jiri Pirko [Mon, 5 May 2025 11:45:13 +0000 (13:45 +0200)]
devlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsg

Use newly introduced DEVLINK_VAR_ATTR_TYPE_* enum values instead of
internal NLA_* in fmsg health reporter code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: avoid param type value translations
Jiri Pirko [Mon, 5 May 2025 11:45:12 +0000 (13:45 +0200)]
devlink: avoid param type value translations

Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to
ensure the same values are used internally and in UAPI. Benefit from
that by removing the value translations.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: define enum for attr types of dynamic attributes
Jiri Pirko [Mon, 5 May 2025 11:45:11 +0000 (13:45 +0200)]
devlink: define enum for attr types of dynamic attributes

Devlink param and health reporter fmsg use attributes with dynamic type
which is determined according to a different type. Currently used values
are NLA_*. The problem is, they are not part of UAPI. They may change
which would cause a break.

To make this future safe, introduce a enum that shadows NLA_* values in
it and is part of UAPI.

Also, this allows to possibly carry types that are unrelated to NLA_*
values.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agotools: ynl-gen: allow noncontiguous enums
Jiri Pirko [Mon, 5 May 2025 11:45:10 +0000 (13:45 +0200)]
tools: ynl-gen: allow noncontiguous enums

in case the enum has holes, instead of hard stop, generate a validation
callback to check valid enum values.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoio_uring/zcrx: selftests: fix setting ntuple rule into rss
David Wei [Sat, 3 May 2025 04:30:07 +0000 (21:30 -0700)]
io_uring/zcrx: selftests: fix setting ntuple rule into rss

Fix ethtool syntax for setting ntuple rule into rss. It should be
`context' instead of `action'.

Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250503043007.857215-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'net-phy-realtek-add-support-for-phy-leds'
Paolo Abeni [Tue, 6 May 2025 11:31:28 +0000 (13:31 +0200)]
Merge branch 'net-phy-realtek-add-support-for-phy-leds'

Michael Klein says:

====================
net: phy: realtek: Add support for PHY LEDs

Changes in V7:
- Remove some unused macros (patch 1)
- Add more register defines for RTL8211F (patch 3)
- Revise macro definition order once more (patch 4)

Changes in V6:
- fix macro definition order (patch 1)
- introduce two more register defines (patch 2)

Changes in V5:
- Split cleanup patch and improve code formatting

Changes in V4:
- Change (!ret) to (ret == 0)
- Replace set_bit() by __set_bit()

Changes in V3:
- move definition of rtl8211e_read_ext_page() to patch 2
- Wrap overlong lines

Changes in V2:
- Designate to net-next
- Add ExtPage access cleanup patch as suggested by Andrew Lunn
====================

Link: https://patch.msgid.link/20250504172916.243185-1-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: Add support for PHY LEDs on RTL8211E
Michael Klein [Sun, 4 May 2025 17:29:16 +0000 (19:29 +0200)]
net: phy: realtek: Add support for PHY LEDs on RTL8211E

Like the RTL8211F, the RTL8211E PHY supports up to three LEDs.
Add netdev trigger support for them, too.

Signed-off-by: Michael Klein <michael@fossekall.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250504172916.243185-7-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: use __set_bit() in rtl8211f_led_hw_control_get()
Michael Klein [Sun, 4 May 2025 17:29:15 +0000 (19:29 +0200)]
net: phy: realtek: use __set_bit() in rtl8211f_led_hw_control_get()

rtl8211f_led_hw_control_get() does not need atomic bit operations,
replace set_bit() by __set_bit().

Signed-off-by: Michael Klein <michael@fossekall.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250504172916.243185-6-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: Group RTL82* macro definitions
Michael Klein [Sun, 4 May 2025 17:29:14 +0000 (19:29 +0200)]
net: phy: realtek: Group RTL82* macro definitions

Group macro definitions by PHY in lexicographic order. Within each PHY
block, definitions are order by page number and then register number.

Signed-off-by: Michael Klein <michael@fossekall.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250504172916.243185-5-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: add RTL8211F register defines
Michael Klein [Sun, 4 May 2025 17:29:13 +0000 (19:29 +0200)]
net: phy: realtek: add RTL8211F register defines

Add some more defines for RTL8211F page and register numbers.

Signed-off-by: Michael Klein <michael@fossekall.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250504172916.243185-4-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: Clean up RTL821x ExtPage access
Michael Klein [Sun, 4 May 2025 17:29:12 +0000 (19:29 +0200)]
net: phy: realtek: Clean up RTL821x ExtPage access

Factor out RTL8211E extension page access code to
rtl821x_modify_ext_page() and clean up rtl8211e_config_init()

Signed-off-by: Michael Klein <michael@fossekall.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250504172916.243185-3-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: phy: realtek: remove unsed RTL821x_PHYSR* macros
Michael Klein [Sun, 4 May 2025 17:29:11 +0000 (19:29 +0200)]
net: phy: realtek: remove unsed RTL821x_PHYSR* macros

These macros have there since the first revision but were never used, so
let's just remove them.

Signed-off-by: Michael Klein <michael@fossekall.de>
Link: https://patch.msgid.link/20250504172916.243185-2-michael@fossekall.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge tag 'nf-next-25-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilt...
Paolo Abeni [Tue, 6 May 2025 11:19:01 +0000 (13:19 +0200)]
Merge tag 'nf-next-25-05-06' of git://git./linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Apparently, nf_conntrack_bridge changes the way in which fragments
   are handled, dealing to packet drop. From Huajian Yang.

2) Add a selftest to stress the conntrack subsystem, from Florian Westphal.

3) nft_quota depletion is off-by-one byte, Zhongqiu Duan.

4) Rewrites the procfs to read the conntrack table to speed it up,
   from Florian Westphal.

5) Two patches to prevent overflow in nft_pipapo lookup table and to
   clamp the maximum bucket size.

6) Update nft_fib selftest to check for loopback packet bypass.
   From Florian Westphal.

netfilter pull request 25-05-06

* tag 'nf-next-25-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  selftests: netfilter: nft_fib.sh: check lo packets bypass fib lookup
  netfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX
  netfilter: nft_set_pipapo: prevent overflow in lookup table allocation
  netfilter: nf_conntrack: speed up reads from nf_conntrack proc file
  netfilter: nft_quota: match correctly when the quota just depleted
  selftests: netfilter: add conntrack stress test
  netfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it
====================

Link: https://patch.msgid.link/20250505234151.228057-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoeth: fbnic: fix `tx_dropped` counting
Mohsin Bashir [Sat, 3 May 2025 02:01:45 +0000 (19:01 -0700)]
eth: fbnic: fix `tx_dropped` counting

Fix the tracking of rtnl_link_stats.tx_dropped. The counter
`tmi.drop.frames` is being double counted whereas, the counter
`tti.cm_drop.frames` is being skipped.

Fixes: f2957147ae7a ("eth: fbnic: add support for TTI HW stats")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250503020145.1868252-1-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoselftests: net: exit cleanly on SIGTERM / timeout
Jakub Kicinski [Sat, 3 May 2025 01:18:56 +0000 (18:18 -0700)]
selftests: net: exit cleanly on SIGTERM / timeout

ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.

Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).

This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250503011856.46308-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'net-ibmveth-make-ibmveth-use-new-reset-function-and-new-kunit-testsg'
Paolo Abeni [Tue, 6 May 2025 08:24:40 +0000 (10:24 +0200)]
Merge branch 'net-ibmveth-make-ibmveth-use-new-reset-function-and-new-kunit-testsg'

Dave Marquardt says:

====================
net: ibmveth: Make ibmveth use new reset function and new KUnit testsg

- Fixed struct ibmveth_adapter indentation
- Made ibmveth driver use WARN_ON with recovery rather than BUG_ON. Some
  recovery code schedules a reset through new function ibmveth_reset. Also
  removed a conflicting and unneeded forward declaration.
- Added KUnit tests for some areas changed by the WARN_ON changes.
====================

Link: https://patch.msgid.link/20250501194944.283729-1-davemarq@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: ibmveth: added KUnit tests for some buffer pool functions
Dave Marquardt [Thu, 1 May 2025 19:49:44 +0000 (14:49 -0500)]
net: ibmveth: added KUnit tests for some buffer pool functions

Added KUnit tests for ibmveth_remove_buffer_from_pool and
ibmveth_rxq_get_buffer under new IBMVETH_KUNIT_TEST config option.

Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501194944.283729-4-davemarq@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: ibmveth: Reset the adapter when unexpected states are detected
Dave Marquardt [Thu, 1 May 2025 19:49:43 +0000 (14:49 -0500)]
net: ibmveth: Reset the adapter when unexpected states are detected

Reset the adapter through new function ibmveth_reset, called in
WARN_ON situations. Removed conflicting and unneeded forward
declaration.

Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Link: https://patch.msgid.link/20250501194944.283729-3-davemarq@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: ibmveth: Indented struct ibmveth_adapter correctly
Dave Marquardt [Thu, 1 May 2025 19:49:42 +0000 (14:49 -0500)]
net: ibmveth: Indented struct ibmveth_adapter correctly

Made struct ibmveth_adapter follow indentation rules

Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501194944.283729-2-davemarq@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agovhost/net: Defer TX queue re-enable until after sendmsg
Jon Kohler [Thu, 1 May 2025 02:04:28 +0000 (19:04 -0700)]
vhost/net: Defer TX queue re-enable until after sendmsg

In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
    TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
    RX VM: taskset -c 2 iperf3 -s -p 5200 -D
    6.12.0, each worker backed by tun interface with IFF_NAPI setup.
    Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
    EPT_MISCONFIG/second: 5411
    Datagrams/second: ~382k
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender

6.12.0 patched
    EPT_MISCONFIG/second: 58 (~93x reduction)
    Datagrams/second: ~650k  (~1.7x increase)
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250501020428.1889162-1-jon@nutanix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipv4: ip_tunnel: Replace strcpy use with strscpy
Ruben Wauters [Thu, 1 May 2025 20:23:55 +0000 (21:23 +0100)]
ipv4: ip_tunnel: Replace strcpy use with strscpy

Use of strcpy is decpreated, replaces the use of strcpy with strscpy as
recommended.

strscpy was chosen as it requires a NUL terminated non-padded string,
which is the case here.

I am aware there is an explicit bounds check above the second instance,
however using strscpy protects against buffer overflows in any future
code, and there is no good reason I can see to not use it.

I have also replaced the scrscpy above that had 3 params with the
version using 2 params. These are functionally equivalent, but it is
cleaner to have both using 2 params.

Signed-off-by: Ruben Wauters <rubenru09@aol.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501202935.46318-1-rubenru09@aol.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'net-ethtool-introduce-ethnl-dump-helpers'
Jakub Kicinski [Tue, 6 May 2025 00:17:43 +0000 (17:17 -0700)]
Merge branch 'net-ethtool-introduce-ethnl-dump-helpers'

Maxime Chevallier says:

====================
net: ethtool: Introduce ethnl dump helpers

This is V8 for per-phy DUMP helpers, improving support for ->dumpit()
operations for PHY targetting commands.

This V8 fixes some issues spotted by Jakub (thanks !) on the multi-part
DUMP sequence. The netdev reftracking was reworked to make sure that
during a filtered DUMP, we only keep a ref on the netdev during
individual .dumpit() calls.

v1: https://lore.kernel.org/20250305141938.319282-1-maxime.chevallier@bootlin.com
v2: https://lore.kernel.org/20250308155440.267782-1-maxime.chevallier@bootlin.com
v3: https://lore.kernel.org/20250313182647.250007-1-maxime.chevallier@bootlin.com
v4: https://lore.kernel.org/20250324104012.367366-1-maxime.chevallier@bootlin.com
v5: https://lore.kernel.org/20250410123350.174105-1-maxime.chevallier@bootlin.com
v6: https://lore.kernel.org/20250415085155.132963-1-maxime.chevallier@bootlin.com
v7: https://lore.kernel.org/20250422161717.164440-1-maxime.chevallier@bootlin.com
====================

Link: https://patch.msgid.link/20250502085242.248645-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethtool: netlink: Use netdev_hold for dumpit() operations
Maxime Chevallier [Fri, 2 May 2025 08:52:41 +0000 (10:52 +0200)]
net: ethtool: netlink: Use netdev_hold for dumpit() operations

Move away from dev_hold and use netdev_hold with a local reftracker when
performing a DUMP on each netdev.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-4-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethtool: phy: Convert the PHY_GET command to generic phy dump
Maxime Chevallier [Fri, 2 May 2025 08:52:40 +0000 (10:52 +0200)]
net: ethtool: phy: Convert the PHY_GET command to generic phy dump

Now that we have an infrastructure in ethnl for perphy DUMPs, we can get
rid of the custom ->doit and ->dumpit to deal with PHY listing commands.

As most of the code was custom, this basically means re-writing how we
deal with PHY listing.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-3-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethtool: Introduce per-PHY DUMP operations
Maxime Chevallier [Fri, 2 May 2025 08:52:39 +0000 (10:52 +0200)]
net: ethtool: Introduce per-PHY DUMP operations

ethnl commands that target a phy_device need a DUMP implementation that
will fill the reply for every PHY behind a netdev. We therefore need to
iterate over the dev->topo to list them.

When multiple PHYs are behind the same netdev, it's also useful to
perform DUMP with a filter on a given netdev, to get the capability of
every PHY.

Implement dedicated genl ->start(), ->dumpit() and ->done() operations
for PHY-targetting command, allowing filtered dumps and using a dump
context that keep track of the PHY iteration for multi-message dump.

PSE-PD and PLCA are converted to this new set of ops along the way.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-2-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: iou-zcrx: Clean up build warnings for error format
Haiyue Wang [Fri, 2 May 2025 17:50:25 +0000 (01:50 +0800)]
selftests: iou-zcrx: Clean up build warnings for error format

Clean up two build warnings:

[1]

iou-zcrx.c: In function â€˜process_recvzc’:
iou-zcrx.c:263:37: warning: too many arguments for format [-Wformat-extra-args]
  263 |                         error(1, 0, "payload mismatch at ", i);
      |                                     ^~~~~~~~~~~~~~~~~~~~~~

[2] Use "%zd" for ssize_t type as better

iou-zcrx.c: In function â€˜run_client’:
iou-zcrx.c:357:47: warning: format â€˜%d’ expects argument of type â€˜int’, but argument 4 has type â€˜ssize_t’ {aka â€˜long int’} [-Wformat=]
  357 |                         error(1, 0, "send(): %d", sent);
      |                                              ~^   ~~~~
      |                                               |   |
      |                                               int ssize_t {aka long int}
      |                                              %ld

Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250502175136.1122-1-haiyuewa@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'selftests-mptcp-increase-code-coverage'
Jakub Kicinski [Mon, 5 May 2025 23:53:04 +0000 (16:53 -0700)]
Merge branch 'selftests-mptcp-increase-code-coverage'

Matthieu Baerts says:

====================
selftests: mptcp: increase code coverage

Here are various patches slightly improving MPTCP code coverage:

- Patch 1: avoid a harmless 'grep: write error' warning.

- Patch 2: use getaddrinfo() with IPPROTO_MPTCP in more places.

- Patch 3-6: prepare and add support to get info for a specific subflow
  when giving the 5-tuple.

- Patch 7: validate the previous patch and cover "subflow_get_info_size"
  in the kernel code (mptcp_diag.c).
====================

Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-0-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: add chk_sublfow in diag.sh
Gang Yan [Fri, 2 May 2025 12:29:27 +0000 (14:29 +0200)]
selftests: mptcp: add chk_sublfow in diag.sh

This patch aims to add chk_dump_subflow in diag.sh. The subflow's
info can be obtained through "ss -tin", then use the 'mptcp_diag'
to verify the token in subflow_info.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/524
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-7-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: add helpers to get subflow_info
Gang Yan [Fri, 2 May 2025 12:29:26 +0000 (14:29 +0200)]
selftests: mptcp: add helpers to get subflow_info

This patch adds 'get_subflow_info' in 'mptcp_diag', which can check whether
a TCP connection is an MPTCP subflow based on the "INET_ULP_INFO_MPTCP"
with tcp_diag method.

The helper 'print_subflow_info' in 'mptcp_diag' can print the subflow_filed
of an MPTCP subflow for further checking the 'subflow_info' through
inet_diag method.

The example of the whole output should be:

  $ ./mptcp_diag -s "127.0.0.1:10000 127.0.0.1:38984"
  127.0.0.1:10000 -> 127.0.0.1:38984
  It's a mptcp subflow, the subflow info:
   flags:Mec token:0000(id:0)/4278e77e(id:0) seq:9288466187236176036 \
   sfseq:1 ssnoff:2317083055 maplen:215

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-6-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: refactor NLMSG handling with 'proto'
Gang Yan [Fri, 2 May 2025 12:29:25 +0000 (14:29 +0200)]
selftests: mptcp: refactor NLMSG handling with 'proto'

This patch introduces the '__u32 proto' variable to the 'send_query' and
'recv_nlmsg' functions for further extending function.

In the 'send_query' function, the inclusion of this variable makes the
structure clearer and more readable.

In the 'recv_nlmsg' function, the '__u32 proto' variable ensures that
the 'diag_info' field remains unmodified when processing IPPROTO_TCP data,
thereby preventing unintended transformation into 'mptcp_info' format.

While at it, increment iovlen directly when an item is added to simplify
this portion of the code and improve its readaility.

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-5-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: refactor send_query parameters for code clarity
Gang Yan [Fri, 2 May 2025 12:29:24 +0000 (14:29 +0200)]
selftests: mptcp: refactor send_query parameters for code clarity

This patch use 'inet_diag_req_v2' instead of 'token' as parameters of
send_query, and construct the req in 'get_mptcpinfo'.

This modification enhances the clarity of the code, and prepare for the
dump_subflow_info.

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-4-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: add struct params in mptcp_diag
Gang Yan [Fri, 2 May 2025 12:29:23 +0000 (14:29 +0200)]
selftests: mptcp: add struct params in mptcp_diag

This patch adds a struct named 'params' to save 'target_token' and other
future parameters. This structure facilitates future function expansions.

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-3-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: sockopt: use IPPROTO_MPTCP for getaddrinfo
Geliang Tang [Fri, 2 May 2025 12:29:22 +0000 (14:29 +0200)]
selftests: mptcp: sockopt: use IPPROTO_MPTCP for getaddrinfo

getaddrinfo MPTCP is recently supported in glibc and IPPROTO_MPTCP for
getaddrinfo is used in mptcp_connect.c. But in mptcp_sockopt.c and
mptcp_inq.c, IPPROTO_TCP are still used for getaddrinfo, So this patch
updates them.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-2-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: info: hide 'grep: write error' warnings
Matthieu Baerts (NGI0) [Fri, 2 May 2025 12:29:21 +0000 (14:29 +0200)]
selftests: mptcp: info: hide 'grep: write error' warnings

mptcp_lib_get_info_value() will only print the first entry that match
the filter because of the ';q' at the end. As a consequence, the 'sed'
command could finish before the previous 'grep' one and print a 'write
error' warning because it is trying to write data to the closed pipe.

Such warnings are not interesting, they can be hidden by muting stderr
here for grep.

While at it, clearly indicate that mptcp_lib_get_info_value() will only
print the first matched entry to avoid confusions later on.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-1-68eec95898fb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agosctp: Remove unused sctp_assoc_del_peer and sctp_chunk_iif
Dr. David Alan Gilbert [Thu, 1 May 2025 23:38:15 +0000 (00:38 +0100)]
sctp: Remove unused sctp_assoc_del_peer and sctp_chunk_iif

sctp_assoc_del_peer() last use was removed in 2015 by
commit 73e6742027f5 ("sctp: Do not try to search for the transport twice")
which now uses rm_peer instead of del_peer.

sctp_chunk_iif() last use was removed in 2016 by
commit 1f45f78f8e51 ("sctp: allow GSO frags to access the chunk too")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250501233815.99832-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: phy: Refactor fwnode_get_phy_node()
Andy Shevchenko [Wed, 30 Apr 2025 14:38:02 +0000 (17:38 +0300)]
net: phy: Refactor fwnode_get_phy_node()

Refactor to check if the fwnode we got is correct and return if so,
otherwise do additional checks. Using same pattern in all conditionals
makes it slightly easier to read and understand.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20250430143802.3714405-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agostrparser: Remove unused __strp_unpause
Dr. David Alan Gilbert [Thu, 1 May 2025 00:24:02 +0000 (01:24 +0100)]
strparser: Remove unused __strp_unpause

The last use of __strp_unpause() was removed in 2022 by
commit 84c61fe1a75b ("tls: rx: do not use the standard strparser")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501002402.308843-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Mon, 5 May 2025 20:22:58 +0000 (13:22 -0700)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-05-02

We've added 14 non-merge commits during the last 10 day(s) which contain
a total of 13 files changed, 740 insertions(+), 121 deletions(-).

The main changes are:

1) Avoid skipping or repeating a sk when using a UDP bpf_iter,
   from Jordan Rife.

2) Fixed a crash when a bpf qdisc is set in
   the net.core.default_qdisc, from Amery Hung.

3) A few other fixes in the bpf qdisc, from Amery Hung.
   - Always call qdisc_watchdog_init() in the .init prologue such that
     the .reset/.destroy epilogue can always call qdisc_watchdog_cancel()
     without issue.
   - bpf_qdisc_init_prologue() was incorrectly returning an error
     when the bpf qdisc is set as the default_qdisc and the mq is creating
     the default_qdisc. It is now fixed.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Cleanup bpf qdisc selftests
  selftests/bpf: Test attaching a bpf qdisc with incomplete operators
  bpf: net_sched: Make some Qdisc_ops ops mandatory
  selftests/bpf: Test setting and creating bpf qdisc as default qdisc
  bpf: net_sched: Fix bpf qdisc init prologue when set as default qdisc
  selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  selftests/bpf: Return socket cookies from sock_iter_batch progs
  bpf: udp: Avoid socket skips and repeats during iteration
  bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
  bpf: udp: Get rid of st_bucket_done
  bpf: udp: Make sure iter->batch always contains a full bucket snapshot
  bpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch
  bpf: net_sched: Fix using bpf qdisc as default qdisc
  selftests/bpf: Fix compilation errors
====================

Link: https://patch.msgid.link/20250503010755.4030524-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: netfilter: nft_fib.sh: check lo packets bypass fib lookup
Florian Westphal [Wed, 23 Apr 2025 09:57:29 +0000 (11:57 +0200)]
selftests: netfilter: nft_fib.sh: check lo packets bypass fib lookup

With reverted fix:
PASS: fib expression did not cause unwanted packet drops
[   37.285169] ns1-KK76Kt nft_rpfilter: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=32287 DF PROTO=ICMP TYPE=8 CODE=0 ID=1818 SEQ=1
FAIL: rpfilter did drop packets
FAIL: ns1-KK76Kt cannot reach 127.0.0.1, ret 0

Check for this.

Link: https://lore.kernel.org/netfilter/20250422114352.GA2092@breakpoint.cc/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agonetfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX
Pablo Neira Ayuso [Tue, 22 Apr 2025 19:52:44 +0000 (21:52 +0200)]
netfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX

Otherwise, it is possible to hit WARN_ON_ONCE in __kvmalloc_node_noprof()
when resizing hashtable because __GFP_NOWARN is unset.

Similar to:

  b541ba7d1f5a ("netfilter: conntrack: clamp maximum hashtable size to INT_MAX")

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agonetfilter: nft_set_pipapo: prevent overflow in lookup table allocation
Pablo Neira Ayuso [Tue, 22 Apr 2025 19:52:43 +0000 (21:52 +0200)]
netfilter: nft_set_pipapo: prevent overflow in lookup table allocation

When calculating the lookup table size, ensure the following
multiplication does not overflow:

- desc->field_len[] maximum value is U8_MAX multiplied by
  NFT_PIPAPO_GROUPS_PER_BYTE(f) that can be 2, worst case.
- NFT_PIPAPO_BUCKETS(f->bb) is 2^8, worst case.
- sizeof(unsigned long), from sizeof(*f->lt), lt in
  struct nft_pipapo_field.

Then, use check_mul_overflow() to multiply by bucket size and then use
check_add_overflow() to the alignment for avx2 (if needed). Finally, add
lt_size_check_overflow() helper and use it to consolidate this.

While at it, replace leftover allocation using the GFP_KERNEL to
GFP_KERNEL_ACCOUNT for consistency, in pipapo_resize().

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agonetfilter: nf_conntrack: speed up reads from nf_conntrack proc file
Florian Westphal [Tue, 22 Apr 2025 13:17:29 +0000 (15:17 +0200)]
netfilter: nf_conntrack: speed up reads from nf_conntrack proc file

Dumping all conntrack entries via proc interface can take hours due to
linear search to skip entries dumped so far in each cycle.

Apply same strategy used to speed up ipvs proc reading done in
commit 178883fd039d ("ipvs: speed up reads from ip_vs_conn proc file")
to nf_conntrack.

Note that the ctnetlink interface doesn't suffer from this problem, but
many scripts depend on the nf_conntrack proc interface.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agonetfilter: nft_quota: match correctly when the quota just depleted
Zhongqiu Duan [Thu, 17 Apr 2025 15:49:30 +0000 (15:49 +0000)]
netfilter: nft_quota: match correctly when the quota just depleted

The xt_quota compares skb length with remaining quota, but the nft_quota
compares it with consumed bytes.

The xt_quota can match consumed bytes up to quota at maximum. But the
nft_quota break match when consumed bytes equal to quota.

i.e., nft_quota match consumed bytes in [0, quota - 1], not [0, quota].

Fixes: 795595f68d6c ("netfilter: nft_quota: dump consumed quota")
Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agoselftests: netfilter: add conntrack stress test
Florian Westphal [Thu, 17 Apr 2025 15:14:28 +0000 (17:14 +0200)]
selftests: netfilter: add conntrack stress test

Add a new test case to check:
 - conntrack_max limit is effective
 - conntrack_max limit cannot be exceeded from within a netns
 - resizing the hash table while packets are inflight works
 - removal of all conntrack rules disables conntrack in netns
 - conntrack tool dump (conntrack -L) returns expected number
   of (unique) entries
 - procfs interface - if available - has same number of entries
   as conntrack -L dump

Expected output with selftest framework:
 selftests: net/netfilter: conntrack_resize.sh
 PASS: got 1 connections: netns conntrack_max is pernet bound
 PASS: got 100 connections: netns conntrack_max is init_net bound
 PASS: dump in netns had same entry count (-C 1778, -L 1778, -p 1778, /proc 0)
 PASS: dump in netns had same entry count (-C 2000, -L 2000, -p 2000, /proc 0)
 PASS: test parallel conntrack dumps
 PASS: resize+flood
 PASS: got 0 connections: conntrack disabled
 PASS: got 1 connections: conntrack enabled
ok 1 selftests: net/netfilter: conntrack_resize.sh

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agonetfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it
Huajian Yang [Thu, 17 Apr 2025 09:29:53 +0000 (17:29 +0800)]
netfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it

The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for
fragmented packets.

The original bridge does not know that it is a fragmented packet and
forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function
nf_br_ip_fragment and br_ip6_fragment will check the headroom.

In original br_forward, insufficient headroom of skb may indeed exist,
but there's still a way to save the skb in the device driver after
dev_queue_xmit.So droping the skb will change the original bridge
forwarding in some cases.

Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system")
Signed-off-by: Huajian Yang <huajianyang@asrmicro.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 weeks agoipv4: Honor "ignore_routes_with_linkdown" sysctl in nexthop selection
Ido Schimmel [Wed, 30 Apr 2025 10:02:40 +0000 (13:02 +0300)]
ipv4: Honor "ignore_routes_with_linkdown" sysctl in nexthop selection

Commit 32607a332cfe ("ipv4: prefer multipath nexthop that matches source
address") changed IPv4 nexthop selection to prefer a nexthop whose
nexthop device is assigned the specified source address for locally
generated traffic.

While the selection honors the "fib_multipath_use_neigh" sysctl and will
not choose a nexthop with an invalid neighbour, it does not honor the
"ignore_routes_with_linkdown" sysctl and can choose a nexthop without a
carrier:

 $ sysctl net.ipv4.conf.all.ignore_routes_with_linkdown
 net.ipv4.conf.all.ignore_routes_with_linkdown = 1
 $ ip route show 198.51.100.0/24
 198.51.100.0/24
         nexthop via 192.0.2.2 dev dummy1 weight 1
         nexthop via 192.0.2.18 dev dummy2 weight 1 dead linkdown
 $ ip route get 198.51.100.1 from 192.0.2.17
 198.51.100.1 from 192.0.2.17 via 192.0.2.18 dev dummy2 uid 0

Solve this by skipping over nexthops whose assigned hash upper bound is
minus one, which is the value assigned to nexthops that do not have a
carrier when the "ignore_routes_with_linkdown" sysctl is set.

In practice, this probably does not matter a lot as the initial route
lookup for the source address would not choose a nexthop that does not
have a carrier in the first place, but the change does make the code
clearer.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 months agoipv6: Restore fib6_config validation for SIOCADDRT.
Kuniyuki Iwashima [Thu, 1 May 2025 00:53:29 +0000 (17:53 -0700)]
ipv6: Restore fib6_config validation for SIOCADDRT.

syzkaller reported out-of-bounds read in ipv6_addr_prefix(),
where the prefix length was over 128.

The cited commit accidentally removed some fib6_config
validation from the ioctl path.

Let's restore the validation.

[0]:
BUG: KASAN: slab-out-of-bounds in ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814)
Read of size 1 at addr ff11000138020ad4 by task repro/261

CPU: 3 UID: 0 PID: 261 Comm: repro Not tainted 6.15.0-rc3-00614-g0d15a26b247d #87 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:123)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:636)
 ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814)
 ip6_route_add (net/ipv6/route.c:3902)
 ipv6_route_ioctl (net/ipv6/route.c:4523)
 inet6_ioctl (net/ipv6/af_inet6.c:577)
 sock_do_ioctl (net/socket.c:1190)
 sock_ioctl (net/socket.c:1314)
 __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:906 fs/ioctl.c:892 fs/ioctl.c:892)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f518fb2de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fff14f38d18 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f518fb2de5d
RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003
RBP: 00007fff14f38d30 R08: 0000000000000800 R09: 0000000000000800
R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff14f38e48
R13: 0000000000401136 R14: 0000000000403df0 R15: 00007f518fd3c000
 </TASK>

Fixes: fa76c1674f2e ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Yi Lai <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/netdev/aBAcKDEFoN%2FLntBF@ly-workstation/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250501005335.53683-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agomptcp: Align mptcp_inet6_sk with other protocols
Pedro Falcato [Wed, 30 Apr 2025 15:45:41 +0000 (16:45 +0100)]
mptcp: Align mptcp_inet6_sk with other protocols

Ever since commit f5f80e32de12 ("ipv6: remove hard coded limitation on
ipv6_pinfo") that protocols stopped using the old "obj_size -
sizeof(struct ipv6_pinfo)" way of grabbing ipv6_pinfo, that severely
restricted struct layout and caused fun, hard to see issues.

However, mptcp_inet6_sk wasn't fixed (unlike tcp_inet6_sk). Do so.
The non-cloned sockets already do the right thing using
ipv6_pinfo_offset + the generic IPv6 code.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250430154541.1038561-1-pfalcato@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'net-stmmac-replace-speed_mode_2500-method'
Jakub Kicinski [Sat, 3 May 2025 01:25:11 +0000 (18:25 -0700)]
Merge branch 'net-stmmac-replace-speed_mode_2500-method'

Russell King says:

====================
net: stmmac: replace speed_mode_2500() method

This series replaces the speed_mode_2500() method with a new method
that is more flexible, allowing the platform glue driver to populate
phylink's supported_interfaces and set the PHY-side interface mode.

The only user of this method is currently dwmac-intel, which we
update to use this new method.
====================

Link: https://patch.msgid.link/aBNe0Vt81vmqVCma@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: remove speed_mode_2500() method
Russell King (Oracle) [Thu, 1 May 2025 11:45:27 +0000 (12:45 +0100)]
net: stmmac: remove speed_mode_2500() method

Remove the speed_mode_2500() platform method which is no longer used
or necessary, being superseded by the more flexible get_interfaces()
method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1uASM3-0021R3-2B@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: intel: convert speed_mode_2500() to get_interfaces()
Russell King (Oracle) [Thu, 1 May 2025 11:45:21 +0000 (12:45 +0100)]
net: stmmac: intel: convert speed_mode_2500() to get_interfaces()

TGL platforms support either SGMII or 2500BASE-X, which is determined
by reading a SERDES register.

Thus, plat->phy_interface (and phylink's supported_interfaces) depend
on this. Use the new .get_interfaces() method to set both
plat->phy_interface and the supported_interfaces bitmap.

This removes the only user of the .speed_mode_2500() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1uASLx-0021Qs-Uz@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: intel: move phy_interface init to tgl_common_data()
Russell King (Oracle) [Thu, 1 May 2025 11:45:16 +0000 (12:45 +0100)]
net: stmmac: intel: move phy_interface init to tgl_common_data()

Move the initialisation of plat->phy_interface to tgl_common_data()
as all callers set this same interface mode. This moves it to a
single location to make the change to get_interfaces() more obvious.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1uASLs-0021Qk-Qt@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: add get_interfaces() platform method
Russell King (Oracle) [Thu, 1 May 2025 11:45:11 +0000 (12:45 +0100)]
net: stmmac: add get_interfaces() platform method

Add a get_interfaces() platform method to allow platforms to indicate
to phylink which interface modes they support - which then allows
phylink to validate on initialisation that the configured PHY interface
mode is actually supported.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uASLn-0021Qd-Mi@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: use priv->plat->phy_interface directly
Russell King (Oracle) [Thu, 1 May 2025 11:45:06 +0000 (12:45 +0100)]
net: stmmac: use priv->plat->phy_interface directly

Avoid using a local variable for priv->plat->phy_interface as this
may be modified in the .get_interfaces() method added in a future
commit.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1uASLi-0021QX-HG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: stmmac: use a local variable for priv->phylink_config
Russell King (Oracle) [Thu, 1 May 2025 11:45:01 +0000 (12:45 +0100)]
net: stmmac: use a local variable for priv->phylink_config

Use a local variable for priv->phylink_config in stmmac_phy_setup()
which makes the code a bit easier to read, allowing some lines to be
merged.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1uASLd-0021QR-Cu@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'fix-bpf-qdisc-bugs-and-clean-up'
Martin KaFai Lau [Fri, 2 May 2025 21:50:09 +0000 (14:50 -0700)]
Merge branch 'fix-bpf-qdisc-bugs-and-clean-up'

Amery Hung says:

====================
Fix bpf qdisc bugs and clean up

This patchset fixes the following bugs in bpf qdisc and clean up the
selftest.

- A null-pointer dereference can happen in qdisc_watchdog_cancel() if
  the timer is not initialized when 1) .init is not defined by user so
  init prologue is not generated. 2) .init fails and qdisc_create()
  calls .destroy

- bpf qdisc fails to attach to mq/mqprio when being set as the default
  qdisc due to failed qdisc_lookup() in init prologue

v2
- Rebase to bpf-next/net
- Fix erroneous commit messages
- Fix and simplify selftests cleanup
  v1: https://lore.kernel.org/bpf/20250501223025.569020-1-ameryhung@gmail.com/
====================

Link: https://patch.msgid.link/20250502201624.3663079-1-ameryhung@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Cleanup bpf qdisc selftests
Amery Hung [Fri, 2 May 2025 20:16:24 +0000 (13:16 -0700)]
selftests/bpf: Cleanup bpf qdisc selftests

Some cleanups:
- Remove unnecessary kfuncs declaration
- Use _ns in the test name to run tests in a separate net namespace
- Call skeleton __attach() instead of bpf_map__attach_struct_ops() to
  simplify tests.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Test attaching a bpf qdisc with incomplete operators
Amery Hung [Fri, 2 May 2025 20:16:23 +0000 (13:16 -0700)]
selftests/bpf: Test attaching a bpf qdisc with incomplete operators

Implement .destroy in bpf_fq and bpf_fifo as it is now mandatory.

Test attaching a bpf qdisc with a missing operator .init. This is not
allowed as bpf qdisc qdisc_watchdog_cancel() could have been called with
an uninitialized timer.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: net_sched: Make some Qdisc_ops ops mandatory
Amery Hung [Fri, 2 May 2025 20:16:22 +0000 (13:16 -0700)]
bpf: net_sched: Make some Qdisc_ops ops mandatory

The patch makes all currently supported Qdisc_ops (i.e., .enqueue,
.dequeue, .init, .reset, and .destroy) mandatory.

Make .init, .reset and .destroy mandatory as bpf qdisc relies on prologue
and epilogue to check attach points and correctly initialize/cleanup
resources. The prologue/epilogue will only be generated for an struct_ops
operator only if users implement the operator.

Make .enqueue and .dequeue mandatory as bpf qdisc infra does not provide
a default data path.

Fixes: c8240344956e ("bpf: net_sched: Support implementation of Qdisc_ops in bpf")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Test setting and creating bpf qdisc as default qdisc
Amery Hung [Fri, 2 May 2025 20:16:21 +0000 (13:16 -0700)]
selftests/bpf: Test setting and creating bpf qdisc as default qdisc

First, test that bpf qdisc can be set as default qdisc. Then, attach
an mq qdisc to see if bpf qdisc can be successfully created and grafted.

The test is a sequential test as net.core.default_qdisc is global.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: net_sched: Fix bpf qdisc init prologue when set as default qdisc
Amery Hung [Fri, 2 May 2025 20:16:20 +0000 (13:16 -0700)]
bpf: net_sched: Fix bpf qdisc init prologue when set as default qdisc

Allow .init to proceed if qdisc_lookup() returns NULL as it only happens
when called by qdisc_create_dflt() in mq/mqprio_init and the parent qdisc
has not been added to qdisc_hash yet. In qdisc_create(), the caller,
__tc_modify_qdisc(), would have made sure the parent qdisc already exist.

In addition, call qdisc_watchdog_init() whether .init succeeds or not to
prevent null-pointer dereference. In qdisc_create() and
qdisc_create_dflt(), if .init fails, .destroy will be called. As a
result, the destroy epilogue could call qdisc_watchdog_cancel() with an
uninitialized timer, causing null-pointer deference in hrtimer_cancel().

Fixes: c8240344956e ("bpf: net_sched: Support implementation of Qdisc_ops in bpf")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoMerge branch 'bpf-udp-exactly-once-socket-iteration'
Martin KaFai Lau [Fri, 2 May 2025 17:32:02 +0000 (10:32 -0700)]
Merge branch 'bpf-udp-exactly-once-socket-iteration'

Jordan Rife says:

====================
bpf: udp: Exactly-once socket iteration

Both UDP and TCP socket iterators use iter->offset to track progress
through a bucket, which is a measure of the number of matching sockets
from the current bucket that have been seen or processed by the
iterator. On subsequent iterations, if the current bucket has
unprocessed items, we skip at least iter->offset matching items in the
bucket before adding any remaining items to the next batch. However,
iter->offset isn't always an accurate measure of "things already seen"
when the underlying bucket changes between reads which can lead to
repeated or skipped sockets. Instead, this series remembers the cookies
of the sockets we haven't seen yet in the current bucket and resumes
from the first cookie in that list that we can find on the next
iteration. This series focuses on UDP socket iterators, but a later
series will apply a similar approach to TCP socket iterators.

To be more specific, this series replaces struct sock **batch inside
struct bpf_udp_iter_state with union bpf_udp_iter_batch_item *batch,
where union bpf_udp_iter_batch_item can contain either a pointer to a
socket or a socket cookie. During reads, batch contains pointers to all
sockets in the current batch while between reads batch contains all the
cookies of the sockets in the current bucket that have yet to be
processed. On subsequent reads, when iteration resumes,
bpf_iter_udp_batch finds the first saved cookie that matches a socket in
the bucket's socket list and picks up from there to construct the next
batch. On average, assuming it's rare that the next socket disappears
before the next read occurs, we should only need to scan as much as we
did with the offset-based approach to find the starting point. In the
case that the next socket is no longer there, we keep scanning through
the saved cookies list until we find a match. The worst case is when
none of the sockets from last time exist anymore, but again, this should
be rare.

CHANGES
=======
v6 -> v7:
* Move initialization of iter->state.bucket to -1 from patch five ("bpf:
  udp: Avoid socket skips and repeats during iteration") to patch three
  ("bpf: udp: Get rid of st_bucket_done") to avoid skipping the first
  bucket in the patch three and four (Martin).
* Rename sock to sk in bpf_iter_batch_item (Martin).
* Use ASSERT_OK_PTR in do_resume_test to check if counts is NULL
  (Martin).
* goto done in do_resume_test when calloc or sock_iter_batch__open fails
  to make sure things are cleaned up properly, and initialize pointers
  to NULL explicitly to silence warnings from llvm 20 in CI.

v5 -> v6:
* Rework the logic in patch two ("bpf: udp: Make sure iter->batch
  always contains a full bucket snapshot") again to simplify it:
  * Only try realloc with GFP_USER one time instead of two (Alexei).
  * v5 introduced a second call to bpf_iter_udp_realloc_batch inside the
    loop to handle the GFP_ATOMIC case. In v6, move the GFP_USER case
    inside the loop as well, so it's all in once place. This, I feel,
    makes it a bit easier to understand the control flow. Consequently,
    it also simplifies the logic outside the loop.
* Use GFP_NOWAIT instead of GFP_ATOMIC to avoid depleting memory
  reserves, since iterators are not critical operation (Alexei). Alexei
  suggested using __GFP_NOWARN as well with GFP_NOWAIT, but this is
  already set inside bpf_iter_udp_realloc_batch, so no change was needed
  there.
* Introduce patch three ("bpf: udp: Get rid of st_bucket_done") to
  simplify things further, since with patch two, st_bucket_done == true
  is equivalent to iter->cur_sk == iter->end_sk.
* In patch five ("bpf: udp: Avoid socket skips and repeats during
  iteration"), initialize iter->state.bucket to -1 so that on the first
  call to bpf_iter_udp_batch, the resume_bucket condition is not hit.
  This avoids adding a special case to the condition around
  bpf_iter_udp_resume for bucket zero.

v4 -> v5:
* Rework the logic from patch two ("bpf: udp: Make sure iter->batch
  always contains a full bucket snapshot") to move the handling of the
  GFP_ATOMIC case inside the main loop and get rid of the extra lock
  variable. This makes the logic clearer and makes it clearer that the
  bucket lock is always released (Martin).
* Introduce udp_portaddr_for_each_entry_from in patch two instead of
  patch four ("bpf: udp: Avoid socket skips and repeats during
  iteration"), since patch two now needs to be able to resume list
  iteration from an arbitrary point in the GFP_ATOMIC case.
* Similarly, introduce the memcpy inside bpf_iter_udp_realloc_batch in
  patch two instead of patch four, since in the GFP_ATOMIC case the new
  batch needs to remember the sockets from the old batch.
* Use sock_gen_cookie instead of __sock_gen_cookie inside
  bpf_iter_udp_put_batch, since it can be called from a preemptible
  context (Martin).

v3 -> v4:
* Explicitly assign sk = NULL on !iter->end_sk exit condition
  (Kuniyuki).
* Reword the commit message of patch two ("bpf: udp: Make sure
  iter->batch always contains a full bucket snapshot") to make the
  reasoning for GFP_ATOMIC more clear.

v2 -> v3:
* Guarantee that iter->batch is always a full snapshot of a bucket to
  prevent socket repeat scenarios [3]. This supercedes the patch from v2
  that simply propagated ENOMEM up from bpf_iter_udp_batch and covers
  the scenario where the batch size is still too small after a realloc.
* Fix up self tests (Martin)
  * ASSERT_EQ(nread, sizeof(out), "nread") instead of
    ASSERT_GE(nread, 1, "nread) in read_n.
  * Use ASSERT_OK and ASSERT_OK_FD in several places.
  * Add missing free(counts) to do_resume_test.
  * Move int local_port declaration to the top of do_resume_test.
  * Remove unnecessary guards before close and free.

v1 -> v2:
* Drop WARN_ON_ONCE from bpf_iter_udp_realloc_batch (Kuniyuki).
* Fixed memcpy size parameter in bpf_iter_udp_realloc_batch; before it
  was missing sizeof(elem) * (Kuniyuki).
* Move "bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch" to patch
  two in the series (Kuniyuki).

rfc [1] -> v1:
* Use hlist_entry_safe directly to retrieve the first socket in the
  current bucket's linked list instead of immediately breaking from
  udp_portaddr_for_each_entry (Martin).
* Cancel iteration if bpf_iter_udp_realloc_batch() can't grab enough
  memory to contain a full snapshot of the current bucket to prevent
  unwanted skips or repeats [2].

[1]: https://lore.kernel.org/bpf/20250404220221.1665428-1-jordan@jrife.io/
[2]: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/d323d417-3e8b-48af-ae94-bc28469ac0c1@linux.dev/
====================

Link: https://patch.msgid.link/20250502161528.264630-1-jordan@jrife.io
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Add tests for bucket resume logic in UDP socket iterators
Jordan Rife [Fri, 2 May 2025 16:15:26 +0000 (09:15 -0700)]
selftests/bpf: Add tests for bucket resume logic in UDP socket iterators

Introduce a set of tests that exercise various bucket resume scenarios:

* remove_seen resumes iteration after removing a socket from the bucket
  that we've already processed. Before, with the offset-based approach,
  this test would have skipped an unseen socket after resuming
  iteration. With the cookie-based approach, we now see all sockets
  exactly once.
* remove_unseen exercises the condition where the next socket that we
  would have seen is removed from the bucket before we resume iteration.
  This tests the scenario where we need to scan past the first cookie in
  our remembered cookies list to find the socket from which to resume
  iteration.
* remove_all exercises the condition where all sockets we remembered
  were removed from the bucket to make sure iteration terminates and
  returns no more results.
* add_some exercises the condition where a few, but not enough to
  trigger a realloc, sockets are added to the head of the current bucket
  between reads. Before, with the offset-based approach, this test would
  have repeated sockets we've already seen. With the cookie-based
  approach, we now see all sockets exactly once.
* force_realloc exercises the condition that we need to realloc the
  batch on a subsequent read, since more sockets than can be held in the
  current batch array were added to the current bucket. This exercies
  the logic inside bpf_iter_udp_realloc_batch that copies cookies into
  the new batch to make sure nothing is skipped or repeated.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Return socket cookies from sock_iter_batch progs
Jordan Rife [Fri, 2 May 2025 16:15:25 +0000 (09:15 -0700)]
selftests/bpf: Return socket cookies from sock_iter_batch progs

Extend the iter_udp_soreuse and iter_tcp_soreuse programs to write the
cookie of the current socket, so that we can track the identity of the
sockets that the iterator has seen so far. Update the existing do_test
function to account for this change to the iterator program output. At
the same time, teach both programs to work with AF_INET as well.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: udp: Avoid socket skips and repeats during iteration
Jordan Rife [Fri, 2 May 2025 16:15:24 +0000 (09:15 -0700)]
bpf: udp: Avoid socket skips and repeats during iteration

Replace the offset-based approach for tracking progress through a bucket
in the UDP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.

This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
Jordan Rife [Fri, 2 May 2025 16:15:23 +0000 (09:15 -0700)]
bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items

Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_udp_iter_batch_item *batch
inside struct bpf_udp_iter_state.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
2 months agobpf: udp: Get rid of st_bucket_done
Jordan Rife [Fri, 2 May 2025 16:15:22 +0000 (09:15 -0700)]
bpf: udp: Get rid of st_bucket_done

Get rid of the st_bucket_done field to simplify UDP iterator state and
logic. Before, st_bucket_done could be false if bpf_iter_udp_batch
returned a partial batch; however, with the last patch ("bpf: udp: Make
sure iter->batch always contains a full bucket snapshot"),
st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: udp: Make sure iter->batch always contains a full bucket snapshot
Jordan Rife [Fri, 2 May 2025 16:15:21 +0000 (09:15 -0700)]
bpf: udp: Make sure iter->batch always contains a full bucket snapshot

Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_udp_batch may only capture part of a
bucket:

1. When bpf_iter_udp_realloc_batch() returns -ENOMEM [1].
2. When more sockets are added to the bucket while calling
   bpf_iter_udp_realloc_batch(), making the updated batch size
   insufficient [2].

In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:

1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
   fails instead of continuing with a partial batch.
2. Try bpf_iter_udp_realloc_batch() with GFP_USER just as before, but if
   we still aren't able to capture the full bucket, call
   bpf_iter_udp_realloc_batch() again while holding the bucket lock to
   guarantee the bucket does not change. On the second attempt use
   GFP_NOWAIT since we hold onto the spin lock.

Introduce the udp_portaddr_for_each_entry_from macro and use it instead
of udp_portaddr_for_each_entry to make it possible to continue iteration
from an arbitrary socket. This is required for this patch in the
GFP_NOWAIT case to allow us to fill the rest of a batch starting from
the middle of a bucket and the later patch which skips sockets that were
already seen.

Testing all scenarios directly is a bit difficult, but I did some manual
testing to exercise the code paths where GFP_NOWAIT is used and where
ERR_PTR(err) is returned. I used the realloc test case included later
in this series to trigger a scenario where a realloc happens inside
bpf_iter_udp_batch and made a small code tweak to force the first
realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:

Apr 25 23:16:24 crow kernel: go again GFP_USER
Apr 25 23:16:24 crow kernel: go again GFP_NOWAIT

With this setup, I also forced each of the bpf_iter_udp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.

[1]: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/7ed28273-a716-4638-912d-f86f965e54bb@linux.dev/

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agobpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch
Jordan Rife [Fri, 2 May 2025 16:15:20 +0000 (09:15 -0700)]
bpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch

Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_NOWAIT for calls to bpf_iter_udp_realloc_batch.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
2 months agoMerge branch 'tools-ynl-gen-additional-c-types-and-classic-netlink-handling'
Paolo Abeni [Fri, 2 May 2025 10:41:05 +0000 (12:41 +0200)]
Merge branch 'tools-ynl-gen-additional-c-types-and-classic-netlink-handling'

Jakub Kicinski says:

====================
tools: ynl-gen: additional C types and classic netlink handling

This series is a bit of a random grab bag adding things we need
to generate code for rt-link.

First two patches are pretty random code cleanups.

Patch 3 adds default values if the spec is missing them.

Patch 4 adds support for setting Netlink request flags
(NLM_F_CREATE, NLM_F_REPLACE etc.). Classic netlink uses those
quite a bit.

Patches 5 and 6 extend the notification handling for variations
used in classic netlink. Patch 6 adds support for when notification
ID is the same as the ID of the response message to GET.

Next 4 patches add support for handling a couple of complex types.
These are supported by the schema and Python but C code gen wasn't
there.

Patch 11 is a bit of a hack, it skips code related to kernel
policy generation, since we don't need it for classic netlink.

Patch 12 adds support for having different fixed headers per op.
Something we could avoid in previous rtnetlink specs but some
specs do mix.

v2: https://lore.kernel.org/20250425024311.1589323-1-kuba@kernel.org
v1: https://lore.kernel.org/20250424021207.1167791-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250429154704.2613851-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl: allow fixed-header to be specified per op
Jakub Kicinski [Tue, 29 Apr 2025 15:47:04 +0000 (08:47 -0700)]
tools: ynl: allow fixed-header to be specified per op

rtnetlink has variety of ops with different fixed headers.
Detect that op fixed header is not the same as family one,
and use sizeof() directly. For reverse parsing we need to
pass the fixed header len along the policy (in the socket
state).

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-13-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: don't init enum checks for classic netlink
Jakub Kicinski [Tue, 29 Apr 2025 15:47:03 +0000 (08:47 -0700)]
tools: ynl-gen: don't init enum checks for classic netlink

rt-link has a vlan-protocols enum with:

   name: 8021q     value: 33024
   name: 8021ad    value: 34984

It's nice to have, since it converts the values to strings in Python.
For C, however, the codegen is trying to use enums to generate strict
policy checks. Parsing such sparse enums is not possible via policies.

Since for classic netlink we don't support kernel codegen and policy
generation - skip the auto-generation of checks from enums.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-12-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: array-nest: support binary array with exact-len
Jakub Kicinski [Tue, 29 Apr 2025 15:47:02 +0000 (08:47 -0700)]
tools: ynl-gen: array-nest: support binary array with exact-len

IPv6 addresses are expressed as binary arrays since we don't have u128.
Since they are not variable length, however, they are relatively
easy to represent as an array of known size.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-11-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: array-nest: support put for scalar
Jakub Kicinski [Tue, 29 Apr 2025 15:47:01 +0000 (08:47 -0700)]
tools: ynl-gen: array-nest: support put for scalar

C codegen supports ArrayNest AKA indexed-array carrying scalars,
but only for the netlink -> struct parsing. Support rendering
from struct to netlink.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-10-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: mutli-attr: support binary types with struct
Jakub Kicinski [Tue, 29 Apr 2025 15:47:00 +0000 (08:47 -0700)]
tools: ynl-gen: mutli-attr: support binary types with struct

Binary types with struct are fixed size, relatively easy to
handle for multi attr. Declare the member as a pointer.
Count the members, allocate an array, copy in the data.
Allow the netlink attr to be smaller or larger than our view
of the struct in case the build headers are newer or older
than the running kernel.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-9-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: multi-attr: type gen for string
Jakub Kicinski [Tue, 29 Apr 2025 15:46:59 +0000 (08:46 -0700)]
tools: ynl-gen: multi-attr: type gen for string

Add support for multi attr strings (needed for link alt_names).
We record the length individual strings in a len member, to do
the same for multi-attr create a struct ynl_string in ynl.h
and use it as a layer holding both the string and its length.
Since strings may be arbitrary length dynamically allocate each
individual one.

Adjust arg_member and struct member to avoid spacing the double
pointers to get "type **name;" rather than "type * *name;"

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-8-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: support CRUD-like notifications for classic Netlink
Jakub Kicinski [Tue, 29 Apr 2025 15:46:58 +0000 (08:46 -0700)]
tools: ynl-gen: support CRUD-like notifications for classic Netlink

Allow CRUD-style notification where the notification is more
like the response to the request, which can optionally be
looped back onto the requesting socket. Since the notification
and request are different ops in the spec, for example:

    -
      name: delrule
      doc: Remove an existing FIB rule
      attribute-set: fib-rule-attrs
      do:
        request:
          value: 33
          attributes: *fib-rule-all
    -
      name: delrule-ntf
      doc: Notify a rule deletion
      value: 33
      notify: getrule

We need to find the request by ID. Ideally we'd detect this model
from the spec properties, rather than assume that its what all
classic netlink families do. But maybe that'd cause this model
to spread and its easy to get wrong. For now assume CRUD == classic.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-7-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: support using dump types for ntf
Jakub Kicinski [Tue, 29 Apr 2025 15:46:57 +0000 (08:46 -0700)]
tools: ynl-gen: support using dump types for ntf

Classic Netlink has GET callbacks with no doit support, just dumps.
Support using their responses in notifications. If notification points
at a type which only has a dump - use the dump's type.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-6-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl: let classic netlink requests specify extra nlflags
Jakub Kicinski [Tue, 29 Apr 2025 15:46:56 +0000 (08:46 -0700)]
tools: ynl: let classic netlink requests specify extra nlflags

Classic netlink makes extensive use of flags. Support specifying
them the same way as attributes are specified (using a helper),
for example:

     rt_link_newlink_req_set_nlflags(req, NLM_F_CREATE | NLM_F_ECHO);

Wrap the code up in a RenderInfo predicate. I think that some
genetlink families may want this, too. It should be easy to
add a spec property later.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-5-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: fill in missing empty attr lists
Jakub Kicinski [Tue, 29 Apr 2025 15:46:55 +0000 (08:46 -0700)]
tools: ynl-gen: fill in missing empty attr lists

The C codegen refers to op attribute lists all over the place,
without checking if they are present, even tho attribute list
is technically an optional property. Add them automatically
at init if missing so that we don't have to make specs longer.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-4-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: factor out free_needs_iter for a struct
Jakub Kicinski [Tue, 29 Apr 2025 15:46:54 +0000 (08:46 -0700)]
tools: ynl-gen: factor out free_needs_iter for a struct

Instead of walking the entries in the code gen add a method
for the struct class to return if any of the members need
an iterator.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agotools: ynl-gen: fix comment about nested struct dict
Jakub Kicinski [Tue, 29 Apr 2025 15:46:53 +0000 (08:46 -0700)]
tools: ynl-gen: fix comment about nested struct dict

The dict stores struct objects (of class Struct), not just
a trivial set with directions.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250429154704.2613851-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2 months agodt-bindings: net: via-rhine: Convert to YAML
Alexey Charkov [Wed, 30 Apr 2025 10:42:45 +0000 (14:42 +0400)]
dt-bindings: net: via-rhine: Convert to YAML

Rewrite the textual description for the VIA Rhine platform Ethernet
controller as YAML schema, and switch the filename to follow the
compatible string. These are used in several VIA/WonderMedia SoCs

Signed-off-by: Alexey Charkov <alchark@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250430-rhine-binding-v2-1-4290156c0f57@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup
Andrea Mayer [Tue, 29 Apr 2025 13:24:53 +0000 (15:24 +0200)]
ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup

Recent updates to the locking mechanism that protects IPv6 routing tables
[1] have affected the SRv6 networking subsystem. Such changes cause
problems with some SRv6 Endpoints behaviors, like End.B6.Encaps and also
impact SRv6 counters.

Starting from commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE."), the inet6_rtm_newroute() function no longer needs to
acquire the RTNL lock for creating and configuring IPv6 routes and set up
lwtunnels.
The RTNL lock can be avoided because the ip6_route_add() function
finishes setting up a new route in a section protected by RCU.
This makes sure that no dev/nexthops can disappear during the operation.
Because of this, the steps for setting up lwtunnels - i.e., calling
lwtunnel_build_state() - are now done in a RCU lock section and not
under the RTNL lock anymore.

However, creating and configuring a lwtunnel instance in an
RCU-protected section can be problematic when that tunnel needs to
allocate memory using the GFP_KERNEL flag.
For example, the following trace shows what happens when an SRv6
End.B6.Encaps behavior is instantiated after commit 169fd62799e8 ("ipv6:
Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."):

[ 3061.219696] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[ 3061.226136] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 445, name: ip
[ 3061.232101] preempt_count: 0, expected: 0
[ 3061.235414] RCU nest depth: 1, expected: 0
[ 3061.238622] 1 lock held by ip/445:
[ 3061.241458]  #0: ffffffff83ec64a0 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x41/0x1e0
[ 3061.248520] CPU: 1 UID: 0 PID: 445 Comm: ip Not tainted 6.15.0-rc3-micro-vm-dev-00590-ge527e891492d #2058 PREEMPT(full)
[ 3061.248532] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 3061.248549] Call Trace:
[ 3061.248620]  <TASK>
[ 3061.248633]  dump_stack_lvl+0xa9/0xc0
[ 3061.248846]  __might_resched+0x218/0x360
[ 3061.248871]  __kmalloc_node_track_caller_noprof+0x332/0x4e0
[ 3061.248889]  ? rcu_is_watching+0x3a/0x70
[ 3061.248902]  ? parse_nla_srh+0x56/0xa0
[ 3061.248938]  kmemdup_noprof+0x1c/0x40
[ 3061.248952]  parse_nla_srh+0x56/0xa0
[ 3061.248969]  seg6_local_build_state+0x2e0/0x580
[ 3061.248992]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249013]  ? do_raw_spin_lock+0x111/0x1d0
[ 3061.249027]  ? __pfx_seg6_local_build_state+0x10/0x10
[ 3061.249068]  ? lwtunnel_build_state+0xe1/0x3a0
[ 3061.249274]  lwtunnel_build_state+0x10d/0x3a0
[ 3061.249303]  fib_nh_common_init+0xce/0x1e0
[ 3061.249337]  ? __pfx_fib_nh_common_init+0x10/0x10
[ 3061.249352]  ? in6_dev_get+0xaf/0x1f0
[ 3061.249369]  ? __rcu_read_unlock+0x64/0x2e0
[ 3061.249392]  fib6_nh_init+0x290/0xc30
[ 3061.249422]  ? __pfx_fib6_nh_init+0x10/0x10
[ 3061.249447]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249459]  ? _raw_spin_unlock_irqrestore+0x22/0x70
[ 3061.249624]  ? ip6_route_info_create+0x423/0x520
[ 3061.249641]  ? rcu_is_watching+0x3a/0x70
[ 3061.249683]  ip6_route_info_create_nh+0x190/0x390
[ 3061.249715]  ip6_route_add+0x71/0x1e0
[ 3061.249730]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.249743]  inet6_rtm_newroute+0x426/0xc50
[ 3061.249764]  ? avc_has_perm_noaudit+0x13d/0x360
[ 3061.249853]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.249905]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249962]  ? rtnetlink_rcv_msg+0x52f/0x890
[ 3061.249996]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.250012]  rtnetlink_rcv_msg+0x551/0x890
[ 3061.250040]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 3061.250065]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250092]  netlink_rcv_skb+0xbd/0x1f0
[ 3061.250108]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 3061.250124]  ? __pfx_netlink_rcv_skb+0x10/0x10
[ 3061.250179]  ? netlink_deliver_tap+0x10b/0x700
[ 3061.250210]  netlink_unicast+0x2e7/0x410
[ 3061.250232]  ? __pfx_netlink_unicast+0x10/0x10
[ 3061.250241]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250280]  netlink_sendmsg+0x366/0x670
[ 3061.250306]  ? __pfx_netlink_sendmsg+0x10/0x10
[ 3061.250313]  ? find_held_lock+0x2d/0xa0
[ 3061.250344]  ? import_ubuf+0xbc/0xf0
[ 3061.250370]  ? __pfx_netlink_sendmsg+0x10/0x10
[ 3061.250381]  __sock_sendmsg+0x13e/0x150
[ 3061.250420]  ____sys_sendmsg+0x33d/0x450
[ 3061.250442]  ? __pfx_____sys_sendmsg+0x10/0x10
[ 3061.250453]  ? __pfx_copy_msghdr_from_user+0x10/0x10
[ 3061.250489]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
[ 3061.250514]  ___sys_sendmsg+0xe5/0x160
[ 3061.250530]  ? __pfx____sys_sendmsg+0x10/0x10
[ 3061.250568]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250617]  ? find_held_lock+0x2d/0xa0
[ 3061.250678]  ? __virt_addr_valid+0x199/0x340
[ 3061.250704]  ? preempt_count_sub+0xf/0xc0
[ 3061.250736]  __sys_sendmsg+0xca/0x140
[ 3061.250750]  ? __pfx___sys_sendmsg+0x10/0x10
[ 3061.250786]  ? syscall_exit_to_user_mode+0xa2/0x1e0
[ 3061.250825]  do_syscall_64+0x62/0x140
[ 3061.250844]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3061.250855] RIP: 0033:0x7f0b042ef914
[ 3061.250868] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
[ 3061.250876] RSP: 002b:00007ffc2d113ef8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 3061.250885] RAX: ffffffffffffffda RBX: 00000000680f93fa RCX: 00007f0b042ef914
[ 3061.250891] RDX: 0000000000000000 RSI: 00007ffc2d113f60 RDI: 0000000000000003
[ 3061.250897] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
[ 3061.250902] R10: fffffffffffff26d R11: 0000000000000246 R12: 0000000000000001
[ 3061.250907] R13: 000055a961f8a520 R14: 000055a961f63eae R15: 00007ffc2d115270
[ 3061.250952]  </TASK>

To solve this issue, we replace the GFP_KERNEL flag with the GFP_ATOMIC
one in those SRv6 Endpoints that need to allocate memory during the
setup phase. This change makes sure that memory allocations are handled
in a way that works with RCU critical sections.

[1] - https://lore.kernel.org/all/20250418000443.43734-1-kuniyu@amazon.com/

Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250429132453.31605-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: phy: factor out provider part from mdio_bus.c
Heiner Kallweit [Tue, 29 Apr 2025 06:04:46 +0000 (08:04 +0200)]
net: phy: factor out provider part from mdio_bus.c

After 52358dd63e34 ("net: phy: remove function stubs") there's a
problem if CONFIG_MDIO_BUS is set, but CONFIG_PHYLIB is not.
mdiobus_scan() uses phylib functions like get_phy_device().
Bringing back the stub wouldn't make much sense, because it would
allow to compile mdiobus_scan(), but the function would be unusable.
The stub returned NULL, and we have the following in mdiobus_scan():

phydev = get_phy_device(bus, addr, c45);
        if (IS_ERR(phydev))
                return phydev;

So calling mdiobus_scan() w/o CONFIG_PHYLIB would cause a crash later in
mdiobus_scan(). In general the PHYLIB functionality isn't optional here.
Consequently, MDIO bus providers depend on PHYLIB.
Therefore factor it out and build it together with the libphy core
modules. In addition make all MDIO bus providers under /drivers/net/mdio
depend on PHYLIB. Same applies to enetc MDIO bus provider. Note that
PHYLIB selects MDIO_DEVRES, therefore we can omit this here.

Fixes: 52358dd63e34 ("net: phy: remove function stubs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202504270639.mT0lh2o1-lkp@intel.com/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/c74772a9-dab6-44bf-a657-389df89d85c2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agonet: ethernet: mtk_eth_soc: add support for MT7988 internal 2.5G PHY
Daniel Golle [Sun, 27 Apr 2025 01:01:29 +0000 (02:01 +0100)]
net: ethernet: mtk_eth_soc: add support for MT7988 internal 2.5G PHY

The MediaTek MT7988 SoC comes with an single built-in Ethernet PHY for
2500Base-T/1000Base-T/100Base-TX/10Base-T link partners in addition to
the built-in 1GE switch. The built-in PHY only supports full duplex.

Add muxes allowing to select GMAC2->2.5G PHY path and add basic support
for XGMAC as the built-in 2.5G PHY is internally connected via XGMII.
The XGMAC features will also be used by 5GBase-R, 10GBase-R and USXGMII
SerDes modes which are going to be added once support for standalone PCS
drivers is in place.

In order to make use of the built-in 2.5G PHY the appropriate PHY driver
as well as (proprietary) PHY firmware has to be present as well.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/9072cefbff6db969720672ec98ed5cef65e8218c.1745715380.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agor8152: use SHA-256 library API instead of crypto_shash API
Eric Biggers [Mon, 28 Apr 2025 19:16:06 +0000 (12:16 -0700)]
r8152: use SHA-256 library API instead of crypto_shash API

This user of SHA-256 does not support any other algorithm, so the
crypto_shash abstraction provides no value.  Just use the SHA-256
library API instead, which is much simpler and easier to use.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://patch.msgid.link/20250428191606.856198-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>