Lorenzo Bianconi [Wed, 9 Apr 2025 09:47:15 +0000 (11:47 +0200)]
net: airoha: Add L2 hw acceleration support
Similar to mtk driver, introduce the capability to offload L2 traffic
defining flower rules in the PSE/PPE engine available on EN7581 SoC.
Since the hw always reports L2/L3/L4 flower rules, link all L2 rules
sharing the same L2 info (with different L3/L4 info) in the L2 subflows
list of a given L2 PPE entry.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20250409-airoha-flowtable-l2b-v2-2-4a1e3935ea92@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Wed, 9 Apr 2025 09:47:14 +0000 (11:47 +0200)]
net: airoha: Add l2_flows rhashtable
Introduce l2_flows rhashtable in airoha_ppe struct in order to
store L2 flows committed by upper layers of the kernel. This is a
preliminary patch in order to offload L2 traffic rules.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20250409-airoha-flowtable-l2b-v2-1-4a1e3935ea92@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 12 Apr 2025 01:58:13 +0000 (18:58 -0700)]
Merge branch 'net-retire-dccp-socket'
Kuniyuki Iwashima says:
====================
net: Retire DCCP socket.
As announced by commit
b144fcaf46d4 ("dccp: Print deprecation
notice."), it's time to remove DCCP socket.
The patch 2 removes net/dccp, LSM code, doc, and etc, leaving
DCCP netfilter modules.
The patch 3 unexports shared functions for DCCP, and the patch 4
renames tcp_or_dccp_get_hashinfo() to tcp_get_hashinfo().
We can do more cleanup; for example, remove IPPROTO_TCP checks in
__inet6?_check_established(), remove __module_get() for twsk,
remove timewait_sock_ops.twsk_destructor(), etc, but it will be
more of TCP stuff, so I'll defer to a later series.
v2: https://lore.kernel.org/
20250409003014.19697-1-kuniyu@amazon.com
v1: https://lore.kernel.org/
20250407231823.95927-1-kuniyu@amazon.com
====================
Link: https://patch.msgid.link/20250410023921.11307-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 10 Apr 2025 02:36:47 +0000 (19:36 -0700)]
tcp: Rename tcp_or_dccp_get_hashinfo().
DCCP was removed, so tcp_or_dccp_get_hashinfo() should be renamed.
Let's rename it to tcp_get_hashinfo().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410023921.11307-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 10 Apr 2025 02:36:46 +0000 (19:36 -0700)]
net: Unexport shared functions for DCCP.
DCCP was removed, so many inet functions no longer need to
be exported.
Let's unexport or use EXPORT_IPV6_MOD() for such functions.
sk_free_unlock_clone() is inlined in sk_clone_lock() as it's
the only caller.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410023921.11307-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 10 Apr 2025 02:36:45 +0000 (19:36 -0700)]
net: Retire DCCP socket.
DCCP was orphaned in 2021 by commit
054c4610bd05 ("MAINTAINERS: dccp:
move Gerrit Renker to CREDITS"), which noted that the last maintainer
had been inactive for five years.
In recent years, it has become a playground for syzbot, and most changes
to DCCP have been odd bug fixes triggered by syzbot. Apart from that,
the only changes have been driven by treewide or networking API updates
or adjustments related to TCP.
Thus, in 2023, we announced we would remove DCCP in 2025 via commit
b144fcaf46d4 ("dccp: Print deprecation notice.").
Since then, only one individual has contacted the netdev mailing list. [0]
There is ongoing research for Multipath DCCP. The repository is hosted
on GitHub [1], and development is not taking place through the upstream
community. While the repository is published under the GPLv2 license,
the scheduling part remains proprietary, with a LICENSE file [2] stating:
"This is not Open Source software."
The researcher mentioned a plan to address the licensing issue, upstream
the patches, and step up as a maintainer, but there has been no further
communication since then.
Maintaining DCCP for a decade without any real users has become a burden.
Therefore, it's time to remove it.
Removing DCCP will also provide significant benefits to TCP. It allows
us to freely reorganize the layout of struct inet_connection_sock, which
is currently shared with DCCP, and optimize it to reduce the number of
cachelines accessed in the TCP fast path.
Note that we keep DCCP netfilter modules as requested. [3]
Link: https://lore.kernel.org/netdev/20230710182253.81446-1-kuniyu@amazon.com/T/#u
Link: https://github.com/telekom/mp-dccp
Link: https://github.com/telekom/mp-dccp/blob/mpdccp_v03_k5.10/net/dccp/non_gpl_scheduler/LICENSE
Link: https://lore.kernel.org/netdev/Z_VQ0KlCRkqYWXa-@calendula/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM and SELinux)
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Link: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Thu, 10 Apr 2025 02:36:44 +0000 (19:36 -0700)]
selftest: net: Remove DCCP bits.
We will remove DCCP.
Let's remove DCCP bits from selftest.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410023921.11307-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lucien.Jheng [Wed, 9 Apr 2025 15:09:02 +0000 (23:09 +0800)]
net: phy: air_en8811h: Add clk provider for CKO pin
EN8811H outputs 25MHz or 50MHz clocks on CKO, selected by GPIO3.
CKO clock operates continuously from power-up through md32 loading.
Implement clk provider driver so we can disable the clock output in case
it isn't needed, which also helps to reduce EMF noise
Signed-off-by: Lucien.Jheng <lucienx123@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250409150902.3596-1-lucienx123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zijun Hu [Thu, 10 Apr 2025 01:01:27 +0000 (09:01 +0800)]
sock: Correct error checking condition for (assign|release)_proto_idx()
(assign|release)_proto_idx() wrongly check find_first_zero_bit() failure
by condition '(prot->inuse_idx == PROTO_INUSE_NR - 1)' obviously.
Fix by correcting the condition to '(prot->inuse_idx == PROTO_INUSE_NR)'
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250410-fix_net-v2-1-d69e7c5739a4@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Mon, 7 Apr 2025 19:15:35 +0000 (20:15 +0100)]
net: stmmac: stm32: simplify clock handling
Some stm32 implementations need the receive clock running in suspend,
as indicated by dwmac->ops->clk_rx_enable_in_suspend. The existing
code achieved this in a rather complex way, by passing a flag around.
However, the clk API prepare/enables are counted - which means that a
clock won't be stopped as long as there are more prepare and enables
than disables and unprepares, just like a reference count.
Therefore, we can simplify this logic by calling clk_prepare_enable()
an additional time in the probe function if this flag is set, and then
balancing that at remove time.
With this, we can avoid passing a "are we suspending" and "are we
resuming" flag to various functions in the driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Wed, 9 Apr 2025 19:14:47 +0000 (21:14 +0200)]
r8169: add helper rtl8125_phy_param
The integrated PHY's of RTL8125/8126 have an own mechanism to access
PHY parameters, similar to what r8168g_phy_param does on earlier PHY
versions. Add helper rtl8125_phy_param to simplify the code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/847b7356-12d6-441b-ade9-4b6e1539b84a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Wed, 9 Apr 2025 19:05:37 +0000 (21:05 +0200)]
r8169: add helper rtl_csi_mod for accessing extended config space
Add a helper for the Realtek-specific mechanism for accessing extended
config space if native access isn't possible.
This avoids code duplication.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/b368fd91-57d7-4cb5-9342-98b4d8fe9aea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhengchao Shao [Wed, 9 Apr 2025 03:33:21 +0000 (11:33 +0800)]
ipv4: remove unnecessary judgment in ip_route_output_key_hash_rcu
In the ip_route_output_key_cash_rcu function, the input fl4 member saddr is
first checked to be non-zero before entering multicast, broadcast and
arbitrary IP address checks. However, the fact that the IP address is not
0 has already ruled out the possibility of any address, so remove
unnecessary judgment.
Signed-off-by: Zhengchao Shao <shaozhengchao@163.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250409033321.108244-1-shaozhengchao@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Apr 2025 03:14:43 +0000 (20:14 -0700)]
Merge branch 'tools-ynl-c-basic-netlink-raw-support'
Jakub Kicinski says:
====================
tools: ynl: c: basic netlink-raw support
Basic support for netlink-raw AKA classic netlink in user space C codegen.
This series is enough to read routes and addresses from the kernel
(see the samples in patches 12 and 13).
Specs need to be slightly adjusted and decorated with the c naming info.
In terms of codegen this series includes just the basic plumbing required
to skip genlmsghdr and handle request types which may technically also
be legal in genetlink-legacy but are very uncommon there.
Subsequent series will add support for:
- handling CRUD-style notifications
- code gen for array types classic netlink uses
- sub-message support
v1: https://lore.kernel.org/
20250409000400.492371-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250410014658.782120-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:58 +0000 (18:46 -0700)]
tools: ynl: generate code for rt-route and add a sample
YNL C can now generate code for simple classic netlink families.
Include rt-route in the Makefile for generation and add a sample.
$ ./tools/net/ynl/samples/rt-route
oif: wlp0s20f3 gateway: 192.168.1.1
oif: wlp0s20f3 dst: 192.168.1.0/24
oif: vpn0 dst: fe80::/64
oif: wlp0s20f3 dst: fe80::/64
oif: wlp0s20f3 gateway: fe80::200:5eff:fe00:201
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-14-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:57 +0000 (18:46 -0700)]
tools: ynl: generate code for rt-addr and add a sample
YNL C can now generate code for simple classic netlink families.
Include rt-addr in the Makefile for generation and add a sample.
$ ./tools/net/ynl/samples/rt-addr
lo: 127.0.0.1
wlp0s20f3: 192.168.1.101
lo: ::
wlp0s20f3: fe80::6385:be6:746e:8116
vpn0: fe80::3597:d353:b5a7:66dd
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:56 +0000 (18:46 -0700)]
tools: ynl-gen: use family c-name in notifications
Family names may include dashes. Fix notification handling
code gen to the c-compatible name.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:55 +0000 (18:46 -0700)]
tools: ynl-gen: consider dump ops without a do "type-consistent"
If the type for the response to do and dump are the same we don't
generate it twice. This is called "type_consistent" in the generator.
Consider operations which only have dump to also be consistent.
This removes unnecessary "_dump" from the names. There's a number
of GET ops in classic Netlink which only have dump handlers.
Make sure we output the "onesided" types, normally if the type
is consistent we only output it when we render the do structures.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250410014658.782120-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:54 +0000 (18:46 -0700)]
tools: ynl: don't use genlmsghdr in classic netlink
Make sure the codegen calls the right YNL lib helper to start
the request based on family type. Classic netlink request must
not include the genl header.
Conversely don't expect genl headers in the responses.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:53 +0000 (18:46 -0700)]
tools: ynl-gen: don't consider requests with fixed hdr empty
C codegen skips generating the structs if request/reply has no attrs.
In such cases the request op takes no argument and return int
(rather than response struct). In case of classic netlink a lot of
information gets passed using the fixed struct, however, so adjust
the logic to consider a request empty only if it has no attrs _and_
no fixed struct.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250410014658.782120-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:52 +0000 (18:46 -0700)]
tools: ynl: support creating non-genl sockets
Classic netlink has static family IDs specified in YAML,
there is no family name -> ID lookup. Support providing
the ID info to the library via the generated struct and
make library use it. Since NETLINK_ROUTE is ID 0 we need
an extra boolean to indicate classic_id is to be used.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:51 +0000 (18:46 -0700)]
netlink: specs: rt-route: add C naming info
Add properties needed for C codegen to match names with uAPI headers.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:50 +0000 (18:46 -0700)]
netlink: specs: rt-addr: add C naming info
Add properties needed for C codegen to match names with uAPI headers.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:49 +0000 (18:46 -0700)]
netlink: specs: rt-route: remove the fixed members from attrs
The purpose of the attribute list is to list the attributes
which will be included in a given message to shrink the objects
for families with huge attr spaces. Fixed headers are always
present in their entirety (between netlink header and the attrs)
so there's no point in listing their members. Current C codegen
doesn't expect them and tries to look them up in the attribute space.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:48 +0000 (18:46 -0700)]
netlink: specs: rt-addr: remove the fixed members from attrs
The purpose of the attribute list is to list the attributes
which will be included in a given message to shrink the objects
for families with huge attr spaces. Fixed headers are always
present in their entirety (between netlink header and the attrs)
so there's no point in listing their members. Current C codegen
doesn't expect them and tries to look them up in the attribute space.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:47 +0000 (18:46 -0700)]
netlink: specs: rt-route: specify fixed-header at operations level
The C codegen currently stores the fixed-header as part of family
info, so it only supports one fixed-header type per spec. Luckily
all rtm route message have the same fixed header so just move it up
to the higher level.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 01:46:46 +0000 (18:46 -0700)]
netlink: specs: rename rtnetlink specs in accordance with family name
The rtnetlink family names are set to rt-$name within the YAML
but the files are called rt_$name. C codegen assumes that the
generated file name will match the family. The use of dashes
is in line with our general expectation that name properties
in the spec use dashes not underscores (even tho, as Donald
points out most genl families use underscores in the name).
We have 3 un-ideal options to choose from:
- accept the slight inconsistency with old families using _, or
- accept the slight annoyance with all languages having to do s/-/_/
when looking up family ID, or
- accept the inconsistency with all name properties in new YAML spec
being separated with - and just the family name always using _.
Pick option 1 and rename the rtnl spec files.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Krzysztof Hałasa [Tue, 8 Apr 2025 11:59:41 +0000 (13:59 +0200)]
usbnet: asix AX88772: leave the carrier control to phylink
ASIX AX88772B based USB 10/100 Ethernet adapter doesn't come
up ("carrier off"), despite the built-in 100BASE-FX PHY positive link
indication. The internal PHY is configured (using EEPROM) in fixed
100 Mbps full duplex mode.
The primary problem appears to be using carrier_netif_{on,off}() while,
at the same time, delegating carrier management to phylink. Use only the
latter and remove "manual control" in the asix driver.
I don't have any other AX88772 board here, but the problem doesn't seem
specific to a particular board or settings - it's probably
timing-dependent.
Remove unused asix_adjust_link() as well.
Signed-off-by: Krzysztof Hałasa <khalasa@piap.pl>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/m3plhmdfte.fsf_-_@t19.piap.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Apr 2025 01:34:08 +0000 (18:34 -0700)]
Merge branch 'trace-add-tracepoint-for-tcp_sendmsg_locked'
Breno Leitao says:
====================
trace: add tracepoint for tcp_sendmsg_locked()
Meta has been using BPF programs to monitor tcp_sendmsg() for years,
indicating significant interest in observing this important
functionality. Adding a proper tracepoint provides a stable API for all
users who need visibility into TCP message transmission.
David Ahern is using a similar functionality with a custom patch[1]. So,
this means we have more than a single use case for this request, and it
might be a good idea to have such feature upstream.
Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/
v2: https://lore.kernel.org/
20250407-tcpsendmsg-v2-0-
9f0ea843ef99@debian.org
v1: https://lore.kernel.org/
20250224-tcpsendmsg-v1-1-
bac043c59cc8@debian.org
====================
Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-0-208b87064c28@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Tue, 8 Apr 2025 18:32:02 +0000 (11:32 -0700)]
trace: tcp: Add tracepoint for tcp_sendmsg_locked()
Add a tracepoint to monitor TCP send operations, enabling detailed
visibility into TCP message transmission.
Create a new tracepoint within the tcp_sendmsg_locked function,
capturing traditional fields along with size_goal, which indicates the
optimal data size for a single TCP segment. Additionally, a reference to
the struct sock sk is passed, allowing direct access for BPF programs.
The implementation is largely based on David's patch[1] and suggestions.
Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Tue, 8 Apr 2025 18:32:01 +0000 (11:32 -0700)]
net: pass const to msg_data_left()
The msg_data_left() function doesn't modify the struct msghdr parameter,
so mark it as const. This allows the function to be used with const
references, improving type safety and making the API more flexible.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-1-208b87064c28@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Apr 2025 01:31:56 +0000 (18:31 -0700)]
Merge branch 'net-stmmac-stmmac_pltfr_find_clk'
Russell King says:
====================
net: stmmac: stmmac_pltfr_find_clk()
The GBETH glue driver that is being proposed duplicates the clock
finding from the bulk clock data in the stmmac platform data structure.
iLet's provide a generic implementation that glue drivers can use, and
convert dwc-qos-eth to use it.
====================
Link: https://patch.msgid.link/Z_Yn3dJjzcOi32uU@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Wed, 9 Apr 2025 08:02:25 +0000 (09:02 +0100)]
net: stmmac: dwc-qos: use stmmac_pltfr_find_clk()
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1u2QO9-001Rp8-Ii@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Wed, 9 Apr 2025 08:02:20 +0000 (09:02 +0100)]
net: stmmac: provide stmmac_pltfr_find_clk()
Provide a generic way to find a clock in the bulk data.
Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u2QO4-001Rp2-Dy@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Apr 2025 01:29:27 +0000 (18:29 -0700)]
Merge branch 'tcp-add-a-new-tw_paws-drop-reason'
Jiayuan Chen says:
====================
tcp: add a new TW_PAWS drop reason
Devices in the networking path, such as firewalls, NATs, or routers, which
can perform SNAT or DNAT, use addresses from their own limited address
pools to masquerade the source address during forwarding, causing PAWS
verification to fail more easily under TW status.
Currently, packet loss statistics for PAWS can only be viewed through MIB,
which is a global metric and cannot be precisely obtained through tracing
to get the specific 4-tuple of the dropped packet. In the past, we had to
use kprobe ret to retrieve relevant skb information from
tcp_timewait_state_process().
We add a drop_reason pointer and a new counter.
I didn't provide a packetdrill script.
I struggled for a long time to get packetdrill to fix the client port, but
ultimately failed to do so...
Instead, I wrote my own program to trigger PAWS, which can be found at
https://github.com/mrpre/nettrigger/tree/main
'''
//assume nginx running on 172.31.75.114:9999, current host is 172.31.75.115
iptables -t filter -I OUTPUT -p tcp --sport 12345 --tcp-flags RST RST -j DROP
./nettrigger -i eth0 -s 172.31.75.115:12345 -d 172.31.75.114:9999 -action paws
'''
v2: https://lore.kernel.org/
5cdc1bdd9caee92a6ae932638a862fd5c67630e8@linux.dev
v3: https://lore.kernel.org/
20250407140001.13886-1-jiayuan.chen@linux.dev
====================
Link: https://patch.msgid.link/20250409112614.16153-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiayuan Chen [Wed, 9 Apr 2025 11:26:05 +0000 (19:26 +0800)]
tcp: add LINUX_MIB_PAWS_TW_REJECTED counter
When TCP is in TIME_WAIT state, PAWS verification uses
LINUX_PAWSESTABREJECTED, which is ambiguous and cannot be distinguished
from other PAWS verification processes.
We added a new counter, like the existing PAWS_OLD_ACK one.
Also we update the doc with previously missing PAWS_OLD_ACK.
usage:
'''
nstat -az | grep PAWSTimewait
TcpExtPAWSTimewait 1 0.0
'''
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250409112614.16153-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiayuan Chen [Wed, 9 Apr 2025 11:26:04 +0000 (19:26 +0800)]
tcp: add TCP_RFC7323_TW_PAWS drop reason
Devices in the networking path, such as firewalls, NATs, or routers, which
can perform SNAT or DNAT, use addresses from their own limited address
pools to masquerade the source address during forwarding, causing PAWS
verification to fail more easily.
Currently, packet loss statistics for PAWS can only be viewed through MIB,
which is a global metric and cannot be precisely obtained through tracing
to get the specific 4-tuple of the dropped packet. In the past, we had to
use kprobe ret to retrieve relevant skb information from
tcp_timewait_state_process().
We add a drop_reason pointer, similar to what previous commit does:
commit
e34100c2ecbb ("tcp: add a drop_reason pointer to tcp_check_req()")
This commit addresses the PAWSESTABREJECTED case and also sets the
corresponding drop reason.
We use 'pwru' to test.
Before this commit:
''''
./pwru 'port 9999'
2025/04/07 13:40:19 Listening for events..
TUPLE FUNC
172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_NOT_SPECIFIED)
'''
After this commit:
'''
./pwru 'port 9999'
2025/04/07 13:51:34 Listening for events..
TUPLE FUNC
172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_TCP_RFC7323_TW_PAWS)
'''
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250409112614.16153-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michal Luczaj [Wed, 9 Apr 2025 12:50:58 +0000 (14:50 +0200)]
af_unix: Remove unix_unhash()
Dummy unix_unhash() was introduced for sockmap in commit
94531cfcbe79
("af_unix: Add unix_stream_proto for sockmap"), but there's no need to
implement it anymore.
->unhash() is only called conditionally: in unix_shutdown() since commit
d359902d5c35 ("af_unix: Fix NULL pointer bug in unix_shutdown"), and in BPF
proto's sock_map_unhash() since commit
5b4a79ba65a1 ("bpf, sockmap: Don't
let sock_map_{close,destroy,unhash} call itself").
Remove it.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250409-cleanup-drop-unix-unhash-v1-1-1659e5b8ee84@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 23:49:17 +0000 (16:49 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc2).
Conflict:
Documentation/networking/netdevices.rst
net/core/lock_debug.c
04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE")
03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 10 Apr 2025 15:52:18 +0000 (08:52 -0700)]
Merge tag 'net-6.15-rc2' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink()
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr()
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority"
* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ethtool: cmis_cdb: Fix incorrect read / write length extension
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
net: ppp: Add bound checking for skb data on ppp_sync_txmung
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
ipv6: Align behavior across nexthops during path selection
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net: libwx: handle page_pool_dev_alloc_pages error
selftests: mptcp: validate MPJoin HMacFailure counters
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
rtnetlink: Fix bad unlock balance in do_setlink().
net: ethtool: Don't call .cleanup_data when prepare_data fails
tc: Ensure we have enough buffer space when sending filter netlink notifications
net: libwx: Fix the wrong Rx descriptor field
octeontx2-pf: qos: fix VF root node parent queue index
selftests: tls: check that disconnect does nothing
...
Linus Torvalds [Thu, 10 Apr 2025 14:04:23 +0000 (07:04 -0700)]
Merge tag 'for-linus-6.15a-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A simple fix adding the module description of the Xenbus frontend
module
- A fix correcting the xen-acpi-processor Kconfig dependency for PVH
Dom0 support
- A fix for the Xen balloon driver when running as Xen Dom0 in PVH mode
- A fix for PVH Dom0 in order to avoid problems with CPU idle and
frequency drivers conflicting with Xen
* tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: disable CPU idle and frequency drivers for PVH dom0
x86/xen: fix balloon target initialization for PVH dom0
xen: Change xen-acpi-processor dom0 dependency
xenbus: add module description
Linus Torvalds [Thu, 10 Apr 2025 14:02:22 +0000 (07:02 -0700)]
Merge tag 'block-6.15-
20250410' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Add a missing ublk selftest script, from test additions added last
week
- Two fixes for ublk error recovery and reissue
- Cleanup of ublk argument passing
* tag 'block-6.15-
20250410' of git://git.kernel.dk/linux:
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ublk: fix handling recovery & reissue in ublk_abort_queue()
selftests: ublk: fix test_stripe_04
Linus Torvalds [Thu, 10 Apr 2025 14:00:21 +0000 (07:00 -0700)]
Merge tag 'io_uring-6.15-
20250410' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Reject zero sized legacy provided buffers upfront. No ill side
effects from this one, only really done to shut up a silly syzbot
test case.
- Fix for a regression in tag posting for registered files or buffers,
where the tag would be posted even when the registration failed.
- two minor zcrx cleanups for code added this merge window.
* tag 'io_uring-6.15-
20250410' of git://git.kernel.dk/linux:
io_uring/kbuf: reject zero sized provided buffers
io_uring/zcrx: separate niov number from pages
io_uring/zcrx: put refill data into separate cache line
io_uring: don't post tag CQEs on file/buffer registration failure
Linus Torvalds [Thu, 10 Apr 2025 13:58:06 +0000 (06:58 -0700)]
Merge tag 'gpio-fixes-for-v6.15-rc2' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix resource handling in gpio-tegra186
- fix wakeup source leaks in gpio-mpc8xxx and gpio-zynq
- fix minor issues with some GPIO OF quirks
- deprecate GPIOD_FLAGS_BIT_NONEXCLUSIVE and devm_gpiod_unhinge()
symbols and add a TODO task to track replacing them with a better
solution
* tag 'gpio-fixes-for-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: of: Move Atmel HSMCI quirk up out of the regulator comment
gpiolib: of: Fix the choice for Ingenic NAND quirk
gpio: zynq: Fix wakeup source leaks on device unbind
gpio: mpc8xxx: Fix wakeup source leaks on device unbind
gpio: TODO: track the removal of regulator-related workarounds
MAINTAINERS: add more keywords for the GPIO subsystem entry
gpio: deprecate devm_gpiod_unhinge()
gpio: deprecate the GPIOD_FLAGS_BIT_NONEXCLUSIVE flag
gpio: tegra186: fix resource handling in ACPI probe path
Linus Torvalds [Thu, 10 Apr 2025 13:56:25 +0000 (06:56 -0700)]
Merge tag 'mtd/fixes-for-6.15-rc2' of git://git./linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
"Two important fixes: the build of the SPI NAND layer with old GCC
versions as well as the fix of the Qpic Makefile which was wrong in
the first place.
There are also two smaller fixes about a missing error and status
check"
* tag 'mtd/fixes-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: spinand: Fix build with gcc < 7.5
mtd: rawnand: Add status chack in r852_ready()
mtd: inftlcore: Add error check for inftl_read_oob()
mtd: nand: Drop explicit test for built-in CONFIG_SPI_QPIC_SNAND
Ido Schimmel [Wed, 9 Apr 2025 11:24:40 +0000 (14:24 +0300)]
ethtool: cmis_cdb: Fix incorrect read / write length extension
The 'read_write_len_ext' field in 'struct ethtool_cmis_cdb_cmd_args'
stores the maximum number of bytes that can be read from or written to
the Local Payload (LPL) page in a single multi-byte access.
Cited commit started overwriting this field with the maximum number of
bytes that can be read from or written to the Extended Payload (LPL)
pages in a single multi-byte access. Transceiver modules that support
auto paging can advertise a number larger than 255 which is problematic
as 'read_write_len_ext' is a 'u8', resulting in the number getting
truncated and firmware flashing failing [1].
Fix by ignoring the maximum EPL access size as the kernel does not
currently support auto paging (even if the transceiver module does) and
will not try to read / write more than 128 bytes at once.
[1]
Transceiver module firmware flashing started for device enp177s0np0
Transceiver module firmware flashing in progress for device enp177s0np0
Progress: 0%
Transceiver module firmware flashing encountered an error for device enp177s0np0
Status message: Write FW block EPL command failed, LPL length is longer
than CDB read write length extension allows.
Fixes:
9a3b0d078bd8 ("net: ethtool: Add support for writing firmware blocks using EPL payload")
Reported-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Closes: https://lore.kernel.org/netdev/
20250402183123.321036-3-michael.chan@broadcom.com/
Tested-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250409112440.365672-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 10 Apr 2025 11:13:35 +0000 (13:13 +0200)]
Merge tag 'nf-25-04-10' of git://git./linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains a Netfilter fix and improved test coverage:
1) Fix AVX2 matching in nft_pipapo, from Florian Westphal.
2) Extend existing test to improve coverage for the aforementioned bug,
also from Florian.
netfilter pull request 25-04-10
* tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
====================
Link: https://patch.msgid.link/20250410103647.1030244-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Florian Westphal [Mon, 7 Apr 2025 17:40:19 +0000 (19:40 +0200)]
selftests: netfilter: add test case for recent mismatch bug
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet"
this fails:
TEST: reported issues
Add two elements, flush, re-add 1s [ OK ]
net,mac with reload 0s [ OK ]
net,port,proto 3s [ OK ]
avx2 false match 0s [FAIL]
False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001
Other tests do not detect the kernel bug as they only alter parts in
the /64 netmask.
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 7 Apr 2025 17:40:18 +0000 (19:40 +0200)]
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
Given a set element like:
icmpv6 . dead:beef:00ff::1
The value of 'ff' is irrelevant, any address will be matched
as long as the other octets are the same.
This is because of too-early register clobbering:
ymm7 is reloaded with new packet data (pkt[9]) but it still holds data
of an earlier load that wasn't processed yet.
The existing tests in nft_concat_range.sh selftests do exercise this code
path, but do not trigger incorrect matching due to the network prefix
limitation.
Fixes:
7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reported-by: sontu mazumdar <sontu21@gmail.com>
Closes: https://lore.kernel.org/netfilter/CANgxkqwnMH7fXra+VUfODT-8+qFLgskq3set1cAzqqJaV4iEZg@mail.gmail.com/T/#t
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Arnaud Lecomte [Tue, 8 Apr 2025 15:55:08 +0000 (17:55 +0200)]
net: ppp: Add bound checking for skb data on ppp_sync_txmung
Ensure we have enough data in linear buffer from skb before accessing
initial bytes. This prevents potential out-of-bounds accesses
when processing short packets.
When ppp_sync_txmung receives an incoming package with an empty
payload:
(remote) gef➤ p *(struct pppoe_hdr *) (skb->head + skb->network_header)
$18 = {
type = 0x1,
ver = 0x1,
code = 0x0,
sid = 0x2,
length = 0x0,
tag = 0xffff8880371cdb96
}
from the skb struct (trimmed)
tail = 0x16,
end = 0x140,
head = 0xffff88803346f400 "4",
data = 0xffff88803346f416 ":\377",
truesize = 0x380,
len = 0x0,
data_len = 0x0,
mac_len = 0xe,
hdr_len = 0x0,
it is not safe to access data[2].
Reported-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
29fc8991b0ecb186cf40
Tested-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com
Fixes:
1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Link: https://patch.msgid.link/20250408-bound-checking-ppp_txmung-v2-1-94bb6e1b92d0@arnaud-lcm.com
[pabeni@redhat.com: fixed subj typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:56 +0000 (17:15 +0800)]
net: txgbe: add sriov function support
Add sriov_configure for driver ops.
Add mailbox handler wx_msg_task for txgbe.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/ECDC57CF4F2316B9+20250408091556.9640-7-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:55 +0000 (17:15 +0800)]
net: ngbe: add sriov function support
Add sriov_configure for driver ops.
Add mailbox handler wx_msg_task for ngbe in
the interrupt handler.
Add the notification flow when the vfs exist.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/C9A0A43732966022+20250408091556.9640-6-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:54 +0000 (17:15 +0800)]
net: libwx: Add msg task func
Implement wx_msg_task which is used to process mailbox
messages sent by vf.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/8257B39B95CDB469+20250408091556.9640-5-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:53 +0000 (17:15 +0800)]
net: libwx: Redesign flow when sriov is enabled
Reallocate queue and int resources when sriov is enabled.
Redefine macro VMDQ to make it work in VT mode.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/64B616774ABE3C5A+20250408091556.9640-4-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:52 +0000 (17:15 +0800)]
net: libwx: Add sriov api for wangxun nics
Implement sriov_configure interface for wangxun nics in libwx.
Enable VT mode and initialize vf control structure, when sriov
is enabled. Do not be allowed to disable sriov when vfs are
assigned.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/81EA45C21B0A98B0+20250408091556.9640-3-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mengyuan Lou [Tue, 8 Apr 2025 09:15:51 +0000 (17:15 +0800)]
net: libwx: Add mailbox api for wangxun pf drivers
Implements the mailbox interfaces for wangxun pf drivers
ngbe and txgbe.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/70017BD4D67614A4+20250408091556.9640-2-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Walleij [Tue, 8 Apr 2025 09:26:58 +0000 (11:26 +0200)]
net: ethernet: cortina: Use TOE/TSO on all TCP
It is desireable to push the hardware accelerator to also
process non-segmented TCP frames: we pass the skb->len
to the "TOE/TSO" offloader and it will handle them.
Without this quirk the driver becomes unstable and lock
up and and crash.
I do not know exactly why, but it is probably due to the
TOE (TCP offload engine) feature that is coupled with the
segmentation feature - it is not possible to turn one
part off and not the other, either both TOE and TSO are
active, or neither of them.
Not having the TOE part active seems detrimental, as if
that hardware feature is not really supposed to be turned
off.
The datasheet says:
"Based on packet parsing and TCP connection/NAT table
lookup results, the NetEngine puts the packets
belonging to the same TCP connection to the same queue
for the software to process. The NetEngine puts
incoming packets to the buffer or series of buffers
for a jumbo packet. With this hardware acceleration,
IP/TCP header parsing, checksum validation and
connection lookup are offloaded from the software
processing."
After numerous tests with the hardware locking up after
something between minutes and hours depending on load
using iperf3 I have concluded this is necessary to stabilize
the hardware.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://patch.msgid.link/20250408-gemini-ethernet-tso-always-v1-1-e669f932359c@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 02:13:46 +0000 (19:13 -0700)]
Merge branch 'bridge-prevent-unicast-arp-ns-packets-from-being-suppressed-by-bridge'
Amit Cohen says:
====================
bridge: Prevent unicast ARP/NS packets from being suppressed by bridge
Currently, unicast ARP requests/NS packets are replied by bridge when
suppression is enabled, then they are also forwarded, which results two
replicas of ARP reply/NA - one from the bridge and second from the target.
The purpose of ARP/ND suppression is to reduce flooding in the broadcast
domain, which is not relevant for unicast packets. In addition, the use
case of unicast ARP/NS is to poll a specific host, so it does not make
sense to have the switch answer on behalf of the host.
Forward ARP requests/NS packets and prevent the bridge from replying to
them.
Patch set overview:
Patch #1 prevents unicast ARP/NS packets from being suppressed by bridge
Patch #2 adds test cases for unicast ARP/NS with suppression enabled
====================
Link: https://patch.msgid.link/cover.1744123493.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Tue, 8 Apr 2025 15:40:24 +0000 (17:40 +0200)]
selftests: test_bridge_neigh_suppress: Test unicast ARP/NS with suppression
Add test cases to check that unicast ARP/NS packets are replied once, even
if ARP/ND suppression is enabled.
Without the previous patch:
$ ./test_bridge_neigh_suppress.sh
...
Unicast ARP, per-port ARP suppression - VLAN 10
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [FAIL]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
Unicast ARP, per-port ARP suppression - VLAN 20
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [FAIL]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
...
Unicast NS, per-port NS suppression - VLAN 10
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [FAIL]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
Unicast NS, per-port NS suppression - VLAN 20
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [FAIL]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
...
Tests passed: 156
Tests failed: 4
With the previous patch:
$ ./test_bridge_neigh_suppress.sh
...
Unicast ARP, per-port ARP suppression - VLAN 10
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [ OK ]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
Unicast ARP, per-port ARP suppression - VLAN 20
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [ OK ]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
...
Unicast NS, per-port NS suppression - VLAN 10
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [ OK ]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
Unicast NS, per-port NS suppression - VLAN 20
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [ OK ]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
...
Tests passed: 160
Tests failed: 0
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/dc240b9649b31278295189f412223f320432c5f2.1744123493.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Tue, 8 Apr 2025 15:40:23 +0000 (17:40 +0200)]
net: bridge: Prevent unicast ARP/NS packets from being suppressed by bridge
When Proxy ARP or ARP/ND suppression are enabled, ARP/NS packets can be
handled by bridge in br_do_proxy_suppress_arp()/br_do_suppress_nd().
For broadcast packets, they are replied by bridge, but later they are not
flooded. Currently, unicast packets are replied by bridge when suppression
is enabled, and they are also forwarded, which results two replicas of
ARP reply/NA - one from the bridge and second from the target.
RFC 1122 describes use case for unicat ARP packets - "unicast poll" -
actively poll the remote host by periodically sending a point-to-point ARP
request to it, and delete the entry if no ARP reply is received from N
successive polls.
The purpose of ARP/ND suppression is to reduce flooding in the broadcast
domain. If a host is sending a unicast ARP/NS, then it means it already
knows the address and the switches probably know it as well and there
will not be any flooding.
In addition, the use case of unicast ARP/NS is to poll a specific host,
so it does not make sense to have the switch answer on behalf of the host.
According to RFC 9161:
"A PE SHOULD reply to broadcast/multicast address resolution messages,
i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages.
An ARP probe is an ARP Request constructed with an all-zero sender IP
address that may be used by hosts for IPv4 Address Conflict Detection as
specified in [RFC5227]. A PE SHOULD NOT reply to unicast address resolution
requests (for instance, NUD NS messages)."
Forward such requests and prevent the bridge from replying to them.
Reported-by: Denis Yulevych <denisyu@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/6bf745a149ddfe5e6be8da684a63aa574a326f8d.1744123493.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima [Mon, 7 Apr 2025 16:33:11 +0000 (09:33 -0700)]
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
# lsmod | grep cifs
cifs
1159168 0
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"
MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1
iptables -A INPUT -s ${CIFS_SERVER} -j DROP
for i in $(seq 10);
do
umount ${MNT}
rmmod cifs
sleep 1
done
rm -r ${MNT}
iptables -D INPUT -s ${CIFS_SERVER} -j DROP
[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
<IRQ>
__lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...
BUG: kernel NULL pointer dereference, address:
00000000000000c4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:
ffa0000000468a10 EFLAGS:
00010046
RAX:
0000000000000000 RBX:
ff1100010091cc38 RCX:
0000000000000027
RDX:
ff1100081f09ca48 RSI:
0000000000000001 RDI:
ff1100010091cc88
RBP:
ff1100010091c200 R08:
ff1100083fe6e228 R09:
00000000ffffbfff
R10:
ff1100081eca0000 R11:
ff1100083fe10dc0 R12:
ff1100010091cc88
R13:
0000000000000001 R14:
0000000000000000 R15:
00000000000424b1
FS:
0000000000000000(0000) GS:
ff1100081f080000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000000000c4 CR3:
0000000002c4a003 CR4:
0000000000771ef0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe07f0 DR7:
0000000000000400
PKRU:
55555554
Call Trace:
<IRQ>
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
ip_list_rcv_finish (net/ipv4/ip_input.c:628)
ip_list_rcv (net/ipv4/ip_input.c:670)
__netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
__napi_poll.constprop.0 (net/core/dev.c:7191)
net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
handle_softirqs (kernel/softirq.c:561)
__irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
irq_exit_rcu (kernel/softirq.c:680)
common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
</IRQ>
<TASK>
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:
ffa00000000ffee8 EFLAGS:
00000202
RAX:
000000000000640b RBX:
ff1100010091c200 RCX:
0000000000061aa4
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
ffffffff812f30c5
RBP:
000000000000000a R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000001 R11:
0000000000000002 R12:
0000000000000000
R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:315)
common_startup_64 (arch/x86/kernel/head_64.S:421)
</TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2:
00000000000000c4
Fixes:
ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 8 Apr 2025 20:27:42 +0000 (20:27 +0000)]
net: remove cpu stall in txq_trans_update()
txq_trans_update() currently uses txq->xmit_lock_owner
to conditionally update txq->trans_start.
For regular devices, txq->xmit_lock_owner is updated
from HARD_TX_LOCK() and HARD_TX_UNLOCK(), and this apparently
causes cpu stalls.
Using dev->lltx, which sits in a read-mostly cache-line,
and already used in HARD_TX_LOCK() and HARD_TX_UNLOCK()
helps cpu prediction.
On an AMD EPYC 7B12 dual socket server, tcp_rr with 128 threads
and 30,000 flows gets a 5 % increase in throughput.
As explained in commit
95ecba62e2fd ("net: fix races in
netdev_tx_sent_queue()/dev_watchdog()") I am planning
to no longer update txq->trans_start in the fast path
in a followup patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408202742.2145516-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 8 Apr 2025 08:43:16 +0000 (11:43 +0300)]
ipv6: Align behavior across nexthops during path selection
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit
4d0ab3a6885e ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes:
3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Fixes:
4d0ab3a6885e ("ipv6: Start path selection from the first nexthop")
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wentao Liang [Tue, 8 Apr 2025 03:26:02 +0000 (11:26 +0800)]
octeontx2-pf: Add error log forcn10k_map_unmap_rq_policer()
The cn10k_free_matchall_ipolicer() calls the cn10k_map_unmap_rq_policer()
for each queue in a for loop without checking for any errors.
Check the return value of the cn10k_map_unmap_rq_policer() function during
each loop, and report a warning if the function fails.
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250408032602.2909-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 7 Apr 2025 09:40:42 +0000 (12:40 +0300)]
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
DSA has 2 kinds of drivers:
1. Those who call dsa_switch_suspend() and dsa_switch_resume() from
their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz
2. Those who don't: all others. The above methods should be optional.
For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(),
and dsa_switch_resume() calls dsa_user_resume() -> phylink_start().
These seem good candidates for setting mac_managed_pm = true because
that is essentially its definition [1], but that does not seem to be the
biggest problem for now, and is not what this change focuses on.
Talking strictly about the 2nd category of DSA drivers here (which
do not have MAC managed PM, meaning that for their attached PHYs,
mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full),
I have noticed that the following warning from mdio_bus_phy_resume() is
triggered:
WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY &&
phydev->state != PHY_UP);
because the PHY state machine is running.
It's running as a result of a previous dsa_user_open() -> ... ->
phylink_start() -> phy_start() having been initiated by the user.
The previous mdio_bus_phy_suspend() was supposed to have called
phy_stop_machine(), but it didn't. So this is why the PHY is in state
PHY_NOLINK by the time mdio_bus_phy_resume() runs.
mdio_bus_phy_suspend() did not call phy_stop_machine() because for
phylink, the phydev->adjust_link function pointer is NULL. This seems a
technicality introduced by commit
fddd91016d16 ("phylib: fix PAL state
machine restart on resume"). That commit was written before phylink
existed, and was intended to avoid crashing with consumer drivers which
don't use the PHY state machine - phylink always does, when using a PHY.
But phylink itself has historically not been developed with
suspend/resume in mind, and apparently not tested too much in that
scenario, allowing this bug to exist unnoticed for so long. Plus, prior
to the WARN_ON(), it would have likely been invisible.
This issue is not in fact restricted to type 2 DSA drivers (according to
the above ad-hoc classification), but can be extrapolated to any MAC
driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where
the issue was reported. Assuming mac_managed_pm is set correctly, a
quick search indicates the following other drivers might be affected:
$ grep -Zlr PHYLINK_NETDEV drivers/ | xargs -0 grep -L mac_managed_pm
drivers/net/ethernet/atheros/ag71xx.c
drivers/net/ethernet/microchip/sparx5/sparx5_main.c
drivers/net/ethernet/microchip/lan966x/lan966x_main.c
drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
drivers/net/ethernet/freescale/ucc_geth.c
drivers/net/ethernet/freescale/enetc/enetc_pf_common.c
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
drivers/net/ethernet/marvell/mvneta.c
drivers/net/ethernet/marvell/prestera/prestera_main.c
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/altera/altera_tse_main.c
drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c
drivers/net/ethernet/meta/fbnic/fbnic_phylink.c
drivers/net/ethernet/tehuti/tn40_phy.c
drivers/net/ethernet/mscc/ocelot_net.c
Make the existing conditions dependent on the PHY device having a
phydev->phy_link_change() implementation equal to the default
phy_link_change() provided by phylib. Otherwise, we implicitly know that
the phydev has the phylink-provided phylink_phy_change() callback, and
when phylink is used, the PHY state machine always needs to be stopped/
started on the suspend/resume path. The code is structured as such that
if phydev->phy_link_change() is absent, it is a matter of time until the
kernel will crash - no need to further complicate the test.
Thus, for the situation where the PM is not managed by the MAC, we will
make the MDIO bus PM ops treat identically the phylink-controlled PHYs
with the phylib-controlled PHYs where an adjust_link() callback is
supplied. In both cases, the MDIO bus PM ops should stop and restart the
PHY state machine.
[1] https://lore.kernel.org/netdev/Z-1tiW9zjcoFkhwc@shell.armlinux.org.uk/
Fixes:
744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Reported-by: Wei Fang <wei.fang@nxp.com>
Tested-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250407094042.2155633-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 7 Apr 2025 09:38:59 +0000 (12:38 +0300)]
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
In an upcoming change, mdio_bus_phy_may_suspend() will need to
distinguish a phylib-based PHY client from a phylink PHY client.
For that, it will need to compare the phydev->phy_link_change() function
pointer with the eponymous phy_link_change() provided by phylib.
To avoid forward function declarations, the default PHY link state
change method should be moved upwards. There is no functional change
associated with this patch, it is only to reduce the noise from a real
bug fix.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250407093900.2155112-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stanislav Fomichev [Wed, 2 Apr 2025 17:23:05 +0000 (10:23 -0700)]
configs/debug: run and debug PREEMPT
Recent change [0] resulted in a "BUG: using __this_cpu_read() in
preemptible" splat [1]. PREEMPT kernels have additional requirements
on what can and can not run with/without preemption enabled.
Expose those constrains in the debug kernels.
0: https://lore.kernel.org/netdev/
20250314120048.12569-2-justin.iurman@uliege.be/
1: https://lore.kernel.org/netdev/
20250402094458.
006ba2a7@kernel.org/T/#mbf72641e9d7d274daee9003ef5edf6833201f1bc
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250402172305.1775226-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Julian Vetter [Tue, 8 Apr 2025 09:19:46 +0000 (11:19 +0200)]
net: ipvlan: remove __get_unaligned_cpu32 from ipvlan driver
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.
Signed-off-by: Julian Vetter <julian@outer-limits.org>
Link: https://patch.msgid.link/20250408091946.2266271-1-julian@outer-limits.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Julian Vetter [Tue, 8 Apr 2025 09:15:48 +0000 (11:15 +0200)]
net: remove __get_unaligned_cpu32 from macvlan driver
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.
Signed-off-by: Julian Vetter <julian@outer-limits.org>
Link: https://patch.msgid.link/20250408091548.2263911-1-julian@outer-limits.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 10 Apr 2025 00:01:54 +0000 (17:01 -0700)]
Merge branch 'net-depend-on-instance-lock-for-queue-related-netlink-ops'
Jakub Kicinski says:
====================
net: depend on instance lock for queue related netlink ops
netdev-genl used to be protected by rtnl_lock. In previous release
we already switched the queue management ops (for Rx zero-copy) to
the instance lock. This series converts other ops to depend on the
instance lock when possible.
Unfortunately queue related state is hard to lock (unlike NAPI)
as the process of switching the number of queues usually involves
a large reconfiguration of the driver. The reconfig process has
historically been under rtnl_lock, but for drivers which opt into
ops locking it is also under the instance lock. Leverage that
and conditionally take rtnl_lock or instance lock depending
on the device capabilities.
v1: https://lore.kernel.org/
20250407190117.16528-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250408195956.412733-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:55 +0000 (12:59 -0700)]
netdev: depend on netdev->lock for qstats in ops locked drivers
We mostly needed rtnl_lock in qstat to make sure the queue count
is stable while we work. For "ops locked" drivers the instance
lock protects the queue count, so we don't have to take rtnl_lock.
For currently ops-locked drivers: netdevsim and bnxt need
the protection from netdev going down while we dump, which
instance lock provides. gve doesn't care.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:54 +0000 (12:59 -0700)]
docs: netdev: break down the instance locking info per ops struct
Explicitly list all the ops structs and what locking they provide.
Use "ops locked" as a term for drivers which have ops called under
the instance lock.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250408195956.412733-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:53 +0000 (12:59 -0700)]
netdev: depend on netdev->lock for xdp features
Writes to XDP features are now protected by netdev->lock.
Other things we report are based on ops which don't change
once device has been registered. It is safe to stop taking
rtnl_lock, and depend on netdev->lock instead.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:52 +0000 (12:59 -0700)]
xdp: double protect netdev->xdp_flags with netdev->lock
Protect xdp_features with netdev->lock. This way pure readers
no longer have to take rtnl_lock to access the field.
This includes calling NETDEV_XDP_FEAT_CHANGE under the lock.
Looks like that's fine for bonding, the only "real" listener,
it's the same as ethtool feature change.
In terms of normal drivers - only GVE need special consideration
(other drivers don't use instance lock or don't support XDP).
It calls xdp_set_features_flag() helper from gve_init_priv() which
in turn is called from gve_reset_recovery() (locked), or prior
to netdev registration. So switch to _locked.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Harshitha Ramamurthy <hramamurthy@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250408195956.412733-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:51 +0000 (12:59 -0700)]
netdev: don't hold rtnl_lock over nl queue info get when possible
Netdev queue dump accesses: NAPI, memory providers, XSk pointers.
All three are "ops protected" now, switch to the op compat locking.
rtnl lock does not have to be taken for "ops locked" devices.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:50 +0000 (12:59 -0700)]
netdev: add "ops compat locking" helpers
Add helpers to "lock a netdev in a backward-compatible way",
which for ops-locked netdevs will mean take the instance lock.
For drivers which haven't opted into the ops locking we'll take
rtnl_lock.
The scoped foreach is dropping and re-taking the lock for each
device, even if prev and next are both under rtnl_lock.
I hope that's fine since we expect that netdev nl to be mostly
supported by modern drivers, and modern drivers should also
opt into the instance locking.
Note that these helpers are mostly needed for queue related state,
because drivers modify queue config in their ops in a non-atomic
way. Or differently put, queue changes don't have a clear-cut API
like NAPI configuration. Any state that can should just use the
instance lock directly, not the "compat" hacks.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:49 +0000 (12:59 -0700)]
net: designate XSK pool pointers in queues as "ops protected"
Read accesses go via xsk_get_pool_from_qid(), the call coming
from the core and gve look safe (other "ops locked" drivers
don't support XSK).
Write accesses go via xsk_reg_pool_at_qid() and xsk_clear_pool_at_qid().
Former is already under the ops lock, latter is not (both coming from
the workqueue via xp_clear_dev() and NETDEV_UNREGISTER via xsk_notifier()).
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 19:59:48 +0000 (12:59 -0700)]
net: avoid potential race between netdev_get_by_index_lock() and netns switch
netdev_get_by_index_lock() performs following steps:
rcu_lock();
dev = lookup(netns, ifindex);
dev_get(dev);
rcu_unlock();
[... lock & validate the dev ...]
return dev
Validation right now only checks if the device is registered but since
the lookup is netns-aware we must also protect against the device
switching netns right after we dropped the RCU lock. Otherwise
the caller in netns1 may get a pointer to a device which has just
switched to netns2.
We can't hold the lock for the entire netns change process (because of
the NETDEV_UNREGISTER notifier), and there's no existing marking to
indicate that the netns is unlisted because of netns move, so add one.
AFAIU none of the existing netdev_get_by_index_lock() callers can
suffer from this problem (NAPI code double checks the netns membership
and other callers are either under rtnl_lock or not ns-sensitive),
so this patch does not have to be treated as a fix.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 9 Apr 2025 23:02:44 +0000 (16:02 -0700)]
Merge tag 'linux_kselftest-fixes-6.15-rc2' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- Fixes tpm2, futex, and mincore tests
- Create a dedicated .gitignore for tpm2 tests
* tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/mincore: Allow read-ahead pages to reach the end of the file
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests: tpm2: create a dedicated .gitignore
Caleb Sander Mateos [Wed, 9 Apr 2025 01:29:26 +0000 (19:29 -0600)]
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
The ublk_ctrl_*() handlers all take struct io_uring_cmd *cmd but only
use it to get struct ublksrv_ctrl_cmd *header from the io_uring SQE.
Since the caller ublk_ctrl_uring_cmd() has already computed header, pass
it instead of cmd.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250409012928.3527198-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 9 Apr 2025 01:14:42 +0000 (09:14 +0800)]
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ubq->canceling is set with request queue quiesced when io_uring context is
exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued
and re-dispatch after device is recovered.
However commit
d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail
any request in case of ubq->canceling, this way breaks USER_RECOVERY or
!RECOVERY_FAIL_IO.
Fix it by calling __ublk_abort_rq() in case of ubq->canceling.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Reported-by: Uday Shankar <ushankar@purestorage.com>
Closes: https://lore.kernel.org/linux-block/Z%2FQkkTRHfRxtN%2FmB@dev-ushankar.dev.purestorage.com/
Fixes:
d796cea7b9f3 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 9 Apr 2025 01:14:41 +0000 (09:14 +0800)]
ublk: fix handling recovery & reissue in ublk_abort_queue()
Commit
8284066946e6 ("ublk: grab request reference when the request is handled
by userspace") doesn't grab request reference in case of recovery reissue.
Then the request can be requeued & re-dispatch & failed when canceling
uring command.
If it is one zc request, the request can be freed before io_uring
returns the zc buffer back, then cause kernel panic:
[ 126.773061] BUG: kernel NULL pointer dereference, address:
00000000000000c8
[ 126.773657] #PF: supervisor read access in kernel mode
[ 126.774052] #PF: error_code(0x0000) - not-present page
[ 126.774455] PGD 0 P4D 0
[ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI
[ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full)
[ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[ 126.776275] Workqueue: iou_exit io_ring_exit_work
[ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv]
Fixes it by always grabbing request reference for aborting the request.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/
Fixes:
8284066946e6 ("ublk: grab request reference when the request is handled by userspace")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
David S. Miller [Wed, 9 Apr 2025 11:55:48 +0000 (12:55 +0100)]
Merge branch 'sch_sfq-derived-limit'
Octavian Purdila says:
====================
net_sched: sch_sfq: reject a derived limit of 1
Because sfq parameters can influence each other there can be
situations where although the user sets a limit of 2 it can be lowered
to 1:
$ tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 depth 1
$ tc qdisc show dev dummy0
qdisc sfq 1: dev dummy0 root refcnt 2 limit 1p quantum 1514b depth 1 divisor 1024
$ tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 10 depth 1 divisor 1
$ tc qdisc show dev dummy0
qdisc sfq 2: root refcnt 2 limit 1p quantum 1514b depth 1 divisor 1
As a limit of 1 is invalid, this patch series moves the limit
validation to after all configuration changes have been done. To do
so, the configuration is done in a temporary work area then applied to
the internal state.
The patch series also adds new test cases.
v3:
- remove a couple of unnecessary comments
- rearrange local variables to use reverse Christmas tree style
declaration order
v2: https://lore.kernel.org/all/
20250402162750.
1671155-1-tavip@google.com/
- remove tmp struct and directly use local variables
v1: https://lore.kernel.org/all/
20250328201634.
3876474-1-tavip@google.com/
===================
Signed-off-by: David S. Miller <davem@davemloft.net>
Octavian Purdila [Mon, 7 Apr 2025 20:24:09 +0000 (13:24 -0700)]
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
Because the limit is updated indirectly when other parameters are
updated, there are cases where even though the user requests a limit
of 2 it can actually be set to 1.
Add the following test cases to check that the kernel rejects them:
- limit 2 depth 1 flows 1
- limit 2 depth 1 divisor 1
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Octavian Purdila [Mon, 7 Apr 2025 20:24:08 +0000 (13:24 -0700)]
net_sched: sch_sfq: move the limit validation
It is not sufficient to directly validate the limit on the data that
the user passes as it can be updated based on how the other parameters
are changed.
Move the check at the end of the configuration update process to also
catch scenarios where the limit is indirectly updated, for example
with the following configurations:
tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 depth 1
tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 divisor 1
This fixes the following syzkaller reported crash:
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:6
index 65535 is out of range for type 'struct sfq_head[128]'
CPU: 1 UID: 0 PID: 3037 Comm: syz.2.16 Not tainted 6.14.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x201/0x300 lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:231 [inline]
__ubsan_handle_out_of_bounds+0xf5/0x120 lib/ubsan.c:429
sfq_link net/sched/sch_sfq.c:203 [inline]
sfq_dec+0x53c/0x610 net/sched/sch_sfq.c:231
sfq_dequeue+0x34e/0x8c0 net/sched/sch_sfq.c:493
sfq_reset+0x17/0x60 net/sched/sch_sfq.c:518
qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035
tbf_reset+0x41/0x110 net/sched/sch_tbf.c:339
qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035
dev_reset_queue+0x100/0x1b0 net/sched/sch_generic.c:1311
netdev_for_each_tx_queue include/linux/netdevice.h:2590 [inline]
dev_deactivate_many+0x7e5/0xe70 net/sched/sch_generic.c:1375
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes:
10685681bafc ("net_sched: sch_sfq: don't allow 1 packet limit")
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Octavian Purdila [Mon, 7 Apr 2025 20:24:07 +0000 (13:24 -0700)]
net_sched: sch_sfq: use a temporary work area for validating configuration
Many configuration parameters have influence on others (e.g. divisor
-> flows -> limit, depth -> limit) and so it is difficult to correctly
do all of the validation before applying the configuration. And if a
validation error is detected late it is difficult to roll back a
partially applied configuration.
To avoid these issues use a temporary work area to update and validate
the configuration and only then apply the configuration to the
internal state.
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Luczaj [Mon, 7 Apr 2025 19:01:02 +0000 (21:01 +0200)]
net: Drop unused @sk of __skb_try_recv_from_queue()
__skb_try_recv_from_queue() deals with a queue, @sk is not used
since commit
e427cad6eee4 ("net: datagram: drop 'destructor'
argument from several helpers"). Remove sk from function parameters,
adapt callers.
No functional change intended.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250407-cleanup-drop-param-sk-v1-1-cd076979afac@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 9 Apr 2025 01:19:45 +0000 (18:19 -0700)]
Merge branch 'udp_tunnel-gro-optimizations'
Paolo Abeni says:
====================
udp_tunnel: GRO optimizations
The UDP tunnel GRO stage is source of measurable overhead for workload
based on UDP-encapsulated traffic: each incoming packets requires a full
UDP socket lookup and an indirect call.
In the most common setups a single UDP tunnel device is used. In such
case we can optimize both the lookup and the indirect call.
Patch 1 tracks per netns the active UDP tunnels and replaces the socket
lookup with a single destination port comparison when possible.
Patch 2 tracks the different types of UDP tunnels and replaces the
indirect call with a static one when there is a single UDP tunnel type
active.
I measure ~10% performance improvement in TCP over UDP tunnel stream
tests on top of this series.
v4: https://lore.kernel.org/cover.
1741718157.git.pabeni@redhat.com
v3: https://lore.kernel.org/cover.
1741632298.git.pabeni@redhat.com
v2: https://lore.kernel.org/cover.
1741338765.git.pabeni@redhat.com
v1: https://lore.kernel.org/cover.
1741275846.git.pabeni@redhat.com
====================
Link: https://patch.msgid.link/cover.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Mon, 7 Apr 2025 15:45:42 +0000 (17:45 +0200)]
udp_tunnel: use static call for GRO hooks when possible
It's quite common to have a single UDP tunnel type active in the
whole system. In such a case we can replace the indirect call for
the UDP tunnel GRO callback with a static call.
Add the related accounting in the control path and switch to static
call when possible. To keep the code simple use a static array for
the registered tunnel types, and size such array based on the kernel
config.
Note that there are valid kernel configurations leading to
UDP_MAX_TUNNEL_TYPES == 0 even with IS_ENABLED(CONFIG_NET_UDP_TUNNEL),
Explicitly skip the accounting in such a case, to avoid compile warning
when accessing "udp_tunnel_gro_types".
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/53d156cdfddcc9678449e873cc83e68fa1582653.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Mon, 7 Apr 2025 15:45:41 +0000 (17:45 +0200)]
udp_tunnel: create a fastpath GRO lookup.
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.
Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.
When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.
The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.
Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.
Restrict the optimization to kernel sockets only: it covers all the
relevant use-cases, and user-space owned sockets could be disconnected
and rebound after setup_udp_tunnel_sock(), breaking the uniqueness
assumption
Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/41d16bc8d1257d567f9344c445b4ae0b4a91ede4.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 9 Apr 2025 00:16:43 +0000 (17:16 -0700)]
Merge tag 'linux_kselftest-kunit-6.15-rc2' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
- Fix the tool to report test count in case of a late test plan when
tests are specified before the test plan
- Fix spelling error
* tag 'linux_kselftest-kunit-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Spelling s/slowm/slow/
kunit: tool: fix count of tests if late test plan
Chenyuan Yang [Mon, 7 Apr 2025 18:49:52 +0000 (13:49 -0500)]
net: libwx: handle page_pool_dev_alloc_pages error
page_pool_dev_alloc_pages could return NULL. There was a WARN_ON(!page)
but it would still proceed to use the NULL pointer and then crash.
This is similar to commit
001ba0902046
("net: fec: handle page_pool_dev_alloc_pages error").
This is found by our static analysis tool KNighter.
Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes:
3c47e8ae113a ("net: libwx: Support to receive packets in NAPI")
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250407184952.2111299-1-chenyuan0y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Apr 2025 23:16:23 +0000 (16:16 -0700)]
Merge branch 'mptcp-only-inc-mpjoinackhmacfailure-for-hmac-failures'
Matthieu Baerts says:
====================
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was strangely not zero on the server side.
The first patch fixes this issue -- present since v5.9 -- and the second
one validates it in the selftests.
====================
Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-0-3c9ecd0a3a50@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts (NGI0) [Mon, 7 Apr 2025 18:26:33 +0000 (20:26 +0200)]
selftests: mptcp: validate MPJoin HMacFailure counters
The parent commit fixes an issue around these counters where one of them
-- MPJoinAckHMacFailure -- was wrongly incremented in some cases.
This makes sure the counter is always 0. It should be incremented only
in case of corruption, or a wrong implementation, which should not be
the case in these selftests.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-2-3c9ecd0a3a50@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts (NGI0) [Mon, 7 Apr 2025 18:26:32 +0000 (20:26 +0200)]
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was not zero on the server side. The
counter was in fact incremented when the PM rejected new subflows,
because the 'subflow' limit was reached.
The fix is easy, simply dissociating the two cases: only the HMAC
validation check should increase MPTCP_MIB_JOINACKMAC counter.
Fixes:
4cf8b7e48a09 ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-1-3c9ecd0a3a50@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Victor Nogueira [Mon, 7 Apr 2025 21:56:56 +0000 (18:56 -0300)]
selftests: tc-testing: Pre-load IFE action and its submodules
Recently we had some issues in parallel TDC where some of IFE tests are
failing due to some of IFE's submodules (like act_meta_skbtcindex and
act_meta_skbprio) taking too long to load [1]. To avoid that issue,
pre-load IFE and all its submodules before running any of the tests in
tdc.sh
[1] https://lore.kernel.org/netdev/
e909b2a0-244e-4141-9fa9-
1b7d96ab7d71@mojatatu.com/T/#u
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250407215656.2535990-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Qiuxu Zhuo [Tue, 11 Mar 2025 08:09:40 +0000 (16:09 +0800)]
selftests/mincore: Allow read-ahead pages to reach the end of the file
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Edward Liaw [Fri, 4 Apr 2025 22:12:20 +0000 (22:12 +0000)]
selftests/futex: futex_waitv wouldblock test should fail
Testcase should fail if -EWOULDBLOCK is not returned when expected value
differs from actual value from the waiter.
Link: https://lore.kernel.org/r/20250404221225.1596324-1-edliaw@google.com
Fixes:
9d57f7c79748920636f8293d2f01192d702fe390 ("selftests: futex: Test sys_futex_waitv() wouldblock")
Signed-off-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Geert Uytterhoeven [Thu, 27 Mar 2025 15:33:01 +0000 (16:33 +0100)]
kunit: Spelling s/slowm/slow/
Fix a misspelling of "slow".
Link: https://lore.kernel.org/r/1f7ebf98598418914ec9f5b6d5cb8583d24a4bf0.1743089563.git.geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <shuah@kernel.org>
Rae Moar [Wed, 19 Mar 2025 22:33:51 +0000 (22:33 +0000)]
kunit: tool: fix count of tests if late test plan
Fix test count with late test plan.
For example,
TAP version 13
ok 1 test1
1..4
Returns a count of 1 passed, 1 crashed (because it expects tests after
the test plan): returning the total count of 2 tests
Change this to be 1 passed, 1 error: total count of 1 test
Link: https://lore.kernel.org/r/20250319223351.1517262-1-rmoar@google.com
Signed-off-by: Rae Moar <rmoar@google.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <shuah@kernel.org>