git.kernel.dk Git - linux-block.git/log

net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute

Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ
packet scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

This patch adds dev->max_pacing_offload_horizon expressing
this timing wheel horizon in nsec units.

This is a read-only attribute.

Unless a driver sets it, dev->max_pacing_offload_horizon
is zero.

v2: addressed Jakub feedback ( https://lore.kernel.org/netdev/20240930152304.472767-2-edumazet@google.com/T/#mf6294d714c41cc459962154cc2580ce3c9693663 )
v3: added yaml doc (also per Jakub feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241003121219.2396589-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest/ptp: update ptp selftest to exercise the gettimex options

With the inclusion of commit c259acab839e ("ptp/ioctl: support
MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDED") clock_gettime()
now allows retrieval of pre/post timestamps for CLOCK_MONOTONIC and
CLOCK_MONOTONIC_RAW timebases along with the previously supported
CLOCK_REALTIME.

This patch adds a command line option 'y' to the testptp program to
choose one of the allowed timebases [realtime aka system, monotonic,
and monotonic-raw).

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://patch.msgid.link/20241003101506.769418-1-maheshb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-add-fast-path-in-timer-handlers'

Eric Dumazet says:

====================
tcp: add fast path in timer handlers

As mentioned in Netconf 2024:

TCP retransmit and delack timers are not stopped from
inet_csk_clear_xmit_timer() because we do not define
INET_CSK_CLEAR_TIMERS.

Enabling INET_CSK_CLEAR_TIMERS leads to lower performance,
mainly because del_timer() and mod_timer() happen from
different cpus quite often.

What we can do instead is to add fast paths to tcp_write_timer()
and tcp_delack_timer() to avoid socket spinlock acquisition.
====================

Link: https://patch.msgid.link/20241002173042.917928-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: add a fast path in tcp_delack_timer()

delack timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.

This is a conscious choice : inet_csk_clear_xmit_timer()
is often called from another cpu. Calling del_timer()
would cause false sharing and lock contention.

This means that very often, tcp_delack_timer() is called
at the timer expiration, while there is no ACK to transmit.

This can be detected very early, avoiding the socket spinlock.

Notes:
- test about tp->compressed_ack is racy,
  but in the unlikely case there is a race, the dedicated
  compressed_ack_timer hrtimer would close it.

- Even if the fast path is not taken, reading
  icsk->icsk_ack.pending and tp->compressed_ack
  before acquiring the socket spinlock reduces
  acquisition time and chances of contention.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: add a fast path in tcp_write_timer()

retransmit timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.

This is a conscious choice : for active TCP flows, it is better
to only call mod_timer(), because there is more chances of
keeping the timer unchanged. Also inet_csk_clear_xmit_timer()
is often called from another cpu, and calling del_timer()
would cause false sharing and lock contention.

This means that very often, tcp_write_timer() is called
at the timer expiration, while there is nothing to retransmit.

This can be detected very early, avoiding the socket spinlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: annotate data-races around icsk->icsk_pending

icsk->icsk_pending can be read locklessly already.

Following patch in the series will add another lockless read.

Add smp_load_acquire() and smp_store_release() annotations
because following patch will add a test in tcp_write_timer(),
and READ_ONCE()/WRITE_ONCE() alone would possibly lead to races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-net-ioam-add-tunsrc-support'

Justin Iurman says:

====================
selftests: net: ioam: add tunsrc support

TL;DR This patch comes from a discussion we had with Jakub and Paolo on
aligning the ioam selftests with its new "tunsrc" feature.

This patch updates the IOAM selftests to support the new "tunsrc"
feature of IOAM. As a consequence, some changes were required. For
example, the IPv6 header must be accessed to check some fields (i.e.,
the source address for the "tunsrc" feature), which is not possible
AFAIK with IPv6 raw sockets. The latter is currently used with
IPV6_RECVHOPOPTS and was introduced by commit 187bbb6968af ("selftests:
ioam: refactoring to align with the fix") to fix an issue. But, we
really need packet sockets actually... which is one of the changes in
this patch (see the description of the topology at the top of ioam6.sh
for explanations). Another change is that all IPv6 addresses used in the
topology are now based on the documentation prefix (2001:db8::/32).
Also, the tests have been improved and there are now many more of them.
Overall, the script is more robust.
====================

Link: https://patch.msgid.link/20241002162731.19847-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add new ioam tests

This patch re-adds the (updated) ioam selftests with support for the
tunsrc feature.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20241002162731.19847-3-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: remove ioam tests

This patch entirely removes the ioam selftests to prepare for the next
patch in this series, which re-adds the new ioam selftests for better
readability.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20241002162731.19847-2-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: Support LED control

This adds control over the hardware LEDs in the Marvell
MV88E6xxx DSA switch and enables it for MV88E6352.

This fixes an imminent problem on the Inteno XG6846 which
has a WAN LED that simply do not work with hardware
defaults: driver amendment is necessary.

The patch is modeled after Christian Marangis LED support
code for the QCA8k DSA switch, I got help with the register
definitions from Tim Harvey.

After this patch it is possible to activate hardware link
indication like this (or with a similar script):

  cd /sys/class/leds/Marvell\ 88E6352:05:00:green:wan/
  echo netdev > trigger
  echo 1 > link

This makes the green link indicator come up on any link
speed. It is also possible to be more elaborate, like this:

  cd /sys/class/leds/Marvell\ 88E6352:05:00:green:wan/
  echo netdev > trigger
  echo 1 > link_1000
  cd /sys/class/leds/Marvell\ 88E6352:05:01:amber:wan/
  echo netdev > trigger
  echo 1 > link_100

Making the green LED come on for a gigabit link and the
amber LED come on for a 100 mbit link.

Each port has 2 LED slots (the hardware may use just one or
none) and the hardware triggers are specified in four bits per
LED, and some of the hardware triggers are only available on the
SFP (fiber) uplink. The restrictions are described in the
port.h header file where the registers are described. For
example, selector 1 set for LED 1 on port 5 or 6 will indicate
Fiber 1000 (gigabit) and activity with a blinking LED, but
ONLY for an SFP connection. If port 5/6 is used with something
not SFP, this selector is a noop: something else need to be
selected.

After the previous series rewriting the MV88E6xxx DT
bindings to use YAML a "leds" subnode is already valid
for each port, in my scratch device tree it looks like
this:

   leds {
     #address-cells = <1>;
     #size-cells = <0>;

     led@0 {
       reg = <0>;
       color = <LED_COLOR_ID_GREEN>;
       function = LED_FUNCTION_LAN;
       default-state = "off";
       linux,default-trigger = "netdev";
     };
     led@1 {
       reg = <1>;
       color = <LED_COLOR_ID_AMBER>;
       function = LED_FUNCTION_LAN;
       default-state = "off";
     };
   };

This DT config is not yet configuring everything: when the netdev
default trigger is assigned the hw acceleration callbacks are
not called, and there is no way to set the netdev sub-trigger
type (such as link_1000) from the device tree, such as if you want
a gigabit link indicator. This has to be done from userspace at
this point.

We add LED operations to all switches in the 6352 family:
6172, 6176, 6240 and 6352.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241001-mv88e6xxx-leds-v4-1-cc11c4f49b18@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hv_netvsc: Don't assume cpu_possible_mask is dense

Current code allocates the pcpu_sum array with size num_possible_cpus().
This code assumes the cpu_possible_mask is dense, which is not true in
the general case per [1]. If cpu_possible_mask is sparse, the array
might be indexed by a value beyond the size of the array.

However, the configurations that Hyper-V provides to guest VMs on x86
and ARM64 hardware, in combination with how architecture specific code
assigns Linux CPU numbers, *does* always produce a dense cpu_possible_mask.
So the dense assumption is not currently causing failures. But for
robustness against future changes in how cpu_possible_mask is populated,
update the code to no longer assume dense.

The correct approach is to allocate and initialize the array using size
"nr_cpu_ids". While this leaves unused array entries corresponding to
holes in cpu_possible_mask, the holes are assumed to be minimal and hence
the amount of memory wasted by unused entries is minimal.

[1] https://lore.kernel.org/lkml/SN6PR02MB4157210CC36B2593F8572E5ED4692@SN6PR02MB4157.namprd02.prod.outlook.com/

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20241003035333.49261-6-mhklinux@outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix rss key initialization warning

This warning is emitted when a driver does not default populate an rss
key when one is not provided from userspace. Some devices do not
support individual rss keys per context. For these devices, it is ok
to leave the key zeroed out in ethtool_rxfh_context. Do not warn on
zeroed key when ethtool_ops.rxfh_per_ctx_key == 0.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20241003162310.1310576-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-10-01 (ice)

This series contains updates to ice driver only.

Karol cleans up current PTP GPIO pin handling, fixes minor bugs,
refactors implementation for all products, introduces SDP (Software
Definable Pins) for E825C and implements reading SDP section from NVM
for E810 products.

Sergey replaces multiple aux buses and devices used in the PTP support
code with struct ice_adapter holding the necessary shared data.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Drop auxbus use for PTP to finalize ice_adapter move
  ice: Use ice_adapter for PTP shared data instead of auxdev
  ice: Initial support for E825C hardware in ice_adapter
  ice: Add ice_get_ctrl_ptp() wrapper to simplify the code
  ice: Introduce ice_get_phy_model() wrapper
  ice: Enable 1PPS out from CGU for E825C products
  ice: Read SDP section from NVM for pin definitions
  ice: Disable shared pin on E810 on setfunc
  ice: Cache perout/extts requests and check flags
  ice: Align E810T GPIO to other products
  ice: Add SDPs support for E825C
  ice: Implement ice_ptp_pin_desc
====================

Link: https://patch.msgid.link/20241001201702.3252954-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-option-to-provide-opt_id-value-via-cmsg'

Vadim Fedorenko says:

====================
Add option to provide OPT_ID value via cmsg

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg.

The first patch in the series adds all needed definitions and implements
the function for UDP sockets. The explicit check of socket's type is not
added because subsequent patches in the series will add support for other
types of sockets. The documentation is also included into the first
patch.

Patch 2/4 adds support for TCP sockets. This part is simple and straight
forward.

Patch 3/4 adds support for RAW sockets. It's a bit tricky because
sock_tx_timestamp functions has to be refactored to receive full socket
cookie information to fill in ID. The commit b534dc46c8ae ("net_tstamp:
add SOF_TIMESTAMPING_OPT_ID_TCP") did the conversion of sk_tsflags to
u32 but sock_tx_timestamp functions were not converted and still receive
16b flags. It wasn't a problem because SOF_TIMESTAMPING_OPT_ID_TCP was
not checked in these functions, that's why no backporting is needed.

Patch 4/4 adds selftests for new feature.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
====================

Link: https://patch.msgid.link/20241001125716.2832769-1-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: txtimestamp: add SCM_TS_OPT_ID test

Extend txtimestamp test to run with fixed tskey using
SCM_TS_OPT_ID control message for all types of sockets.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-4-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net_tstamp: add SCM_TS_OPT_ID for RAW sockets

The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-3-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-hw-counters-refactor'

Tariq Toukan says:

====================
net/mlx5: hw counters refactor

This is a patchset re-post, see:
https://lore.kernel.org/20240815054656.2210494-7-tariqt@nvidia.com

In this patchset, Cosmin refactors hw counters and solves perf scaling
issue.

Series generated against:
commit c824deb1a897 ("cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"")

HW counters are central to mlx5 driver operations. They are hardware
objects created and used alongside most steering operations, and queried
from a variety of places. Most counters are queried in bulk from a
periodic task in fs_counters.c.

Counter performance is important and as such, a variety of improvements
have been done over the years. Currently, counters are allocated from
pools, which are bulk allocated to amortize the cost of firmware
commands. Counters are managed through an IDR, a doubly linked list and
two atomic single linked lists. Adding/removing counters is a complex
dance between user contexts requesting it and the mlx5_fc_stats_work
task which does most of the work.

Under high load (e.g. from connection tracking flow insertion/deletion),
the counter code becomes a bottleneck, as seen on flame graphs. Whenever
a counter is deleted, it gets added to a list and the wq task is
scheduled to run immediately to actually delete it. This is done via
mod_delayed_work which uses an internal spinlock. In some tests, waiting
for this spinlock took up to 66% of all samples.

This series refactors the counter code to use a more straight-forward
approach, avoiding the mod_delayed_work problem and making the code
easier to understand. For that:

- patch #1 moves counters data structs to a more appropriate place.
- patch #2 simplifies the bulk query allocation scheme by using vmalloc.
- patch #3 replaces the IDR+3 lists with an xarray. This is the main
patch of the series, solving the spinlock congestion issue.
- patch #4 removes an unnecessary cacheline alignment causing a lot of
memory to be wasted.
- patches #5 and #6 are small cleanups enabled by the refactoring.
====================

Link: https://patch.msgid.link/20241001103709.58127-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Remove mlx5_fc_create_ex

It no longer serves any purpose and is identical to mlx5_fc_create upon
which it was originally based of.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Don't maintain a counter count

num_counters is only used for deciding whether to grow the bulk query
buffer, which is done once more counters than a small initial threshold
are present. After that, maintaining num_counters serves no purpose.

This commit replaces that with an actual xarray traversal to count the
counters. This appears expensive at first sight, but is only done when
the number of counters is less than the initial threshold (8) and only
once every sampling interval. Once the number of counters goes above the
threshold, the bulk query buffer is grown to max size and the xarray
traversal is never done again.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Drop unneeded cacheline alignment

The mlx5_fc struct has a cache for values queried from hw, which is
cacheline aligned. On x86_64, this results in:

struct mlx5_fc {
        u32                    id;                   /*     0     4 */
        bool                   aging;                /*     4     1 */

        /* XXX 3 bytes hole, try to pack */

        struct mlx5_fc_bulk *  bulk;                 /*     8     8 */

        /* XXX 48 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct mlx5_fc_cache   cache __attribute__((__aligned__(64)));
/*    64    24 */
        u64                    lastpackets;          /*    88     8 */
        u64                    lastbytes;            /*    96     8 */

        /* size: 128, cachelines: 2, members: 6 */
        /* sum members: 53, holes: 2, sum holes: 51 */
        /* padding: 24 */
        /* forced aligns: 1, forced holes: 1, sum forced holes: 48 */
} __attribute__((__aligned__(64)));

(output from pahole).

...So a 48+24=72 byte waste. As far as I can determine, this serves no
purpose other than maybe making sure that the values in the cache do not
span two cachelines in the worst case scenario, but that's not a valid
enough reason to waste 72 bytes per counter, especially since this code
is not performance-critical. There could potentially be hundreds of
thousands of counters (e.g. for connection-tracking), so this quickly
adds up to multiple MB wasted.

This commit removes the alignment, resulting in:
struct mlx5_fc {
        [...]
        /* size: 56, cachelines: 1, members: 6 */
        /* sum members: 53, holes: 1, sum holes: 3 */
        /* last cacheline: 56 bytes */
};

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Replace IDR+lists with xarray

Previously, managing counters was a complicated affair involving an IDR,
a sorted double linked list, two single linked lists and a complex dance
between a non-periodic wq task and users adding/deleting counters.

Adding was done by inserting new counters into the IDR and into a single
linked list, leaving the wq to process the list and actually add the
counters into the double linked list, maintained sorted with the IDR.

Deleting involved adding the counter into another single linked list,
leaving the wq to actually unlink the counter from the other structures
and release it.

Dumping the counters is done with the bulk query API, which relies on
the counter list being sorted and unmutable during querying to
efficiently retrieve cached counter values.

Finally, the IDR data struct is deprecated.

This commit replaces all of that with an xarray.

Adding is now done directly, by using xa_lock.
Deleting is also done directly, under the xa_lock.

Querying is done from a periodic task running every sampling_interval
(default 1s) and uses the bulk query API for efficiency.
It works by iterating over the xarray:
- when a new bulk needs to be started, the bulk information is computed
under the xa_lock.
- the xa iteration state is saved and the xa_lock dropped.
- the HW is queried for bulk counter values.
- the xa_lock is reacquired.
- counter caches with ids covered by the bulk response are updated.

Querying always requests the max bulk length, for simplicity.

Counters could be added/deleted while the HW is queried. This is safe,
as the HW API simply returns unknown values for counters not in HW, but
those values won't be accessed. Only counters present in xarray before
bulk query will actually read queried cache values.

This cuts down the size of mlx5_fc by 4 pointers (88->56 bytes), which
amounts to ~3MB / 100K counters.
But more importantly, this solves the wq spinlock congestion issue seen
happening on high-rate counter insertion+deletion.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Use kvmalloc for bulk query buffer

The bulk query buffer starts out small (see [1]) and as soon as the
number of counters goes past the initial threshold grows to max
size (32K entries, 512KB) with a retry scheme.

This commit switches to using kvmalloc for the buffer, which has a near
zero likelihood of failing, and thus the explicit retry scheme becomes
superfluous and is taken out. On the low chance the allocation fails, it
will still be retried every sampling_interval, when the wq task runs.

[1] commit b247f32aecad ("net/mlx5: Dynamically resize flow counters
query buffer")

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: hw counters: Make fc_stats & fc_pool private

The mlx5_fc_stats and mlx5_fc_pool structs are only used from
fs_counters.c. As such, make them private there.

mlx5_fc_pool is not used or referenced at all outside fs_counters.

mlx5_fc_stats is referenced from mlx5_core_dev, so instead of having it
as a direct member (which requires exporting it from fs_counters), store
a pointer to it, allocate it on init and clear it on destroy.
One caveat is that a simple container_of to get from a 'work' struct to
the outermost mlx5_core_dev struct directly no longer works, so an extra
pointer had to be added to mlx5_fc_stats back to the parent dev.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241001103709.58127-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Change block parameter to const pointer in get_lf_str_list

Convert struct rvu_block block to const struct rvu_block *block in
get_lf_str_list() function parameter. This improves efficiency by
avoiding structure copying and reflects the function's read-only
access to block.

Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241001110542.5404-2-riyandhiman14@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: Adding support for Jumbo Frames up to 10240 Bytes in SAMA5D2

As per the SAMA5D2 device specification it supports Jumbo frames.
But the suggested flag and length of bytes it supports was not updated
in this driver config_structure.
The maximum jumbo frames the device supports:
10240 bytes as per the device spec.

While changing the MTU value greater than 1500, it threw error:
sudo ifconfig eth1 mtu 9000
SIOCSIFMTU: Invalid argument

Add this support to driver so that it works as expected and designed.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Link: https://patch.msgid.link/20241003171941.8814-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-airoha-fix-pse-memory-configuration'

Lorenzo Bianconi says:

====================
net: airoha: Fix PSE memory configuration

Align PSE memory configuration to vendor SDK.
Increase initial value of PSE reserved memory in
airoha_fe_pse_ports_init() by the value used for the second Packet
Processor Engine (PPE2).
Do not overwrite the default value for the number of PSE reserved pages
in airoha_fe_set_pse_oq_rsv().
These changes fix issues which are not visible to the user.

v1: https://lore.kernel.org/20240930-airoha-eth-pse-fix-v1-0-f41f2f35abb9@kernel.org
====================

Link: https://patch.msgid.link/20241001-airoha-eth-pse-fix-v2-0-9a56cdffd074@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: fix PSE memory configuration in airoha_fe_pse_ports_init()

Align PSE memory configuration to vendor SDK. In particular, increase
initial value of PSE reserved memory in airoha_fe_pse_ports_init()
routine by the value used for the second Packet Processor Engine (PPE2)
and do not overwrite the default value.

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support
for EN7581 SoC")

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241001-airoha-eth-pse-fix-v2-2-9a56cdffd074@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: read default PSE reserved pages value before updating

Store the default value for the number of PSE reserved pages in orig_val
at the beginning of airoha_fe_set_pse_oq_rsv routine, before updating it
with airoha_fe_set_pse_queue_rsv_pages().
Introduce airoha_fe_get_pse_all_rsv utility routine.

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support
for EN7581 SoC")

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241001-airoha-eth-pse-fix-v2-1-9a56cdffd074@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-switch-to-scoped-device_for_each_child_node'

Javier Carrasco says:

====================
net: switch to scoped device_for_each_child_node()

This series switches from the device_for_each_child_node() macro to its
scoped variant. This makes the code more robust if new early exits are
added to the loops, because there is no need for explicit calls to
fwnode_handle_put(), which also simplifies existing code.

The non-scoped macros to walk over nodes turn error-prone as soon as
the loop contains early exits (break, goto, return), and patches to
fix them show up regularly, sometimes due to new error paths in an
existing loop [1].

Note that the child node is now declared in the macro, and therefore the
explicit declaration is no longer required.

The general functionality should not be affected by this modification.
If functional changes are found, please report them back as errors.

Link: https://lore.kernel.org/20240901160829.709296395@linuxfoundation.org
v1: https://lore.kernel.org/r/20240930-net-device_for_each_child_node_scoped-v1-0-bbdd7f9fd649@gmail.com
====================

Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-0-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns: hisilicon: hns_dsaf_mac: switch to scoped device_for_each_child_node()

Use device_for_each_child_node_scoped() to simplify the code by removing
the need for explicit calls to fwnode_handle_put() in every error path.
This approach also accounts for any error path that could be added.

Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-2-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: thunder: switch to scoped device_for_each_child_node()

There has already been an issue with the handling of early exits from
device_for_each_child() in this driver, and it was solved with commit
b1de5c78ebe9 ("net: mdio: thunder: Add missing fwnode_handle_put()") by
adding a call to fwnode_handle_put() right after the loop.

That solution is valid indeed, but if a new error path with a 'return'
is added to the loop, this solution will fail. A more secure approach
is using the scoped variant of the macro, which automatically
decrements the refcount of the child node when it goes out of scope,
removing the need for explicit calls to fwnode_handle_put().

Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-1-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'qed-ethtool-d-faster-less-latency'

Michal Schmidt says:

====================
qed: 'ethtool -d' faster, less latency

Here is a patch to make 'ethtool -d' on a qede network device a lot
faster and 3 patches to make it cause less latency for other tasks on
non-preemptible kernels.
====================

Link: https://patch.msgid.link/20240930201307.330692-1-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

qed: put cond_resched() in qed_dmae_operation_wait()

It is OK to sleep in qed_dmae_operation_wait, because it is called only
in process context, while holding p_hwfn->dmae_info.mutex from one of
the qed_dmae_{host,grc}2{host,grc} functions.
The udelay(DMAE_MIN_WAIT_TIME=2) in the function is too short to replace
with usleep_range, but at least it's a suitable point for checking if we
should give up the CPU with cond_resched().

This lowers the latency caused by 'ethtool -d' from 10 ms to less than
2 ms on my test system with voluntary preemption.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://patch.msgid.link/20240930201307.330692-5-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

qed: allow the callee of qed_mcp_nvm_read() to sleep

qed_mcp_nvm_read has a loop where it calls qed_mcp_nvm_rd_cmd with the
argument b_can_sleep=false. And it sleeps once every 0x1000 bytes
read.

Simplify this by letting qed_mcp_nvm_rd_cmd itself sleep
(b_can_sleep=true). It will have slept at least once when successful
(in the "Wait for the MFW response" loop). So the extra sleep once every
0x1000 bytes becomes superfluous. Delete it.

On my test system with voluntary preemption, this lowers the latency
caused by 'ethtool -d' from 53 ms to 10 ms.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://patch.msgid.link/20240930201307.330692-4-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

qed: put cond_resched() in qed_grc_dump_ctx_data()

On a kernel with preemption none or voluntary, 'ethtool -d'
on a qede network device can cause a big latency spike.
The biggest part of it is the loop in qed_grc_dump_ctx_data.

The function is called only from the .get_size and .perform_dump
callbacks for the "grc" feature defined in qed_features_lookup[].
As far as I can see, they are used in:
- qed's devlink healh reporter .dump op
- qede's ethtool get_regs/get_regs_len/get_dump_data ops
- qedf's qedf_get_grc_dump, called from:
   - qedf_sysfs_write_grcdump - "grcdump" sysfs attribute write
   - qedf_wq_grcdump - a workqueue

It is safe to sleep in all of them.
Let's insert a cond_resched() in the outer loop to let other tasks run.

Measured using this script:

  #!/bin/bash
  DEV=ens3f1
  echo wakeup_rt > /sys/kernel/tracing/current_tracer
  echo 0 > /sys/kernel/tracing/tracing_max_latency
  echo 1 > /sys/kernel/tracing/tracing_on
  echo "Setting the task CPU affinity"
  taskset -p 1 $$ > /dev/null
  echo "Starting the real-time task"
  chrt -f 50 bash -c 'while sleep 0.01; do :; done' &
  sleep 1
  echo "Running: ethtool -d $DEV"
  time ethtool -d $DEV > /dev/null
  kill %1
  echo 0 > /sys/kernel/tracing/tracing_on
  echo "Measured latency: $(</sys/kernel/tracing/tracing_max_latency) us"
  echo "To see the latency trace: less /sys/kernel/tracing/trace"

The patch lowers the latency from 180 ms to 53 ms on my test system with
voluntary preemption.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://patch.msgid.link/20240930201307.330692-3-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

qed: make 'ethtool -d' 10 times faster

As a side effect of commit 5401c3e09928 ("qed: allow sleep in
qed_mcp_trace_dump()"), 'ethtool -d' became much slower.
Almost all the time is spent collecting the "mcp_trace".
It is caused by sleeping too long in _qed_mcp_cmd_and_union.
When called with sleeping not allowed, the function delays for 10 µs
between firmware polls. But if sleeping is allowed, it sleeps for 10 ms
instead.

The sleeps in _qed_mcp_cmd_and_union are unnecessarily long.
Replace msleep with usleep_range, which allows to achieve a similar
polling interval like in the no-sleeping mode (10 - 20 µs).

The only caller, qed_mcp_cmd_and_union, can stop doing the
multiplication/division of the usecs/max_retries. The polling interval
and the number of retries do not need to be parameters at all.

On my test system, 'ethtool -d' now takes 4 seconds instead of 44.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://patch.msgid.link/20240930201307.330692-2-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mv643xx-devm-fixes'

Rosen Penev says:

====================
net: mv643xx: devm fixes

Small simplification and a fix for a seemingly wrong function usage.
====================

Link: https://patch.msgid.link/20240930202951.297737-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mv643xx: fix wrong devm_clk_get usage

This clock should be optional. In addition, PTR_ERR can be -EPROBE_DEFER
in which case it should return.

devm_clk_get_optional_enabled also allows removing explicit clock enable
and disable calls.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240930202951.297737-3-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mv643xx: use devm_platform_ioremap_resource

This combines multiple steps in one function.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240930202951.297737-2-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ag71xx-small-cleanups'

Rosen Penev says:

====================
net: ag71xx: small cleanups

More devm and some loose ends.
====================

Link: https://patch.msgid.link/20240930181823.288892-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: move assignment into main loop

Effectively what's going on here is there's a main loop and an identical
one below with a single assignment. Simpler to move it up.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20240930181823.288892-6-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: replace INIT_LIST_HEAD

LIST_HEAD is a shorter macro. No real difference.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20240930181823.288892-5-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: remove platform_set_drvdata

platform_get_drvdata is never called as a result of all the devm
conversions.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20240930181823.288892-4-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: use some dev_err_probe

These functions can return EPROBE_DEFER. Don't warn in such a case.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20240930181823.288892-3-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: use devm_ioremap_resource

We can just use res directly.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20240930181823.288892-2-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

appletalk: Remove deadcode

alloc_ltalkdev in net/appletalk/dev.c is dead since
commit 00f3696f7555 ("net: appletalk: remove cops support")

Removing it (and it's helper) leaves dev.c and if_ltalk.h empty;
remove them and the Makefile entry.

tun.c was including that if_ltalk.h but actually wanted
the uapi version for LTALK_ALEN, fix up the path.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Add stat for tx direct vs tx batched

Allow tracking of packets sent with send_subcrq direct vs
indirect. `ethtool -S <dev>` will now provide a counter
of the number of uses of each xmit method. This metric will
be useful in performance debugging.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241001163531.1803152-1-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: marvell: mvmdio: use clk_get_optional

The code seems to be handling EPROBE_DEFER explicitly and if there's no
error, enables the clock. clk_get_optional exists for that.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240930211628.330703-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'gve-link-irqs-queues-and-napi-instances'

Joe Damato says:

====================
gve: Link IRQs, queues, and NAPI instances

This series uses the netdev-genl API to link IRQs and queues to NAPI IDs
so that this information is queryable by user apps. This is particularly
useful for epoll-based busy polling apps which rely on having access to
the NAPI ID.

I've tested these commits on a GCP instance with a GVE NIC configured
and have included test output in the commit messages for each patch
showing how to query the information.

[1]: https://lore.kernel.org/netdev/20240926030025.226221-1-jdamato@fastly.com/
====================

Link: https://patch.msgid.link/20240930210731.1629-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: Map NAPI instances to queues

Use the netdev-genl interface to map NAPI instances to queues so that
this information is accessible to user programs via netlink.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'

[{'id': 0, 'ifindex': 2, 'napi-id': 8313, 'type': 'rx'},
{'id': 1, 'ifindex': 2, 'napi-id': 8314, 'type': 'rx'},
{'id': 2, 'ifindex': 2, 'napi-id': 8315, 'type': 'rx'},
{'id': 3, 'ifindex': 2, 'napi-id': 8316, 'type': 'rx'},
{'id': 4, 'ifindex': 2, 'napi-id': 8317, 'type': 'rx'},
[...]
{'id': 0, 'ifindex': 2, 'napi-id': 8297, 'type': 'tx'},
{'id': 1, 'ifindex': 2, 'napi-id': 8298, 'type': 'tx'},
{'id': 2, 'ifindex': 2, 'napi-id': 8299, 'type': 'tx'},
{'id': 3, 'ifindex': 2, 'napi-id': 8300, 'type': 'tx'},
{'id': 4, 'ifindex': 2, 'napi-id': 8301, 'type': 'tx'},
[...]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Link: https://patch.msgid.link/20240930210731.1629-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: Map IRQs to NAPI instances

Use netdev-genl interface to map IRQs to NAPI instances so that this
information is accessible by user apps via netlink.

$ cat /proc/interrupts | grep gve | grep -v mgmnt | cut -f1 --delimiter=':'
34
35
36
37
38
39
40
[...]
65

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'id': 8288, 'ifindex': 2, 'irq': 65},
[...]
{'id': 8263, 'ifindex': 2, 'irq': 40},
{'id': 8262, 'ifindex': 2, 'irq': 39},
{'id': 8261, 'ifindex': 2, 'irq': 38},
{'id': 8260, 'ifindex': 2, 'irq': 37},
{'id': 8259, 'ifindex': 2, 'irq': 36},
{'id': 8258, 'ifindex': 2, 'irq': 35},
{'id': 8257, 'ifindex': 2, 'irq': 34}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Link: https://patch.msgid.link/20240930210731.1629-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: csum: Clean up recv_verify_packet_ipv6

Rename ip_len to payload_len since the length in this case refers only
to the payload, and not the entire IP packet like for IPv4. While we're
at it, just use the variable directly when calling
recv_verify_packet_udp/tcp.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240930162935.980712-1-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: rtnetlink: Use devlink_reload() API

The test runs "devlink reload" explicitly. Instead, it is better to use
devlink_reload() which waits for udev events to be processed. Do not sleep
after reload, as devlink_reload() blocks until all the netdevs are renamed.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/844509e3057b65277a7181a23c95b71ec95e8a56.1727706741.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: remove unused struct 'rds_ib_dereg_odp_mr'

'rds_ib_dereg_odp_mr' has been unused since the original
commit 2eafa1746f17 ("net/rds: Handle ODP mr
registration/unregistration").

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20240930134358.48647-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rust: net::phy always define device_table in module_phy_driver macro

device_table in module_phy_driver macro is defined only when the
driver is built as a module. So a PHY driver imports phy::DeviceId
module in the following way then hits `unused import` warning when
it's compiled as built-in:

use kernel::net::phy::DeviceId;

kernel::module_phy_driver! {
     drivers: [PhyQT2025],
     device_table: [
        DeviceId::new_with_driver::<PhyQT2025>(),
     ],

Put device_table in a const. It's not included in the kernel image if
unused (when the driver is compiled as built-in), and the compiler
doesn't complain.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20240930134038.1309-1-fujita.tomonori@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: microchip_t1: Interrupt support for lan887x

Add support for link up and link down interrupts in lan887x.

Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241001144421.6661-1-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv4-convert-ip_route_input_slow-and-its-callers-to-dscp_t'

Guillaume Nault says:

====================
ipv4: Convert ip_route_input_slow() and its callers to dscp_t.

Prepare ip_route_input_slow() and its call chain to future conversion
of ->flowi4_tos.

The ->flowi4_tos field of "struct flowi4" is used in many different
places, which makes it hard to convert it from __u8 to dscp_t.

In order to avoid a big patch updating all its users at once, this
patch series gradually converts some users to dscp_t. Those users now
set ->flowi4_tos from a dscp_t variable that is converted to __u8 using
inet_dscp_to_dsfield().

When all users of ->flowi4_tos will use a dscp_t variable, converting
that field to dscp_t will just be a matter of removing all the
inet_dscp_to_dsfield() conversions.

This series concentrates on ip_route_input_slow() and its direct and
indirect callers.
====================

Link: https://patch.msgid.link/cover.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Convert ip_route_input_slow() to dscp_t.

Pass a dscp_t variable to ip_route_input_slow(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.

Only ip_route_input_rcu() actually calls ip_route_input_slow(). Since
it already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/d6bca5f87eea9e83a3861e6e05594cdd252583c9.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Convert ip_route_input_rcu() to dscp_t.

Pass a dscp_t variable to ip_route_input_rcu(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input_rcu() to consider are:

  * ip_route_input_noref(), which already has a dscp_t variable to pass
    as parameter. We just need to remove the inet_dscp_to_dsfield()
    conversion.

  * inet_rtm_getroute(), which receives a u8 from user space and needs
    to convert it with inet_dsfield_to_dscp().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c4dbb5aa9cbc79c4fcb317abbffa7c7156bc56a7.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Convert ip_route_input_noref() to dscp_t.

Pass a dscp_t variable to ip_route_input_noref(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input_noref() to consider are:

  * arp_process() in net/ipv4/arp.c. This function sets the tos
    parameter to 0, which is already a valid dscp_t value, so it
    doesn't need to be adjusted for the new prototype.

  * ip_route_input(), which already has a dscp_t variable to pass as
    parameter. We just need to remove the inet_dscp_to_dsfield()
    conversion.

  * ipvlan_l3_rcv(), bpf_lwt_input_reroute(), ip_expire(),
    ip_rcv_finish_core(), xfrm4_rcv_encap_finish() and
    xfrm4_rcv_encap(), which get the DSCP directly from IPv4 headers
    and can simply use the ip4h_dscp() helper.

While there, declare the IPv4 header pointers as const in
ipvlan_l3_rcv() and bpf_lwt_input_reroute().
Also, modify the declaration of ip_route_input_noref() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/a8a747bed452519c4d0cc06af32c7e7795d7b627.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Convert ip_route_input() to dscp_t.

Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input() to consider are:

  * input_action_end_dx4_finish() and input_action_end_dt4() in
    net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
    which is already a valid dscp_t value, so they don't need to be
    adjusted for the new prototype.

  * icmp_route_lookup(), which already has a dscp_t variable to pass as
    parameter. We just need to remove the inet_dscp_to_dsfield()
    conversion.

  * br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
    which get the DSCP directly from IPv4 headers. Define a helper to
    read the .tos field of struct iphdr as dscp_t, so that these
    function don't have to do the conversion manually.

While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Convert icmp_route_lookup() to dscp_t.

Pass a dscp_t variable to icmp_route_lookup(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos. Rename that
variable ("tos" -> "dscp") to make the intent clear.

While there, reorganise the function parameters to fill up horizontal
space.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/294fead85c6035bcdc5fcf9a6bb4ce8798c45ba1.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: avoid quadratic behavior in FIB insertion of common address

Mix netns into all IPv4 FIB hashes to avoid massive collision when
inserting the same address in many netns.

Signed-off-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241001231438.3855035-1-alexandre.ferrieux@orange.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ena-link-irqs-queues-and-napi-instances'

Joe Damato says:

====================
ena: Link IRQs, queues, and NAPI instances

This series uses the netdev-genl API to link IRQs and queues to NAPI IDs
so that this information is queryable by user apps. This is particularly
useful for epoll-based busy polling apps which rely on having access to
the NAPI ID.

I've tested these commits on an EC2 instance with an ENA NIC configured
and have included test output in the commit messages for each patch
showing how to query the information.

I noted in the implementation that the driver requests an IRQ for
management purposes which does not have an associated NAPI. I tried
to take this into account in patch 1, but would appreciate if ENA
maintainers can verify I did this correctly.

v1: https://lore.kernel.org/all/20240930195617.37369-1-jdamato@fastly.com/
====================

Link: https://patch.msgid.link/20241002001331.65444-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ena: Link queues to NAPIs

Link queues to NAPIs using the netdev-genl API so this information is
queryable.

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'

[{'id': 0, 'ifindex': 2, 'napi-id': 8201, 'type': 'rx'},
{'id': 1, 'ifindex': 2, 'napi-id': 8202, 'type': 'rx'},
{'id': 2, 'ifindex': 2, 'napi-id': 8203, 'type': 'rx'},
{'id': 3, 'ifindex': 2, 'napi-id': 8204, 'type': 'rx'},
{'id': 4, 'ifindex': 2, 'napi-id': 8205, 'type': 'rx'},
{'id': 5, 'ifindex': 2, 'napi-id': 8206, 'type': 'rx'},
{'id': 6, 'ifindex': 2, 'napi-id': 8207, 'type': 'rx'},
{'id': 7, 'ifindex': 2, 'napi-id': 8208, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'napi-id': 8201, 'type': 'tx'},
{'id': 1, 'ifindex': 2, 'napi-id': 8202, 'type': 'tx'},
{'id': 2, 'ifindex': 2, 'napi-id': 8203, 'type': 'tx'},
{'id': 3, 'ifindex': 2, 'napi-id': 8204, 'type': 'tx'},
{'id': 4, 'ifindex': 2, 'napi-id': 8205, 'type': 'tx'},
{'id': 5, 'ifindex': 2, 'napi-id': 8206, 'type': 'tx'},
{'id': 6, 'ifindex': 2, 'napi-id': 8207, 'type': 'tx'},
{'id': 7, 'ifindex': 2, 'napi-id': 8208, 'type': 'tx'}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20241002001331.65444-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ena: Link IRQs to NAPI instances

Link IRQs to NAPI instances with netif_napi_set_irq. This information
can be queried with the netdev-genl API. Note that the ENA device
appears to allocate an IRQ for management purposes which does not have a
NAPI associated with it; this commit takes this into consideration to
accurately construct a map between IRQs and NAPI instances.

Compare the output of /proc/interrupts for my ena device with the output of
netdev-genl after applying this patch:

$ cat /proc/interrupts | grep enp55s0 | cut -f1 --delimiter=':'
94
95
96
97
98
99
100
101

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'

[{'id': 8208, 'ifindex': 2, 'irq': 101},
{'id': 8207, 'ifindex': 2, 'irq': 100},
{'id': 8206, 'ifindex': 2, 'irq': 99},
{'id': 8205, 'ifindex': 2, 'irq': 98},
{'id': 8204, 'ifindex': 2, 'irq': 97},
{'id': 8203, 'ifindex': 2, 'irq': 96},
{'id': 8202, 'ifindex': 2, 'irq': 95},
{'id': 8201, 'ifindex': 2, 'irq': 94}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20241002001331.65444-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'packing-various-improvements-and-kunit-tests'

Jacob Keller says:

====================
packing: various improvements and KUnit tests

This series contains a handful of improvements and fixes for the packing
library, including the addition of KUnit tests.

There are two major changes which might be considered bug fixes:

1) The library is updated to handle arbitrary buffer lengths, fixing
   undefined behavior when operating on buffers which are not a multiple
   of 4 bytes.

2) The behavior of QUIRK_MSB_ON_THE_RIGHT is fixed to match the intended
   behavior when operating on packings that are not byte aligned.

These are not sent to net because no driver currently depends on this
behavior. For (1), the existing users of the packing API all operate on
buffers which are multiples of 4-bytes. For (2), no driver currently uses
QUIRK_MSB_ON_THE_RIGHT. The incorrect behavior was found while writing
KUnit tests.

This series also includes a handful of minor cleanups from Vladimir, as
well as a change to introduce a separated pack() and unpack() API. This API
is not (yet) used by a driver, but is the first step in implementing
pack_fields() and unpack_fields() which will be used in future changes for
the ice driver and changes Vladimir has in progress for other drivers using
the packing API.

This series is part 1 of a 2-part series for implementing use of
lib/packing in the ice driver. The 2nd part includes a new pack_fields()
and unpack_fields() implementation inspired by the ice driver's existing
bit packing code. It is built on top of the split pack() and unpack()
code. Additionally, the KUnit tests are built on top of pack() and
unpack(), based on original selftests written by Vladimir.

Fitting the entire library changes and drivers changes into a single series
exceeded the usual series limits.

v1: https://lore.kernel.org/r/20240930-packing-kunit-tests-and-split-pack-unpack-v1-0-94b1f04aca85@intel.com
====================

Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-0-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: use GENMASK() for box_mask

This is an u8, so using GENMASK_ULL() for unsigned long long is
unnecessary.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-10-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: use BITS_PER_BYTE instead of 8

This helps clarify what the 8 is for.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-9-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: fix QUIRK_MSB_ON_THE_RIGHT behavior

The QUIRK_MSB_ON_THE_RIGHT quirk is intended to modify pack() and unpack()
so that the most significant bit of each byte in the packed layout is on
the right.

The way the quirk is currently implemented is broken whenever the packing
code packs or unpacks any value that is not exactly a full byte.

The broken behavior can occur when packing any values smaller than one
byte, when packing any value that is not exactly a whole number of bytes,
or when the packing is not aligned to a byte boundary.

This quirk is documented in the following way:

  1. Normally (no quirks), we would do it like this:

  ::

    63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
    7                       6                       5                        4
    31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
    3                       2                       1                        0

  <snip>

  2. If QUIRK_MSB_ON_THE_RIGHT is set, we do it like this:

  ::

    56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39
    7                       6                        5                       4
    24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23  8  9 10 11 12 13 14 15  0  1  2  3  4  5  6  7
    3                       2                        1                       0

  That is, QUIRK_MSB_ON_THE_RIGHT does not affect byte positioning, but
  inverts bit offsets inside a byte.

Essentially, the mapping for physical bit offsets should be reserved for a
given byte within the payload. This reversal should be fixed to the bytes
in the packing layout.

The logic to implement this quirk is handled within the
adjust_for_msb_right_quirk() function. This function does not work properly
when dealing with the bytes that contain only a partial amount of data.

In particular, consider trying to pack or unpack the range 53-44. We should
always be mapping the bits from the logical ordering to their physical
ordering in the same way, regardless of what sequence of bits we are
unpacking.

This, we should grab the following logical bits:

  Logical: 55 54 53 52 51 50 49 48 47 45 44 43 42 41 40 39
                  ^  ^  ^  ^  ^  ^  ^  ^  ^

And pack them into the physical bits:

   Physical: 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47
    Logical: 48 49 50 51 52 53                   44 45 46 47
              ^  ^  ^  ^  ^  ^                    ^  ^  ^  ^

The current logic in adjust_for_msb_right_quirk is broken. I believe it is
intending to map according to the following:

  Physical: 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47
   Logical:       48 49 50 51 52 53 44 45 46 47
                   ^  ^  ^  ^  ^  ^  ^  ^  ^  ^

That is, it tries to keep the bits at the start and end of a packing
together. This is wrong, as it makes the packing change what bit is being
mapped to what based on which bits you're currently packing or unpacking.

Worse, the actual calculations within adjust_for_msb_right_quirk don't make
sense.

Consider the case when packing the last byte of an unaligned packing. It
might have a start bit of 7 and an end bit of 5. This would have a width of
3 bits. The new_start_bit will be calculated as the width - the box_end_bit
- 1. This will underflow and produce a negative value, which will
ultimate result in generating a new box_mask of all 0s.

For any other values, the result of the calculations of the
new_box_end_bit, new_box_start_bit, and the new box_mask will result in the
exact same values for the box_end_bit, box_start_bit, and box_mask. This
makes the calculations completely irrelevant.

If box_end_bit is 0, and box_start_bit is 7, then the entire function of
adjust_for_msb_right_quirk will boil down to just:

    *to_write = bitrev8(*to_write)

The other adjustments are attempting (incorrectly) to keep the bits in the
same place but just reversed. This is not the right behavior even if
implemented correctly, as it leaves the mapping dependent on the bit values
being packed or unpacked.

Remove adjust_for_msb_right_quirk() and just use bitrev8 to reverse the
byte order when interacting with the packed data.

In particular, for packing, we need to reverse both the box_mask and the
physical value being packed. This is done after shifting the value by
box_end_bit so that the reversed mapping is always aligned to the physical
buffer byte boundary. The box_mask is reversed as we're about to use it to
clear any stale bits in the physical buffer at this block.

For unpacking, we need to reverse the contents of the physical buffer
*before* masking with the box_mask. This is critical, as the box_mask is a
logical mask of the bit layout before handling the QUIRK_MSB_ON_THE_RIGHT.

Add several new tests which cover this behavior. These tests will fail
without the fix and pass afterwards. Note that no current drivers make use
of QUIRK_MSB_ON_THE_RIGHT. I suspect this is why there have been no reports
of this inconsistency before.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-8-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: add additional KUnit tests

While reviewing the initial KUnit tests for lib/packing, Przemek pointed
out that the test values have duplicate bytes in the input sequence.

In addition, I noticed that the unit tests pack and unpack on a byte
boundary, instead of crossing bytes. Thus, we lack good coverage of the
corner cases of the API.

Add additional unit tests to cover packing and unpacking byte buffers which
do not have duplicate bytes in the unpacked value, and which pack and
unpack to an unaligned offset.

A careful reviewer may note the lack tests for QUIRK_MSB_ON_THE_RIGHT. This
is because I found issues with that quirk during test implementation. This
quirk will be fixed and the tests will be included in a future change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-7-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: add KUnit tests adapted from selftests

Add 24 simple KUnit tests for the lib/packing.c pack() and unpack() APIs.

The first 16 tests exercise all combinations of quirks with a simple magic
number value on a 16-byte buffer. The remaining 8 tests cover
non-multiple-of-4 buffer sizes.

These tests were originally written by Vladimir as simple selftest
functions. I adapted them to KUnit, refactoring them into a table driven
approach. This will aid in adding additional tests in the future.

Co-developed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-6-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: duplicate pack() and unpack() implementations

packing() is now used in some hot paths, and it would be good to get rid
of some ifs and buts that depend on "op", to speed things up a little bit.

With the main implementations now taking size_t endbit, we no longer
have to check for negative values. Update the local integer variables to
also be size_t to match.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-5-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: add pack() and unpack() wrappers over packing()

Geert Uytterhoeven described packing() as "really bad API" because of
not being able to enforce const correctness. The same function is used
both when "pbuf" is input and "uval" is output, as in the other way
around.

Create 2 wrapper functions where const correctness can be ensured.
Do ugly type casts inside, to be able to reuse packing() as currently
implemented - which will _not_ modify the input argument.

Also, take the opportunity to change the type of startbit and endbit to
size_t - an unsigned type - in these new function prototypes. When int,
an extra check for negative values is necessary. Hopefully, when
packing() goes away completely, that check can be dropped.

My concern is that code which does rely on the conditional directionality
of packing() is harder to refactor without blowing up in size. So it may
take a while to completely eliminate packing(). But let's make alternatives
available for those who do not need that.

Link: https://lore.kernel.org/netdev/20210223112003.2223332-1-geert+renesas@glider.be/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-4-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: remove kernel-doc from header file

It is not necessary to have the kernel-doc duplicated both in the
header and in the implementation. It is better to have it near the
implementation of the function, since in C, a function can have N
declarations, but only one definition.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-3-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: adjust definitions and implementation for arbitrary buffer lengths

Jacob Keller has a use case for packing() in the intel/ice networking
driver, but it cannot be used as-is.

Simply put, the API quirks for LSW32_IS_FIRST and LITTLE_ENDIAN are
naively implemented with the undocumented assumption that the buffer
length must be a multiple of 4. All calculations of group offsets and
offsets of bytes within groups assume that this is the case. But in the
ice case, this does not hold true. For example, packing into a buffer
of 22 bytes would yield wrong results, but pretending it was a 24 byte
buffer would work.

Rather than requiring such hacks, and leaving a big question mark when
it comes to discontinuities in the accessible bit fields of such buffer,
we should extend the packing API to support this use case.

It turns out that we can keep the design in terms of groups of 4 bytes,
but also make it work if the total length is not a multiple of 4.
Just like before, imagine the buffer as a big number, and its most
significant bytes (the ones that would make up to a multiple of 4) are
missing. Thus, with a big endian (no quirks) interpretation of the
buffer, those most significant bytes would be absent from the beginning
of the buffer, and with a LSW32_IS_FIRST interpretation, they would be
absent from the end of the buffer. The LITTLE_ENDIAN quirk, in the
packing() API world, only affects byte ordering within groups of 4.
Thus, it does not change which bytes are missing. Only the significance
of the remaining bytes within the (smaller) group.

No change intended for buffer sizes which are multiples of 4. Tested
with the sja1105 driver and with downstream unit tests.

Link: https://lore.kernel.org/netdev/a0338310-e66c-497c-bc1f-a597e50aa3ff@intel.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-2-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

lib: packing: refuse operating on bit indices which exceed size of buffer

While reworking the implementation, it became apparent that this check
does not exist.

There is no functional issue yet, because at call sites, "startbit" and
"endbit" are always hardcoded to correct values, and never come from the
user.

Even with the upcoming support of arbitrary buffer lengths, the
"startbit >= 8 * pbuflen" check will remain correct. This is because
we intend to always interpret the packed buffer in a way that avoids
discontinuities in the available bit indices.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241002-packing-kunit-tests-and-split-pack-unpack-v2-1-8373e551eae3@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts and no adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.12-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from ieee802154, bluetooth and netfilter.

  Current release - regressions:

   - eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc

   - eth: am65-cpsw: fix forever loop in cleanup code

  Current release - new code bugs:

   - eth: mlx5: HWS, fixed double-free in error flow of creating SQ

  Previous releases - regressions:

   - core: avoid potential underflow in qdisc_pkt_len_init() with UFO

   - core: test for not too small csum_start in virtio_net_hdr_to_skb()

   - vrf: revert "vrf: remove unnecessary RCU-bh critical section"

   - bluetooth:
       - fix uaf in l2cap_connect
       - fix possible crash on mgmt_index_removed

   - dsa: improve shutdown sequence

   - eth: mlx5e: SHAMPO, fix overflow of hd_per_wq

   - eth: ip_gre: fix drops of small packets in ipgre_xmit

  Previous releases - always broken:

   - core: fix gso_features_check to check for both
     dev->gso_{ipv4_,}max_size

   - core: fix tcp fraglist segmentation after pull from frag_list

   - netfilter: nf_tables: prevent nf_skb_duplicated corruption

   - sctp: set sk_state back to CLOSED if autobind fails in
     sctp_listen_start

   - mac802154: fix potential RCU dereference issue in
     mac802154_scan_worker

   - eth: fec: restart PPS after link state change"

* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
  sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
  dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
  doc: net: napi: Update documentation for napi_schedule_irqoff
  net/ncsi: Disable the ncsi work before freeing the associated structure
  net: phy: qt2025: Fix warning: unused import DeviceId
  gso: fix udp gso fraglist segmentation after pull from frag_list
  bridge: mcast: Fail MDB get request on empty entry
  vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
  net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
  net: phy: realtek: Check the index value in led_hw_control_get
  ppp: do not assume bh is held in ppp_channel_bridge_input()
  selftests: rds: move include.sh to TEST_FILES
  net: test for not too small csum_start in virtio_net_hdr_to_skb()
  net: gso: fix tcp fraglist segmentation after pull from frag_list
  ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
  net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
  net: add more sanity checks to qdisc_pkt_len_init()
  net: avoid potential underflow in qdisc_pkt_len_init() with UFO
  net: ethernet: ti: cpsw_ale: Fix warning on some platforms
  net: microchip: Make FDMA config symbol invisible
  ...

Merge tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- small cleanup patches leveraging struct size to improve access bounds checking

* tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Use struct_size() to improve smb_direct_rdma_xmit()
  ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
  ksmbd: Use struct_size() to improve get_file_alternate_info()

Merge tag 'vfs-6.12-rc2.fixes.2' of git://git./linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
"vfs:

   - Ensure that iter_folioq_get_pages() advances to the next slot
     otherwise it will end up using the same folio with an out-of-bound
     offset.

  iomap:

   - Dont unshare delalloc extents which can't be reflinked, and thus
     can't be shared.

   - Constrain the file range passed to iomap_file_unshare() directly in
     iomap instead of requiring the callers to do it.

  netfs:

   - Use folioq_count instead of folioq_nr_slot to prevent an
     unitialized value warning in netfs_clear_buffer().

   - Fix missing wakeup after issuing writes by scheduling the write
     collector only if all the subrequest queues are empty and thus no
     writes are pending.

   - Fix two minor documentation bugs"

* tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: constrain the file range passed to iomap_file_unshare
  iomap: don't bother unsharing delalloc extents
  netfs: Fix missing wakeup after issuing writes
  Documentation: add missing folio_queue entry
  folio_queue: fix documentation
  netfs: Fix a KMSAN uninit-value error in netfs_clear_buffer
  iov_iter: fix advancing slot in iter_folioq_get_pages()

net: mana: Add get_link and get_link_ksettings in ethtool

Add support for the ethtool get_link and get_link_ksettings
operations. Display standard port information using ethtool.

Before the change:
$ethtool enP30832s1
> No data available

After the change:
$ethtool enP30832s1
> Settings for enP30832s1:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1727674934-12130-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start

In sctp_listen_start() invoked by sctp_inet_listen(), it should set the
sk_state back to CLOSED if sctp_autobind() fails due to whatever reason.

Otherwise, next time when calling sctp_inet_listen(), if sctp_sk(sk)->reuse
is already set via setsockopt(SCTP_REUSE_PORT), sctp_sk(sk)->bind_hash will
be dereferenced as sk_state is LISTENING, which causes a crash as bind_hash
is NULL.

  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  RIP: 0010:sctp_inet_listen+0x7f0/0xa20 net/sctp/socket.c:8617
  Call Trace:
   <TASK>
   __sys_listen_socket net/socket.c:1883 [inline]
   __sys_listen+0x1b7/0x230 net/socket.c:1894
   __do_sys_listen net/socket.c:1902 [inline]

Fixes: 5e8f3f703ae4 ("sctp: simplify sctp listening code")
Reported-by: syzbot+f4e0f821e3a3b7cee51d@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/a93e655b3c153dc8945d7a812e6d8ab0d52b7aa0.1727729391.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems

Add missing reg minItems as based on current binding document
only ethernet MAC IO space is a supported configuration.

There is a bug in schema, current examples contain 64-bit
addressing as well as 32-bit addressing. The schema validation
does pass incidentally considering one 64-bit reg address as
two 32-bit reg address entries. If we change axi_ethernet_eth1
example node reg addressing to 32-bit schema validation reports:

Documentation/devicetree/bindings/net/xlnx,axi-ethernet.example.dtb:
ethernet@40000000: reg: [[1073741824, 262144]] is too short

To fix it add missing reg minItems constraints and to make things clearer
stick to 32-bit addressing in examples.

Fixes: cbb1ca6d5f9a ("dt-bindings: net: xlnx,axi-ethernet: convert bindings document to yaml")
Signed-off-by: Ravikanth Tuniki <ravikanth.tuniki@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/1727723615-2109795-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

doc: net: napi: Update documentation for napi_schedule_irqoff

Since commit 8380c81d5c4f ("net: Treat __napi_schedule_irqoff() as
__napi_schedule() on PREEMPT_RT"), napi_schedule_irqoff will do the
right thing if IRQs are threaded. Therefore, there is no need to use
IRQF_NO_THREAD.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240930153955.971657-1-sean.anderson@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'nf-24-10-02' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix incorrect documentation in uapi/linux/netfilter/nf_tables.h
   regarding flowtable hooks, from Phil Sutter.

2) Fix nft_audit.sh selftests with newer nft binaries, due to different
   (valid) audit output, also from Phil.

3) Disable BH when duplicating packets via nf_dup infrastructure,
   otherwise race on nf_skb_duplicated for locally generated traffic.
   From Eric.

4) Missing return in callback of selftest C program, from zhang jiao.

netfilter pull request 24-10-02

* tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: Add missing return value
  netfilter: nf_tables: prevent nf_skb_duplicated corruption
  selftests: netfilter: Fix nft_audit.sh for newer nft binaries
  netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED
====================

Link: https://patch.msgid.link/20241002202421.1281311-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mana: Increase the DEF_RX_BUFFERS_PER_QUEUE to 1024

Through some experiments, we found out that increasing the default
RX buffers count from 512 to 1024, gives slightly better throughput
and significantly reduces the no_wqe_rx errs on the receiver side.
Along with these, other parameters like cpu usage, retrans seg etc
also show some improvement with 1024 value.

Following are some snippets from the experiments

ntttcp tests with 512 Rx buffers
---------------------------------------
connections|  throughput|  no_wqe errs|
---------------------------------------
1          |  40.93Gbps | 123,211     |
16         | 180.15Gbps | 190,120     |
128        | 180.20Gbps | 173,508     |
256        | 180.27Gbps | 189,884     |

ntttcp tests with 1024 Rx buffers
---------------------------------------
connections|  throughput|  no_wqe errs|
---------------------------------------
1          |  44.22Gbps | 19,864      |
16         | 180.19Gbps | 4,430       |
128        | 180.21Gbps | 2,560       |
256        | 180.29Gbps | 1,529       |

So, increasing the default RX buffers per queue count to 1024

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/1727667875-29908-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add missing va_end.

There is no va_end after va_copy, just add it.

Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240927040050.7851-1-zhangjiao2@cmss.chinamobile.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

iomap: constrain the file range passed to iomap_file_unshare

File contents can only be shared (i.e. reflinked) below EOF, so it makes
no sense to try to unshare ranges beyond EOF. Constrain the file range
parameters here so that we don't have to do that in the callers.

Fixes: 5f4e5752a8a3 ("fs: add iomap_file_dirty")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150213.GC21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

iomap: don't bother unsharing delalloc extents

If unshare encounters a delalloc reservation in the srcmap, that means
that the file range isn't shared because delalloc reservations cannot be
reflinked. Therefore, don't try to unshare them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150040.GB21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

net/ncsi: Disable the ncsi work before freeing the associated structure

The work function can run after the ncsi device is freed, resulting
in use-after-free bugs or kernel panic.

Fixes: 2d283bdd079c ("net/ncsi: Resource management")
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Link: https://patch.msgid.link/20240925155523.1017097-1-eajames@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: qt2025: Fix warning: unused import DeviceId

Fix the following warning when the driver is compiled as built-in:

      warning: unused import: `DeviceId`
      --> drivers/net/phy/qt2025.rs:18:5
      |
   18 |     DeviceId, Driver,
      |     ^^^^^^^^
      |
      = note: `#[warn(unused_imports)]` on by default

device_table in module_phy_driver macro is defined only when the
driver is built as a module. Use phy::DeviceId in the macro instead of
importing `DeviceId` since `phy` is always used.

Fixes: fd3eaad826da ("net: phy: add Applied Micro QT2025 PHY driver")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409190717.i135rfVo-lkp@intel.com/
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Fiona Behrens <me@kloenk.dev>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://patch.msgid.link/20240926121404.242092-1-fujita.tomonori@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-pcs-xpcs-cleanups-batch-1'

Russell King says:

====================
net: pcs: xpcs: cleanups batch 1

First, sorry for the bland series subject - this is the first in a
number of cleanup series to the XPCS driver. This series has some
functional changes beyond merely cleanups, notably the first patch.

This series starts off with a patch that moves the PCS reset from
the xpcs_create*() family of calls to when phylink first configures
the PHY. The motivation for this change is to get rid of the
interface argument to the xpcs_create*() functions, which I see as
unnecessary complexity. This patch should be tested on Wangxun
and STMMAC drivers.

Patch 2 removes the now unnecessary interface argument from the
internal xpcs_create() and xpcs_init_iface() functions. With this,
xpcs_init_iface() becomes a misnamed function, but patch 3 removes
this function, moving its now meager contents to xpcs_create().

Patch 4 adds xpcs_destroy_pcs() and xpcs_create_pcs_mdiodev()
functions which return and take a phylink_pcs, allowing SJA1105
and Wangxun drivers to be converted to using the phylink_pcs
structure internally.

Patches 5 through 8 convert both these drivers to that end.

Patch 9 drops the interface argument from the remaining xpcs_create*()
functions, addressing the only remaining caller of these functions,
that being the STMMAC driver.

As patch 7 removed the direct calls to the XPCS config/link-up
functions, the last patch makes these functions static.
====================

Link: https://patch.msgid.link/ZvwdKIp3oYSenGdH@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: xpcs: make xpcs_do_config() and xpcs_link_up() internal

As nothing outside pcs-xpcs.c calls neither xpcs_do_config() nor
xpcs_link_up(), remove their exports and prototypes.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1svfMv-005ZIv-2M@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: xpcs: drop interface argument from xpcs_create*()

The XPCS sub-driver no longer uses the "interface" argument to the
xpcs_create_mdiodev() and xpcs_create_fwnode() functions. Remove
this now unnecessary argument, updating the stmmac driver
appropriately.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1svfMp-005ZIp-UX@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: use phylink_pcs internally

Use xpcs_create_pcs_mdiodev() to create the XPCS instance, storing
and using the phylink_pcs pointer internally, rather than dw_xpcs.
Use xpcs_destroy_pcs() to destroy the XPCS instance when we've
finished with it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1svfMk-005ZIj-R3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: call PCS config/link_up via pcs_ops structure

Call the PCS operations through the ops structure, which avoids needing
to export xpcs internal functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1svfMf-005ZId-Mx@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: simplify static configuration reload

The static configuration reload saves the port speed in the static
configuration tables by first converting it from the internal
respresentation to the SPEED_xxx ethtool representation, and then
converts it back to restore the setting. This is because
sja1105_adjust_port_config() takes the speed as SPEED_xxx.

However, this is unnecessarily complex. If we split
sja1105_adjust_port_config() up, we can simply save and restore the
mac[port].speed member in the static configuration tables.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1svfMa-005ZIX-If@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wangxun: txgbe: use phylink_pcs internally

Use xpcs_create_pcs_mdiodev() to create the XPCS instance, storing
and using the phylink_pcs pointer internally, rather than dw_xpcs.
Use xpcs_destroy_pcs() to destroy the XPCS instance when we've
finished with it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1svfMV-005ZIR-FE@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>