git.kernel.dk Git - linux-block.git/log

net/mlx5: Make vport QoS enablement more flexible for future extensions

Refactor esw_qos_vport_enable to support more generic configurations,
allowing it to be reused for new vport node types in future patches.

This refactor includes a new way to change the vport parent node by
disabling the current setup and re-enabling it with the new parent.
This change sets the foundation for adapting configuration based on the
parent type in future patches.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Integrate esw_qos_vport_enable logic into rate operations

Fold the esw_qos_vport_enable function into operations for configuring
maximum and minimum rates, simplifying QoS logic. This change
consolidates enabling and updating the scheduling element
configuration, streamlining how vport QoS is initialized and adjusted.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Generalize scheduling element operations

Introduce helper functions to create and destroy scheduling elements,
allowing flexible configuration for different scheduling element types.

The new helper functions streamline the process by centralizing error
handling and logging through esw_qos_sched_elem_op_warn, which now
accepts the operation type (create, destroy, or modify).

The changes also adjust the esw_qos_vport_enable and
mlx5_esw_qos_vport_disable functions to leverage the new generalized
create/destroy helpers.

The destroy functions now log errors with esw_warn without returning
them. This prevents unnecessary error handling since the node was
already destroyed and no further action is required from callers.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Refactor scheduling element configuration bitmasks

Refactor esw_qos_sched_elem_config to set bitmasks only when max_rate
or bw_share values change, allowing the function to configure nodes
with only one of these parameters.

This enables more flexible usage for nodes where only one parameter
requires configuration.

Remove scattered assignments and checks to centralize them within this
function, removing the now redundant esw_qos_set_node_max_rate
entirely.

With this refactor, also remove the assignment of the vport scheduling
node max rate to the parent max rate for unlimited vports
(where max rate is set to zero), as firmware already handles this
behavior.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Generalize max_rate and min_rate setting for nodes

Refactor max_rate and min_rate setting functions to operate on
mlx5_esw_sched_node, allowing for generalized handling of both vports
and nodes.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Simplify QoS normalization by removing error handling

This change updates esw_qos_normalize_min_rate to not return errors,
significantly simplifying the code.

Normalization failures are software bugs, and it's unnecessary to
handle them with rollback mechanisms. Instead,
`esw_qos_update_sched_node_bw_share` and `esw_qos_normalize_min_rate`
now return void, with any errors logged as warnings to indicate
potential software issues.

This approach avoids compensating for hidden bugs and removes error
handling from all places that perform normalization, streamlining
future patches.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-switch, refactor eswitch mode change

The E-switch mode was previously updated before removing and re-adding the
IB device, which could cause a temporary mismatch between the E-switch mode
and the IB device configuration.

To prevent this discrepancy, the IB device is now removed first, then
the E-switch mode is updated, and finally, the IB device is re-added.
This sequence ensures consistent alignment between the E-switch mode and
the IB device whenever the mode changes, regardless of the new mode value.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107194357.683732-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: Cache pmtu for all packet paths if multipath enabled

Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.

An example topology showing the problem:

                    |  host1
                +---------+
                |  dummy0 | 10.179.20.18/32  mtu9000
                +---------+
        +-----------+----------------+
    +---------+                     +---------+
    | ens17f0 |  10.179.2.141/31    | ens17f1 |  10.179.2.13/31
    +---------+                     +---------+
        |    (all here have mtu 9000)    |
    +------+                         +------+
    | ro1  |  10.179.2.140/31        | ro2  |  10.179.2.12/31
    +------+                         +------+
        |                                |
---------+------------+-------------------+------
                        |
                    +-----+
                    | ro3 | 10.10.10.10  mtu1500
                    +-----+
                        |
    ========================================
                some networks
    ========================================
                        |
                    +-----+
                    | eth0| 10.10.30.30  mtu9000
                    +-----+
                        |  host2

host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:

default proto static src 10.179.20.18
        nexthop via 10.179.2.12 dev ens17f1 weight 1
        nexthop via 10.179.2.140 dev ens17f0 weight 1

When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.

Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.

Host1 now have this routes to host2:

ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
    cache expires 521sec mtu 1500

ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
    cache

So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.

Signed-off-by: Vladimir Vdovin <deliran@verdict.gg>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241108093427.317942-1-deliran@verdict.gg
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netconsole: selftests: Check if netdevsim is available

The netconsole selftest relies on the availability of the netdevsim module.
To ensure the test can run correctly, we need to check if the netdevsim
module is either loaded or built-in before proceeding.

Update the netconsole selftest to check for the existence of
the /sys/bus/netdevsim/new_device file before running the test. If the
file is not found, the test is skipped with an explanation that the
CONFIG_NETDEVSIM kernel config option may not be enabled.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241108-netcon_selftest_deps-v1-1-1789cbf3adcd@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phylink-phylink_resolve-cleanups'

Russell King says:

====================
net: phylink: phylink_resolve() cleanups

This series does a bit of clean-up in phylink_resolve() to make the code
a little easier to follow.

Patch 1 moves the manual flow control setting in two of the switch
cases to after the switch().

Patch 2 changes the MLO_AN_FIXED case to be a simple if() statement,
reducing its indentation.

Patch 3 changes the MLO_AN_PHY case to also be a simple if() statment,
also reducing its indentation.

Patch 4 does the same for the last case.

Patch 5 reformats the code and comments for the reduced indentation,
making it easier to read.
====================

Link: https://patch.msgid.link/Zy411lVWe2SikuOs@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: clean up phylink_resolve()

Now that we have reduced the indentation level, clean up the code
formatting.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1t9RQz-002Ff5-EA@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: remove switch() statement in resolve handling

The switch() statement doesn't sit very well with the preceeding if()
statements, so let's just convert everything to if()s. As a result of
the two preceding commits, there is now only one case in the switch()
statement. Remove the switch statement and reduce the code indentation.
Code reformatting will be in the following commit.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1t9RQu-002Fez-AA@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: move MLO_AN_PHY resolve handling to if() statement

The switch() statement doesn't sit very well with the preceeding if()
statements, and results in excessive indentation that spoils code
readability. Continue cleaning this up by converting the MLO_AN_PHY
case to use an if() statmeent.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1t9RQp-002Fet-5W@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: move MLO_AN_FIXED resolve handling to if() statement

The switch() statement doesn't sit very well with the preceeding if()
statements, and results in excessive indentation that spoils code
readability. Begin cleaning this up by converting the MLO_AN_FIXED case
to an if() statement.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1t9RQk-002Fen-1A@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: move manual flow control setting

Move the handling of manual flow control configuration to a common
location during resolve. We currently evaluate this for all but
fixed links.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1t9RQe-002Feh-T1@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'suspend-irqs-during-application-busy-periods'

Joe Damato says:

====================
Suspend IRQs during application busy periods

This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.

Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.

I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].

~ The short explanation (TL;DR)

We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.

If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.

If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).

The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.

For a more in depth explanation, please continue reading.

~ Comparison with existing mechanisms

Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:

At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.

At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.

While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.

An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.

We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.

This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.

~ Usage details

IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.

Here's how it is intended to work:
  - The user application (or system administrator) uses the netdev-genl
    netlink interface to set the pre-existing napi_defer_hard_irqs and
    gro_flush_timeout NAPI config parameters to enable IRQ deferral.

  - The user application (or system administrator) sets the proposed
    irq_suspend_timeout parameter via the netdev-genl netlink interface
    to a larger value than gro_flush_timeout to enable IRQ suspension.

  - The user application issues the existing epoll ioctl to set the
    prefer_busy_poll flag on the epoll context.

  - The user application then calls epoll_wait to busy poll for network
    events, as it normally would.

  - If epoll_wait returns events to userland, IRQs are suspended for the
    duration of irq_suspend_timeout.

  - If epoll_wait finds no events and the thread is about to go to
    sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
    is resumed.

As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.

Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.

~ Usage scenario

The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).

gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.

irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.

~ Important call out in the implementation

  - Enabling per epoll-context preferred busy poll will now effectively
    lead to a nonblocking iteration through napi_busy_loop, even when
    busy_poll_usecs is 0. See patch 4.

~ Benchmark configs & descriptions

The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].

To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.

Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):

  - base:
    - no other options enabled
  - deferX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
  - napibusy:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to 200,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
      busy_poll_budget = 64, prefer_busy_poll = true)
  - fullbusy:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to 5,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
      busy_poll_budget = 64, prefer_busy_poll = true)
    - change memcached's nonblocking epoll_wait invocation (via
      libevent) to using a 1 ms timeout
  - suspend0:
    - set defer_hard_irqs to 0
    - set gro_flush_timeout to 0
    - set irq_suspend_timeout to 20,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
      busy_poll_budget = 64, prefer_busy_poll = true)
  - suspendX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
    - set irq_suspend_timeout to 20,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
      busy_poll_budget = 64, prefer_busy_poll = true)

~ Benchmark results

Tested on:

Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC

The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.

Results:

Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.

The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.

These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.

The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.

Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
  example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
  increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
  nearly the same performance as the base case (this is FAQ item #1).

The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).

Note: we've reorganized the results to make comparison among testcases
with the same load easier.

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  200K  199946     112     239     416      26   12973   11343
   defer10  200K  199971      54     124     142      29   19412   17460
   defer20  200K  199986      60     130     153      26   15644   14095
   defer50  200K  200025      79     144     182      23   12122   11632
  defer200  200K  199999     164     254     309      19    8923    9635
  fullbusy  200K  199998      46     118     133     100   43658   23133
  napibusy  200K  199983     100     237     277      56   24840   24716
  suspend0  200K  200020     105     249     432      30   14264   11796
suspend10  200K  199950      53     123     141      32   19518   16903
suspend20  200K  200037      58     126     151      30   16426   14736
suspend50  200K  199961      73     136     177      26   13310   12633
suspend200  200K  199998     149     251     306      21    9566   10203

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  400K  400014     139     269     707      41    9476    9343
   defer10  400K  400016      59     133     166      53   13991   12989
   defer20  400K  399952      67     140     172      47   12063   11644
   defer50  400K  400007      87     162     198      39    9384    9880
  defer200  400K  399979     181     274     330      31    7089    8430
  fullbusy  400K  399987      50     123     156     100   21827   16037
  napibusy  400K  400014      76     222     272      83   18185   16529
  suspend0  400K  400015     127     350     776      47   10699    9603
suspend10  400K  400023      57     129     164      54   13758   13178
suspend20  400K  400043      62     135     169      49   12071   11826
suspend50  400K  400071      76     149     186      42   10011   10301
suspend200  400K  399961     154     269     327      34    7827    8774

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  600K  599951     149     266     574      61    9265    8876
   defer10  600K  600006      71     147     203      76   11866   10936
   defer20  600K  600123      76     152     203      66   10430   10342
   defer50  600K  600162      95     172     217      54    8526    9142
  defer200  600K  599942     200     301     357      46    6977    8212
  fullbusy  600K  599990      55     127     177     100   14551   13983
  napibusy  600K  600035      63     160     250      96   13937   14140
  suspend0  600K  599903     127     320     732      68   10166    8963
suspend10  600K  599908      63     137     192      69   10902   11100
suspend20  600K  599961      66     141     194      65    9976   10370
suspend50  600K  599973      80     159     204      57    8678    9381
suspend200  600K  600010     157     277     346      48    7133    8381

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base  800K  800039     181     300     536      87    9585    8304
   defer10  800K  800038     181     530     939      96   10564    8970
   defer20  800K  800029     112     225     329      90   10056    8935
   defer50  800K  799999     120     208     296      82    9234    8562
  defer200  800K  800066     227     338     401      63    7117    8129
  fullbusy  800K  800040      61     134     190     100   10913   12608
  napibusy  800K  799944      64     141     214      99   10828   12588
  suspend0  800K  799911     126     248     509      85    9346    8498
suspend10  800K  800006      69     143     200      83    9410    9845
suspend20  800K  800120      74     150     207      78    8786    9454
suspend50  800K  799989      87     168     224      71    7946    8833
suspend200  800K  799987     160     292     357      62    6923    8229

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base 1000K  906879    4079    5751    6216      98    9496    7904
   defer10 1000K  860849    3643    6274    6730      99   10040    8676
   defer20 1000K  896063    3298    5840    6349      98    9620    8237
   defer50 1000K  919782    2962    5513    5807      97    9284    7951
  defer200 1000K  970941    3059    5348    5984      95    8593    7959
  fullbusy 1000K  999950      70     150     207     100    8732   10777
  napibusy 1000K  999996      78     154     223     100    8722   10656
  suspend0 1000K  949706    2666    5770    6660      99    9071    8046
suspend10 1000K 1000024      80     160     220      92    8137    9035
suspend20 1000K 1000059      83     165     226      89    7850    8804
suspend50 1000K  999955      95     180     240      84    7411    8459
suspend200 1000K  999914     163     299     366      77    6833    8078

  testcase  load     qps  avglat  95%lat  99%lat     cpu     cpq     ipq
      base   MAX 1037654    4184    5453    5810     100    8411    7938
   defer10   MAX  905607    4840    6151    6380     100    9639    8431
   defer20   MAX  986463    4455    5594    5796     100    8848    8110
   defer50   MAX 1077030    4000    5073    5299     100    8104    7920
  defer200   MAX 1040728    4152    5385    5765     100    8379    7849
  fullbusy   MAX 1247536    3518    3935    3984     100    6998    7930
  napibusy   MAX 1136310    3799    7756    9964     100    7670    7877
  suspend0   MAX 1057509    4132    5724    6185     100    8253    7918
suspend10   MAX 1215147    3580    3957    4041     100    7185    7944
suspend20   MAX 1216469    3576    3953    3988     100    7175    7950
suspend50   MAX 1215871    3577    3961    4075     100    7181    7949
suspend200   MAX 1216882    3556    3951    3988     100    7175    7955

~ FAQ

  - Why is a new parameter needed? Does irq_suspend_timeout override
    gro_flush_timeout?

    Using the suspend mechanism causes the system to alternate between
    polling mode and irq-driven packet delivery. During busy periods,
    irq_suspend_timeout overrides gro_flush_timeout and keeps the system
    busy polling, but when epoll finds no events, the setting of
    gro_flush_timeout and napi_defer_hard_irqs determine the next step.

    There are essentially three possible loops for network processing and
    packet delivery:

    1) hardirq -> softirq -> napi poll; basic interrupt delivery
    2) timer -> softirq -> napi poll; deferred irq processing
    3) epoll -> busy-poll -> napi poll; busy looping

    Loop 2 can take control from Loop 1, if gro_flush_timeout and
    napi_defer_hard_irqs are set.

    If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
    3 "wrestle" with each other for control. During busy periods,
    irq_suspend_timeout is used as timer in Loop 2, which essentially
    tilts this in favour of Loop 3.

    If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
    cannot take control from Loop 1.

    Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
    recommended usage, because otherwise setting irq_suspend_timeout
    might not have any discernible effect.

    This is shown in the results above: compare suspend0 with the base
    case. Note that the lack of napi_defer_hard_irqs and
    gro_flush_timeout produce similar results for both, which encourages
    the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
    irq_suspend_timeout.

  - Can the new timeout value be threaded through the new epoll ioctl ?

    It is possible, but presents challenges for userspace. User
    applications must ensure that the file descriptors added to epoll
    contexts have the same NAPI ID to support busy polling.

    An epoll context is not permanently tied to any particular NAPI ID.
    So, a user application could decide to clear the file descriptors
    from the context and add a new set of file descriptors with a
    different NAPI ID to the context. Busy polling would work as
    expected, but the meaning of the suspend timeout becomes ambiguous
    because IRQs are not inherently associated with epoll contexts, but
    rather with the NAPI. The user program would need to reissue the
    ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
    and gro_flush_timeout settings would come from the NAPI's
    napi_config (which are set either by sysfs or by netlink). Such an
    interface seems awkard to use from a user perspective.

    Further, IRQs are related to NAPIs, which is why they are stored in
    the napi_config space. Putting the irq_suspend_timeout in
    the epoll context while other IRQ deferral mechanisms remain in the
    NAPI's napi_config space seems like an odd design choice.

    We've opted to keep all of the IRQ deferral parameters together and
    place the irq_suspend_timeout in napi_config. This has nice benefits
    for userspace: if a user app were to remove all file descriptors
    from an epoll context and add new file descriptors with a new NAPI ID,
    the correct suspend timeout for that NAPI ID would be used automatically
    without the user application needing to do anything (like re-issuing an
    ioctl, for example). All IRQ deferral related parameters are in one
    place and can all be set the same way: with netlink.

  - Can irq suspend be built by combining NIC coalescing and
    gro_flush_timeout ?

    No. The problem is that the long timeout must engage if and only if
    prefer-busy is active.

    When using NIC coalescing for the short timeout (without
    napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
    period will trigger softirq, which will run napi polling. At this
    point, prefer-busy is not active, so NIC interrupts would be
    re-enabled. Then it is not possible for the longer timeout to
    interject to switch control back to polling. In other words, only by
    using the software timer for the short timeout, it is possible to
    extend the timeout without having to reprogram the NIC timer or
    reach down directly and disable interrupts.

    Using gro_flush_timeout for the long timeout also has problems, for
    the same underlying reason. In the current napi implementation,
    gro_flush_timeout is not tied to prefer-busy. We'd either have to
    change that and in the process modify the existing deferral
    mechanism, or introduce a state variable to determine whether
    gro_flush_timeout is used as long timeout for irq suspend or whether
    it is used for its default purpose. In an earlier version, we did
    try something similar to the latter and made it work, but it ends up
    being a lot more convoluted than our current proposal.

  - Isn't it already possible to combine busy looping with irq deferral?

    Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
    gro_flush_timeout is a precondition for prefer_busy_poll to have an
    effect. If the application also uses a tight busy loop with
    essentially nonblocking epoll_wait (accomplished with a very short
    timeout parameter), this is the fullbusy case shown in the results.
    An application using blocking epoll_wait is shown as the napibusy
    case in the results. It's a hybrid approach that provides limited
    latency benefits compared to the base case and plain irq deferral,
    but not as good as fullbusy or suspend.

~ Special thanks

Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:

  - Peter Cai (CC'd), for the initial kernel patch and his contributions
    to the paper.

  - Mohammadamin Shafie (CC'd), for testing various versions of the kernel
    patch and providing helpful feedback.

Thanks,
Martin and Joe

[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results

v8: https://lore.kernel.org/20241108045337.292905-1-jdamato@fastly.com
v7: https://lore.kernel.org/20241108023912.98416-1-jdamato@fastly.com
v6: https://lore.kernel.org/20241104215542.215919-1-jdamato@fastly.com
v5: https://lore.kernel.org/20241103052421.518856-1-jdamato@fastly.com
v4: https://lore.kernel.org/20241102005214.32443-1-jdamato@fastly.com
v3: https://lore.kernel.org/20241101004846.32532-1-jdamato@fastly.com
v2: https://lore.kernel.org/20241021015311.95468-1-jdamato@fastly.com
====================

Link: https://patch.msgid.link/20241109050245.191288-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: networking: Describe irq suspension

Describe irq suspension, the epoll ioctls, and the tradeoffs of using
different gro_flush_timeout values.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://patch.msgid.link/20241109050245.191288-7-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: Add busy_poll_test

Add an epoll busy poll test using netdevsim.

This test is comprised of:
  - busy_poller (via busy_poller.c)
  - busy_poll_test.sh which loads netdevsim, sets up network namespaces,
    and runs busy_poller to receive data and socat to send data.

The selftest tests two different scenarios:
  - busy poll (the pre-existing version in the kernel)
  - busy poll with suspend enabled (what this series adds)

The data transmit is a 1MiB temporary file generated from /dev/urandom
and the test is considered passing if the md5sum of the input file to
socat matches the md5sum of the output file from busy_poller.

netdevsim was chosen instead of veth due to netdevsim's support for
netdev-genl.

For now, this test uses the functionality that netdevsim provides. In the
future, perhaps netdevsim can be extended to emulate device IRQs to more
thoroughly test all pre-existing kernel options (like defer_hard_irqs)
and suspend.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241109050245.191288-6-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eventpoll: Control irq suspension for prefer_busy_poll

When events are reported to userland and prefer_busy_poll is set, irqs
are temporarily suspended using napi_suspend_irqs.

If no events are found and ep_poll would go to sleep, irq suspension is
cancelled using napi_resume_irqs.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-5-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set

Setting prefer_busy_poll now leads to an effectively nonblocking
iteration though napi_busy_loop, even when busy_poll_usecs is 0.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Add control functions for irq suspension

The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.

The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Add napi_struct parameter irq_suspend_timeout

Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.

This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: add unlocked version of bnxt_refclk_read

Serialization of PHC read with FW reset mechanism uses ptp_lock which
also protects timecounter updates. This means we cannot grab it when
called from bnxt_cc_read(). Let's move locking into different function.

Fixes: 6c0828d00f07 ("bnxt_en: replace PTP spinlock with seqlock")
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20241107214917.2980976-1-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rtnetlink-convert-rtnl_newlink-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Convert rtnl_newlink() to per-netns RTNL.

Patch 1 - 3 removes __rtnl_link_unregister and protect link_ops by
its dedicated mutex to move synchronize_srcu() out of RTNL scope.

Patch 4 introduces struct rtnl_nets and helper functions to acquire
multiple per-netns RTNL in rtnl_newlink().

Patch 5 - 8 are to prefetch the peer device's netns in rtnl_newlink().

Patch 9 converts rtnl_newlink() to per-netns RTNL.

Patch 10 pushes RTNL down to rtnl_dellink() and rtnl_setlink(), but
the conversion will not be completed unless we support cases with
peer/upper/lower devices.

I confirmed v3 survived ./rtnetlink.sh; rmmod netdevsim.ko; without
lockdep splat.

v3: https://lore.kernel.org/20241107022900.70287-1-kuniyu@amazon.com
v2: https://lore.kernel.org/20241106022432.13065-1-kuniyu@amazon.com
v1: https://lore.kernel.org/20241105020514.41963-1-kuniyu@amazon.com
====================

Link: https://patch.msgid.link/20241108004823.29419-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.

Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.

For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.

The same situation happens when we remove a device.

rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?

Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().

This will unblock converting RTNL users where such devices are not related.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Convert RTM_NEWLINK to per-netns RTNL.

Now, we are ready to convert rtnl_newlink() to per-netns RTNL;
rtnl_link_ops is protected by SRCU and netns is prefetched in
rtnl_newlink().

Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and
push RTNL down as rtnl_nets_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

veth: Set VETH_INFO_PEER to veth_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Add peer_type in struct rtnl_link_ops.

In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().

rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.

We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.

All of the three get the peer netns in the same way:

  1. Call rtnl_nla_parse_ifinfomsg()
  2. Call ops->validate() (vxcan doesn't have)
  3. Call rtnl_link_get_net_tb()

Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Introduce struct rtnl_nets and helpers.

rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.

We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.

rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].

Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.

Let's apply the helpers to rtnl_newlink().

When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Remove __rtnl_link_register()

link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.

Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.

Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Protect link_ops by mutex.

rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.

Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

Let's move as such and add a mutex to protect link_ops.

Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.

While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: Remove __rtnl_link_unregister().

rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(),
where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests
for per-netns RTNL.

We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko
and dummy.ko.

However, rtnl_newlink() will acquire SRCU before RTNL later in this
series.  Then, lockdep will detect the deadlock:

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

To avoid the problem, we must call synchronize_srcu() before RTNL in
rtnl_link_unregister().

As a preparation, let's remove __rtnl_link_unregister().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: use helper r8169_mod_reg8_cond to simplify rtl_jumbo_config

Use recently added helper r8169_mod_reg8_cond() to simplify jumbo
mode configuration.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/3df1d484-a02e-46e7-8f75-db5b428e422e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-ncdevmem-add-ncdevmem-to-ksft'

Stanislav Fomichev says:

====================
selftests: ncdevmem: Add ncdevmem to ksft

The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.

ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
====================

Link: https://patch.msgid.link/20241107181211.3934153-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Add automated test

Only RX side for now and small message to test the setup.
In the future, we can extend it to TX side and to testing
both sides with a couple of megs of data.

  make \
   -C tools/testing/selftests \
   TARGETS="drivers/hw/net" \
   install INSTALL_PATH=~/tmp/ksft

  scp ~/tmp/ksft ${HOST}:
  scp ~/tmp/ksft ${PEER}:

  cfg+="NETIF=${DEV}\n"
  cfg+="LOCAL_V6=${HOST_IP}\n"
  cfg+="REMOTE_V6=${PEER_IP}\n"
  cfg+="REMOTE_TYPE=ssh\n"
  cfg+="REMOTE_ARGS=root@${PEER}\n"

  echo -e "$cfg" | ssh root@${HOST} "cat > ksft/drivers/net/net.config"
  ssh root@${HOST} "cd ksft && ./run_kselftest.sh -t drivers/net:devmem.py"

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241107181211.3934153-13-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Move ncdevmem under drivers/net/hw

This is where all the tests that depend on the HW functionality live in
and this is where the automated test is gonna be added in the next
patch.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-12-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Run selftest when none of the -s or -c has been provided

This will be used as a 'probe' mode in the selftest to check whether
the device supports the devmem or not. Use hard-coded queue layout
(two last queues) and prevent user from passing custom -q and/or -t.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-11-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Remove hard-coded queue numbers

Use single last queue of the device and probe it dynamically.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-10-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Use YNL to enable TCP header split

In the next patch the hard-coded queue numbers are gonna be removed.
So introduce some initial support for ethtool YNL and use
it to enable header split.

Also, tcp-data-split requires latest ethtool which is unlikely
to be present in the distros right now.

(ideally, we should not shell out to ethtool at all).

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Properly reset flow steering

ntuple off/on might be not enough to do it on all NICs.
Add a bunch of shell crap to explicitly remove the rules.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Switch to AF_INET6

Use dualstack socket to support both v4 and v6. v4-mapped-v6 address
can be used to do v4.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Remove default arguments

To make it clear what's required and what's not. Also, some of the
values don't seem like a good defaults; for example eth1.

Move the invocation comment to the top, add missing -s to the client
and cleanup the client invocation a bit to make more readable.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Make client_ip optional

Support 3-tuple filtering by making client_ip optional. When -c is
not passed, don't specify src-ip/src-port in the filter.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Unify error handling

There is a bunch of places where error() calls look out of place.
Use the same error(1, errno, ...) pattern everywhere.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Separate out dmabuf provider

So we can plug the other ones in the future if needed.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: ncdevmem: Redirect all non-payload output to stderr

That should make it possible to do expected payload validation on
the caller side.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107181211.3934153-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-dwmac4-fixes-issues-in-dwmac4'

Ley Foon Tan says:

====================
net: stmmac: dwmac4: Fixes issues in dwmac4

This patch series fixes issues in the dwmac4 driver. These three patches
don't cause any user-visible issues, so they are targeted for net-next.

Patch #1:
Corrects the masking logic in the MTL Operation Mode RTC mask and shift
macros. The current code lacks the use of the ~ operator, which is
necessary to clear the bits properly.

Patch #2:
Addresses inaccuracies in the MTL_OP_MODE_*_MASK macros. The RTC fields
are located in bits [1:0], and this patch ensures the mask and shift
macros use the appropriate values to reflect this.

Patch #3:
Moves the handling of the Receive Watchdog Timeout (RWT) out of the
Abnormal Interrupt Summary (AIS) condition. According to the databook,
the RWT interrupt is not included in the AIS.

v1: https://lore.kernel.org/20241023112005.GN402847@kernel.org
v2: https://lore.kernel.org/20241101082336.1552084-3-leyfoon.tan@starfivetech.com
====================

Link: https://patch.msgid.link/20241107063637.2122726-1-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac4: Receive Watchdog Timeout is not in abnormal interrupt summary

The Receive Watchdog Timeout (RWT, bit[9]) is not part of Abnormal
Interrupt Summary (AIS). Move the RWT handling out of the AIS
condition statement.

From databook, the AIS is the logical OR of the following interrupt bits:

- Bit 1: Transmit Process Stopped
- Bit 7: Receive Buffer Unavailable
- Bit 8: Receive Process Stopped
- Bit 10: Early Transmit Interrupt
- Bit 12: Fatal Bus Error
- Bit 13: Context Descriptor Error

Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107063637.2122726-4-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac4: Fix the MTL_OP_MODE_*_MASK operation

In order to mask off the bits, we need to use the '~' operator to invert
all the bits of _MASK and clear them.

Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107063637.2122726-3-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac4: Fix MTL_OP_MODE_RTC mask and shift macros

RTC fields are located in bits [1:0]. Correct the _MASK and _SHIFT
macros to use the appropriate mask and shift.

Signed-off-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107063637.2122726-2-leyfoon.tan@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: aquantia: Add mdix config and reporting

Add support for configuring MDI-X state of PHY.
Add reporting of resolved MDI-X state in status information.

Tested on AQR113C.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Link: https://patch.msgid.link/20241106222057.3965379-1-paul.davey@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'introduce-vlan-support-in-hsr'

MD Danish Anwar says:

====================
Introduce VLAN support in HSR

This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.

[1] https://gist.githubusercontent.com/danish-ti/d309f92c640134ccc4f2c0c442de5be1/raw/9cfb5f8bd12b374ae591f4bd9ba3e91ae509ed4f/hsr_vlan_logs
v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/
v2 https://lore.kernel.org/all/20241024103056.3201071-1-danishanwar@ti.com/
====================

Link: https://patch.msgid.link/20241106091710.3308519-1-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: hsr: Add test for VLAN

Add test for VLAN ping for HSR. The test adds vlan interfaces to the hsr
interface and then verifies if ping to them works.

Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20241106091710.3308519-5-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssg-prueth: Add VLAN support for HSR mode

Add support for VLAN addition/deletion in HSR mode.
In HSR mode, even if the host port is not a member of
the VLAN domain, the slave ports should simply forward the
frames. So allow forwarding of all VLAN frames in HSR mode.

Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20241106091710.3308519-4-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hsr: Add VLAN CTAG filter support

This patch adds support for VLAN ctag based filtering at slave devices.
The slave ethernet device may be capable of filtering ethernet packets
based on VLAN ID. This requires that when the VLAN interface is created
over an HSR/PRP interface, it passes the VID information to the
associated slave ethernet devices so that it updates the hardware
filters to filter ethernet frames based on VID. This patch adds the
required functions to propagate the vid information to the slave
devices.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20241106091710.3308519-3-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hsr: Add VLAN support

Add support for creating VLAN interfaces over HSR/PRP interface.

Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20241106091710.3308519-2-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'side-mdio-support-for-lan937x-switches'

Oleksij Rempel says:

====================
Side MDIO Support for LAN937x Switches

This patch set introduces support for an internal MDIO bus in LAN937x
switches, enabling the use of a side MDIO channel for PHY management
while keeping SPI as the main interface for switch configuration.

other changelogs are added to separate patches.
====================

Link: https://patch.msgid.link/20241106075942.1636998-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: parse PHY config from device tree

Introduce ksz_parse_dt_phy_config() to validate and parse PHY
configuration from the device tree for KSZ switches. This function
ensures proper setup of internal PHYs by checking `phy-handle`
properties, verifying expected PHY IDs, and handling parent node
mismatches. Sets the PHY mask on the MII bus if validation is
successful. Returns -EINVAL on configuration errors.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241106075942.1636998-7-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: add support for side MDIO interface in LAN937x

Implement side MDIO channel support for LAN937x switches, providing an
alternative to SPI for PHY management alongside existing SPI-based
switch configuration. This is needed to reduce SPI load, as SPI can be
relatively expensive for small packets compared to MDIO support.

Also, implemented static mappings for PHY addresses for various LAN937x
models to support different internal PHY configurations. Since the PHY
address mappings are not equal to the port indexes, this patch also
provides PHY address calculation based on hardware strapping
configuration.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-6-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: cleanup error handling in ksz_mdio_register

Replace repeated cleanup code with a single error path using a label.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-5-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: Refactor MDIO handling for side MDIO access

Add support for accessing PHYs via a side MDIO interface in LAN937x
switches. The existing code already supports accessing PHYs via main
management interfaces, which can be SPI, I2C, or MDIO, depending on the
chip variant. This patch enables using a side MDIO bus, where SPI is
used for the main switch configuration and MDIO for managing the
integrated PHYs. On LAN937x, this is optional, allowing them to operate
in both configurations: SPI only, or SPI + MDIO. Typically, the SPI
interface is used for switch configuration, while MDIO handles PHY
management.

Additionally, update interrupt controller code to support non-linear
port to PHY address mapping, enabling correct interrupt handling for
configurations where PHY addresses do not directly correspond to port
indexes. This change ensures that the interrupt mechanism properly
aligns with the new, flexible PHY address mappings introduced by side
MDIO support.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: microchip: add mdio-parent-bus property for internal MDIO

Introduce `mdio-parent-bus` property in the ksz DSA bindings to
reference the parent MDIO bus when the internal MDIO bus is attached to
it, bypassing the main management interface.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: microchip: add internal MDIO bus description

Add description for the internal MDIO bus, including integrated PHY
nodes, to ksz DSA bindings.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241106075942.1636998-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
   order to prevent IRQs from being moved to certain CPUs that run a
   real-time workload.

2. atlantic device reopening will resets the affinity
   in aq_ndev_open().

3. atlantic has no idea about irqbalance's config, so it may move an IRQ to
   a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241107120739.415743-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
   order to prevent IRQs from being moved to certain CPUs that run a
   real-time workload.

2. nfp device reopening will resets the affinity
   in nfp_net_netdev_open().

3. nfp has no idea about irqbalance's config, so it may move an IRQ to
   a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Link: https://patch.msgid.link/20241107115002.413358-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
    order to prevent IRQs from being moved to certain CPUs that run a
    real-time workload.

2. bnxt_en device reopening will resets the affinity
    in bnxt_open().

3. bnxt_en has no idea about irqbalance's config, so it may move an IRQ to
    a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Link: https://patch.msgid.link/20241106180811.385175-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rxrpc: Add a tracepoint for aborts being proposed

Add a tracepoint to rxrpc to trace the proposal of an abort. The abort is
performed asynchronously by the I/O thread.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/726356.1730898045@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Fix soft lockups in fib6_select_path under high next hop churn

Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.

Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.

Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.

Kernel log:

0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
    [exception RIP: fib6_select_path+299]
    RIP: ffffffff8ddafe7b  RSP: ffffbd13003d37b8  RFLAGS: 00000287
    RAX: ffff975850b43600  RBX: ffff975850b40200  RCX: 0000000000000000
    RDX: 000000003fffffff  RSI: 0000000051d383e4  RDI: ffff975850b43618
    RBP: ffffbd13003d3800   R8: 0000000000000000   R9: ffff975850b40200
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffbd13003d3830
    R13: ffff975850b436a8  R14: ffff975850b43600  R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel@aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi@menlosecurity.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241106010236.1239299-1-omid.ehtemamhaghighi@menlosecurity.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'knobs-for-npc-default-rule-counters'

Linu Cherian says:

====================
Knobs for NPC default rule counters

Patch 1 introduce _rvu_mcam_remove/add_counter_from/to_rule
by refactoring existing code

Patch 2 adds a devlink param to enable/disable counters for default
rules. Once enabled, counters can

Patch 3 adds documentation for devlink params

v4: https://lore.kernel.org/20241029035739.1981839-1-lcherian@marvell.com
====================

Link: https://patch.msgid.link/20241105125620.2114301-1-lcherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Add documentation for OcteonTx2 AF

Add documentation for the following devlink params
- npc_mcam_high_zone_percent
- npc_def_rule_cntr
- nix_maxlf

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Link: https://patch.msgid.link/20241105125620.2114301-4-lcherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Knobs for NPC default rule counters

Add devlink knobs to enable/disable counters on NPC
default rule entries.

Sample command to enable default rule counters:
devlink dev param set <dev> name npc_def_rule_cntr value true cmode runtime

Sample command to read the counter:
cat /sys/kernel/debug/cn10k/npc/mcam_rules

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Link: https://patch.msgid.link/20241105125620.2114301-3-lcherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Refactor few NPC mcam APIs

Introduce lowlevel variant of rvu_mcam_remove/add_counter_from/to_rule
for better code reuse, which assumes necessary locks are taken at
higher level.

These low level functions would be used for implementing default rule
counter APIs in the subsequent patch.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241105125620.2114301-2-lcherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlx5/core: deduplicate {mlx5_,}eq_update_ci()

The logic of eq_update_ci() is duplicated in mlx5_eq_update_ci(). The
only additional work done by mlx5_eq_update_ci() is to increment
eq->cons_index. Call eq_update_ci() from mlx5_eq_update_ci() to avoid
the duplication.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183054.2443218-2-csander@purestorage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlx5/core: relax memory barrier in eq_update_ci()

The memory barrier in eq_update_ci() after the doorbell write is a
significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see
3% of CPU time spent on the mfence instruction.

98df6d5b877c ("net/mlx5: A write memory barrier is sufficient in EQ ci
update") already relaxed the full memory barrier to just a write barrier
in mlx5_eq_update_ci(), which duplicates eq_update_ci(). So replace mb()
with wmb() in eq_update_ci() too.

On strongly ordered architectures, no barrier is actually needed because
the MMIO writes to the doorbell register are guaranteed to appear to the
device in the order they were made. However, the kernel's ordered MMIO
primitive writel() lacks a convenient big-endian interface.
Therefore, we opt to stick with __raw_writel() + a barrier.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183054.2443218-1-csander@purestorage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'macsec-inherit-lower-device-s-features-and-tso-limits-when-offloading'

Sabrina Dubroca says:

====================
macsec: inherit lower device's features and TSO limits when offloading

When macsec is offloaded to a NIC, we can take advantage of some of
its features, mainly TSO and checksumming. This increases performance
significantly. Some features cannot be inherited, because they require
additional ops that aren't provided by the macsec netdevice.

We also need to inherit TSO limits from the lower device, like
VLAN/macvlan devices do.

This series also moves the existing macsec offload selftest to the
netdevsim selftests before adding tests for the new features. To allow
this new selftest to work, netdevsim's hw_features are expanded.
====================

Link: https://patch.msgid.link/cover.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netdevsim: add ethtool features to macsec offload tests

The test verifies that available features aren't changed by toggling
offload on the device. Creating a device with offload off and then
enabling it later should result in the same features as creating the
device with offload enabled directly.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ba801bd0a75b02de2dddbfc77f9efceb8b3d8a2e.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netdevsim: add test toggling macsec offload

The test verifies that toggling offload works (both via rtnetlink and
macsec's genetlink APIs). This is only possible when no SA is
configured.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/bf8e27ee0d921caa4eb35f1e830eca6d4080ddb2.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: move macsec offload tests from net/rtnetlink to drivers/net/netdvesim

We're going to expand this test, and macsec offload is only lightly
related to rtnetlink.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/a1f92c250cc129b4bb111a206c4b560bab4e24a5.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: inherit lower device's TSO limits when offloading

If macsec is offloaded, we need to follow the lower device's
capabilities, like VLAN devices do.

Leave the limits unchanged when the offload is disabled.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8240c0181e851f169d815f59658a01fb9dfc5073.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: clean up local variables in macsec_notify

For all events, we need to loop over the list of secys, so let's move
the common variables out of the switch/case.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/9b8996af518fbeb3b7d527feb15d5788495e3108.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

macsec: add some of the lower device's features when offloading

This commit extends the set of netdevice features supported by macsec
devices when offload is enabled, which increases performance
significantly (for a single TCP stream: 17.5Gbps to 38.5Gbps on my
test machines).

Commit c850240b6c41 ("net: macsec: report real_dev features when HW
offloading is enabled") previously attempted something similar, but
had to be reverted (commit 8bcd560ae878 ("Revert "net: macsec: report
real_dev features when HW offloading is enabled"")) because the set of
features it exposed was too large.

During initialization, all features are set, and they're then removed
via ndo_fix_features (macsec_fix_features). This allows the
offloadable features to be automatically enabled if offloading is
turned on after device creation.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8b32c3011d269d6f149724e80c1ffe67c9534067.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netdevsim: add a test checking ethtool features

Add a test checking that some features are active by default and
changeable.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/fff58fa70f8a300440958b5020f6a4eb2e9dad61.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: add more hw_features

netdevsim currently only set HW_TC in its hw_features, but other
features should also be present to better reflect the behavior of real
HW.

In my macsec offload testing, this ends up as HW_CSUM being missing
from hw_features, so it doesn't stick in wanted_features when offload
is turned off. Then HW_CSUM (and thus TSO, thanks to
netdev_fix_features) is not automatically turned back on when offload
is re-enabled.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/b918dc4dd76410a57f7516a855f66b0a2bd58326.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'replace-page_frag-with-page_frag_cache-part-1'

Yunsheng Lin says:

====================
Replace page_frag with page_frag_cache (Part-1)

This is part 1 of "Replace page_frag with page_frag_cache",
which mainly contain refactoring and optimization for the
implementation of page_frag API before the replacing.

As the discussion in [1], it would be better to target net-next
tree to get more testing as all the callers page_frag API are
in networking, and the chance of conflicting with MM tree seems
low as implementation of page_frag API seems quite self-contained.

After [2], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Kernel Image changing:
    Linux Kernel   total |      text      data        bss
    ------------------------------------------------------
    after     45250307 |   27274279   17209996     766032
    before    45254134 |   27278118   17209984     766032
    delta        -3827 |      -3839        +12         +0

1. https://lore.kernel.org/all/add10dd4-7f5d-4aa1-aa04-767590f944e0@redhat.com/
2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
====================

Link: https://patch.msgid.link/20241028115343.3405838-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: page_frag: use __alloc_pages() to replace alloc_pages_node()

It seems there is about 24Bytes binary size increase for
__page_frag_cache_refill() after refactoring in arm64 system
with 64K PAGE_SIZE. By doing the gdb disassembling, It seems
we can have more than 100Bytes decrease for the binary size
by using __alloc_pages() to replace alloc_pages_node(), as
there seems to be some unnecessary checking for nid being
NUMA_NO_NODE, especially when page_frag is part of the mm
system.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-8-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

Currently there is one 'struct page_frag' for every 'struct
sock' and 'struct task_struct', we are about to replace the
'struct page_frag' with 'struct page_frag_cache' for them.
Before begin the replacing, we need to ensure the size of
'struct page_frag_cache' is not bigger than the size of
'struct page_frag', as there may be tens of thousands of
'struct sock' and 'struct task_struct' instances in the
system.

By or'ing the page order & pfmemalloc with lower bits of
'va' instead of using 'u16' or 'u32' for page size and 'u8'
for pfmemalloc, we are able to avoid 3 or 5 bytes space waste.
And page address & pfmemalloc & order is unchanged for the
same page in the same 'page_frag_cache' instance, it makes
sense to fit them together.

After this patch, the size of 'struct page_frag_cache' should be
the same as the size of 'struct page_frag'.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-7-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xtensa: remove the get_order() implementation

As the get_order() implemented by xtensa supporting 'nsau'
instruction seems be the same as the generic implementation
in include/asm-generic/getorder.h when size is not a constant
value as the generic implementation calling the fls*() is also
utilizing the 'nsau' instruction for xtensa.

So remove the get_order() implemented by xtensa, as using the
generic implementation may enable the compiler to do the
computing when size is a constant value instead of runtime
computing and enable the using of get_order() in BUILD_BUG_ON()
macro in next patch.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-6-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: page_frag: avoid caller accessing 'page_frag_cache' directly

Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: page_frag: use initial zero offset for page_frag_alloc_align()

We are about to use page_frag_alloc_*() API to not just
allocate memory for skb->data, but also use them to do
the memory allocation for skb frag too. Currently the
implementation of page_frag in mm subsystem is running
the offset as a countdown rather than count-up value,
there may have several advantages to that as mentioned
in [1], but it may have some disadvantages, for example,
it may disable skb frag coalescing and more correct cache
prefetching

We have a trade-off to make in order to have a unified
implementation and API for page_frag, so use a initial zero
offset in this patch, and the following patch will try to
make some optimization to avoid the disadvantages as much
as possible.

1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-4-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: move the page fragment allocator from page_alloc into its own file

Inspired by [1], move the page fragment allocator from page_alloc
into its own c file and header file, as we are about to make more
change for it to replace another page_frag implementation in
sock.c

As this patchset is going to replace 'struct page_frag' with
'struct page_frag_cache' in sched.h, including page_frag_cache.h
in sched.h has a compiler error caused by interdependence between
mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler
error by moving 'struct page_frag_cache' to mm_types_task.h as
suggested by Alexander, see [3].

1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/
2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/
3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/
CC: David Howells <dhowells@redhat.com>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm: page_frag: add a test module for page_frag

The testing is done by ensuring that the fragment allocated
from a frag_frag_cache instance is pushed into a ptr_ring
instance in a kthread binded to a specified cpu, and a kthread
binded to a specified cpu will pop the fragment from the
ptr_ring and free the fragment.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241028115343.3405838-2-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: convert to nla_get_*_default()

Most of the original conversion is from the spatch below,
but I edited some and left out other instances that were
either buggy after conversion (where default values don't
fit into the type) or just looked strange.

    @@
    expression attr, def;
    expression val;
    identifier fn =~ "^nla_get_.*";
    fresh identifier dfn = fn ## "_default";
    @@
    (
    -if (attr)
    -  val = fn(attr);
    -else
    -  val = def;
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  val = def;
    -else
    -  val = fn(attr);
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  return def;
    -return fn(attr);
    +return dfn(attr, def);
    |
    -attr ? fn(attr) : def
    +dfn(attr, def)
    |
    -!attr ? def : fn(attr)
    +dfn(attr, def)
    )

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netlink: add nla_get_*_default() accessors

There are quite a number of places that use patterns
such as

  if (attr)
     val = nla_get_u16(attr);
  else
     val = DEFAULT;

Add nla_get_u16_default() and friends like that to
not have to type this out all the time.

Acked-by: Toke Høiland-Jørgensen <toke@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20241108114145.acd2aadb03ac.I3df6aac71d38a5baa1c0a03d0c7e82d4395c030e@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: Allow deleting FDB entries with non-existent VLAN

It is currently impossible to delete individual FDB entries (as opposed
to flushing) that were added with a VLAN that no longer exists:

# ip link add name dummy1 up type dummy
# ip link add name br1 up type bridge vlan_filtering 1
# ip link set dev dummy1 master br1
# bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1
# bridge vlan del vid 1 dev dummy1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static
# bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1
RTNETLINK answers: Invalid argument
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static

This is in contrast to MDB entries that can be deleted after the VLAN
was deleted:

# bridge vlan add vid 10 dev dummy1
# bridge mdb add dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge vlan del vid 10 dev dummy1
# bridge mdb get dev br1 grp 239.1.1.1 vid 10
dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge mdb del dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge mdb get dev br1 grp 239.1.1.1 vid 10
Error: bridge: MDB entry not found.

Align the two interfaces and allow user space to delete FDB entries that
were added with a VLAN that no longer exists:

# ip link add name dummy1 up type dummy
# ip link add name br1 up type bridge vlan_filtering 1
# ip link set dev dummy1 master br1
# bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1
# bridge vlan del vid 1 dev dummy1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static
# bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
Error: Fdb entry not found.

Add a selftest to make sure this behavior does not regress:

# ./rtnetlink.sh -t kci_test_fdb_del
PASS: bridge fdb del

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241105133954.350479-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlx5/core: Schedule EQ comp tasklet only if necessary

Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
whose completions don't need to be processed in tasklet context, this
adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
time spent on the tasklet_trylock() in tasklet_action_common(), with a
smaller amount spent on the atomic operations in tasklet_schedule(),
tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
TCP completions are handled by mlx5e_completion_event(), which schedules
NAPI to poll the queue, so they don't need tasklet processing.

Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
to be processed in tasklet context, so it can schedule the tasklet. CQs
that need tasklet processing have their interrupt comp handler set to
mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
don't need tasklet processing won't schedule the tasklet. To avoid
scheduling the tasklet multiple times during the same interrupt, only
schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
queue was empty before the new CQ was pushed to it.

The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
may add a small cost for the userspace Infiniband CQs whose completions
are processed in tasklet context. But this seems worth it to avoid the
tasklet overhead for CQs that don't need it.

Note that the mlx4 driver works the same way: it schedules the tasklet
in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20241105204000.1807095-1-csander@purestorage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'improve-neigh_flush_dev-performance'

Gilad Naaman says:

====================
Improve neigh_flush_dev performance

This patchsets improves the performance of neigh_flush_dev.

Currently, the only way to implement it requires traversing
all neighbours known to the kernel, across all network-namespaces.

This means that some flows are slowed down as a function of neigh-scale,
even if the specific link they're handling has little to no neighbours.

In order to solve this, this patchset adds a netdev->neighbours list,
as well as making the original linked-list doubly-, so that it is
possible to unlink neighbours without traversing the hash-bucket to
obtain the previous neighbour.

The original use-case we encountered was mass-deletion of links (12K
VLANs) while there are 50K ARPs and 50K NDPs in the system; though the
slowdowns would also appear when the links are set down.
====================

Link: https://patch.msgid.link/20241107160444.2913124-1-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

neighbour: Create netdev->neighbour association

Create a mapping between a netdev and its neighoburs,
allowing for much cheaper flushes.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241107160444.2913124-7-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

neighbour: Remove bare neighbour::next pointer

Remove the now-unused neighbour::next pointer, leaving struct neighbour
solely with the hlist_node implementation.

Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-6-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>