git.kernel.dk Git - linux-block.git/log

selftests: net: lib: Introduce deferred commands

In commit 8510801a9dbd ("selftests: drv-net: add ability to schedule
cleanup with defer()"), a defer helper was added to Python selftests.
The idea is to keep cleanup commands close to their dirtying counterparts,
thereby making it more transparent what is cleaning up what, making it
harder to miss a cleanup, and make the whole cleanup business exception
safe. All these benefits are applicable to bash as well, exception safety
can be interpreted in terms of safety vs. a SIGINT.

This patch therefore introduces a framework of several helpers that serve
to schedule cleanups in bash selftests:

- defer_scope_push(), defer_scope_pop(): Deferred statements can be batched
  together in scopes. When a scope is popped, the deferred commands
  scheduled in that scope are executed in the order opposite to order of
  their scheduling.

- defer(): Schedules a defer to the most recently pushed scope (or the
  default scope if none was pushed.)

- defer_prio(): Schedules a defer on the priority track. The priority defer
  queue is run before the default defer queue when scope is popped.

  The issue that this is addressing is specifically the one of restoring
  devlink shared buffer threshold type. When setting up static thresholds,
  one has to first change the threshold type to static, then override the
  individual thresholds. When cleaning up, it would be natural to reset the
  threshold values first, then change the threshold type. But the values
  that are valid for dynamic thresholds are generally invalid for static
  thresholds and vice versa. Attempts to restore the values first would be
  bounced. Thus one has to first reset the threshold type, then adjust the
  thresholds.

  (You could argue that the shared buffer threshold type API is broken and
  you would be right, but here we are.)

  This cannot be solved by pure defers easily. I considered making it
  possible to disable an existing defer, so that one could then schedule a
  new defer and disable the original. But this forward-shifting of the
  defer job would have to take place after every threshold-adjusting
  command, which would make it very awkward to schedule these jobs.

- defer_scopes_cleanup(): Pops any unpopped scopes, including the default
  one. The selftests that use defer should run this in their exit trap.
  This is important to get cleanups of interrupted scripts.

- in_defer_scope(): Sometimes a function would like to introduce a new
  defer scope, then run whatever it is that it wants to run, and then pop
  the scope to run the deferred cleanups. The helper in_defer_scope() can
  be used to run another command within such environment, such that any
  scheduled defers run after the command finishes.

The framework is added as a separate file lib/sh/defer.sh so that it can be
used by all bash selftests, including those that do not currently use
lib.sh. lib.sh however includes the file by default, because ideally all
tests would use these helpers instead of hand-rolling their cleanups.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: marvell: Add mdix status reporting

Report MDI-X resolved state after link up.

Tested on Linkstreet 88E6193X internal PHYs.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241017015026.255224-1-paul.davey@alliedtelesis.co.nz
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Programming sequence for VLAN packets with split header

Currently reset state configuration of split header works fine for
non-tagged packets and we see no corruption in payload of any size

We need additional programming sequence with reset configuration to
handle VLAN tagged packets to avoid corruption in payload for packets
of size greater than 256 bytes.

Without this change ping application complains about corruption
in payload when the size of the VLAN packet exceeds 256 bytes.

With this change tagged and non-tagged packets of any size works fine
and there is no corruption seen.

Current configuration which has the issue for VLAN packet
----------------------------------------------------------

Split happens at the position at Layer 3 header
|MAC-DA|MAC-SA|Vlan Tag|Ether type|IP header|IP data|Rest of the payload|
                         2 bytes            ^
                                            |

With the fix we are making sure that the split happens now at
Layer 2 which is end of ethernet header and start of IP payload

Ip traffic split
-----------------

Bits which take care of this are SPLM and SPLOFST
SPLM = Split mode is set to Layer 2
SPLOFST = These bits indicate the value of offset from the beginning
of Length/Type field at which header split should take place when the
appropriate SPLM is selected. Reset value is 2bytes.

Un-tagged data (without VLAN)
|MAC-DA|MAC-SA|Ether type|IP header|IP data|Rest of the payload|
                  2bytes ^
|

Tagged data (with VLAN)
|MAC-DA|MAC-SA|VLAN Tag|Ether type|IP header|IP data|Rest of the payload|
                          2bytes  ^
  |

Non-IP traffic split such AV packet
------------------------------------

Bits which take care of this are
SAVE = Split AV Enable
SAVO = Split AV Offset, similar to SPLOFST but this is for AVTP
packets.

|Preamble|MAC-DA|MAC-SA|VLAN tag|Ether type|IEEE 1722 payload|CRC|
    2bytes ^
   |

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241016234313.3992214-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'rtnetlink-refactor-rtnl_-new-del-set-link-for-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Refactor rtnl_{new,del,set}link() for per-netns RTNL.

This is a prep for the next series where we will push RTNL down to
rtnl_{new,del,set}link().

That means, for example, __rtnl_newlink() is always under RTNL, but
rtnl_newlink() has a non-RTNL section.

As a prerequisite for per-netns RTNL, we will move netns validation
(and RTNL-independent validations if possible) to that section.

rtnl_link_ops and rtnl_af_ops will be protected with SRCU not to
depend on RTNL.

Changes:
  v2:
    * Add Eric's Reviewed-by to patch 1-4,6,8-11, (no tag on 5,7,12-14)
    * Patch 7
      * Handle error of init_srcu_struct().
      * Call cleanup_srcu_struct() after synchronize_srcu().
    * Patch 12
      * Move put_net() before errorout label
    * Patch 13
      * Newly added as prep for patch 14
    * Patch 14
      * Handle error of init_srcu_struct().
      * Call cleanup_srcu_struct() after synchronize_srcu().

  v1: https://lore.kernel.org/netdev/20241009231656.57830-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20241016185357.83849-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Protect struct rtnl_af_ops with SRCU.

Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_af_ops is alive during inflight RTM_SETLINK
even when its module is being unloaded.

Let's use SRCU to protect ops.

rtnl_af_lookup() now iterates rtnl_af_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_af_put()
to release the pointer after the use.

Also, rtnl_af_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_SETLINK requests to
complete.

Note that rtnl_af_ops needs to be protected by its dedicated lock
when RTNL is removed.

Note also that BUG_ON() in do_setlink() is changed to the normal
error handling as a different af_ops might be found after
validate_linkmsg().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Return int from rtnl_af_register().

The next patch will add init_srcu_struct() in rtnl_af_register(),
then we need to handle its error.

Let's add the error handling in advance to make the following
patch cleaner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call rtnl_link_get_net_capable() in do_setlink().

We will push RTNL down to rtnl_setlink().

RTM_SETLINK could call rtnl_link_get_net_capable() in do_setlink()
to move a dev to a new netns, but the netns needs to be fetched before
holding rtnl_net_lock().

Let's move it to rtnl_setlink() and pass the netns to do_setlink().

Now, RTM_NEWLINK paths (rtnl_changelink() and rtnl_group_changelink())
can pass the prefetched netns to do_setlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Clean up rtnl_setlink().

We will push RTNL down to rtnl_setlink().

Let's unify the error path to make it easy to place rtnl_net_lock().

While at it, keep the variables in reverse xmas order.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Clean up rtnl_dellink().

We will push RTNL down to rtnl_delink().

Let's unify the error path to make it easy to place rtnl_net_lock().

While at it, keep the variables in reverse xmas order.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Fetch IFLA_LINK_NETNSID in rtnl_newlink().

Another netns option for RTM_NEWLINK is IFLA_LINK_NETNSID and
is fetched in rtnl_newlink_create().

This must be done before holding rtnl_net_lock().

Let's move IFLA_LINK_NETNSID processing to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call rtnl_link_get_net_capable() in rtnl_newlink().

As a prerequisite of per-netns RTNL, we must fetch netns before
looking up dev or moving it to another netns.

rtnl_link_get_net_capable() is called in rtnl_newlink_create() and
do_setlink(), but both of them need to be moved to the RTNL-independent
region, which will be rtnl_newlink().

Let's call rtnl_link_get_net_capable() in rtnl_newlink() and pass the
netns down to where needed.

Note that the latter two have not passed the nets to do_setlink() yet
but will do so after the remaining rtnl_link_get_net_capable() is moved
to rtnl_setlink() later.

While at it, dest_net is renamed to tgt_net in rtnl_newlink_create() to
align with rtnl_{del,set}link().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Protect struct rtnl_link_ops with SRCU.

Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_link_ops is alive during inflight RTM_NEWLINK
even when its module is being unloaded.

Let's use SRCU to protect ops.

rtnl_link_ops_get() now iterates link_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_link_ops_put()
to release the pointer after the use.

Also, __rtnl_link_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_NEWLINK requests to
complete.

Note that link_ops needs to be protected by its dedicated lock
when RTNL is removed.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move ops->validate to rtnl_newlink().

ops->validate() does not require RTNL.

Let's move it to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move rtnl_link_ops_get() and retry to rtnl_newlink().

Currently, if neither dev nor rtnl_link_ops is found in __rtnl_newlink(),
we release RTNL and redo the whole process after request_module(), which
complicates the logic.

The ops will be RTNL-independent later.

Let's move the ops lookup to rtnl_newlink() and do the retry earlier.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move simple validation from __rtnl_newlink() to rtnl_newlink().

We will push RTNL down to rtnl_newlink().

Let's move RTNL-independent validation to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Factorise do_setlink() path from __rtnl_newlink().

__rtnl_newlink() got too long to maintain.

For example, netdev_master_upper_dev_get()->rtnl_link_ops is fetched even
when IFLA_INFO_SLAVE_DATA is not specified.

Let's factorise the single dev do_setlink() path to a separate function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call validate_linkmsg() in do_setlink().

There are 3 paths that finally call do_setlink(), and validate_linkmsg()
is called in each path.

  1. RTM_NEWLINK
    1-1. dev is found in __rtnl_newlink()
    1-2. dev isn't found, but IFLA_GROUP is specified in
          rtnl_group_changelink()
  2. RTM_SETLINK

The next patch factorises 1-1 to a separate function.

As a preparation, let's move validate_linkmsg() calls to do_setlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Allocate linkinfo[] as struct rtnl_newlink_tbs.

We will move linkinfo to rtnl_newlink() and pass it down to other
functions.

Let's pack it into rtnl_newlink_tbs.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-mlx5-refactor-esw-qos-to-support-generalized-operations'

Tariq Toukan says:

====================
net/mlx5: Refactor esw QoS to support generalized operations

This patch series from the team to mlx5 core driver consists of one main
QoS part followed by small misc patches.

This main part (patches 1 to 11) by Carolina refactors the QoS handling
to generalize operations on scheduling groups and vports. These changes
are necessary to support new features that will extend group
functionality, introduce new group types, and support deeper
hierarchies.

Additionally, this refactor updates the terminology from "group" to
"node" to better reflect the hardware’s rate hierarchy and its use
of scheduling element nodes.

Simplify group scheduling element creation:
- net/mlx5: Refactor QoS group scheduling element creation

Refactor to support generalized operations for QoS:
- net/mlx5: Introduce node type to rate group structure
- net/mlx5: Add parent group support in rate group structure
- net/mlx5: Restrict domain list insertion to root TSAR ancestors
- net/mlx5: Rename vport QoS group reference to parent
- net/mlx5: Introduce node struct and rename group terminology to node
- net/mlx5: Refactor vport scheduling element creation function
- net/mlx5: Refactor vport QoS to use scheduling node structure
- net/mlx5: Remove vport QoS enabled flag

Support generalized operations for QoS elements:
- net/mlx5: Simplify QoS scheduling element configuration
- net/mlx5: Generalize QoS operations for nodes and vports

On top, patch 12 by Moshe handles FW request to move to drop mode.

In patch 13, Benjamin Poirier removes an empty eswitch flow table when
not used, which improves packet processing performance.

Patches 14 and 15 by Moshe are small field renamings as preparation for
future fields addition to these structures.

Series generated against:
commit c531f2269a53 ("net: bcmasp: enable SW timestamping")
====================

Link: https://patch.msgid.link/20241016173617.217736-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: fs, rename modify header struct member action

As preparation for HW Steering support, rename modify header struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.

Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: fs, rename packet reformat struct member action

As preparation for HW Steering support, rename packet reformat struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.

Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Only create VEPA flow table when in VEPA mode

Currently, when VFs are created, two flow tables are added for the eswitch:
the "fdb" table, which contains rules for each VF and the "vepa_fdb" table.
In the default VEB mode, the vepa_fdb table is empty. When switching to
VEPA mode, flow steering rules are added to vepa_fdb. Even though the
vepa_fdb table is empty in VEB mode, its presence adds some cost to packet
processing. In some workloads, this leads to drops which are reported by
the rx_discards_phy ethtool counter.

In order to improve performance, only create vepa_fdb when in VEPA mode.

Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between
both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as
reported by testpmd.

Without changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35257998,35264499
numvfs=1,1
24590124,24590888
testpmd on VF with numvfs=1,1
20434338,20434887
traffic to VF mac
testpmd on VF with numvfs=1,1
30341014,30340749

With changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35404361,35383378
numvfs=1,1
29801247,29790757
testpmd on VF with numvfs=1,1
24310435,24309084
traffic to VF mac
testpmd on VF with numvfs=1,1
34811436,34781706

Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Add sync reset drop mode support

On sync reset flow, firmware may request a PF, which already
acknowledged the unload event, to move to drop mode. Drop mode means
that this PF will reduce polling frequency, as this PF is not going to
have another active part in the reset, but only reload back after the
reset.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Generalize QoS operations for nodes and vports

Refactor QoS normalization and rate calculation functions to operate
on mlx5_esw_sched_node, allowing for generalized handling of both
vports and nodes.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Simplify QoS scheduling element configuration

Simplify the configuration of QoS scheduling elements by removing the
separate functions `esw_qos_node_config` and `esw_qos_vport_config`.

Instead, directly use the existing `esw_qos_sched_elem_config` function
for both nodes and vports.

This unification helps in generalizing operations on scheduling
elements nodes.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Remove vport QoS enabled flag

Remove the `enabled` flag from the `vport->qos` struct, as QoS now
relies solely on the `sched_node` pointer to determine whether QoS
features are in use.

Currently, the vport `qos` struct consists only of the `sched_node`,
introducing an unnecessary two-level reference. However, the qos struct
is retained as it will be extended in future patches to support new QoS
features.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor vport QoS to use scheduling node structure

Refactor the vport QoS structure by moving group membership and
scheduling details into the `mlx5_esw_sched_node` structure.

This change consolidates the vport into the rate hierarchy by unifying
the handling of different types of scheduling element nodes.

In addition, add a direct reference to the mlx5_vport within the
mlx5_esw_sched_node structure, to ensure that the vport is easily
accessible when a scheduling node is associated with a vport.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor vport scheduling element creation function

Modify the vport scheduling element creation function to get the parent
node directly, aligning it with the group creation function.

This ensures a consistent flow for scheduling elements creation, as the
parent nodes already contain the device and parent element index.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Introduce node struct and rename group terminology to node

Introduce the `mlx5_esw_sched_node` struct, consolidating all rate
hierarchy related details, including membership and scheduling
parameters.

Since the group concept aligns with the `mlx5_esw_sched_node`, replace
the `mlx5_esw_rate_group` struct with it and rename the "group"
terminology to "node" throughout the rate hierarchy.

All relevant code paths and structures have been updated to use the
"node" terminology accordingly, laying the groundwork for future
patches that will unify the handling of different types of members
within the rate hierarchy.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Rename vport QoS group reference to parent

Rename the `group` field in the `mlx5_vport` structure to `parent` to
clarify the vport's role as a member of a parent group and distinguish
it from the concept of a general group.

Additionally, rename `group_entry` to `parent_entry` to reflect this
update.

This distinction will be important for handling more complex group
structures and scheduling elements.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Restrict domain list insertion to root TSAR ancestors

Update the logic for adding rate groups to the E-Switch domain list,
ensuring only groups with the root Transmit Scheduling Arbiter as their
parent are included.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Add parent group support in rate group structure

Introduce a `parent` field in the `mlx5_esw_rate_group` structure to
support hierarchical group relationships.

The `parent` can reference another group or be set to `NULL`,
indicating the group is connected to the root TSAR.

This change enables the ability to manage groups in a hierarchical
structure for future enhancements.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Introduce node type to rate group structure

Introduce the `sched_node_type` enum to represent both the group and
its members as scheduling nodes in the rate hierarchy.

Add the `type` field to the rate group structure to specify the type of
the node membership in the rate hierarchy.

Generalize comments to reflect this flexibility within the rate group
structure.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor QoS group scheduling element creation

Introduce `esw_qos_create_group_sched_elem` to handle the creation of
group scheduling elements for E-Switch QoS, Transmit Scheduling
Arbiter (TSAR).

This reduces duplication and simplifies code for TSAR setup.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-support-of-hibmcge-ethernet-driver'

Jijie Shao says:

====================
Add support of HIBMCGE Ethernet Driver

This patch set adds the support of Hisilicon BMC Gigabit Ethernet Driver.

This patch set includes basic Rx/Tx functionality. It also includes
the registration and interrupt codes.

This work provides the initial support to the HIBMCGE and
would incrementally add features or enhancements.
====================

Link: https://patch.msgid.link/20241015123516.4035035-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add maintainer for hibmcge

Add myself as the maintainer for the hibmcge ethernet driver.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add a Makefile and update Kconfig for hibmcge

Add a Makefile and update Kconfig to build hibmcge driver.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement some ethtool_ops functions

Implement the .get_drvinfo .get_link .get_link_ksettings to get
the basic information and working status of the driver.
Implement the .set_link_ksettings to modify the rate, duplex,
and auto-negotiation status.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement rx_poll function to receive packets

Implement rx_poll function to read the rx descriptor after
receiving the rx interrupt. Adjust the skb based on the
descriptor to complete the reception of the packet.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement .ndo_start_xmit function

Implement .ndo_start_xmit function to fill the information of the packet
to be transmitted into the tx descriptor, and then the hardware will
transmit the packet using the information in the tx descriptor.
In addition, we also implemented the tx_handler function to enable the
tx descriptor to be reused, and .ndo_tx_timeout function to print some
information when the hardware is busy.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement some .ndo functions

Implement the .ndo_open() .ndo_stop() .ndo_set_mac_address()
and .ndo_change_mtu functions().
And .ndo_validate_addr calls the eth_validate_addr function directly

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add interrupt supported in this module

The driver supports four interrupts: TX interrupt, RX interrupt,
mdio interrupt, and error interrupt.

Actually, the driver does not use the mdio interrupt.
Therefore, the driver does not request the mdio interrupt.

The error interrupt distinguishes different error information
by using different masks. To distinguish different errors,
the statistics count is added for each error.

To ensure the consistency of the code process, masks are added for the
TX interrupt and RX interrupt.

This patch implements interrupt request, and provides a
unified entry for the interrupt handler function. However,
the specific interrupt handler function of each interrupt
is not implemented currently.

Because of pcim_enable_device(), the interrupt vector
is already device managed and does not need to be free actively.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add mdio and hardware configuration supported in this module

Implements the C22 read and write PHY registers interfaces.

Some hardware interfaces related to the PHY are also implemented
in this patch.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add read/write registers supported through the bar space

Add support for to read and write registers through the pic bar space.

Some driver parameters, such as mac_id, are determined by the
board form. Therefore, these parameters are initialized
from the register as device specifications.

the device specifications register are initialized and written by bmc.
driver will read these registers when loading.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add pci table supported in this module

Add pci table supported in this module, and implement pci_driver function
to initialize this driver.

hibmcge is a passthrough network device. Its software runs
on the host side, and the MAC hardware runs on the BMC side
to reduce the host CPU area. The software interacts with the
MAC hardware through the PCIe.

  ┌─────────────────────────┐
  │ HOST CPU network device │
  │    ┌──────────────┐     │
  │    │hibmcge driver│     │
  │    └─────┬─┬──────┘     │
  │          │ │            │
  │HOST  ┌───┴─┴───┐        │
  │      │ PCIE RC │        │
  └──────┴───┬─┬───┴────────┘
             │ │
            PCIE
             │ │
  ┌──────┬───┴─┴───┬────────┐
  │      │ PCIE EP │        │
  │BMC   └───┬─┬───┘        │
  │          │ │            │
  │ ┌────────┴─┴──────────┐ │
  │ │        GE           │ │
  │ │ ┌─────┐    ┌─────┐  │ │
  │ │ │ MAC │    │ MAC │  │ │
  └─┴─┼─────┼────┼─────┼──┴─┘
      │ PHY │    │ PHY │
      └─────┘    └─────┘

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sfp: change quirks for Alcatel Lucent G-010S-P

Seems Alcatel Lucent G-010S-P also have the same problem that it uses
TX_FAULT pin for SOC uart. So apply sfp_fixup_ignore_tx_fault to it.

Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
Link: https://patch.msgid.link/TYCPR01MB84373677E45A7BFA5A28232C98792@TYCPR01MB8437.jpnprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge git://git./linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.12-rc4).

Conflicts:

107a034d5c1e ("net/mlx5: qos: Store rate groups in a qos domain")
1da9cfd6c41c ("net/mlx5: Unregister notifier on eswitch init failure")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ftgmac100: correct the phy interface of NC-SI mode

In NC-SI specification, NC-SI is using RMII, not MII.

Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Message-ID: <20241018053331.1900100-1-jacky_chou@aspeedtech.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

eth: Fix typo 'accelaration'. 'exprienced' and 'rewritting'

There are some spelling mistakes of 'accelaration', 'exprienced' and
'rewritting' in comments which should be 'acceleration', 'experienced'
and 'rewriting'.

Suggested-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/all/20241017162846.GA51712@kernel.org/
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <90D42CB167CA0842+20241018021910.31359-1-wangyuli@uniontech.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

r8169: enable EEE at 2.5G per default on RTL8125B

Register a6d/12 is shadowing register MDIO_AN_EEE_ADV2. So this line
disables advertisement of EEE at 2.5G. Latest vendor driver r8125
doesn't do this (any longer?), so this mode seems to be safe.
EEE saves quite some energy, therefore enable this mode per default.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <95dd5a0c-09ea-4847-94d9-b7aa3063e8ff@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phy: realtek: add RTL8125D-internal PHY

The first boards show up with Realtek's RTL8125D. This MAC/PHY chip
comes with an integrated 2.5Gbps PHY with ID 0x001cc841. It's not
clear yet whether there's an external version of this PHY and how
Realtek calls it, therefore use the numeric id for now.

Link: https://lore.kernel.org/netdev/2ada65e1-5dfa-456c-9334-2bc51272e9da@gmail.com/T/
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Message-ID: <7d2924de-053b-44d2-a479-870dc3878170@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: airoha: Reset BQL stopping the netdevice

Run airoha_qdma_cleanup_tx_queue() in ndo_stop callback in order to
unmap pending skbs. Moreover, reset BQL txq state stopping the netdevice,

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Message-ID: <20241017-airoha-en7581-reset-bql-v1-1-08c0c9888de5@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phy: mediatek-ge-soc: Propagate error code correctly in cal_cycle()

This patch propagates error code correctly in cal_cycle()
and improve with FIELD_GET().

Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phy: mediatek-ge-soc: Shrink line wrapping to 80 characters

This patch shrinks line wrapping to 80 chars. Also, in
tx_amp_fill_result(), use FIELD_PREP() to prettify code.

Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phy: mediatek-ge-soc: Fix coding style

This patch fixes spelling errors, re-arrange vars with
reverse Xmas tree and remove unnecessary parens in
mediatek-ge-soc.c.

Signed-off-by: SkyLake.Huang <skylake.huang@mediatek.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

r8169: remove rtl_dash_loop_wait_high/low

Remove rtl_dash_loop_wait_high/low to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <fb8c490c-2d92-48f5-8bbf-1fc1f2ee1649@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

r8169: avoid duplicated messages if loading firmware fails and switch to warn level

In case of a problem with firmware loading we inform at the driver level,
in addition the firmware load code itself issues warnings. Therefore
switch to firmware_request_nowarn() to avoid duplicated error messages.
In addition switch to warn level because the firmware is optional and
typically just fixes compatibility issues.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <d9c5094c-89a6-40e2-b5fe-8df7df4624ef@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

r8169: replace custom flag with disable_work() et al

So far we use a custom flag to define when a task can be scheduled and
when not. Let's use the standard mechanism with disable_work() et al
instead.
Note that in rtl8169_close() we can remove the call to cancel_work()
because we now call disable_work_sync() in rtl8169_down() already.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

r8169: don't take RTNL lock in rtl_task()

There's not really a benefit here in taking the RTNL lock. The task
handler does exception handling only, so we're in trouble anyway when
we come here, and there's no need to protect against e.g. a parallel
ethtool call.
A benefit of removing the RTNL lock here is that we now can
synchronously cancel the workqueue from a context holding the RTNL mutex.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

eth: fbnic: add CONFIG_PTP_1588_CLOCK_OPTIONAL dependency

fbnic fails to link as built-in when PTP support is in a loadable
module:

aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_ethtool.o: in function `fbnic_get_ts_info':
fbnic_ethtool.c:(.text+0x428): undefined reference to `ptp_clock_index'
aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_time_start':
fbnic_time.c:(.text+0x820): undefined reference to `ptp_schedule_worker'
aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_ptp_setup':
fbnic_time.c:(.text+0xa68): undefined reference to `ptp_clock_register'

Add the appropriate dependency to enforce this.

Fixes: 6a2b3ede9543 ("eth: fbnic: add RX packets timestamping support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Message-ID: <20241016062303.2551686-1-arnd@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: vxlan: update the document for vxlan_snoop()

The function vxlan_snoop() returns drop reasons now, so update the
document of it too.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vxlan: replace VXLAN_INVALID_HDR with VNI_NOT_FOUND

Replace the drop reason "SKB_DROP_REASON_VXLAN_INVALID_HDR" with
"SKB_DROP_REASON_VXLAN_VNI_NOT_FOUND" in encap_bypass_if_local(), as the
latter is more accurate.

Fixes: 790961d88b0e ("net: vxlan: use kfree_skb_reason() in encap_bypass_if_local()")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: airoha: Fix typo in REG_CDM2_FWD_CFG configuration

Fix typo in airoha_fe_init routine configuring CDM2_OAM_QSEL_MASK field
of REG_CDM2_FWD_CFG register.
This bug is not introducing any user visible problem since Frame Engine
CDM2 port is used just by the second QDMA block and we currently enable
just QDMA1 block connected to the MT7530 dsa switch via CDM1 port.

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet
support for EN7581 SoC")

Reported-by: ChihWei Cheng <chihwei.cheng@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <20241015-airoha-eth-cdm2-fixes-v1-1-9dc6993286c3@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Add VLAN checksum support

The GbEth IP supports offloading checksum calculation for VLAN-tagged
packets, provided that the EtherType is 0x8100 and only one VLAN tag is
present.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Enable IPv6 TX checksum offload for GbEth

The GbEth IP supports offloading IPv6 TCP, UDP & ICMPv6 checksums in the
TX path.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Enable IPv6 RX checksum offloading for GbEth

The GbEth IP supports offloading IPv6 TCP, UDP & ICMPv6 checksums in the
RX path.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Simplify UDP TX checksum offload

The GbEth IP will pass through a zero UDP checksum without asserting any
error flags so we do not need to resort to software checksum calculation
in this case.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Disable IP header TX checksum offloading

For IPv4 packets, the header checksum will always be calculated in software
in the TX path (Documentation/networking/checksum-offloads.rst says "No
offloading of the IP header checksum is performed; it is always done in
software.") so there is no advantage in asking the hardware to also
calculate this checksum.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Simplify types in RX csum validation

The hardware checksum value is used as a 16-bit flag, it is zero when
the checksum has been validated and non-zero otherwise. Therefore we
don't need to treat this as an actual __wsum type or call csum_unfold(),
we can just use a u16 pointer.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Combine if conditions in RX csum validation

We can merge the two if conditions on skb_is_nonlinear(). Since
skb_frag_size_sub() and skb_trim() do not free memory, it is still safe
to access the trimmed bytes at the end of the packet after these calls.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Drop IP protocol check from RX csum verification

We do not need to confirm that the protocol is IPv4. If the hardware
encounters an unsupported protocol, it will set the checksum value to
0xFFFF.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Disable IP header RX checksum offloading

For IPv4 packets, the header checksum will always be checked in software
in the RX path (inet_gro_receive() calls ip_fast_csum() unconditionally)
so there is no advantage in asking the hardware to also calculate this
checksum.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ravb: Factor out checksum offload enable bits

Introduce new constants for the CSR1 (TX) and CSR2 (RX) checksum enable
bits, removing the risk of inconsistency when we change which flags we
enable.

Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

tg3: Increase buffer size for IRQ label

GCC is not happy with the current code, e.g.:

.../tg3.c:11313:37: error: ‘-txrx-’ directive output may be truncated writing 6 bytes into a region of size between 1 and 16 [-Werror=format-truncation=]
11313 |                                  "%s-txrx-%d", tp->dev->name, irq_num);
      |                                     ^~~~~~
.../tg3.c:11313:34: note: using the range [-2147483648, 2147483647] for directive argument
11313 |                                  "%s-txrx-%d", tp->dev->name, irq_num);

When `make W=1` is supplied, this prevents kernel building. Fix it by
increasing the buffer size for IRQ label and use sizeoF() instead of
hard coded constants.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Message-ID: <20241016090647.691022-1-andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phylink: remove "using_mac_select_pcs"

With DSA's implementation of the mac_select_pcs() method removed, we
can now remove the detection of mac_select_pcs() implementation.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phylink: remove use of pl->pcs in phylink_validate_mac_and_pcs()

When the mac_select_pcs() method is not implemented, there is no way
for pl->pcs to be set to a non-NULL value. This was here to support
the old phylink_set_pcs() method which has been removed a few years
ago. Simplify the code in phylink_validate_mac_and_pcs().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: phylink: allow mac_select_pcs() to remove a PCS

phylink has historically not permitted a PCS to be removed. An attempt
to permit this with phylink_set_pcs() resulted in comments indicating
that there was no need for this. This behaviour has been propagated
forward to the mac_select_pcs() approach as it was believed from these
comments that changing this would be NAK'd.

However, with mac_select_pcs(), it takes more code and thus complexity
to maintain this behaviour, which can - and in this case has - resulted
in a bug. If mac_select_pcs() returns NULL for a particular interface
type, but there is already a PCS in-use, then we skip the pcs_validate()
method, but continue using the old PCS. Also, it wouldn't be expected
behaviour by implementers of mac_select_pcs().

Allow this by removing this old unnecessary restriction.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: dsa: mv88e6xxx: return NULL when no PCS is present

Rather than returning an EOPNOTSUPP error pointer when the switch
has no support for PCS, return NULL to indicate that no PCS is
required.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: dsa: remove dsa_port_phylink_mac_select_pcs()

There is no longer any reason to implement the mac_select_pcs()
callback in DSA. Returning ERR_PTR(-EOPNOTSUPP) is functionally
equivalent to not providing the function.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ks8851: use %*ph to print small buffer

Use %*ph format to print small buffer as hex string. It will change
the output format from 32-bit words to byte hexdump, but this is not
critical as it's only a debug message.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <20241016132615.899037-1-andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: usb: sr9700: only store little-endian values in __le16 variable

In sr_mdio_read() the local variable res is used to store both
little-endian and host byte order values. This prevents Sparse
from helping us by flagging when endian miss matches occur - the
detection process hinges on the type of variables matching the
byte order of values stored in them.

Address this by adding a new local variable, word, to store little-endian
values; change the type of res to int, and use it to store host-byte
order values.

Flagged by Sparse as:

.../sr9700.c:205:21: warning: incorrect type in assignment (different base types)
.../sr9700.c:205:21:    expected restricted __le16 [addressable] [usertype] res
.../sr9700.c:205:21:    got int
.../sr9700.c:207:21: warning: incorrect type in assignment (different base types)
.../sr9700.c:207:21:    expected restricted __le16 [addressable] [usertype] res
.../sr9700.c:207:21:    got int
.../sr9700.c:212:16: warning: incorrect type in return expression (different base types)
.../sr9700.c:212:16:    expected int
.../sr9700.c:212:16:    got restricted __le16 [addressable] [usertype] res

Compile tested only.
No functional change intended.

Signed-off-by: Simon Horman <horms@kernel.org>
Message-ID: <20241016-blackbird-le16-v1-1-97ba8de6b38f@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: ethernet: ti: am65-cpsw: Fix uninitialized variable

The *ndev pointer needs to be set or it leads to an uninitialized variable
bug in the caller.

Fixes: 4a7b2ba94a59 ("net: ethernet: ti: am65-cpsw: Use tstats instead of open coded version")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Message-ID: <b168d5c7-704b-4452-84f9-1c1762b1f4ce@stanley.mountain>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Merge tag 'net-6.12-rc4' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Current release - new code bugs:

   - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated

  Previous releases - regressions:

   - ipv4: give an IPv4 dev to blackhole_netdev

   - udp: compute L4 checksum as usual when not segmenting the skb

   - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().

   - eth: mlx5e: don't call cleanup on profile rollback failure

   - eth: microchip: vcap api: fix memory leaks in
     vcap_api_encode_rule_test()

   - eth: enetc: disable Tx BD rings after they are empty

   - eth: macb: avoid 20s boot delay by skipping MDIO bus registration
     for fixed-link PHY

  Previous releases - always broken:

   - posix-clock: fix missing timespec64 check in pc_clock_settime()

   - genetlink: hold RCU in genlmsg_mcast()

   - mptcp: prevent MPC handshake on port-based signal endpoints

   - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame

   - eth: stmmac: dwmac-tegra: fix link bring-up sequence

   - eth: bcmasp: fix potential memory leak in bcmasp_xmit()

  Misc:

   - add Andrew Lunn as a co-maintainer of all networking drivers"

* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net/mlx5e: Don't call cleanup on profile rollback failure
  net/mlx5: Unregister notifier on eswitch init failure
  net/mlx5: Fix command bitmask initialization
  net/mlx5: Check for invalid vector index on EQ creation
  net/mlx5: HWS, use lock classes for bwc locks
  net/mlx5: HWS, don't destroy more bwc queue locks than allocated
  net/mlx5: HWS, fixed double free in error flow of definer layout
  net/mlx5: HWS, removed wrong access to a number of rules variable
  mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
  net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
  vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
  net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
  net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
  net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
  net: phy: mdio-bcm-unimac: Add BCM6846 support
  dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
  udp: Compute L4 checksum as usual when not segmenting the skb
  genetlink: hold RCU in genlmsg_mcast()
  net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
  tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
  ...

net: phy: realtek: merge the drivers for internal NBase-T PHY's

The Realtek RTL8125/RTL8126 NBase-T MAC/PHY chips have internal PHY's
which are register-compatible, at least for the registers we use here.
So let's use just one PHY driver to support all of them.
These internal PHY's exist also as external C45 PHY's, but on the
internal PHY's no access to MMD registers is possible. This can be
used to differentiate between the internal and external version.

As a side effect the drivers for two now external-only drivers don't
require read_mmd/write_mmd hooks any longer.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/c57081a6-811f-4571-ab35-34f4ca6de9af@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth: fbnic: Add hardware monitoring support via HWMON interface

This patch adds support for hardware monitoring to the fbnic driver,
allowing for temperature and voltage sensor data to be exposed to
userspace via the HWMON interface. The driver registers a HWMON device
and provides callbacks for reading sensor data, enabling system
admins to monitor the health and operating conditions of fbnic.

Signed-off-by: Sanman Pradhan <sanmanpradhan@meta.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241014152709.2123811-1-sanman.p211993@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mlx5-misc-fixes-2024-10-15'

Tariq Toukan says:

====================
mlx5 misc fixes 2024-10-15

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.

Series generated against:
commit 174714f0e505 ("selftests: drivers: net: fix name not defined")
====================

Link: https://patch.msgid.link/20241015093208.197603-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5e: Don't call cleanup on profile rollback failure

When profile rollback fails in mlx5e_netdev_change_profile, the netdev
profile var is left set to NULL. Avoid a crash when unloading the driver
by not calling profile->cleanup in such a case.

This was encountered while testing, with the original trigger that
the wq rescuer thread creation got interrupted (presumably due to
Ctrl+C-ing modprobe), which gets converted to ENOMEM (-12) by
mlx5e_priv_init, the profile rollback also fails for the same reason
(signal still active) so the profile is left as NULL, leading to a crash
later in _mlx5e_remove.

[  732.473932] mlx5_core 0000:08:00.1: E-Switch: Unload vfs: mode(OFFLOADS), nvfs(2), necvfs(0), active vports(2)
[  734.525513] workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
[  734.557372] mlx5_core 0000:08:00.1: mlx5e_netdev_init_profile:6235:(pid 6086): mlx5e_priv_init failed, err=-12
[  734.559187] mlx5_core 0000:08:00.1 eth3: mlx5e_netdev_change_profile: new profile init failed, -12
[  734.560153] workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
[  734.589378] mlx5_core 0000:08:00.1: mlx5e_netdev_init_profile:6235:(pid 6086): mlx5e_priv_init failed, err=-12
[  734.591136] mlx5_core 0000:08:00.1 eth3: mlx5e_netdev_change_profile: failed to rollback to orig profile, -12
[  745.537492] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  745.538222] #PF: supervisor read access in kernel mode
<snipped>
[  745.551290] Call Trace:
[  745.551590]  <TASK>
[  745.551866]  ? __die+0x20/0x60
[  745.552218]  ? page_fault_oops+0x150/0x400
[  745.555307]  ? exc_page_fault+0x79/0x240
[  745.555729]  ? asm_exc_page_fault+0x22/0x30
[  745.556166]  ? mlx5e_remove+0x6b/0xb0 [mlx5_core]
[  745.556698]  auxiliary_bus_remove+0x18/0x30
[  745.557134]  device_release_driver_internal+0x1df/0x240
[  745.557654]  bus_remove_device+0xd7/0x140
[  745.558075]  device_del+0x15b/0x3c0
[  745.558456]  mlx5_rescan_drivers_locked.part.0+0xb1/0x2f0 [mlx5_core]
[  745.559112]  mlx5_unregister_device+0x34/0x50 [mlx5_core]
[  745.559686]  mlx5_uninit_one+0x46/0xf0 [mlx5_core]
[  745.560203]  remove_one+0x4e/0xd0 [mlx5_core]
[  745.560694]  pci_device_remove+0x39/0xa0
[  745.561112]  device_release_driver_internal+0x1df/0x240
[  745.561631]  driver_detach+0x47/0x90
[  745.562022]  bus_remove_driver+0x84/0x100
[  745.562444]  pci_unregister_driver+0x3b/0x90
[  745.562890]  mlx5_cleanup+0xc/0x1b [mlx5_core]
[  745.563415]  __x64_sys_delete_module+0x14d/0x2f0
[  745.563886]  ? kmem_cache_free+0x1b0/0x460
[  745.564313]  ? lockdep_hardirqs_on_prepare+0xe2/0x190
[  745.564825]  do_syscall_64+0x6d/0x140
[  745.565223]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  745.565725] RIP: 0033:0x7f1579b1288b

Fixes: 3ef14e463f6e ("net/mlx5e: Separate between netdev objects and mlx5e profiles initialization")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Unregister notifier on eswitch init failure

It otherwise remains registered and a subsequent attempt at eswitch
enabling might trigger warnings of the sort:

[  682.589148] ------------[ cut here ]------------
[  682.590204] notifier callback eswitch_vport_event [mlx5_core] already registered
[  682.590256] WARNING: CPU: 13 PID: 2660 at kernel/notifier.c:31 notifier_chain_register+0x3e/0x90
[...snipped]
[  682.610052] Call Trace:
[  682.610369]  <TASK>
[  682.610663]  ? __warn+0x7c/0x110
[  682.611050]  ? notifier_chain_register+0x3e/0x90
[  682.611556]  ? report_bug+0x148/0x170
[  682.611977]  ? handle_bug+0x36/0x70
[  682.612384]  ? exc_invalid_op+0x13/0x60
[  682.612817]  ? asm_exc_invalid_op+0x16/0x20
[  682.613284]  ? notifier_chain_register+0x3e/0x90
[  682.613789]  atomic_notifier_chain_register+0x25/0x40
[  682.614322]  mlx5_eswitch_enable_locked+0x1d4/0x3b0 [mlx5_core]
[  682.614965]  mlx5_eswitch_enable+0xc9/0x100 [mlx5_core]
[  682.615551]  mlx5_device_enable_sriov+0x25/0x340 [mlx5_core]
[  682.616170]  mlx5_core_sriov_configure+0x50/0x170 [mlx5_core]
[  682.616789]  sriov_numvfs_store+0xb0/0x1b0
[  682.617248]  kernfs_fop_write_iter+0x117/0x1a0
[  682.617734]  vfs_write+0x231/0x3f0
[  682.618138]  ksys_write+0x63/0xe0
[  682.618536]  do_syscall_64+0x4c/0x100
[  682.618958]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 7624e58a8b3a ("net/mlx5: E-switch, register event handler before arming the event")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Fix command bitmask initialization

Command bitmask have a dedicated bit for MANAGE_PAGES command, this bit
isn't Initialize during command bitmask Initialization, only during
MANAGE_PAGES.

In addition, mlx5_cmd_trigger_completions() is trying to trigger
completion for MANAGE_PAGES command as well.

Hence, in case health error occurred before any MANAGE_PAGES command
have been invoke (for example, during mlx5_enable_hca()),
mlx5_cmd_trigger_completions() will try to trigger completion for
MANAGE_PAGES command, which will result in null-ptr-deref error.[1]

Fix it by Initialize command bitmask correctly.

While at it, re-write the code for better understanding.

[1]
BUG: KASAN: null-ptr-deref in mlx5_cmd_trigger_completions+0x1db/0x600 [mlx5_core]
Write of size 4 at addr 0000000000000214 by task kworker/u96:2/12078
CPU: 10 PID: 12078 Comm: kworker/u96:2 Not tainted 6.9.0-rc2_for_upstream_debug_2024_04_07_19_01 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5_health0000:08:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
Call Trace:
<TASK>
dump_stack_lvl+0x7e/0xc0
kasan_report+0xb9/0xf0
kasan_check_range+0xec/0x190
mlx5_cmd_trigger_completions+0x1db/0x600 [mlx5_core]
mlx5_cmd_flush+0x94/0x240 [mlx5_core]
enter_error_state+0x6c/0xd0 [mlx5_core]
mlx5_fw_fatal_reporter_err_work+0xf3/0x480 [mlx5_core]
process_one_work+0x787/0x1490
? lockdep_hardirqs_on_prepare+0x400/0x400
? pwq_dec_nr_in_flight+0xda0/0xda0
? assign_work+0x168/0x240
worker_thread+0x586/0xd30
? rescuer_thread+0xae0/0xae0
kthread+0x2df/0x3b0
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2d/0x70
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>

Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Check for invalid vector index on EQ creation

Currently, mlx5 driver does not enforce vector index to be lower than
the maximum number of supported completion vectors when requesting a
new completion EQ. Thus, mlx5_comp_eqn_get() fails when trying to
acquire an IRQ with an improper vector index.

To prevent the case above, enforce that vector index value is
valid and lower than maximum in mlx5_comp_eqn_get() before handling the
request.

Fixes: f14c1a14e632 ("net/mlx5: Allocate completion EQs dynamically")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: HWS, use lock classes for bwc locks

The HWS BWC API uses one lock per queue and usually acquires one of
them, except when doing changes which require locking all queues in
order. Naturally, lockdep isn't too happy about acquiring the same lock
class multiple times, so inform it that each queue lock is a different
class to avoid false positives.

Fixes: 2ca62599aa0b ("net/mlx5: HWS, added send engine and context handling")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: HWS, don't destroy more bwc queue locks than allocated

hws_send_queues_bwc_locks_destroy destroyed more queue locks than
allocated, leading to memory corruption (occasionally) and warnings such
as DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock)) in __mutex_destroy because
sometimes, the 'mutex' being destroyed was random memory.
The severity of this problem is proportional to the number of queues
configured because the code overreaches beyond the end of the
bwc_send_queue_locks array by 2x its length.

Fix that by using the correct number of bwc queues.

Fixes: 2ca62599aa0b ("net/mlx5: HWS, added send engine and context handling")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: HWS, fixed double free in error flow of definer layout

Fix error flow bug that could lead to double free of a buffer
during a failure to calculate a suitable definer layout.

Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: HWS, removed wrong access to a number of rules variable

Removed wrong access to the num_of_rules field of the matcher.
This is a usual u32 variable, but the access was as if it was atomic.

This fixes the following CI warnings:
mlx5hws_bwc.c:708:17: warning: large atomic operation may incur significant performance penalty;
the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment]

Fixes: 510f9f61a112 ("net/mlx5: HWS, added API and enabled HWS support")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409291101.6NdtMFVC-lkp@intel.com/
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow

Syzkaller reported this splat:

  ==================================================================
  BUG: KASAN: slab-use-after-free in mptcp_pm_nl_rm_addr_or_subflow+0xb44/0xcc0 net/mptcp/pm_netlink.c:881
  Read of size 4 at addr ffff8880569ac858 by task syz.1.2799/14662

  CPU: 0 UID: 0 PID: 14662 Comm: syz.1.2799 Not tainted 6.12.0-rc2-syzkaller-00307-g36c254515dc6 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:377 [inline]
   print_report+0xc3/0x620 mm/kasan/report.c:488
   kasan_report+0xd9/0x110 mm/kasan/report.c:601
   mptcp_pm_nl_rm_addr_or_subflow+0xb44/0xcc0 net/mptcp/pm_netlink.c:881
   mptcp_pm_nl_rm_subflow_received net/mptcp/pm_netlink.c:914 [inline]
   mptcp_nl_remove_id_zero_address+0x305/0x4a0 net/mptcp/pm_netlink.c:1572
   mptcp_pm_nl_del_addr_doit+0x5c9/0x770 net/mptcp/pm_netlink.c:1603
   genl_family_rcv_msg_doit+0x202/0x2f0 net/netlink/genetlink.c:1115
   genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
   genl_rcv_msg+0x565/0x800 net/netlink/genetlink.c:1210
   netlink_rcv_skb+0x165/0x410 net/netlink/af_netlink.c:2551
   genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
   netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline]
   netlink_unicast+0x53c/0x7f0 net/netlink/af_netlink.c:1357
   netlink_sendmsg+0x8b8/0xd70 net/netlink/af_netlink.c:1901
   sock_sendmsg_nosec net/socket.c:729 [inline]
   __sock_sendmsg net/socket.c:744 [inline]
   ____sys_sendmsg+0x9ae/0xb40 net/socket.c:2607
   ___sys_sendmsg+0x135/0x1e0 net/socket.c:2661
   __sys_sendmsg+0x117/0x1f0 net/socket.c:2690
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  RIP: 0023:0xf7fe4579
  Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
  RSP: 002b:00000000f574556c EFLAGS: 00000296 ORIG_RAX: 0000000000000172
  RAX: ffffffffffffffda RBX: 000000000000000b RCX: 0000000020000140
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000296 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   </TASK>

  Allocated by task 5387:
   kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
   kasan_save_track+0x14/0x30 mm/kasan/common.c:68
   poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
   __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:394
   kmalloc_noprof include/linux/slab.h:878 [inline]
   kzalloc_noprof include/linux/slab.h:1014 [inline]
   subflow_create_ctx+0x87/0x2a0 net/mptcp/subflow.c:1803
   subflow_ulp_init+0xc3/0x4d0 net/mptcp/subflow.c:1956
   __tcp_set_ulp net/ipv4/tcp_ulp.c:146 [inline]
   tcp_set_ulp+0x326/0x7f0 net/ipv4/tcp_ulp.c:167
   mptcp_subflow_create_socket+0x4ae/0x10a0 net/mptcp/subflow.c:1764
   __mptcp_subflow_connect+0x3cc/0x1490 net/mptcp/subflow.c:1592
   mptcp_pm_create_subflow_or_signal_addr+0xbda/0x23a0 net/mptcp/pm_netlink.c:642
   mptcp_pm_nl_fully_established net/mptcp/pm_netlink.c:650 [inline]
   mptcp_pm_nl_work+0x3a1/0x4f0 net/mptcp/pm_netlink.c:943
   mptcp_worker+0x15a/0x1240 net/mptcp/protocol.c:2777
   process_one_work+0x958/0x1b30 kernel/workqueue.c:3229
   process_scheduled_works kernel/workqueue.c:3310 [inline]
   worker_thread+0x6c8/0xf00 kernel/workqueue.c:3391
   kthread+0x2c1/0x3a0 kernel/kthread.c:389
   ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 113:
   kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
   kasan_save_track+0x14/0x30 mm/kasan/common.c:68
   kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:579
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x51/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:230 [inline]
   slab_free_hook mm/slub.c:2342 [inline]
   slab_free mm/slub.c:4579 [inline]
   kfree+0x14f/0x4b0 mm/slub.c:4727
   kvfree+0x47/0x50 mm/util.c:701
   kvfree_rcu_list+0xf5/0x2c0 kernel/rcu/tree.c:3423
   kvfree_rcu_drain_ready kernel/rcu/tree.c:3563 [inline]
   kfree_rcu_monitor+0x503/0x8b0 kernel/rcu/tree.c:3632
   kfree_rcu_shrink_scan+0x245/0x3a0 kernel/rcu/tree.c:3966
   do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
   shrink_slab+0x32b/0x12a0 mm/shrinker.c:662
   shrink_one+0x47e/0x7b0 mm/vmscan.c:4818
   shrink_many mm/vmscan.c:4879 [inline]
   lru_gen_shrink_node mm/vmscan.c:4957 [inline]
   shrink_node+0x2452/0x39d0 mm/vmscan.c:5937
   kswapd_shrink_node mm/vmscan.c:6765 [inline]
   balance_pgdat+0xc19/0x18f0 mm/vmscan.c:6957
   kswapd+0x5ea/0xbf0 mm/vmscan.c:7226
   kthread+0x2c1/0x3a0 kernel/kthread.c:389
   ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Last potentially related work creation:
   kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xba/0xd0 mm/kasan/generic.c:541
   kvfree_call_rcu+0x74/0xbe0 kernel/rcu/tree.c:3810
   subflow_ulp_release+0x2ae/0x350 net/mptcp/subflow.c:2009
   tcp_cleanup_ulp+0x7c/0x130 net/ipv4/tcp_ulp.c:124
   tcp_v4_destroy_sock+0x1c5/0x6a0 net/ipv4/tcp_ipv4.c:2541
   inet_csk_destroy_sock+0x1a3/0x440 net/ipv4/inet_connection_sock.c:1293
   tcp_done+0x252/0x350 net/ipv4/tcp.c:4870
   tcp_rcv_state_process+0x379b/0x4f30 net/ipv4/tcp_input.c:6933
   tcp_v4_do_rcv+0x1ad/0xa90 net/ipv4/tcp_ipv4.c:1938
   sk_backlog_rcv include/net/sock.h:1115 [inline]
   __release_sock+0x31b/0x400 net/core/sock.c:3072
   __tcp_close+0x4f3/0xff0 net/ipv4/tcp.c:3142
   __mptcp_close_ssk+0x331/0x14d0 net/mptcp/protocol.c:2489
   mptcp_close_ssk net/mptcp/protocol.c:2543 [inline]
   mptcp_close_ssk+0x150/0x220 net/mptcp/protocol.c:2526
   mptcp_pm_nl_rm_addr_or_subflow+0x2be/0xcc0 net/mptcp/pm_netlink.c:878
   mptcp_pm_nl_rm_subflow_received net/mptcp/pm_netlink.c:914 [inline]
   mptcp_nl_remove_id_zero_address+0x305/0x4a0 net/mptcp/pm_netlink.c:1572
   mptcp_pm_nl_del_addr_doit+0x5c9/0x770 net/mptcp/pm_netlink.c:1603
   genl_family_rcv_msg_doit+0x202/0x2f0 net/netlink/genetlink.c:1115
   genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
   genl_rcv_msg+0x565/0x800 net/netlink/genetlink.c:1210
   netlink_rcv_skb+0x165/0x410 net/netlink/af_netlink.c:2551
   genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
   netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline]
   netlink_unicast+0x53c/0x7f0 net/netlink/af_netlink.c:1357
   netlink_sendmsg+0x8b8/0xd70 net/netlink/af_netlink.c:1901
   sock_sendmsg_nosec net/socket.c:729 [inline]
   __sock_sendmsg net/socket.c:744 [inline]
   ____sys_sendmsg+0x9ae/0xb40 net/socket.c:2607
   ___sys_sendmsg+0x135/0x1e0 net/socket.c:2661
   __sys_sendmsg+0x117/0x1f0 net/socket.c:2690
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  The buggy address belongs to the object at ffff8880569ac800
   which belongs to the cache kmalloc-512 of size 512
  The buggy address is located 88 bytes inside of
   freed 512-byte region [ffff8880569ac800, ffff8880569aca00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x569ac
  head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  flags: 0x4fff00000000040(head|node=1|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 04fff00000000040 ffff88801ac42c80 dead000000000100 dead000000000122
  raw: 0000000000000000 0000000080100010 00000001f5000000 0000000000000000
  head: 04fff00000000040 ffff88801ac42c80 dead000000000100 dead000000000122
  head: 0000000000000000 0000000080100010 00000001f5000000 0000000000000000
  head: 04fff00000000002 ffffea00015a6b01 ffffffffffffffff 0000000000000000
  head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 10238, tgid 10238 (kworker/u32:6), ts 597403252405, free_ts 597177952947
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x2d1/0x350 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x101e/0x3070 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x223/0x25a0 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x2c9/0x610 mm/mempolicy.c:2265
   alloc_slab_page mm/slub.c:2412 [inline]
   allocate_slab mm/slub.c:2578 [inline]
   new_slab+0x2ba/0x3f0 mm/slub.c:2631
   ___slab_alloc+0xd1d/0x16f0 mm/slub.c:3818
   __slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3908
   __slab_alloc_node mm/slub.c:3961 [inline]
   slab_alloc_node mm/slub.c:4122 [inline]
   __kmalloc_cache_noprof+0x2c5/0x310 mm/slub.c:4290
   kmalloc_noprof include/linux/slab.h:878 [inline]
   kzalloc_noprof include/linux/slab.h:1014 [inline]
   mld_add_delrec net/ipv6/mcast.c:743 [inline]
   igmp6_leave_group net/ipv6/mcast.c:2625 [inline]
   igmp6_group_dropped+0x4ab/0xe40 net/ipv6/mcast.c:723
   __ipv6_dev_mc_dec+0x281/0x360 net/ipv6/mcast.c:979
   addrconf_leave_solict net/ipv6/addrconf.c:2253 [inline]
   __ipv6_ifa_notify+0x3f6/0xc30 net/ipv6/addrconf.c:6283
   addrconf_ifdown.isra.0+0xef9/0x1a20 net/ipv6/addrconf.c:3982
   addrconf_notify+0x220/0x19c0 net/ipv6/addrconf.c:3781
   notifier_call_chain+0xb9/0x410 kernel/notifier.c:93
   call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:1996
   call_netdevice_notifiers_extack net/core/dev.c:2034 [inline]
   call_netdevice_notifiers net/core/dev.c:2048 [inline]
   dev_close_many+0x333/0x6a0 net/core/dev.c:1589
  page last free pid 13136 tgid 13136 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_page+0x5f4/0xdc0 mm/page_alloc.c:2638
   stack_depot_save_flags+0x2da/0x900 lib/stackdepot.c:666
   kasan_save_stack+0x42/0x60 mm/kasan/common.c:48
   kasan_save_track+0x14/0x30 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x89/0x90 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4085 [inline]
   slab_alloc_node mm/slub.c:4134 [inline]
   kmem_cache_alloc_noprof+0x121/0x2f0 mm/slub.c:4141
   skb_clone+0x190/0x3f0 net/core/skbuff.c:2084
   do_one_broadcast net/netlink/af_netlink.c:1462 [inline]
   netlink_broadcast_filtered+0xb11/0xef0 net/netlink/af_netlink.c:1540
   netlink_broadcast+0x39/0x50 net/netlink/af_netlink.c:1564
   uevent_net_broadcast_untagged lib/kobject_uevent.c:331 [inline]
   kobject_uevent_net_broadcast lib/kobject_uevent.c:410 [inline]
   kobject_uevent_env+0xacd/0x1670 lib/kobject_uevent.c:608
   device_del+0x623/0x9f0 drivers/base/core.c:3882
   snd_card_disconnect.part.0+0x58a/0x7c0 sound/core/init.c:546
   snd_card_disconnect+0x1f/0x30 sound/core/init.c:495
   snd_usx2y_disconnect+0xe9/0x1f0 sound/usb/usx2y/usbusx2y.c:417
   usb_unbind_interface+0x1e8/0x970 drivers/usb/core/driver.c:461
   device_remove drivers/base/dd.c:569 [inline]
   device_remove+0x122/0x170 drivers/base/dd.c:561

That's because 'subflow' is used just after 'mptcp_close_ssk(subflow)',
which will initiate the release of its memory. Even if it is very likely
the release and the re-utilisation will be done later on, it is of
course better to avoid any issues and read the content of 'subflow'
before closing it.

Fixes: 1c1f72137598 ("mptcp: pm: only decrement add_addr_accepted for MPJ req")
Cc: stable@vger.kernel.org
Reported-by: syzbot+3c8b7a8e7df6a2a226ca@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/670d7337.050a0220.4cbc0.004f.GAE@google.com
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20241015-net-mptcp-uaf-pm-rm-v1-1-c4ee5d987a64@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init

The loop responsible for allocating up to MTK_FQ_DMA_LENGTH buffers must
only touch as many descriptors, otherwise it ends up corrupting unrelated
memory. Fix the loop iteration count accordingly.

Fixes: c57e55819443 ("net: ethernet: mtk_eth_soc: handle dma buffer size soc specific")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241015081755.31060-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame

Andrew and Nikolay reported connectivity issues with Cilium's service
load-balancing in case of vmxnet3.

If a BPF program for native XDP adds an encapsulation header such as
IPIP and transmits the packet out the same interface, then in case
of vmxnet3 a corrupted packet is being sent and subsequently dropped
on the path.

vmxnet3_xdp_xmit_frame() which is called e.g. via vmxnet3_run_xdp()
through vmxnet3_xdp_xmit_back() calculates an incorrect DMA address:

  page = virt_to_page(xdpf->data);
  tbi->dma_addr = page_pool_get_dma_addr(page) +
                  VMXNET3_XDP_HEADROOM;
  dma_sync_single_for_device(&adapter->pdev->dev,
                             tbi->dma_addr, buf_size,
                             DMA_TO_DEVICE);

The above assumes a fixed offset (VMXNET3_XDP_HEADROOM), but the XDP
BPF program could have moved xdp->data. While the passed buf_size is
correct (xdpf->len), the dma_addr needs to have a dynamic offset which
can be calculated as xdpf->data - (void *)xdpf, that is, xdp->data -
xdp->data_hard_start.

Fixes: 54f00cce1178 ("vmxnet3: Add XDP support.")
Reported-by: Andrew Sauber <andrew.sauber@isovalent.com>
Reported-by: Nikolay Nikolaev <nikolay.nikolaev@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Nikolay Nikolaev <nikolay.nikolaev@isovalent.com>
Acked-by: Anton Protopopov <aspsk@isovalent.com>
Cc: William Tu <witu@nvidia.com>
Cc: Ronak Doshi <ronak.doshi@broadcom.com>
Link: https://patch.msgid.link/a0888656d7f09028f9984498cc698bb5364d89fc.1728931137.git.daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'ethtool-rss-track-rss-ctx-busy-from-core'

Daniel Zahka says:

====================
ethtool: rss: track rss ctx busy from core

This series prevents deletion of rss contexts that are
in use by ntuple filters from ethtool core.
====================

Link: https://patch.msgid.link/20241011183549.1581021-1-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: drv-net: rss_ctx: add rss ctx busy testcase

It should be invalid to delete an rss context while it is being
referenced from an ntuple filter. ethtool core should prevent this
from happening. This patch adds a testcase to verify this behavior.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ethtool: rss: prevent rss ctx deletion when in use

ntuple filters can specify an rss context to use for packet hashing
and queue selection. When a filter is referencing an rss context, it
should be invalid for that context to be deleted. A list of active
ntuple filters and their associated rss contexts can be compiled by
querying a device's ethtool_ops.get_rxnfc. This patch checks to see if
any ntuple filters are referencing an rss context during context
deletion, and prevents the deletion if the requested context is still
in use.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>