linux-2.6-block.git
4 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Thu, 25 Apr 2024 03:05:31 +0000 (20:05 -0700)]
Merge branch '100GbE' of git://git./linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: Support 5 layer Tx scheduler topology

Mateusz Polchlopek says:

For performance reasons there is a need to have support for selectable
Tx scheduler topology. Currently firmware supports only the default
9-layer and 5-layer topology. This patch series enables switch from
default to 5-layer topology, if user decides to opt-in.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Document tx_scheduling_layers parameter
  ice: Add tx_scheduling_layers devlink param
  ice: Enable switching default Tx scheduler topology
  ice: Adjust the VSI/Aggregator layers
  ice: Support 5 layer topology
  devlink: extend devlink_param *set pointer
====================

Link: https://lore.kernel.org/r/20240422203913.225151-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoocteontx2-pf: flower: check for unsupported control flags
Asbjørn Sloth Tønnesen [Mon, 22 Apr 2024 15:27:34 +0000 (15:27 +0000)]
octeontx2-pf: flower: check for unsupported control flags

Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.

In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.

Remove FLOW_DIS_FIRST_FRAG specific error message,
and treat it as any other unsupported control flag.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Link: https://lore.kernel.org/r/20240422152735.175693-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: hns3: flower: validate control flags
Asbjørn Sloth Tønnesen [Mon, 22 Apr 2024 15:27:16 +0000 (15:27 +0000)]
net: hns3: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Also propagate extack to hclge_get_cls_key_ip(), and convert it to
return error code.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Tested-by: Jijie Shao <shaojijie@huawei.com>
Link: https://lore.kernel.org/r/20240422152717.175659-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: ethernet: ti: cpsw: flower: validate control flags
Asbjørn Sloth Tønnesen [Mon, 22 Apr 2024 15:26:55 +0000 (15:26 +0000)]
net: ethernet: ti: cpsw: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240422152656.175627-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: ethernet: ti: am65-cpsw: flower: validate control flags
Asbjørn Sloth Tønnesen [Mon, 22 Apr 2024 15:26:42 +0000 (15:26 +0000)]
net: ethernet: ti: am65-cpsw: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240422152643.175592-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agobnxt_en: flower: validate control flags
Asbjørn Sloth Tønnesen [Mon, 22 Apr 2024 15:26:23 +0000 (15:26 +0000)]
bnxt_en: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Link: https://lore.kernel.org/r/20240422152626.175569-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: ethernet: ti: am65-cpsw-nuss: Enable SGMII mode for J784S4 CPSW9G
Chintan Vankar [Mon, 22 Apr 2024 12:45:15 +0000 (18:15 +0530)]
net: ethernet: ti: am65-cpsw-nuss: Enable SGMII mode for J784S4 CPSW9G

TI's J784S4 SoC supports SGMII mode with CPSW9G instance of the CPSW
Ethernet Switch. Thus, enable it by adding SGMII mode to the
extra_modes member of the "j784s4_cpswxg_pdata" SoC data.

Reviewed-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Link: https://lore.kernel.org/r/20240422124515.887511-1-c-vankar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: fix conntrack_dump_flush retval on unsupported kernel
Florian Westphal [Mon, 22 Apr 2024 10:33:53 +0000 (12:33 +0200)]
selftests: netfilter: fix conntrack_dump_flush retval on unsupported kernel

With CONFIG_NETFILTER=n test passes instead of skip.  Before:

 ./run_kselftest.sh -t net/netfilter:conntrack_dump_flush
[..]
 # Starting 3 tests from 1 test cases.
 #  RUN           conntrack_dump_flush.test_dump_by_zone ...
 mnl_socket_open: Protocol not supported
[..]
 ok 3 conntrack_dump_flush.test_flush_by_zone_default
 # PASSED: 3 / 3 tests passed.
 # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

After:
  mnl_socket_open: Protocol not supported
[..]
  ok 3 conntrack_dump_flush.test_flush_by_zone_default # SKIP cannot open netlink_netfilter socket
  # PASSED: 3 / 3 tests passed.
  # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:3 error:0

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240422103358.3511-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_zones_many.sh: set ct sysctl after ruleset load
Florian Westphal [Mon, 22 Apr 2024 10:25:42 +0000 (12:25 +0200)]
selftests: netfilter: nft_zones_many.sh: set ct sysctl after ruleset load

nf_conntrack_udp_timeout sysctl only exist once conntrack module is loaded,
if this test runs standalone on a modular kernel sysctl setting fails,
this can result in test failure as udp conntrack entries expire too fast.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240422102546.2494-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'selftest-netfilter-additional-cleanups'
Jakub Kicinski [Thu, 25 Apr 2024 00:12:49 +0000 (17:12 -0700)]
Merge branch 'selftest-netfilter-additional-cleanups'

Florian Westphal says:

====================
selftest: netfilter: additional cleanups

This is the last planned series of the netfilter-selftest-move.
It contains cleanups (and speedups) and a few small updates to
scripts to improve error/skip reporting.

I intend to route future changes, if any, via nf(-next) trees
now that the 'massive code churn' phase is over.
====================

Link: https://lore.kernel.org/r/20240423130604.7013-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: conntrack_vrf.sh: prefer socat, not iperf3
Florian Westphal [Tue, 23 Apr 2024 13:05:50 +0000 (15:05 +0200)]
selftests: netfilter: conntrack_vrf.sh: prefer socat, not iperf3

Use socat, like most of the other scripts already do.  This also makes
the script complete slightly faster (3s -> 1s).

iperf3 establishes two connections (1 control connection, and 1+x
depending on test), so adjust expected counter values as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-8-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: skip tests on early errors
Florian Westphal [Tue, 23 Apr 2024 13:05:49 +0000 (15:05 +0200)]
selftests: netfilter: skip tests on early errors

br_netfilter: If we can't add the needed initial nftables ruleset skip the
test, kernel doesn't support a required feature.

rpath: run a subset of the tests if possible, but make sure we return
the skip return value so they are marked appropriately by the kselftest
framework.

nft_audit.sh: provide version information when skipping, this should
help catching kernel problem (feature not available in kernel) vs.
userspace issue (parser doesn't support keyword).

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_flowtable.sh: shellcheck cleanups
Florian Westphal [Tue, 23 Apr 2024 13:05:48 +0000 (15:05 +0200)]
selftests: netfilter: nft_flowtable.sh: shellcheck cleanups

no functional changes intended except that test will now SKIP in
case kernel lacks bridge support and initial rule load failure provides
nft version information.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_flowtable.sh: re-run with random mtu sizes
Florian Westphal [Tue, 23 Apr 2024 13:05:47 +0000 (15:05 +0200)]
selftests: netfilter: nft_flowtable.sh: re-run with random mtu sizes

Now that the test runs much faster, also re-run it with random MTU sizes
for the different link legs.  flowtable should pass ip fragments, if
any, up to the normal forwarding path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-5-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_concat_range.sh: shellcheck cleanups
Florian Westphal [Tue, 23 Apr 2024 13:05:46 +0000 (15:05 +0200)]
selftests: netfilter: nft_concat_range.sh: shellcheck cleanups

no functional changes intended.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_concat_range.sh: drop netcat support
Florian Westphal [Tue, 23 Apr 2024 13:05:45 +0000 (15:05 +0200)]
selftests: netfilter: nft_concat_range.sh: drop netcat support

Tests fail on my workstation with netcat 110, instead of debugging+more
workarounds just remove this.

Tests will fall back to bash or socat.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: netfilter: nft_concat_range.sh: move to lib.sh infra
Florian Westphal [Tue, 23 Apr 2024 13:05:44 +0000 (15:05 +0200)]
selftests: netfilter: nft_concat_range.sh: move to lib.sh infra

Use busywait helper instead of unconditional sleep, reduces run time
from 6m to 2:30 on my system.

The busywait helper calls the function passed to it as argument; disable
the shellcheck test for unreachable code, it generates many (false)
warnings here.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240423130604.7013-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: openvswitch: Release reference to netdev
Jun Gu [Tue, 23 Apr 2024 07:37:51 +0000 (15:37 +0800)]
net: openvswitch: Release reference to netdev

dev_get_by_name will provide a reference on the netdev. So ensure that
the reference of netdev is released after completed.

Fixes: 2540088b836f ("net: openvswitch: Check vport netdev name")
Signed-off-by: Jun Gu <jun.gu@easystack.cn>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20240423073751.52706-1-jun.gu@easystack.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'sparx5-port-mirroring'
David S. Miller [Wed, 24 Apr 2024 12:08:07 +0000 (13:08 +0100)]
Merge branch 'sparx5-port-mirroring'

Daniel Machon says:

====================
net: sparx5: add support for port mirroring

This series adds support for port mirroring, and port mirroring stats,
through tc matchall action FLOW_ACTION_MIRRED.

The hardware has three independent mirroring probes. Each probe can be
configured with a separate set of filtering conditions that must be
fulfilled before traffic is mirrored.

A mirror probe can have up to 64 source ports and a single monitor port.
The direction of a mirror probe determines if rx or tx traffic is
mirrored from the source port to the monitor port.

To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Lars Povlsen <lars.povlsen@microchip.com>
To: Steen Hegelund <Steen.Hegelund@microchip.com>
To: UNGLinuxDriver@microchip.com
To: Russell King <linux@armlinux.org.uk>
Cc: netdev@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Cc: Yue Haibing <yuehaibing@huawei.com>
---
Changes in v3:
- Ditch do_div() (patch #3) to fix warning on hexagon arch, reported by intel bot
- Link to v2: https://lore.kernel.org/r/20240418-port-mirroring-v2-0-20642868b386@microchip.com

Changes in v2:
- Fix clang build warning about uninitialized variable 'err'
- Link to v1: https://lore.kernel.org/r/20240418-port-mirroring-v1-0-e05c35007c55@microchip.com
====================

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: sparx5: add support for matchall mirror stats
Daniel Machon [Sat, 20 Apr 2024 19:29:14 +0000 (21:29 +0200)]
net: sparx5: add support for matchall mirror stats

Add support for tc matchall mirror stats. When a new matchall mirror
rule is added, the baseline stats for that port is saved.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: sparx5: add the tc glue to support port mirroring
Daniel Machon [Sat, 20 Apr 2024 19:29:13 +0000 (21:29 +0200)]
net: sparx5: add the tc glue to support port mirroring

Add the necessary tc glue to add and delete mirror rules through tc
matchall.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: sparx5: add port mirroring implementation
Daniel Machon [Sat, 20 Apr 2024 19:29:12 +0000 (21:29 +0200)]
net: sparx5: add port mirroring implementation

The hardware supports three independent mirroring probes. Each probe can
be configured to mirror rx or tx traffic (direction).

Using tc matchall, it is now possible to add a source port and a monitor
port to a mirror probe. Depending on the mirror direction, rx or tx
traffic from a source port will be mirrored to the monitor port.

A single source port can be a member of multiple mirror probes.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: sparx5: add bookkeeping code for matchall rules
Daniel Machon [Sat, 20 Apr 2024 19:29:11 +0000 (21:29 +0200)]
net: sparx5: add bookkeeping code for matchall rules

In preparation for new tc matchall rules, we add a bit of bookkeeping
code to keep track of them. The rules are identified by the cookie
passed from the tc stack.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: sparx5: add new register definitions
Daniel Machon [Sat, 20 Apr 2024 19:29:10 +0000 (21:29 +0200)]
net: sparx5: add new register definitions

In preparation for port mirroring support through tc matchall, add the
required register definitions.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agoMerge branch 'net-dunamic-dummy-device'
David S. Miller [Wed, 24 Apr 2024 11:00:17 +0000 (12:00 +0100)]
Merge branch 'net-dunamic-dummy-device'

Breno Leitao says:

====================
allocate dummy device dynamically

struct net_device shouldn't be embedded into any structure, instead,
the owner should use the private space to embed their state into
net_device.

But, in some cases the net_device is embedded inside the private
structure, which blocks the usage of zero-length arrays inside
net_device.

Create a helper to allocate a dummy device at dynamically runtime, and
move the Ethernet devices to use it, instead of embedding the dummy
device inside the private structure.

This fixes all the network cases plus some wireless drivers.

PS: Due to lack of hardware, unfortunately most these patches are
compiled tested only, except ath11k that was kindly tested by Kalle Valo.

---
Changelog:

v7:
* Document the return value of alloc_netdev_dummy()
v6:
* No code change. Just added Reviewed-by: and fix a commit message
v5:
* Added a new patch to fix some typos in the previous code
* Rebased to net-net/main
v4:
* Added a new patch to add dummy device at free_netdev(), as suggested
  by Jakub.
* Added support for some wireless driver.
* Added some Acked-by and Reviewed-by.
v3:
* Use free_netdev() instead of kfree() as suggested by Jakub.
* Change the free_netdev() place in ipa driver, as suggested by
  Alex Elder.
* Set err in the error path in the Marvell driver, as suggested
  by Simon Horman.
v2:
* Patch 1: Use a pre-defined name ("dummy#") for the dummy
  net_devices.
* Patch 2-5: Added users for the new helper.
v1:
* https://lore.kernel.org/all/20240327200809.512867-1-leitao@debian.org/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agowifi: ath11k: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:39:03 +0000 (05:39 -0700)]
wifi: ath11k: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from struct ath11k_ext_irq_grp by converting it
into a pointer. Then use the leverage alloc_netdev() to allocate the
net_device object at ath11k_ahb_config_ext_irq() for ahb, and
ath11k_pcic_ext_irq_config() for pcic.

The free of the device occurs at ath11k_ahb_free_ext_irq() for the ahb
case, and ath11k_pcic_free_ext_irq() for the pcic case.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Kalle Valo <kvalo@kernel.org>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agowifi: ath10k: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:39:02 +0000 (05:39 -0700)]
wifi: ath10k: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from struct ath10k by converting it
into a pointer. Then use the leverage alloc_netdev() to allocate the
net_device object at ath10k_core_create(). The free of the device occurs
at ath10k_core_destroy().

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agowifi: qtnfmac: Use netdev dummy allocator helper
Breno Leitao [Mon, 22 Apr 2024 12:39:01 +0000 (05:39 -0700)]
wifi: qtnfmac: Use netdev dummy allocator helper

There is a new dummy netdev allocator, use it instead of
alloc_netdev()/init_dummy_netdev combination.

Using alloc_netdev() with init_dummy_netdev might cause some memory
corruption at the driver removal side.

Fixes: 61cdb09ff760 ("wifi: qtnfmac: allocate dummy net_device dynamically")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: ibm/emac: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:39:00 +0000 (05:39 -0700)]
net: ibm/emac: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: ipa: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:38:59 +0000 (05:38 -0700)]
net: ipa: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: mediatek: mtk_eth_sock: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:38:58 +0000 (05:38 -0700)]
net: mediatek: mtk_eth_sock: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: marvell: prestera: allocate dummy net_device dynamically
Breno Leitao [Mon, 22 Apr 2024 12:38:57 +0000 (05:38 -0700)]
net: marvell: prestera: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from the private struct by converting it
into a pointer. Then use the leverage the new alloc_netdev_dummy()
helper to allocate and initialize dummy devices.

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Elad Nachman <enachman@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: create a dummy net_device allocator
Breno Leitao [Mon, 22 Apr 2024 12:38:56 +0000 (05:38 -0700)]
net: create a dummy net_device allocator

It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.

This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:

https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/

Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.

init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().

Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: free_netdev: exit earlier if dummy
Breno Leitao [Mon, 22 Apr 2024 12:38:55 +0000 (05:38 -0700)]
net: free_netdev: exit earlier if dummy

For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.

Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: core: Fix documentation
Breno Leitao [Mon, 22 Apr 2024 12:38:54 +0000 (05:38 -0700)]
net: core: Fix documentation

Fix bad grammar in description of init_dummy_netdev() function.  This
topic showed up in the review of the "allocate dummy device dynamically"
patch set.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agoMerge branch 'dsa-mt7530-improvements'
David S. Miller [Wed, 24 Apr 2024 10:57:03 +0000 (11:57 +0100)]
Merge branch 'dsa-mt7530-improvements'

Arınç ÜNAL says:

====================
MT7530 DSA Subdriver Improvements Act IV

This is the forth patch series with the goal of simplifying the MT7530 DSA
subdriver and improving support for MT7530, MT7531, and the switch on the
MT7988 SoC.

I have done a simple ping test to confirm basic communication on all switch
ports on MCM and standalone MT7530, and MT7531 switch with this patch
series applied.

MT7621 Unielec, MCM MT7530:

rgmii-only-gmac0-mt7621-unielec-u7621-06-16m.dtb
gmac0-and-gmac1-mt7621-unielec-u7621-06-16m.dtb

tftpboot 0x80008000 mips-uzImage.bin; tftpboot 0x83000000 mips-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootm 0x80008000 0x83000000 0x83f00000

MT7622 Bananapi, MT7531:

gmac0-and-gmac1-mt7622-bananapi-bpi-r64.dtb

tftpboot 0x40000000 arm64-Image; tftpboot 0x45000000 arm64-rootfs.cpio.uboot; tftpboot 0x4a000000 $dtb; booti 0x40000000 0x45000000 0x4a000000

MT7623 Bananapi, standalone MT7530:

rgmii-only-gmac0-mt7623n-bananapi-bpi-r2.dtb
gmac0-and-gmac1-mt7623n-bananapi-bpi-r2.dtb

tftpboot 0x80008000 arm-zImage; tftpboot 0x83000000 arm-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootz 0x80008000 0x83000000 0x83f00000

This patch series finalises the patch series linked below.

https://lore.kernel.org/r/20230522121532.86610-1-arinc.unal@arinc9.com

---
Changes in v2:
- Add two new patches to the end.
- Patch 13
  - Add the missing patch log.
- Link to v1: https://lore.kernel.org/r/20240419-for-netnext-mt7530-improvements-4-v1-0-6d852ca79b1d@arinc9.com
====================

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: explain exposing MDIO bus of MT7531AE better
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:22 +0000 (10:15 +0300)]
net: dsa: mt7530: explain exposing MDIO bus of MT7531AE better

Unlike MT7531BE, the GPIO 6-12 pins are not used for RGMII on MT7531AE.
Therefore, the GPIO 11-12 pins are set to function as MDC and MDIO to
expose the MDIO bus of the switch. Replace the comment with a better
explanation.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: do not pass port variable to mt7531_rgmii_setup()
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:21 +0000 (10:15 +0300)]
net: dsa: mt7530: do not pass port variable to mt7531_rgmii_setup()

The mt7531_rgmii_setup() function does not use the port variable, do not
pass the variable to it.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: use priv->ds->num_ports instead of MT7530_NUM_PORTS
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:20 +0000 (10:15 +0300)]
net: dsa: mt7530: use priv->ds->num_ports instead of MT7530_NUM_PORTS

Use priv->ds->num_ports on all for loops which configure the switch
registers. In the future, the value of MT7530_NUM_PORTS will depend on
priv->id. Therefore, this change prepares the subdriver for a simpler
implementation.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: get rid of mac_port_validate member of mt753x_info
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:19 +0000 (10:15 +0300)]
net: dsa: mt7530: get rid of mac_port_validate member of mt753x_info

The mac_port_validate member of the mt753x_info structure is not being
used, remove it. Improve the member description section in the process.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: refactor MT7530_PMEEECR_P()
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:18 +0000 (10:15 +0300)]
net: dsa: mt7530: refactor MT7530_PMEEECR_P()

The MT7530_PMEEECR_P() register is on MT7530, MT7531, and the switch on the
MT7988 SoC. Rename the definition for them to MT753X_PMEEECR_P(). Use the
FIELD_PREP and FIELD_GET macros. Rename GET_LPI_THRESH() and
SET_LPI_THRESH() to LPI_THRESH_GET() and LPI_THRESH_SET().

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: get rid of function sanity check
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:17 +0000 (10:15 +0300)]
net: dsa: mt7530: get rid of function sanity check

Get rid of checking whether functions are filled properly. priv->info which
is an mt753x_info structure is filled and checked for before this check.
It's unnecessary checking whether it's filled properly.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: define MAC speed capabilities per switch model
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:16 +0000 (10:15 +0300)]
net: dsa: mt7530: define MAC speed capabilities per switch model

With the support of the MT7988 SoC switch, the MAC speed capabilities
defined on mt753x_phylink_get_caps() won't apply to all switch models
anymore. Move them to more appropriate locations instead of overwriting
config->mac_capabilities.

Remove the comment on mt753x_phylink_get_caps() as it's become invalid with
the support of MT7531 and MT7988 SoC switch.

Add break to case 6 of mt7988_mac_port_get_caps() to be explicit.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: return mt7530_setup_mdio & mt7531_setup_common on error
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:15 +0000 (10:15 +0300)]
net: dsa: mt7530: return mt7530_setup_mdio & mt7531_setup_common on error

The mt7530_setup_mdio() and mt7531_setup_common() functions should be
checked for errors. Return if the functions return a non-zero value.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: move MT753X_MTRAP operations for MT7530
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:14 +0000 (10:15 +0300)]
net: dsa: mt7530: move MT753X_MTRAP operations for MT7530

On MT7530, the media-independent interfaces of port 5 and 6 are controlled
by the MT7530_P5_DIS and MT7530_P6_DIS bits of the hardware trap. Deal with
these bits only when the relevant port is being enabled or disabled. This
ensures that these ports will be disabled when they are not in use.

Do not set MT7530_CHG_TRAP on mt7530_setup_port5() as that's already being
done on mt7530_setup().

Instead of globally setting MT7530_P5_MAC_SEL, clear it, then set it only
on the appropriate case.

If PHY muxing is detected, clear MT7530_P5_DIS before calling
mt7530_setup_port5().

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: refactor MT7530_HWTRAP and MT7530_MHWTRAP
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:13 +0000 (10:15 +0300)]
net: dsa: mt7530: refactor MT7530_HWTRAP and MT7530_MHWTRAP

The MT7530_HWTRAP and MT7530_MHWTRAP registers are on MT7530 and MT7531.
It's called hardware trap on MT7530, software trap on MT7531. That's
because some bits of the trap on MT7530 cannot be modified by software
whilst all bits of the trap on MT7531 can. Rename the definitions for them
to MT753X_TRAP and MT753X_MTRAP. Add MT7530 and MT7531 prefixes to the
definitions specific to the switch model.

Remove the extra parentheses from MT7530_XTAL_40MHZ and MT7530_XTAL_20MHZ.

Rename MHWTRAP_PHY0_SEL, MHWTRAP_MANUAL, and MHWTRAP_PHY_ACCESS to be on
par with the "MT7621 Giga Switch Programming Guide v0.3" document.

Make an enumaration for the XTAL frequency. Set the data type of the xtal
variable on mt7531_pll_setup() to it.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: refactor MT7530_MFC and MT7531_CFC, add MT7531_QRY_FFP
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:12 +0000 (10:15 +0300)]
net: dsa: mt7530: refactor MT7530_MFC and MT7531_CFC, add MT7531_QRY_FFP

The MT7530_MFC register is on MT7530, MT7531, and the switch on the MT7988
SoC. Rename it to MT753X_MFC. Bit 7 to 0 differs between MT7530 and
MT7531/MT7988. Add MT7530 prefix to these definitions, and define the
IGMP/MLD Query Frame Flooding Ports mask for MT7531.

Rename the cases of MIRROR_MASK to MIRROR_PORT_MASK.

Move mt753x_mirror_port_get() and mt753x_port_mirror_set() to mt7530.h as
macros.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: rename mt753x_bpdu_port_fw enum to mt753x_to_cpu_fw
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:11 +0000 (10:15 +0300)]
net: dsa: mt7530: rename mt753x_bpdu_port_fw enum to mt753x_to_cpu_fw

The mt753x_bpdu_port_fw enum is globally used for manipulating the process
of deciding the forwardable ports, specifically concerning the CPU port(s).
Therefore, rename it and the values in it to mt753x_to_cpu_fw.

Change FOLLOW_MFC to SYSTEM_DEFAULT to be on par with the switch documents.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: rename p5_intf_sel and use only for MT7530 switch
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:10 +0000 (10:15 +0300)]
net: dsa: mt7530: rename p5_intf_sel and use only for MT7530 switch

The p5_intf_sel pointer is used to store the information of whether PHY
muxing is used or not. PHY muxing is a feature specific to port 5 of the
MT7530 switch. Do not use it for other switch models.

Rename the pointer to p5_mode to store the mode the port is being used in.
Rename the p5_interface_select enum to mt7530_p5_mode, the string
representation to mt7530_p5_mode_str, and the enum elements.

If PHY muxing is not detected, the default mode, GMAC5, will be used.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: refactor MT7530_PMCR_P()
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:09 +0000 (10:15 +0300)]
net: dsa: mt7530: refactor MT7530_PMCR_P()

The MT7530_PMCR_P() registers are on MT7530, MT7531, and the switch on the
MT7988 SoC. Rename the definition for them to MT753X_PMCR_P(). Bit 15 is
for MT7530 only. Add MT7530 prefix to the definition for bit 15.

Use GENMASK and FIELD_PREP for PMCR_IFG_XMIT().

Rename PMCR_TX_EN and PMCR_RX_EN to PMCR_MAC_TX_EN and PMCR_MAC_TX_EN to
follow the naming on the "MT7621 Giga Switch Programming Guide v0.3",
"MT7531 Reference Manual for Development Board v1.0", and "MT7988A Wi-Fi 7
Generation Router Platform: Datasheet (Open Version) v0.1" documents.

These documents show that PMCR_RX_FC_EN is at bit 5. Correct this along
with renaming it to PMCR_FORCE_RX_FC_EN, and the same for PMCR_TX_FC_EN.

Remove PMCR_SPEED_MASK which doesn't have a use.

Rename the force mode definitions for MT7531 to FORCE_MODE. Add MASK at the
end for the mask that includes all force mode definitions.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agonet: dsa: mt7530: disable EEE abilities on failure on MT7531 and MT7988
Arınç ÜNAL [Mon, 22 Apr 2024 07:15:08 +0000 (10:15 +0300)]
net: dsa: mt7530: disable EEE abilities on failure on MT7531 and MT7988

The MT7531_FORCE_EEE1G and MT7531_FORCE_EEE100 bits let the
PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 bits determine the 1G/100 EEE
abilities of the MAC. If MT7531_FORCE_EEE1G and MT7531_FORCE_EEE100 are
unset, the abilities are left to be determined by PHY auto polling.

The commit 40b5d2f15c09 ("net: dsa: mt7530: Add support for EEE features")
made it so that the PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 bits are set on
mt753x_phylink_mac_link_up(). But it did not set the MT7531_FORCE_EEE1G and
MT7531_FORCE_EEE100 bits. Because of this, the EEE abilities will be
determined by PHY auto polling, regardless of the result of phy_init_eee().

Define these bits and add them to the MT7531_FORCE_MODE mask which is set
in mt7531_setup_common(). With this, there won't be any EEE abilities set
when phy_init_eee() returns a negative value.

Thanks to Russell for explaining when phy_init_eee() could return a
negative value below.

Looking at phy_init_eee(), it could return a negative value when:

1. phydev->drv is NULL
2. if genphy_c45_eee_is_active() returns negative
3. if genphy_c45_eee_is_active() returns zero, it returns -EPROTONOSUPPORT
4. if phy_set_bits_mmd() fails (e.g. communication error with the PHY)

If we then look at genphy_c45_eee_is_active(), then:

genphy_c45_read_eee_adv() and genphy_c45_read_eee_lpa() propagate their
non-zero return values, otherwise this function returns zero or positive
integer.

If we then look at genphy_c45_read_eee_adv(), then a failure of
phy_read_mmd() would cause a negative value to be returned.

Looking at genphy_c45_read_eee_lpa(), the same is true.

So, it can be summarised as:

- phydev->drv is NULL
- there is a communication error accessing the PHY
- EEE is not active

otherwise, it returns zero on success.

If one wishes to determine whether an error occurred vs EEE not being
supported through negotiation for the negotiated speed, if it returns
-EPROTONOSUPPORT in the latter case. Other error codes mean either the
driver has been unloaded or communication error.

In conclusion, determining the EEE abilities by PHY auto polling shouldn't
result in having any EEE abilities enabled, when one of the last two
situations in the summary happens. And it seems that if phydev->drv is
NULL, there would be bigger problems with the device than a broken link. So
this is not a bugfix.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 months agoneighbour: fix neigh_master_filtered()
Eric Dumazet [Sun, 21 Apr 2024 18:57:53 +0000 (18:57 +0000)]
neighbour: fix neigh_master_filtered()

If we no longer hold RTNL, we must use netdev_master_upper_dev_get_rcu()
instead of netdev_master_upper_dev_get().

Fixes: ba0f78069423 ("neighbour: no longer hold RTNL in neigh_dump_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240421185753.1808077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'selftests-drv-net-support-testing-with-a-remote-system'
Jakub Kicinski [Tue, 23 Apr 2024 17:13:58 +0000 (10:13 -0700)]
Merge branch 'selftests-drv-net-support-testing-with-a-remote-system'

Jakub Kicinski says:

====================
selftests: drv-net: support testing with a remote system

Implement support for tests which require access to a remote system /
endpoint which can generate traffic.
This series concludes the "groundwork" for upstream driver tests.

I wanted to support the three models which came up in discussions:
 - SW testing with netdevsim
 - "local" testing with two ports on the same system in a loopback
 - "remote" testing via SSH
so there is a tiny bit of an abstraction which wraps up how "remote"
commands are executed. Otherwise hopefully there's nothing surprising.

I'm only adding a ping test. I had a bigger one written but I was
worried we'll get into discussing the details of the test itself
and how I chose to hack up netdevsim, instead of the test infra...
So that test will be a follow up :)

v4: https://lore.kernel.org/all/20240418233844.2762396-1-kuba@kernel.org
v3: https://lore.kernel.org/all/20240417231146.2435572-1-kuba@kernel.org
v2: https://lore.kernel.org/all/20240416004556.1618804-1-kuba@kernel.org
v1: https://lore.kernel.org/all/20240412233705.1066444-1-kuba@kernel.org
====================

Link: https://lore.kernel.org/r/20240420025237.3309296-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: add require_XYZ() helpers for validating env
Jakub Kicinski [Sat, 20 Apr 2024 02:52:37 +0000 (19:52 -0700)]
selftests: drv-net: add require_XYZ() helpers for validating env

Wrap typical checks like whether given command used by the test
is available in helpers.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: add a TCP ping test case (and useful helpers)
Jakub Kicinski [Sat, 20 Apr 2024 02:52:36 +0000 (19:52 -0700)]
selftests: drv-net: add a TCP ping test case (and useful helpers)

More complex tests often have to spawn a background process,
like a server which will respond to requests or tcpdump.

Add support for creating such processes using the with keyword:

  with bkg("my-daemon", ..):
     # my-daemon is alive in this block

My initial thought was to add this support to cmd() directly
but it runs the command in the constructor, so by the time
we __enter__ it's too late to make sure we used "background=True".

Second useful helper transplanted from net_helper.sh is
wait_port_listen().

The test itself uses socat, which insists on v6 addresses
being wrapped in [], it's not the only command which requires
this format, so add the wrapped address to env. The hope
is to save test code from checking if address is v6.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: net: support matching cases by name prefix
Jakub Kicinski [Sat, 20 Apr 2024 02:52:35 +0000 (19:52 -0700)]
selftests: net: support matching cases by name prefix

While writing tests with a lot more cases I got tired of having
to jump back and forth to add the name of the test to the ksft_run()
list. Most unittest frameworks do some name matching, e.g. assume
that functions with names starting with test_ are test cases.

Support similar flow in ksft_run(). Let the author list the desired
prefixes. globals() need to be passed explicitly, IDK how to work
around that.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: add a trivial ping test
Jakub Kicinski [Sat, 20 Apr 2024 02:52:34 +0000 (19:52 -0700)]
selftests: drv-net: add a trivial ping test

Add a very simple test for testing with a remote system.
Both IPv4 and IPv6 connectivity is optional, later change
will add checks to skip tests based on available addresses.

Using netdevsim:

 $ ./run_kselftest.sh -t drivers/net:ping.py
 TAP version 13
 1..1
 # timeout set to 45
 # selftests: drivers/net: ping.py
 # KTAP version 1
 # 1..2
 # ok 1 ping.test_v4
 # ok 2 ping.test_v6
 # # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 1 selftests: drivers/net: ping.py

Command line SSH:

 $ NETIF=virbr0 REMOTE_TYPE=ssh REMOTE_ARGS=root@192.168.122.123 \
    LOCAL_V4=192.168.122.1 REMOTE_V4=192.168.122.123 \
    ./tools/testing/selftests/drivers/net/ping.py
 KTAP version 1
 1..2
 ok 1 ping.test_v4
 ok 2 ping.test_v6 # SKIP Test requires IPv6 connectivity
 # Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0

Existing devices placed in netns (and using net.config):

 $ cat drivers/net/net.config
 NETIF=veth0
 REMOTE_TYPE=netns
 REMOTE_ARGS=red
 LOCAL_V4="192.168.1.1"
 REMOTE_V4="192.168.1.2"

 $ ./run_kselftest.sh -t drivers/net:ping.py
 TAP version 13
 1..1
 # timeout set to 45
 # selftests: drivers/net: ping.py
 # KTAP version 1
 # 1..2
 # ok 1 ping.test_v4
 # ok 2 ping.test_v6 # SKIP Test requires IPv6 connectivity
 # # Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: construct environment for running tests which require an endpoint
Jakub Kicinski [Sat, 20 Apr 2024 02:52:33 +0000 (19:52 -0700)]
selftests: drv-net: construct environment for running tests which require an endpoint

Nothing surprising here, hopefully. Wrap the variables from
the environment into a class or spawn a netdevsim based env
and pass it to the tests.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: factor out parsing of the env
Jakub Kicinski [Sat, 20 Apr 2024 02:52:32 +0000 (19:52 -0700)]
selftests: drv-net: factor out parsing of the env

The tests with a remote end will use a different class,
for clarity, but will also need to parse the env.
So factor parsing the env out to a function.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: define endpoint structures
Jakub Kicinski [Sat, 20 Apr 2024 02:52:31 +0000 (19:52 -0700)]
selftests: drv-net: define endpoint structures

Define the remote endpoint "model". To execute most meaningful device
driver tests we need to be able to communicate with a remote system,
and have it send traffic to the device under test.

Various test environments will have different requirements.

0) "Local" netdevsim-based testing can simply use net namespaces.
netdevsim supports connecting two devices now, to form a veth-like
construct.

1) Similarly on hosts with multiple NICs, the NICs may be connected
together with a loopback cable or internal device loopback.
One interface may be placed into separate netns, and tests
would proceed much like in the netdevsim case. Note that
the loopback config or the moving of one interface
into a netns is not expected to be part of selftest code.

2) Some systems may need to communicate with the remote endpoint
via SSH.

3) Last but not least environment may have its own custom communication
method.

Fundamentally we only need two operations:
 - run a command remotely
 - deploy a binary (if some tool we need is built as part of kselftests)

Wrap these two in a class. Use dynamic loading to load the Remote
class. This will allow very easy definition of other communication
methods without bothering upstream code base.

Stick to the "simple" / "no unnecessary abstractions" model for
referring to the remote endpoints. The host / remote object are
passed as an argument to the usual cmd() or ip() invocation.
For example:

 ip("link show", json=True, host=remote)

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240420025237.3309296-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'netdev-support-dumping-a-single-netdev-in-qstats'
Jakub Kicinski [Tue, 23 Apr 2024 17:09:52 +0000 (10:09 -0700)]
Merge branch 'netdev-support-dumping-a-single-netdev-in-qstats'

Jakub Kicinski says:

====================
netdev: support dumping a single netdev in qstats

I was writing a test for page pool which depended on qstats,
and got tired of having to filter dumps in user space.
Add support for dumping stats for a single netdev.

To get there we first need to add full support for extack
in dumps (and fix a dump error handling bug in YNL, sent
separately to the net tree).
====================

Link: https://lore.kernel.org/r/20240420023543.3300306-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: test dumping qstats per device
Jakub Kicinski [Sat, 20 Apr 2024 02:35:42 +0000 (19:35 -0700)]
selftests: drv-net: test dumping qstats per device

Add a test for dumping qstats device by device.

ksft framework grows a ksft_raises() helper, to be used
under with, which should be familiar to unittest users.

Link: https://lore.kernel.org/r/20240420023543.3300306-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonetlink: support all extack types in dumps
Jakub Kicinski [Sat, 20 Apr 2024 02:35:41 +0000 (19:35 -0700)]
netlink: support all extack types in dumps

Note that when this commit message refers to netlink dump
it only means the actual dumping part, the parsing / dump
start is handled by the same code as "doit".

Commit 4a19edb60d02 ("netlink: Pass extack to dump handlers")
added support for returning extack messages from dump handlers,
but left out other extack info, e.g. bad attribute.

This used to be fine because until YNL we had little practical
use for the machine readable attributes, and only messages were
used in practice.

YNL flips the preference 180 degrees, it's now much more useful
to point to a bad attr with NL_SET_BAD_ATTR() than type
an English message saying "attribute XYZ is $reason-why-bad".

Support all of extack. The fact that extack only gets added if
it fits remains unaddressed.

Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonetlink: move extack writing helpers
Jakub Kicinski [Sat, 20 Apr 2024 02:35:40 +0000 (19:35 -0700)]
netlink: move extack writing helpers

Next change will need them in netlink_dump_done(), pure move.

Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonetdev: support dumping a single netdev in qstats
Jakub Kicinski [Sat, 20 Apr 2024 02:35:39 +0000 (19:35 -0700)]
netdev: support dumping a single netdev in qstats

Having to filter the right ifindex in the tests is a bit tedious.
Add support for dumping qstats for a single ifindex.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240420023543.3300306-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoaf_unix: Don't access successor in unix_del_edges() during GC.
Kuniyuki Iwashima [Fri, 19 Apr 2024 23:51:02 +0000 (16:51 -0700)]
af_unix: Don't access successor in unix_del_edges() during GC.

syzbot reported use-after-free in unix_del_edges().  [0]

What the repro does is basically repeat the following quickly.

  1. pass a fd of an AF_UNIX socket to itself

    socketpair(AF_UNIX, SOCK_DGRAM, 0, [3, 4]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], ...}, 0) = 0

  2. pass other fds of AF_UNIX sockets to the socket above

    socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [5, 6]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=48, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[5, 6]}], ...}, 0) = 0

  3. close all sockets

Here, two skb are created, and every unix_edge->successor is the first
socket.  Then, __unix_gc() will garbage-collect the two skb:

  (a) free skb with self-referencing fd
  (b) free skb holding other sockets

After (a), the self-referencing socket will be scheduled to be freed
later by the delayed_fput() task.

syzbot repeated the sequences above (1. ~ 3.) quickly and triggered
the task concurrently while GC was running.

So, at (b), the socket was already freed, and accessing it was illegal.

unix_del_edges() accesses the receiver socket as edge->successor to
optimise GC.  However, we should not do it during GC.

Garbage-collecting sockets does not change the shape of the rest
of the graph, so we need not call unix_update_graph() to update
unix_graph_grouped when we purge skb.

However, if we clean up all loops in the unix_walk_scc_fast() path,
unix_graph_maybe_cyclic remains unchanged (true), and __unix_gc()
will call unix_walk_scc_fast() continuously even though there is no
socket to garbage-collect.

To keep that optimisation while fixing UAF, let's add the same
updating logic of unix_graph_maybe_cyclic in unix_walk_scc_fast()
as done in unix_walk_scc() and __unix_walk_scc().

Note that when unix_del_edges() is called from other places, the
receiver socket is always alive:

  - sendmsg: the successor's sk_refcnt is bumped by sock_hold()
             unix_find_other() for SOCK_DGRAM, connect() for SOCK_STREAM

  - recvmsg: the successor is the receiver, and its fd is alive

[0]:
BUG: KASAN: slab-use-after-free in unix_edge_successor net/unix/garbage.c:109 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edge net/unix/garbage.c:165 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edges+0x148/0x630 net/unix/garbage.c:237
Read of size 8 at addr ffff888079c6e640 by task kworker/u8:6/1099

CPU: 0 PID: 1099 Comm: kworker/u8:6 Not tainted 6.9.0-rc4-next-20240418-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 unix_edge_successor net/unix/garbage.c:109 [inline]
 unix_del_edge net/unix/garbage.c:165 [inline]
 unix_del_edges+0x148/0x630 net/unix/garbage.c:237
 unix_destroy_fpl+0x59/0x210 net/unix/garbage.c:298
 unix_detach_fds net/unix/af_unix.c:1811 [inline]
 unix_destruct_scm+0x13e/0x210 net/unix/af_unix.c:1826
 skb_release_head_state+0x100/0x250 net/core/skbuff.c:1127
 skb_release_all net/core/skbuff.c:1138 [inline]
 __kfree_skb net/core/skbuff.c:1154 [inline]
 kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1190
 __skb_queue_purge_reason include/linux/skbuff.h:3251 [inline]
 __skb_queue_purge include/linux/skbuff.h:3256 [inline]
 __unix_gc+0x1732/0x1830 net/unix/garbage.c:575
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

Allocated by task 14427:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:312 [inline]
 __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
 kasan_slab_alloc include/linux/kasan.h:201 [inline]
 slab_post_alloc_hook mm/slub.c:3897 [inline]
 slab_alloc_node mm/slub.c:3957 [inline]
 kmem_cache_alloc_noprof+0x135/0x290 mm/slub.c:3964
 sk_prot_alloc+0x58/0x210 net/core/sock.c:2074
 sk_alloc+0x38/0x370 net/core/sock.c:2133
 unix_create1+0xb4/0x770
 unix_create+0x14e/0x200 net/unix/af_unix.c:1034
 __sock_create+0x490/0x920 net/socket.c:1571
 sock_create net/socket.c:1622 [inline]
 __sys_socketpair+0x33e/0x720 net/socket.c:1773
 __do_sys_socketpair net/socket.c:1822 [inline]
 __se_sys_socketpair net/socket.c:1819 [inline]
 __x64_sys_socketpair+0x9b/0xb0 net/socket.c:1819
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 1805:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4393 [inline]
 kmem_cache_free+0x145/0x340 mm/slub.c:4468
 sk_prot_free net/core/sock.c:2114 [inline]
 __sk_destruct+0x467/0x5f0 net/core/sock.c:2208
 sock_put include/net/sock.h:1948 [inline]
 unix_release_sock+0xa8b/0xd20 net/unix/af_unix.c:665
 unix_release+0x91/0xc0 net/unix/af_unix.c:1049
 __sock_release net/socket.c:659 [inline]
 sock_close+0xbc/0x240 net/socket.c:1421
 __fput+0x406/0x8b0 fs/file_table.c:422
 delayed_fput+0x59/0x80 fs/file_table.c:445
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff888079c6e000
 which belongs to the cache UNIX of size 1920
The buggy address is located 1600 bytes inside of
 freed 1920-byte region [ffff888079c6e000ffff888079c6e780)

Reported-by: syzbot+f3f3eef1d2100200e593@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f3f3eef1d2100200e593
Fixes: 77e5593aebba ("af_unix: Skip GC if no cycle exists.")
Fixes: fd86344823b5 ("af_unix: Try not to hold unix_gc_lock during accept().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240419235102.31707-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agoMerge branch 'net-ipa-eight-simple-cleanups'
Paolo Abeni [Tue, 23 Apr 2024 11:10:16 +0000 (13:10 +0200)]
Merge branch 'net-ipa-eight-simple-cleanups'

Alex Elder says:

====================
net: ipa: eight simple cleanups

This series contains a mix of cleanups, some dating back to
December, 2022.  Version 1 was based on an older version of
net-next/main; this version has simply been rebased.

The first two make it so the IPA SUSPEND interrupt only gets enabled
when necessary.  That make it possible in the third patch to call
device_init_wakeup() during an earlier phase of initialization, and
remove two functions.

The next patch removes IPA register definitions that are never used.
The fifth patch makes ipa_table_hash_support() a real function, so
the IPA structure only needs to be declared rather than defined when
that file is parsed.

The sixth patch fixes improper argument names in two function
declarations.  The seventh removes the declaration for a function
that does not exist, and makes ipa_cmd_init() actually get called.
And the last one eliminates ipa_version_supported(), in favor of
just deciding that if a device is probed because its compatible
matches, that device is assumed to be supported.
====================

Link: https://lore.kernel.org/r/20240419151800.2168903-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: kill ipa_version_supported()
Alex Elder [Fri, 19 Apr 2024 15:18:00 +0000 (10:18 -0500)]
net: ipa: kill ipa_version_supported()

The only place ipa_version_supported() is called is in the probe
function.  The version comes from the match data.  Rather than
checking the version validity separately, just consider anything
that has match data to be supported.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: fix two minor ipa_cmd problems
Alex Elder [Fri, 19 Apr 2024 15:17:59 +0000 (10:17 -0500)]
net: ipa: fix two minor ipa_cmd problems

In "ipa_cmd.h", ipa_cmd_data_valid() is declared, but that function
does not exist.  So delete that declaration.

Also, for some reason ipa_cmd_init() never gets called.  It isn't
really critical--it just validates that some memory offsets and a
size can be represented in some register fields, and they won't fail
with current data.  Regardless, call the function in ipa_probe().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: fix two bogus argument names
Alex Elder [Fri, 19 Apr 2024 15:17:58 +0000 (10:17 -0500)]
net: ipa: fix two bogus argument names

In "ipa_endpoint.h", two function declarations have bogus argument
names.  Fix these.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: make ipa_table_hash_support() a real function
Alex Elder [Fri, 19 Apr 2024 15:17:57 +0000 (10:17 -0500)]
net: ipa: make ipa_table_hash_support() a real function

With the exception of ipa_table_hash_support(), nothing defined in
"ipa_table.h" requires the full definition of the IPA structure.

Change that function to be a "real" function rather than an inline,
to avoid requring the IPA structure to be defined.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: remove unneeded FILT_ROUT_HASH_EN definitions
Alex Elder [Fri, 19 Apr 2024 15:17:56 +0000 (10:17 -0500)]
net: ipa: remove unneeded FILT_ROUT_HASH_EN definitions

The FILT_ROUT_HASH_EN register is only used for IPA v4.2.  There,
routing and filter table hashing are not supported, and so the
register must be written to disable the feature.  No other version
uses this register, so its definition can be removed.  If we need to
use these some day (for example, explicitly enable the feature) this
commit can be reverted.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: call device_init_wakeup() earlier
Alex Elder [Fri, 19 Apr 2024 15:17:55 +0000 (10:17 -0500)]
net: ipa: call device_init_wakeup() earlier

Currently, enabling wakeup for the IPA device doesn't occur until
the setup phase of initialization (in ipa_power_setup()).

There is no need to delay doing that, however.  We can conveniently
do it during the config phase, in ipa_interrupt_config(), where we
enable power management wakeup mode for the IPA interrupt.

Moving the device_init_wakeup() out of ipa_power_setup() leaves that
function empty, so it can just be eliminated.

Similarly, rearrange all of the matching inverse calls, disabling
device wakeup in ipa_interrupt_deconfig() and removing that function
as well.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: only enable the SUSPEND IPA interrupt when needed
Alex Elder [Fri, 19 Apr 2024 15:17:54 +0000 (10:17 -0500)]
net: ipa: only enable the SUSPEND IPA interrupt when needed

Only enable the SUSPEND IPA interrupt type when at least one
endpoint has that interrupt enabled.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ipa: maintain bitmap of suspend-enabled endpoints
Alex Elder [Fri, 19 Apr 2024 15:17:53 +0000 (10:17 -0500)]
net: ipa: maintain bitmap of suspend-enabled endpoints

Keep track of which endpoints have the SUSPEND IPA interrupt enabled
in a variable-length bitmap.  This will be used in the next patch to
allow the SUSPEND interrupt type to be disabled except when needed.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agoMerge branch 'net-stmmac-fix-mac-capabilities-procedure'
Paolo Abeni [Tue, 23 Apr 2024 10:25:39 +0000 (12:25 +0200)]
Merge branch 'net-stmmac-fix-mac-capabilities-procedure'

Serge Semin says:

====================
net: stmmac: Fix MAC-capabilities procedure

The series got born as a result of the discussions around the recent
Yanteng' series adding the Loongson LS7A1000, LS2K1000, LS7A2000, LS2K2000
MACs support:
Link: https://lore.kernel.org/netdev/fu3f6uoakylnb6eijllakeu5i4okcyqq7sfafhp5efaocbsrwe@w74xe7gb6x7p
In particular the Yanteng' patchset needed to implement the Loongson
MAC-specific constraints applied to the link speed and link duplex mode.
As a result of the discussion with Russel the next preliminary patch was
born:
Link: https://lore.kernel.org/netdev/df31e8bcf74b3b4ddb7ddf5a1c371390f16a2ad5.1712917541.git.siyanteng@loongson.cn
The patch above was a temporal solution utilized by Yanteng for further
developments and to move on with the on-going review. This patchset is a
refactored version of that single patch with formatting required for the
fixes patches.

The main part of the series has already been merged in on v1 stage. The
leftover is the cleanup patches which rename
stmmac_ops::phylink_get_caps() callback to stmmac_ops::update_caps() and
move the MAC-capabilities init/re-init to the phylink MAC-capabilities
getter.

Link: https://lore.kernel.org/netdev/20240412180340.7965-1-fancer.lancer@gmail.com/
Changelog v2:
- Add a new patch (Romain):
  [PATCH net-next v2 1/2] net: stmmac: Rename phylink_get_caps() callback to update_caps()
- Resubmit the leftover patches to net-next tree (Paolo).

Link: https://lore.kernel.org/netdev/20240417140013.12575-1-fancer.lancer@gmail.com/
Changelog v3:
- Just resubmit (Jakub).

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
====================

Link: https://lore.kernel.org/r/20240419090357.5547-1-fancer.lancer@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: stmmac: Move MAC caps init to phylink MAC caps getter
Serge Semin [Fri, 19 Apr 2024 09:03:06 +0000 (12:03 +0300)]
net: stmmac: Move MAC caps init to phylink MAC caps getter

After a set of recent fixes the stmmac_phy_setup() and
stmmac_reinit_queues() methods have turned to having some duplicated code.
Let's get rid from the duplication by moving the MAC-capabilities
initialization to the PHYLINK MAC-capabilities getter. The getter is
called during each network device interface open/close cycle. So the
MAC-capabilities will be initialized in generic device open procedure and
in case of the Tx/Rx queues re-initialization as the original code
semantics implies.

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: stmmac: Rename phylink_get_caps() callback to update_caps()
Serge Semin [Fri, 19 Apr 2024 09:03:05 +0000 (12:03 +0300)]
net: stmmac: Rename phylink_get_caps() callback to update_caps()

Since recent commits the stmmac_ops::phylink_get_caps() callback has no
longer been responsible for the phylink MAC capabilities getting, but
merely updates the MAC capabilities in the mac_device_info::link::caps
field. Rename the callback to comply with the what the method does now.

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agoMerge branch 'enable-rx-hw-timestamp-for-ptp-packets-using-cpts-fifo'
Paolo Abeni [Tue, 23 Apr 2024 10:07:25 +0000 (12:07 +0200)]
Merge branch 'enable-rx-hw-timestamp-for-ptp-packets-using-cpts-fifo'

Chintan Vankar says:

====================
Enable RX HW timestamp for PTP packets using CPTS FIFO

The CPSW offers two mechanisms for communicating packet ingress timestamp
information to the host.

The first mechanism is via the CPTS Event FIFO which records timestamp
when triggered by certain events. One such event is the reception of an
Ethernet packet with a specified EtherType field. This is used to capture
ingress timestamps for PTP packets. With this mechanism the host must
read the timestamp (from the CPTS FIFO) separately from the packet payload
which is delivered via DMA.

In the second mechanism of timestamping, CPSW driver enables hardware
timestamping for all received packets by setting the TSTAMP_EN bit in
CPTS_CONTROL register, which directs the CPTS module to timestamp all
received packets, followed by passing timestamp via DMA descriptors.
This mechanism is responsible for triggering errata i2401:
"CPSW: Host Timestamps Cause CPSW Port to Lock up."

The errata affects all K3 SoCs. Link to errata for AM64x:
https://www.ti.com/lit/er/sprz457h/sprz457h.pdf

As a workaround we can use first mechanism to timestamp received
packets.
====================

Link: https://lore.kernel.org/r/20240419082626.57225-1-c-vankar@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ethernet: ti: am65-cpsw/ethtool: Enable RX HW timestamp only for PTP packets
Chintan Vankar [Fri, 19 Apr 2024 08:26:26 +0000 (13:56 +0530)]
net: ethernet: ti: am65-cpsw/ethtool: Enable RX HW timestamp only for PTP packets

In the current mechanism of timestamping, am65-cpsw-nuss driver
enables hardware timestamping for all received packets by setting
the TSTAMP_EN bit in CPTS_CONTROL register, which directs the CPTS
module to timestamp all received packets, followed by passing
timestamp via DMA descriptors. This mechanism causes CPSW Port to
Lock up.

To prevent port lock up, don't enable rx packet timestamping by
setting TSTAMP_EN bit in CPTS_CONTROL register. The workaround for
timestamping received packets is to utilize the CPTS Event FIFO
that records timestamps corresponding to certain events. The CPTS
module is configured to generate timestamps for Multicast Ethernet,
UDP/IPv4 and UDP/IPv6 PTP packets.

Update supported hwtstamp_rx_filters values for CPSW's timestamping
capability.

Fixes: b1f66a5bee07 ("net: ethernet: ti: am65-cpsw-nuss: enable packet timestamping support")
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ethernet: ti: am65-cpts: Enable RX HW timestamp for PTP packets using CPTS FIFO
Chintan Vankar [Fri, 19 Apr 2024 08:26:25 +0000 (13:56 +0530)]
net: ethernet: ti: am65-cpts: Enable RX HW timestamp for PTP packets using CPTS FIFO

Add a new function "am65_cpts_rx_timestamp()" which checks for PTP
packets from header and timestamps them.

Add another function "am65_cpts_find_rx_ts()" which finds CPTS FIFO
Event to get the timestamp of received PTP packet.

Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agoMerge branch 'read-phy-address-of-switch-from-device-tree-on-mt7530-dsa-subdriver'
Paolo Abeni [Tue, 23 Apr 2024 08:32:43 +0000 (10:32 +0200)]
Merge branch 'read-phy-address-of-switch-from-device-tree-on-mt7530-dsa-subdriver'

Arınç ÜNAL says:

====================
Read PHY address of switch from device tree on MT7530 DSA subdriver

This patch series makes the driver read the PHY address the switch listens
on from the device tree which, in result, brings support for MT7530
switches listening on a different PHY address than 31. And the patch series
simplifies the core operations.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
====================

Link: https://lore.kernel.org/r/20240418-b4-for-netnext-mt7530-phy-addr-from-dt-and-simplify-core-ops-v3-0-3b5fb249b004@arinc9.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: dsa: mt7530: simplify core operations
Arınç ÜNAL [Thu, 18 Apr 2024 05:35:31 +0000 (08:35 +0300)]
net: dsa: mt7530: simplify core operations

The core_rmw() function calls core_read_mmd_indirect() to read the
requested register, and then calls core_write_mmd_indirect() to write the
requested value to the register. Because Clause 22 is used to access Clause
45 registers, some operations on core_write_mmd_indirect() are
unnecessarily run. Get rid of core_read_mmd_indirect() and
core_write_mmd_indirect(), and run only the necessary operations on
core_write() and core_rmw().

Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: dsa: mt7530-mdio: read PHY address of switch from device tree
Arınç ÜNAL [Thu, 18 Apr 2024 05:35:30 +0000 (08:35 +0300)]
net: dsa: mt7530-mdio: read PHY address of switch from device tree

Read the PHY address the switch listens on from the reg property of the
switch node on the device tree. This change brings support for MT7530
switches on boards with such bootstrapping configuration where the switch
listens on a different PHY address than the hardcoded PHY address on the
driver, 31.

As described on the "MT7621 Programming Guide v0.4" document, the MT7530
switch and its PHYs can be configured to listen on the range of 7-12,
15-20, 23-28, and 31 and 0-4 PHY addresses.

There are operations where the switch PHY registers are used. For the PHY
address of the control PHY, transform the MT753X_CTRL_PHY_ADDR constant
into a macro and use it. The PHY address for the control PHY is 0 when the
switch listens on 31. In any other case, it is one greater than the PHY
address the switch listens on.

Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 months agonet: ethernet: mtk_eth_soc: flower: validate control flags
Asbjørn Sloth Tønnesen [Thu, 18 Apr 2024 16:18:15 +0000 (16:18 +0000)]
net: ethernet: mtk_eth_soc: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240418161821.189263-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agodpaa2-switch: flower: validate control flags
Asbjørn Sloth Tønnesen [Thu, 18 Apr 2024 16:18:01 +0000 (16:18 +0000)]
dpaa2-switch: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20240418161802.189247-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agocxgb4: flower: validate control flags
Asbjørn Sloth Tønnesen [Thu, 18 Apr 2024 16:17:49 +0000 (16:17 +0000)]
cxgb4: flower: validate control flags

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Only compile tested, no hardware available.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240418161751.189226-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: openvswitch: Check vport netdev name
Jun Gu [Fri, 19 Apr 2024 06:14:25 +0000 (14:14 +0800)]
net: openvswitch: Check vport netdev name

Ensure that the provided netdev name is not one of its aliases to
prevent unnecessary creation and destruction of the vport by
ovs-vswitchd.

Signed-off-by: Jun Gu <jun.gu@easystack.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20240419061425.132723-1-jun.gu@easystack.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'netlink-add-nftables-spec-w-multi-messages'
Jakub Kicinski [Tue, 23 Apr 2024 00:20:45 +0000 (17:20 -0700)]
Merge branch 'netlink-add-nftables-spec-w-multi-messages'

Donald Hunter says:

====================
netlink: Add nftables spec w/ multi messages

This series adds a ynl spec for nftables and extends ynl with a --multi
command line option that makes it possible to send transactional batches
for nftables.

This series includes a patch for nfnetlink which adds ACK processing for
batch begin/end messages. If you'd prefer that to be sent separately to
nf-next then I can do so, but I included it here so that it gets seen in
context.

An example of usage is:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi batch-begin '{"res-id": 10}' \
 --multi newtable '{"name": "test", "nfgen-family": 1}' \
 --multi newchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --multi batch-end '{"res-id": 10}'
[None, None, None, None]

It can also be used for bundling get requests:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi gettable '{"name": "test", "nfgen-family": 1}' \
 --multi getchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --output-json
[{"name": "test", "use": 1, "handle": 1, "flags": [],
 "nfgen-family": 1, "version": 0, "res-id": 2},
 {"table": "test", "name": "chain", "handle": 1, "use": 0,
 "nfgen-family": 1, "version": 0, "res-id": 2}]

There are 2 issues that may be worth resolving:

 - ynl reports errors by raising an NlError exception so only the first
   error gets reported. This could be changed to add errors to the list
   of responses so that multiple errors could be reported.

 - If any message does not get a response (e.g. batch-begin w/o patch 2)
   then ynl waits indefinitely. A recv timeout could be added which
   would allow ynl to terminate.
====================

Link: https://lore.kernel.org/r/20240418104737.77914-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonetfilter: nfnetlink: Handle ACK flags for batch messages
Donald Hunter [Thu, 18 Apr 2024 10:47:37 +0000 (11:47 +0100)]
netfilter: nfnetlink: Handle ACK flags for batch messages

The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.

Add processing for ACKs for begin/end messages and provide responses
when requested.

I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agotools/net/ynl: Add multi message support to ynl
Donald Hunter [Thu, 18 Apr 2024 10:47:36 +0000 (11:47 +0100)]
tools/net/ynl: Add multi message support to ynl

Add a "--multi <do-op> <json>" command line to ynl that makes it
possible to add several operations to a single netlink request payload.
The --multi command line option is repeated for each operation.

This is used by the nftables family for transaction batches. For
example:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi batch-begin '{"res-id": 10}' \
 --multi newtable '{"name": "test", "nfgen-family": 1}' \
 --multi newchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --multi batch-end '{"res-id": 10}'
[None, None, None, None]

It can also be used for bundling get requests:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi gettable '{"name": "test", "nfgen-family": 1}' \
 --multi getchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --output-json
[{"name": "test", "use": 1, "handle": 1, "flags": [],
 "nfgen-family": 1, "version": 0, "res-id": 2},
 {"table": "test", "name": "chain", "handle": 1, "use": 0,
 "nfgen-family": 1, "version": 0, "res-id": 2}]

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-4-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agotools/net/ynl: Fix extack decoding for directional ops
Donald Hunter [Thu, 18 Apr 2024 10:47:35 +0000 (11:47 +0100)]
tools/net/ynl: Fix extack decoding for directional ops

NetlinkProtocol.decode() was looking up ops by response value which breaks
when it is used for extack decoding of directional ops. Instead, pass
the op to decode().

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-3-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agodoc/netlink/specs: Add draft nftables spec
Donald Hunter [Thu, 18 Apr 2024 10:47:34 +0000 (11:47 +0100)]
doc/netlink/specs: Add draft nftables spec

Add a spec for nftables that has nearly complete coverage of the ops,
but limited coverage of rule types and subexpressions.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-2-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'for-uring-ubufops' into HEAD
Jakub Kicinski [Mon, 22 Apr 2024 23:33:10 +0000 (16:33 -0700)]
Merge branch 'for-uring-ubufops' into HEAD

Pavel Begunkov says:

====================
implement io_uring notification (ubuf_info) stacking (net part)

To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.

The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.

liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.

bytes  | before | after (Kqps)
1200   | 195    | 1023
4000   | 193    | 1386
8000   | 154    | 1058
====================

Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: add callback for setting a ubuf_info to skb
Pavel Begunkov [Fri, 19 Apr 2024 11:08:40 +0000 (12:08 +0100)]
net: add callback for setting a ubuf_info to skb

At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.

Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: extend ubuf_info callback to ops structure
Pavel Begunkov [Fri, 19 Apr 2024 11:08:39 +0000 (12:08 +0100)]
net: extend ubuf_info callback to ops structure

We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'tcp-avoid-sending-too-small-packets'
Jakub Kicinski [Mon, 22 Apr 2024 21:25:32 +0000 (14:25 -0700)]
Merge branch 'tcp-avoid-sending-too-small-packets'

Eric Dumazet says:

====================
tcp: avoid sending too small packets

tcp_sendmsg() cooks 'large' skbs, that are later split
if needed from tcp_write_xmit().

After a split, the leftover skb size is smaller than the optimal
size, and this causes a performance drop.

In this series, tcp_grow_skb() helper is added to shift
payload from the second skb in the write queue to the first
skb to always send optimal sized skbs.

This increases TSO efficiency, and decreases number of ACK
packets.
====================

Link: https://lore.kernel.org/r/20240418214600.1291486-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agotcp: try to send bigger TSO packets
Eric Dumazet [Thu, 18 Apr 2024 21:46:00 +0000 (21:46 +0000)]
tcp: try to send bigger TSO packets

While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.

For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.

This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,

However, I was seeing packets sizes interleaved in this pattern:

172032, 8192, 172032, 8192, 172032, 8192, <repeat>

tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.

It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.

After this patch, we can see the following output:

14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980

With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agotcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
Eric Dumazet [Thu, 18 Apr 2024 21:45:59 +0000 (21:45 +0000)]
tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()

tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.

tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.

In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agotcp: remove dubious FIN exception from tcp_cwnd_test()
Eric Dumazet [Thu, 18 Apr 2024 21:45:58 +0000 (21:45 +0000)]
tcp: remove dubious FIN exception from tcp_cwnd_test()

tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.

This is in violation of TCP RFC, and seems quite dubious.

This packet can be sent only if the current CWND is bigger
than the number of packets in flight.

Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>