linux-block.git
6 months agonet: Re-use and set mono_delivery_time bit for userspace tstamp packets
Abhishek Chauhan [Fri, 1 Mar 2024 20:13:48 +0000 (12:13 -0800)]
net: Re-use and set mono_delivery_time bit for userspace tstamp packets

Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.

Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.

Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240301201348.2815102-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'net-gro-cleanups-and-fast-path-refinement'
Paolo Abeni [Tue, 5 Mar 2024 12:30:15 +0000 (13:30 +0100)]
Merge branch 'net-gro-cleanups-and-fast-path-refinement'

Eric Dumazet says:

====================
net: gro: cleanups and fast path refinement

Current GRO stack has a 'fast path' for a subset of drivers,
users of napi_frags_skb().

With TCP zerocopy/direct uses, header split at receive is becoming
more important, and GRO fast path is disabled.

This series makes GRO (a bit) more efficient for almost all use cases.
====================

Link: https://lore.kernel.org/r/20240301193740.3436871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agotcp: gro: micro optimizations in tcp[4]_gro_complete()
Eric Dumazet [Fri, 1 Mar 2024 19:37:40 +0000 (19:37 +0000)]
tcp: gro: micro optimizations in tcp[4]_gro_complete()

In tcp_gro_complete() :

Moving the skb->inner_transport_header setting
allows the compiler to reuse the previously loaded value
of skb->transport_header.

Caching skb_shinfo() avoids duplications as well.

In tcp4_gro_complete(), doing a single change on
skb_shinfo(skb)->gso_type also generates better code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: gro: enable fast path for more cases
Eric Dumazet [Fri, 1 Mar 2024 19:37:39 +0000 (19:37 +0000)]
net: gro: enable fast path for more cases

Currently the so-called GRO fast path is only enabled for
napi_frags_skb() callers.

After the prior patch, we no longer have to clear frag0 whenever
we pulled bytes to skb->head.

We therefore can initialize frag0 to skb->data so that GRO
fast path can be used in the following additional cases:

- Drivers using header split (populating skb->data with headers,
  and having payload in one or more page fragments).

- Drivers not using any page frag (entire packet is in skb->data)

Add a likely() in skb_gro_may_pull() to help the compiler
to generate better code if possible.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: gro: change skb_gro_network_header()
Eric Dumazet [Fri, 1 Mar 2024 19:37:38 +0000 (19:37 +0000)]
net: gro: change skb_gro_network_header()

Change skb_gro_network_header() to accept a const sk_buff
and to no longer check if frag0 is NULL or not.

This allows to remove skb_gro_frag0_invalidate()
which is seen in profiles when header-split is enabled.

sk_buff parameter is constified for skb_gro_header_fast(),
inet_gro_compute_pseudo() and ip6_gro_compute_pseudo().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: gro: rename skb_gro_header_hard()
Eric Dumazet [Fri, 1 Mar 2024 19:37:37 +0000 (19:37 +0000)]
net: gro: rename skb_gro_header_hard()

skb_gro_header_hard() is renamed to skb_gro_may_pull() to match
the convention used by common helpers like pskb_may_pull().

This means the condition is inverted:

if (skb_gro_header_hard(skb, hlen))
slow_path();

becomes:

if (!skb_gro_may_pull(skb, hlen))
slow_path();

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'mt7530-dsa-subdriver-improvements-act-iii'
Paolo Abeni [Tue, 5 Mar 2024 11:23:45 +0000 (12:23 +0100)]
Merge branch 'mt7530-dsa-subdriver-improvements-act-iii'

 says:

====================
MT7530 DSA Subdriver Improvements Act III

This is the third patch series with the goal of simplifying the MT7530 DSA
subdriver and improving support for MT7530, MT7531, and the switch on the
MT7988 SoC.

I have done a simple ping test to confirm basic communication on all switch
ports on MCM and standalone MT7530, and MT7531 switch with this patch
series applied.

MT7621 Unielec, MCM MT7530:

rgmii-only-gmac0-mt7621-unielec-u7621-06-16m.dtb
gmac0-and-gmac1-mt7621-unielec-u7621-06-16m.dtb

tftpboot 0x80008000 mips-uzImage.bin; tftpboot 0x83000000 mips-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootm 0x80008000 0x83000000 0x83f00000

MT7622 Bananapi, MT7531:

gmac0-and-gmac1-mt7622-bananapi-bpi-r64.dtb

tftpboot 0x40000000 arm64-Image; tftpboot 0x45000000 arm64-rootfs.cpio.uboot; tftpboot 0x4a000000 $dtb; booti 0x40000000 0x45000000 0x4a000000

MT7623 Bananapi, standalone MT7530:

rgmii-only-gmac0-mt7623n-bananapi-bpi-r2.dtb
gmac0-and-gmac1-mt7623n-bananapi-bpi-r2.dtb

tftpboot 0x80008000 arm-zImage; tftpboot 0x83000000 arm-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootz 0x80008000 0x83000000 0x83f00000

This patch series is the continuation of the patch series linked below.

https://lore.kernel.org/r/20230522121532.86610-1-arinc.unal@arinc9.com

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
---
Changes in v3:
- Patch 8
  - Explain properly the behaviour of setting link down on all ports at
    setup.
  - Split the changes for simplifying the link settings operations out to
    another patch.
- Link to v2: https://lore.kernel.org/r/20240216-for-netnext-mt7530-improvements-3-v2-0-094cae3ff23b@arinc9.com

Changes in v2:
- Patch 8
  - Use a single mt7530_rmw() instead of two mt7530_clear() and
    mt7530_set() commands.
- Link to v1: https://lore.kernel.org/r/20240208-for-netnext-mt7530-improvements-3-v1-0-d7c1cfd502ca@arinc9.com
====================

Link: https://lore.kernel.org/r/20240301-for-netnext-mt7530-improvements-3-v3-0-449f4f166454@arinc9.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: simplify link operations
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:05 +0000 (12:43 +0200)]
net: dsa: mt7530: simplify link operations

The "MT7621 Giga Switch Programming Guide v0.3", "MT7531 Reference Manual
for Development Board v1.0", and "MT7988A Wi-Fi 7 Generation Router
Platform: Datasheet (Open Version) v0.1" documents show that these bits are
enabled at reset:

PMCR_IFG_XMIT(1) (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_MAC_MODE (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_TX_EN
PMCR_RX_EN
PMCR_BACKOFF_EN (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_BACKPR_EN (not part of PMCR_LINK_SETTINGS_MASK)
PMCR_TX_FC_EN
PMCR_RX_FC_EN

These bits also don't exist on the MT7530_PMCR_P(6) register of the switch
on the MT7988 SoC:

PMCR_IFG_XMIT()
PMCR_MAC_MODE
PMCR_BACKOFF_EN
PMCR_BACKPR_EN

Remove the setting of the bits not part of PMCR_LINK_SETTINGS_MASK on
phylink_mac_config as they're already set.

The bit for setting the port on force mode is already done on
mt7530_setup() and mt7531_setup_common(). So get rid of
PMCR_FORCE_MODE_ID() which helped determine which bit to use for the switch
model.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: sort link settings ops and force link down on all ports
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:04 +0000 (12:43 +0200)]
net: dsa: mt7530: sort link settings ops and force link down on all ports

port_enable and port_disable clears the link settings. Move that to
mt7530_setup() and mt7531_setup_common() which set up the switches. This
way, the link settings are cleared on all ports at setup, and then only
once with phylink_mac_link_down() when a link goes down.

Enable force mode at setup to apply the force part of the link settings.
This ensures that disabled ports will have their link down.

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: put initialising PCS devices code back to original order
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:03 +0000 (12:43 +0200)]
net: dsa: mt7530: put initialising PCS devices code back to original order

The commit fae463084032 ("net: dsa: mt753x: fix pcs conversion regression")
fixes regression caused by cpu_port_config manually calling phylink
operations. cpu_port_config was deemed useless and was removed. Therefore,
put initialising PCS devices code back to its original order.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: get rid of mt753x_mac_config()
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:02 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of mt753x_mac_config()

There is no need for a separate function to call
priv->info->mac_port_config(). Call it from mt753x_phylink_mac_config()
instead and remove mt753x_mac_config().

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: get rid of priv->info->cpu_port_config()
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:01 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of priv->info->cpu_port_config()

priv->info->cpu_port_config() is used for MT7531 and the switch on the
MT7988 SoC. It sets up the ports described as a CPU port earlier than the
phylink code path would do.

This function is useless as:
- Configuring the MACs can be done from the phylink_mac_config code path
  instead.
- All the link configuration it does on the CPU ports are later undone with
  the port_enable, phylink_mac_config, and then phylink_mac_link_up code
  path [1].

priv->p5_interface and priv->p6_interface were being used to prevent
configuring the MACs from the phylink_mac_config code path. Remove them now
that they hold no purpose.

Remove priv->info->cpu_port_config(). On mt753x_phylink_mac_config, switch
to if statements to simplify the code.

Remove the overwriting of the speed and duplex interfaces for certain
interface modes. Phylink already provides the speed and duplex variables
with proper values. Phylink already sets the max speed of TRGMII to
SPEED_1000. Add SPEED_2500 for PHY_INTERFACE_MODE_2500BASEX to where the
speed and EEE bits are set instead.

On the switch on the MT7988 SoC, PHY_INTERFACE_MODE_INTERNAL is being used
to describe the interface mode of the 10G MAC, which is of port 6. On
mt7988_cpu_port_config() PMCR_FORCE_SPEED_1000 was set via the
PMCR_CPU_PORT_SETTING() mask. Add SPEED_10000 case to where the speed bits
are set to cover this. No need to add it to where the EEE bits are set as
the "MT7988A Wi-Fi 7 Generation Router Platform: Datasheet (Open Version)
v0.1" document shows that these bits don't exist on the MT7530_PMCR_P(6)
register.

Remove the definition of PMCR_CPU_PORT_SETTING() now that it holds no
purpose.

Change mt753x_cpu_port_enable() to void now that there're no error cases
left.

Link: https://lore.kernel.org/netdev/ZHy2jQLesdYFMQtO@shell.armlinux.org.uk/
Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: get rid of useless error returns on phylink code path
Arınç ÜNAL [Fri, 1 Mar 2024 10:43:00 +0000 (12:43 +0200)]
net: dsa: mt7530: get rid of useless error returns on phylink code path

Remove error returns on the cases where they are already handled with the
function the mac_port_get_caps member in mt753x_table points to.

mt7531_mac_config() is also called from mt7531_cpu_port_config() outside of
phylink but the port and interface modes are already handled there.

Change the functions and the mac_port_config function pointer to void now
that there're no error returns anymore.

Remove mt753x_is_mac_port() that used to help the said error returns.

On mt7531_mac_config(), switch to if statements to simplify the code.

Remove internal phy cases from mt753x_phylink_mac_config(), there is no
need to check the interface mode as that's already handled with the
function the mac_port_get_caps member in mt753x_table points to.

Acked-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: do not use SW_PHY_RST to reset MT7531 switch
Arınç ÜNAL [Fri, 1 Mar 2024 10:42:59 +0000 (12:42 +0200)]
net: dsa: mt7530: do not use SW_PHY_RST to reset MT7531 switch

According to the document MT7531 Reference Manual for Development Board
v1.0, the SW_PHY_RST bit on the SYS_CTRL register doesn't exist for
MT7531. This is likely why forcing link down on all ports is necessary for
MT7531.

Therefore, do not set SW_PHY_RST on mt7531_setup().

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: set interrupt register only for MT7530
Arınç ÜNAL [Fri, 1 Mar 2024 10:42:58 +0000 (12:42 +0200)]
net: dsa: mt7530: set interrupt register only for MT7530

Setting this register related to interrupts is only needed for the MT7530
switch. Make an exclusive check to ensure this.

Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Acked-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: dsa: mt7530: remove .mac_port_config for MT7988 and make it optional
Arınç ÜNAL [Fri, 1 Mar 2024 10:42:57 +0000 (12:42 +0200)]
net: dsa: mt7530: remove .mac_port_config for MT7988 and make it optional

For the switch on the MT7988 SoC, the mac_port_config member for ID_MT7988
in mt753x_table is not needed as the interfaces of all MACs are already
handled on mt7988_mac_port_get_caps().

Therefore, remove the mac_port_config member from ID_MT7988 in
mt753x_table. Before calling priv->info->mac_port_config(), if there's no
mac_port_config member in mt753x_table, exit mt753x_mac_config()
successfully.

Remove calling priv->info->mac_port_config() from the sanity check as the
sanity check requires a pointer to a mac_port_config function to be
non-NULL. This will fail for MT7988 as mac_port_config won't be a member of
its info table.

Co-developed-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'remove-page-frag-implementation-in-vhost_net'
Paolo Abeni [Tue, 5 Mar 2024 10:38:16 +0000 (11:38 +0100)]
Merge branch 'remove-page-frag-implementation-in-vhost_net'

Yunsheng Lin says:

====================
remove page frag implementation in vhost_net

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patchset tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )

After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V6: Add timeout for poll() and simplify some logic as suggested
    by Jason.

V5: Address the comment from jason in vhost_net_test.c and the
    comment about leaving out the gfp change for page frag in
    sock.c as suggested by Paolo.

V4: Resend based on latest net-next branch.

V3:
1. Add __page_frag_alloc_align() which is passed with the align mask
   the original function expected as suggested by Alexander.
2. Drop patch 3 in v2 suggested by Alexander.
3. Reorder patch 4 & 5 in v2 suggested by Alexander.

Note that placing this gfp flags handing for order 3 page in an inline
function is not considered, as we may be able to unify the page_frag
and page_frag_cache handling.

V2: Change 'xor'd' to 'masked off', add vhost tx testing for
    vhost_net_test.

V1: Fix some typo, drop RFC tag and rebase on latest net-next.
====================

Link: https://lore.kernel.org/r/20240228093013.8263-1-linyunsheng@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agotools: virtio: introduce vhost_net_test
Yunsheng Lin [Wed, 28 Feb 2024 09:30:12 +0000 (17:30 +0800)]
tools: virtio: introduce vhost_net_test

introduce vhost_net_test for both vhost_net tx and rx basing
on virtio_test to test vhost_net changing in the kernel.

Steps for vhost_net tx testing:
1. Prepare a out buf.
2. Kick the vhost_net to do tx processing.
3. Do the receiving in the tun side.
4. verify the data received by tun is correct.

Steps for vhost_net rx testing:
1. Prepare a in buf.
2. Do the sending in the tun side.
3. Kick the vhost_net to do rx processing.
4. verify the data received by vhost_net is correct.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agovhost/net: remove vhost_net_page_frag_refill()
Yunsheng Lin [Wed, 28 Feb 2024 09:30:11 +0000 (17:30 +0800)]
vhost/net: remove vhost_net_page_frag_refill()

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: introduce page_frag_cache_drain()
Yunsheng Lin [Wed, 28 Feb 2024 09:30:10 +0000 (17:30 +0800)]
net: introduce page_frag_cache_drain()

When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agopage_frag: unify gfp bits for order 3 page allocation
Yunsheng Lin [Wed, 28 Feb 2024 09:30:09 +0000 (17:30 +0800)]
page_frag: unify gfp bits for order 3 page allocation

Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agomm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
Yunsheng Lin [Wed, 28 Feb 2024 09:30:08 +0000 (17:30 +0800)]
mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument

napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the alignment checking and align mask conversion, in order
to call page_frag_alloc_align() directly. The intention here is
to keep the alignment checking and the alignmask conversion in
in-line wrapper to avoid those kind of operations during execution
time since it can usually be handled during compile time.

We are going to use page_frag_alloc_align() in vhost_net.c, it
need the same kind of alignment checking and alignmask conversion,
so split up page_frag_alloc_align into an inline wrapper doing the
above operation, and add __page_frag_alloc_align() which is passed
with the align mask the original function expected as suggested by
Alexander.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: txgbe: fix to clear interrupt status after handling IRQ
Jiawen Wu [Fri, 1 Mar 2024 09:29:56 +0000 (17:29 +0800)]
net: txgbe: fix to clear interrupt status after handling IRQ

GPIO EOI is not set to clear interrupt status after handling the
interrupt. It should be done in irq_chip->irq_ack, but this function
is not called in handle_nested_irq(). So executing function
txgbe_gpio_irq_ack() manually in txgbe_gpio_irq_handler().

Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20240301092956.18544-2-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agonet: txgbe: fix GPIO interrupt blocking
Jiawen Wu [Fri, 1 Mar 2024 09:29:55 +0000 (17:29 +0800)]
net: txgbe: fix GPIO interrupt blocking

The register of GPIO interrupt status is masked before MAC IRQ
is enabled. This is because of hardware deficiency. So manually
clear the interrupt status before using them. Otherwise, GPIO
interrupts will never be reported again. There is a workaround for
clearing interrupts to set GPIO EOI in txgbe_up_complete().

Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20240301092956.18544-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 months agoMerge branch 'intel-wired-lan-driver-updates-2024-02-28-ixgbe-igc-igb-e1000e-e100'
Jakub Kicinski [Tue, 5 Mar 2024 04:50:01 +0000 (20:50 -0800)]
Merge branch 'intel-wired-lan-driver-updates-2024-02-28-ixgbe-igc-igb-e1000e-e100'

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-02-28 (ixgbe, igc, igb, e1000e, e100)

This series contains updates to ixgbe, igc, igb, e1000e, and e100
drivers.

Jon Maxwell makes module parameter values readable in sysfs for ixgbe,
igb, and e100.

Ernesto Castellotti adds support for 1000BASE-BX on ixgbe.

Arnd Bergmann fixes build failure due to dependency issues for igc.

Vitaly refactors error check to be more concise and prevent future
issues on e1000e.

v1: https://lore.kernel.org/netdev/20240229004135.741586-1-anthony.l.nguyen@intel.com/
====================

Link: https://lore.kernel.org/r/20240301184806.2634508-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoe1000e: Minor flow correction in e1000_shutdown function
Vitaly Lifshits [Fri, 1 Mar 2024 18:48:05 +0000 (10:48 -0800)]
e1000e: Minor flow correction in e1000_shutdown function

Add curly braces to avoid entering to an if statement where it is not
always required in e1000_shutdown function.
This improves code readability and might prevent non-deterministic
behaviour in the future.

Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240301184806.2634508-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoigc: fix LEDS_CLASS dependency
Arnd Bergmann [Fri, 1 Mar 2024 18:48:04 +0000 (10:48 -0800)]
igc: fix LEDS_CLASS dependency

When IGC is built-in but LEDS_CLASS is a loadable module, there is
a link failure:

x86_64-linux-ld: drivers/net/ethernet/intel/igc/igc_leds.o: in function `igc_led_setup':
igc_leds.c:(.text+0x75c): undefined reference to `devm_led_classdev_register_ext'

Add another dependency that prevents this combination.

Fixes: ea578703b03d ("igc: Add support for LEDs on i225/i226")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240301184806.2634508-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoixgbe: Add 1000BASE-BX support
Ernesto Castellotti [Fri, 1 Mar 2024 18:48:03 +0000 (10:48 -0800)]
ixgbe: Add 1000BASE-BX support

Added support for 1000BASE-BX, i.e. Gigabit Ethernet over single strand
of single-mode fiber.
The initialization of a 1000BASE-BX SFP is the same as 1000BASE-SX/LX
with the only difference that the Bit Rate Nominal Value must be
checked to make sure it is a Gigabit Ethernet transceiver, as described
by the SFF-8472 specification.

This was tested with the FS.com SFP-GE-BX 1310/1490nm 10km transceiver:
$ ethtool -m eth4
        Identifier                                : 0x03 (SFP)
        Extended identifier                       : 0x04 (GBIC/SFP defined by 2-wire interface ID)
        Connector                                 : 0x07 (LC)
        Transceiver codes                         : 0x00 0x00 0x00 0x40 0x00 0x00 0x00 0x00 0x00
        Transceiver type                          : Ethernet: BASE-BX10
        Encoding                                  : 0x01 (8B/10B)
        BR, Nominal                               : 1300MBd
        Rate identifier                           : 0x00 (unspecified)
        Length (SMF,km)                           : 10km
        Length (SMF)                              : 10000m
        Length (50um)                             : 0m
        Length (62.5um)                           : 0m
        Length (Copper)                           : 0m
        Length (OM3)                              : 0m
        Laser wavelength                          : 1310nm
        Vendor name                               : FS
        Vendor OUI                                : 64:9d:99
        Vendor PN                                 : SFP-GE-BX
        Vendor rev                                :
        Option values                             : 0x20 0x0a
        Option                                    : RX_LOS implemented
        Option                                    : TX_FAULT implemented
        Option                                    : Power level 3 requirement
        BR margin, max                            : 0%
        BR margin, min                            : 0%
        Vendor SN                                 : S2202359108
        Date code                                 : 220307
        Optical diagnostics support               : Yes
        Laser bias current                        : 17.650 mA
        Laser output power                        : 0.2132 mW / -6.71 dBm
        Receiver signal average optical power     : 0.2740 mW / -5.62 dBm
        Module temperature                        : 47.30 degrees C / 117.13 degrees F
        Module voltage                            : 3.2576 V
        Alarm/warning flags implemented           : Yes
        Laser bias current high alarm             : Off
        Laser bias current low alarm              : Off
        Laser bias current high warning           : Off
        Laser bias current low warning            : Off
        Laser output power high alarm             : Off
        Laser output power low alarm              : Off
        Laser output power high warning           : Off
        Laser output power low warning            : Off
        Module temperature high alarm             : Off
        Module temperature low alarm              : Off
        Module temperature high warning           : Off
        Module temperature low warning            : Off
        Module voltage high alarm                 : Off
        Module voltage low alarm                  : Off
        Module voltage high warning               : Off
        Module voltage low warning                : Off
        Laser rx power high alarm                 : Off
        Laser rx power low alarm                  : Off
        Laser rx power high warning               : Off
        Laser rx power low warning                : Off
        Laser bias current high alarm threshold   : 110.000 mA
        Laser bias current low alarm threshold    : 1.000 mA
        Laser bias current high warning threshold : 100.000 mA
        Laser bias current low warning threshold  : 1.000 mA
        Laser output power high alarm threshold   : 0.7079 mW / -1.50 dBm
        Laser output power low alarm threshold    : 0.0891 mW / -10.50 dBm
        Laser output power high warning threshold : 0.6310 mW / -2.00 dBm
        Laser output power low warning threshold  : 0.1000 mW / -10.00 dBm
        Module temperature high alarm threshold   : 90.00 degrees C / 194.00 degrees F
        Module temperature low alarm threshold    : -45.00 degrees C / -49.00 degrees F
        Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F
        Module temperature low warning threshold  : -40.00 degrees C / -40.00 degrees F
        Module voltage high alarm threshold       : 3.7950 V
        Module voltage low alarm threshold        : 2.8050 V
        Module voltage high warning threshold     : 3.4650 V
        Module voltage low warning threshold      : 3.1350 V
        Laser rx power high alarm threshold       : 0.7079 mW / -1.50 dBm
        Laser rx power low alarm threshold        : 0.0028 mW / -25.53 dBm
        Laser rx power high warning threshold     : 0.6310 mW / -2.00 dBm
        Laser rx power low warning threshold      : 0.0032 mW / -24.95 dBm

Signed-off-by: Ernesto Castellotti <ernesto@castellotti.net>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240301184806.2634508-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agointel: make module parameters readable in sys filesystem
Jon Maxwell [Fri, 1 Mar 2024 18:48:02 +0000 (10:48 -0800)]
intel: make module parameters readable in sys filesystem

Linux users sometimes need an easy way to check current values of module
parameters. For example the module may be manually reloaded with different
parameters. Make these visible and readable in the /sys filesystem to allow
that. But don't make the "debug" module parameter visible as debugging is
enabled via ethtool msglvl.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240301184806.2634508-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agotcp: align tcp_sock_write_rx group
Eric Dumazet [Fri, 1 Mar 2024 17:19:45 +0000 (17:19 +0000)]
tcp: align tcp_sock_write_rx group

Stephen Rothwell and kernel test robot reported that some arches
(parisc, hexagon) and/or compilers would not like blamed commit.

Lets make sure tcp_sock_write_rx group does not start with a hole.

While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE()
since after the blamed commit, we went to 105 bytes.

Fixes: 99123622050f ("tcp: remove some holes in struct tcp_sock")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/netdev/20240301121108.5d39e4f9@canb.auug.org.au/
Closes: https://lore.kernel.org/oe-kbuild-all/202403011451.csPYOS3C-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://lore.kernel.org/r/20240301171945.2958176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests/tc-testing: require an up to date iproute2 for blockcast tests
Pedro Tammela [Thu, 29 Feb 2024 14:38:25 +0000 (11:38 -0300)]
selftests/tc-testing: require an up to date iproute2 for blockcast tests

Add the dependsOn test check for all the mirred blockcast tests.
It will prevent the issue reported by LKFT which happens when an older
iproute2 is used to run the current tdc.

Tests are skipped if the dependsOn check fails.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://lore.kernel.org/r/20240229143825.1373550-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests: net: Correct couple of spelling mistakes
Prabhav Kumar Vaish [Wed, 28 Feb 2024 12:07:01 +0000 (17:37 +0530)]
selftests: net: Correct couple of spelling mistakes

Changes :
- "excercise" is corrected to "exercise" in drivers/net/mlxsw/spectrum-2/tc_flower.sh
- "mutliple" is corrected to "multiple" in drivers/net/netdevsim/ethtool-fec.sh

Signed-off-by: Prabhav Kumar Vaish <pvkumar5749404@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228120701.422264-1-pvkumar5749404@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'mptcp-userspace-pm'
David S. Miller [Mon, 4 Mar 2024 13:07:46 +0000 (13:07 +0000)]
Merge branch 'mptcp-userspace-pm'

Matthieu Baerts says:

====================
mptcp: userspace pm: 'dump addrs' and 'get addr'

This series from Geliang adds two new Netlink commands to the userspace
PM:

- one to dump all addresses of a specific MPTCP connection:
  - feature added in patches 3 to 5
  - test added in patches 7, 8 and 10

- and one to get a specific address for an MPTCP connection:
  - feature added in patches 11 to 13
  - test added in patches 14 and 15

These new Netlink commands can be useful if an MPTCP daemon lost track
of the different connections, e.g. after having been restarted.

The other patches are some clean-ups and small improvements added
while working on the new features.

====================

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
6 months agoselftests: mptcp: userspace pm get addr tests
Geliang Tang [Fri, 1 Mar 2024 18:18:39 +0000 (19:18 +0100)]
selftests: mptcp: userspace pm get addr tests

This patch adds a new helper userspace_pm_get_addr() in mptcp_join.sh.
In it, parse the token value from the output of 'pm_nl_ctl events', then
pass it to pm_nl_ctl get_addr command. Use this helper in userspace pm
dump tests.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoselftests: mptcp: add token for get_addr
Geliang Tang [Fri, 1 Mar 2024 18:18:38 +0000 (19:18 +0100)]
selftests: mptcp: add token for get_addr

The command get_addr() of pm_nl_ctl can be used like this in in-kernel PM:

pm_nl_ctl get $id

This patch adds token argument for it to support userspace PM:

pm_nl_ctl get $id token $token

If 'token $token' is passed to get_addr(), copy it into the kernel netlink.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: get addr in userspace pm list
Geliang Tang [Fri, 1 Mar 2024 18:18:37 +0000 (19:18 +0100)]
mptcp: get addr in userspace pm list

This patch renames mptcp_pm_nl_get_addr_doit() as a dedicated in-kernel
netlink PM get addr function mptcp_pm_nl_get_addr(). and invoke a new
wrapper mptcp_pm_get_addr() in mptcp_pm_nl_get_addr_doit.

If a token is gotten in the wrapper, that means a userspace PM is used.
So invoke mptcp_userspace_pm_get_addr() to get addr in userspace PM list.
Otherwise, invoke mptcp_pm_nl_get_addr().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: implement mptcp_userspace_pm_get_addr
Geliang Tang [Fri, 1 Mar 2024 18:18:36 +0000 (19:18 +0100)]
mptcp: implement mptcp_userspace_pm_get_addr

This patch implements mptcp_userspace_pm_get_addr() to get an address
from userspace pm address list according the given 'token' and 'id'.
Use nla_get_u32() to get the u32 value of 'token', then pass it to
mptcp_token_get_sock() to get the msk. Pass 'msk' and 'id' to the helper
mptcp_userspace_pm_lookup_addr_by_id() to get the address entry. Put
this entry to userspace using mptcp_pm_nl_put_entry_info().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: add userspace_pm_lookup_addr_by_id helper
Geliang Tang [Fri, 1 Mar 2024 18:18:35 +0000 (19:18 +0100)]
mptcp: add userspace_pm_lookup_addr_by_id helper

Corresponding __lookup_addr_by_id() helper in the in-kernel netlink PM,
this patch adds a new helper mptcp_userspace_pm_lookup_addr_by_id() to
lookup the address entry with the given id on the userspace pm local
address list.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoselftests: mptcp: dump userspace addrs list
Geliang Tang [Fri, 1 Mar 2024 18:18:34 +0000 (19:18 +0100)]
selftests: mptcp: dump userspace addrs list

This patch adds a new helper userspace_pm_dump() to dump addresses
for the userspace PM. Use this helper to check whether an ID 0 subflow
is listed in the output of dump command after creating an ID 0 subflow
in "userspace pm create id 0 subflow" test. Dump userspace PM addresses
list in "userspace pm add & remove address" test and in "userspace pm
create destroy subflow" test.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoselftests: mptcp: add mptcp_lib_check_output helper
Geliang Tang [Fri, 1 Mar 2024 18:18:33 +0000 (19:18 +0100)]
selftests: mptcp: add mptcp_lib_check_output helper

Extract the main part of check() in pm_netlink.sh into a new helper
named mptcp_lib_check_output in mptcp_lib.sh.

This helper will be used for userspace dump addresses tests.

Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoselftests: mptcp: add token for dump_addr
Geliang Tang [Fri, 1 Mar 2024 18:18:32 +0000 (19:18 +0100)]
selftests: mptcp: add token for dump_addr

The command dump_addr() of pm_nl_ctl can be used like this in in-kernel PM:

        pm_nl_ctl dump

This patch adds token argument for it to support userspace PM:

        pm_nl_ctl dump token $token

If 'token $token' is passed to dump_addr(), copy it into the kernel
netlink.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoselftests: mptcp: add userspace pm subflow flag
Geliang Tang [Fri, 1 Mar 2024 18:18:31 +0000 (19:18 +0100)]
selftests: mptcp: add userspace pm subflow flag

This patch adds the address flag MPTCP_PM_ADDR_FLAG_SUBFLOW in csf() in
pm_nl_ctl.c when subflow is created by a userspace PM.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: check userspace pm flags
Geliang Tang [Fri, 1 Mar 2024 18:18:30 +0000 (19:18 +0100)]
mptcp: check userspace pm flags

Just like MPTCP_PM_ADDR_FLAG_SIGNAL flag is checked in userspace PM
announce mptcp_pm_nl_announce_doit(), PM flags should be checked in
mptcp_pm_nl_subflow_create_doit() too.

If MPTCP_PM_ADDR_FLAG_SUBFLOW flag is not set, there's no flags field
in the output of dump_addr. This looks a bit strange:

        id 10 flags  10.0.3.2

This patch uses mptcp_pm_parse_entry() instead of mptcp_pm_parse_addr()
to get the PM flags of the entry and check it. MPTCP_PM_ADDR_FLAG_SIGNAL
flag shouldn't be set here, and if MPTCP_PM_ADDR_FLAG_SUBFLOW flag is
missing from the netlink attribute, always set this flag.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: dump addrs in userspace pm list
Geliang Tang [Fri, 1 Mar 2024 18:18:29 +0000 (19:18 +0100)]
mptcp: dump addrs in userspace pm list

This patch renames mptcp_pm_nl_get_addr_dumpit() as a dedicated in-kernel
netlink PM dump addrs function mptcp_pm_nl_dump_addr(), and invoke a newly
added wrapper mptcp_pm_dump_addr() in mptcp_pm_nl_get_addr_dumpit().

Invoke in-kernel PM dump addrs function mptcp_pm_nl_dump_addr() or
userspace PM dump addrs function mptcp_userspace_pm_dump_addr() based on
whether the token parameter is passed in or not in the wrapper.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: add token for get-addr in yaml
Geliang Tang [Fri, 1 Mar 2024 18:18:28 +0000 (19:18 +0100)]
mptcp: add token for get-addr in yaml

This patch adds token parameter together with addr in get-addr section in
mptcp_pm.yaml, then use the following commands to update mptcp_pm_gen.c
and mptcp_pm_gen.h:

./tools/net/ynl/ynl-gen-c.py --mode kernel \
        --spec Documentation/netlink/specs/mptcp_pm.yaml --source \
        -o net/mptcp/mptcp_pm_gen.c
./tools/net/ynl/ynl-gen-c.py --mode kernel \
        --spec Documentation/netlink/specs/mptcp_pm.yaml --header \
        -o net/mptcp/mptcp_pm_gen.h

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: implement mptcp_userspace_pm_dump_addr
Geliang Tang [Fri, 1 Mar 2024 18:18:27 +0000 (19:18 +0100)]
mptcp: implement mptcp_userspace_pm_dump_addr

This patch implements mptcp_userspace_pm_dump_addr() to dump addresses
from userspace pm address list. Use mptcp_token_get_sock() to get the
msk from the given token, if userspace PM is enabled in it, traverse
each address entry in address list, put every entry to userspace using
mptcp_pm_nl_put_entry_msg().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: export mptcp_genl_family & mptcp_nl_fill_addr
Geliang Tang [Fri, 1 Mar 2024 18:18:26 +0000 (19:18 +0100)]
mptcp: export mptcp_genl_family & mptcp_nl_fill_addr

This patch exports struct mptcp_genl_family and mptcp_nl_fill_addr() helper
to allow them can be used in pm_userspace.c.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: make pm_remove_addrs_and_subflows static
Geliang Tang [Fri, 1 Mar 2024 18:18:25 +0000 (19:18 +0100)]
mptcp: make pm_remove_addrs_and_subflows static

mptcp_pm_remove_addrs_and_subflows() is only used in pm_netlink.c, it's
no longer used in pm_userspace.c any more since the commit 8b1c94da1e48
("mptcp: only send RM_ADDR in nl_cmd_remove"). So this patch changes it
to a static function.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'ipa-device-pointer-access'
David S. Miller [Mon, 4 Mar 2024 11:44:41 +0000 (11:44 +0000)]
Merge branch 'ipa-device-pointer-access'

Alex Elder says:

====================
net: ipa: simplify device pointer access

This version of this patch series fixes the bugs in the first patch
(which were fixed in the second), where ipa_interrupt_config() had
two remaining spots that returned a pointer rather than an integer.

Outside of initialization, all uses of the platform device pointer
stored in the IPA structure determine the address of device
structure embedded within the platform device structure.

By changing some of the initialization functions to take a platform
device as argument we can simplify getting at the device structure
address by storing it (instead of the platform device pointer) in
the IPA structure.

The first two patches split the interrupt initialization code into
two parts--one done earlier than before.  The next four patches
update some initialization functions to take a platform device
pointer as argument.  And the last patch replaces the platform
device pointer with a device pointer, and converts all remaining
references to the &ipa->pdev->dev to use ipa->dev.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: don't save the platform device
Alex Elder [Fri, 1 Mar 2024 17:02:42 +0000 (11:02 -0600)]
net: ipa: don't save the platform device

The IPA platform device is now only used as the structure containing
the IPA device structure.  Replace the platform device pointer with
a pointer to the device structure.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: pass a platform device to ipa_smp2p_init()
Alex Elder [Fri, 1 Mar 2024 17:02:41 +0000 (11:02 -0600)]
net: ipa: pass a platform device to ipa_smp2p_init()

Rather than using the platform device pointer field in the IPA
pointer, pass a platform device pointer to ipa_smp2p_init().  Use
that pointer throughout that function.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: pass a platform device to ipa_smp2p_irq_init()
Alex Elder [Fri, 1 Mar 2024 17:02:40 +0000 (11:02 -0600)]
net: ipa: pass a platform device to ipa_smp2p_irq_init()

Rather than using the platform device pointer field in the IPA
pointer, pass a platform device pointer to ipa_smp2p_irq_init().
Use that pointer throughout that function (without assuming it's
the same as the IPA platform device pointer).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: pass a platform device to ipa_mem_init()
Alex Elder [Fri, 1 Mar 2024 17:02:39 +0000 (11:02 -0600)]
net: ipa: pass a platform device to ipa_mem_init()

Rather than using the platform device pointer field in the IPA
pointer, pass a platform device pointer to ipa_mem_init().  Use
that pointer throughout that function.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: pass a platform device to ipa_reg_init()
Alex Elder [Fri, 1 Mar 2024 17:02:38 +0000 (11:02 -0600)]
net: ipa: pass a platform device to ipa_reg_init()

Rather than using the platform device pointer field in the IPA
pointer, pass a platform device pointer to ipa_reg_init().  Use
that pointer throughout that function.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: introduce ipa_interrupt_init()
Alex Elder [Fri, 1 Mar 2024 17:02:37 +0000 (11:02 -0600)]
net: ipa: introduce ipa_interrupt_init()

Create a new function ipa_interrupt_init() that is called at probe
time to allocate and initialize the IPA interrupt data structure.
Create ipa_interrupt_exit() as its inverse.

This follows the normal IPA driver pattern of *_init() functions
doing things that can be done before access to hardware is required.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ipa: change ipa_interrupt_config() prototype
Alex Elder [Fri, 1 Mar 2024 17:02:36 +0000 (11:02 -0600)]
net: ipa: change ipa_interrupt_config() prototype

Change the return type of ipa_interrupt_config() to be an error
code rather than an IPA interrupt structure pointer, and assign the
the pointer within that function.

Change ipa_interrupt_deconfig() to take the IPA pointer as argument
and have it invalidate the ipa->interrupt pointer.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'mptcp-lowat-sockopt'
David S. Miller [Mon, 4 Mar 2024 10:50:28 +0000 (10:50 +0000)]
Merge branch 'mptcp-lowat-sockopt'

Matthieu Baerts says:

====================
mptcp: add TCP_NOTSENT_LOWAT sockopt support

Patch 3 does the magic of adding TCP_NOTSENT_LOWAT support, all the
other ones are minor cleanup seen along when working on the new
feature.

Note that this feature relies on the existing accounting for snd_nxt.
Such accounting is not 110% accurate as it tracks the most recent
sequence number queued to any subflow, and not the actual sequence
number sent on the wire. Paolo experimented a lot, trying to implement
the latter, and in the end it proved to be both "too complex" and "not
necessary".

The complexity raises from the need for additional lock and a lot of
refactoring to introduce such protections without adding significant
overhead. Additionally, snd_nxt is currently used and exposed with the
current semantic by the internal packet scheduling. Introducing a
different tracking will still require us to keep the old one.

More interestingly, a more accurate tracking could be not strictly
necessary: as the MPTCP socket enqueues data to the subflows only up
to the available send window, any enqueue data is sent on the wire
instantly, without any blocking operation short or a drop in the tx
path at the nft or TC layer.
====================

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
6 months agomptcp: cleanup SOL_TCP handling
Paolo Abeni [Fri, 1 Mar 2024 17:43:47 +0000 (18:43 +0100)]
mptcp: cleanup SOL_TCP handling

Most TCP-level socket options get an integer from user space, and
set the corresponding field under the msk-level socket lock.

Reduce the code duplication moving such operations in the common code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: implement TCP_NOTSENT_LOWAT support
Paolo Abeni [Fri, 1 Mar 2024 17:43:46 +0000 (18:43 +0100)]
mptcp: implement TCP_NOTSENT_LOWAT support

Add support for such socket option storing the user-space provided
value in a new msk field, and using such data to implement the
_mptcp_stream_memory_free() helper, similar to the TCP one.

To avoid adding more indirect calls in the fast path, open-code
a variant of sk_stream_memory_free() in mptcp_sendmsg() and add
direct calls to the mptcp stream memory free helper where possible.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/464
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: avoid some duplicate code in socket option handling
Paolo Abeni [Fri, 1 Mar 2024 17:43:45 +0000 (18:43 +0100)]
mptcp: avoid some duplicate code in socket option handling

The mptcp_get_int_option() helper is needless open-coded in a
couple of places, replace the duplicate code with the helper
call.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agomptcp: cleanup writer wake-up
Paolo Abeni [Fri, 1 Mar 2024 17:43:44 +0000 (18:43 +0100)]
mptcp: cleanup writer wake-up

After commit 5cf92bbadc58 ("mptcp: re-enable sndbuf autotune"), the
MPTCP_NOSPACE bit is redundant: it is always set and cleared together with
SOCK_NOSPACE.

Let's drop the first and always relay on the latter, dropping a bunch
of useless code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: nlmon: Simplify nlmon_get_stats64
Breno Leitao [Fri, 1 Mar 2024 13:42:14 +0000 (05:42 -0800)]
net: nlmon: Simplify nlmon_get_stats64

Do not set rtnl_link_stats64 fields to zero, since they are zeroed
before ops->ndo_get_stats64 is called in core dev_get_stats() function.

Also, simplify the data collection by removing the temporary variable.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: nlmon: Remove init and uninit functions
Breno Leitao [Fri, 1 Mar 2024 13:42:13 +0000 (05:42 -0800)]
net: nlmon: Remove init and uninit functions

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the nlmon driver and leverage the network
core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoOcteontx2-af: Fix an issue in firmware shared data reserved space
Hariprasad Kelam [Fri, 1 Mar 2024 11:03:14 +0000 (16:33 +0530)]
Octeontx2-af: Fix an issue in firmware shared data reserved space

The last patch which added support to extend the firmware shared
data to add channel data information has introduced a bug due to
the reserved space not adjusted accordingly.

This patch fixes the issue and also adds BUILD_BUG to avoid this
regression error.

Fixes: 997814491cee ("Octeontx2-af: Fetch MAC channel info from firmware")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoeth: igc: remove unused embedded struct net_device
Jakub Kicinski [Fri, 1 Mar 2024 07:02:55 +0000 (23:02 -0800)]
eth: igc: remove unused embedded struct net_device

struct net_device poll_dev in struct igc_q_vector was added
in one of the initial commits, but never used.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'net-gve-header-split-support'
David S. Miller [Mon, 4 Mar 2024 10:03:32 +0000 (10:03 +0000)]
Merge branch 'net-gve-header-split-support'

Ziwei Xiao says:

====================
gve: Add header split support

Currently, the ethtool's ringparam has added a new field
tcp-data-split for enabling and disabling header split. These three
patches will utilize that ethtool flag to support header split in GVE
driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agogve: Add header split ethtool stats
Jeroen de Borst [Thu, 29 Feb 2024 21:22:36 +0000 (13:22 -0800)]
gve: Add header split ethtool stats

To record the stats of header split packets, three stats are added in
the driver's ethtool stats.

- rx_hsplit_pkt is the split packets count with header split
- rx_hsplit_bytes is the received header bytes count with header split
- rx_hsplit_unsplit_pkt is the unsplit packet count due to header buffer
  overflow or zero header length when header split is enabled

Currently, it's entering the stats_update critical section more than
once per packet. We have plans to avoid that in the future change to let
all the stats_update happen in one place at the end of
`gve_rx_poll_dqo`.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agogve: Add header split data path
Jeroen de Borst [Thu, 29 Feb 2024 21:22:35 +0000 (13:22 -0800)]
gve: Add header split data path

Add header buffers and ethtool support to enable header split via the
tcp-data-split flag in ethtool's ringparam config. A coherent dma memory
is allocated for the header buffers. There is one header buffer per ring
entry by calculating the offset to the header-buffers starting address.
The header buffer is always copied directly into the skb and payload is
always added as frags. When there is a header buffer overflow or the
header length is 0, the driver places the whole unsplit packet in frags.

When toggling header split, the driver will call gve_adjust_config to
set its queues appropriately. If header split is enabled by the user and
the max packet buffer size is no less than 4KB, driver will set the
packet buffer size as 4KB to support TCP_ZEROCOPY_RECEIVE. Otherwise the
driver will use the default 2KB as the packet buffer size.

`ethtool -G <dev> tcp-data-split on/off` is the command to toggle header
split.
`ethtool -g <dev>` will show the status of header split with the field
of `tcp-data-split`.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agogve: Add header split device option
Jeroen de Borst [Thu, 29 Feb 2024 21:22:34 +0000 (13:22 -0800)]
gve: Add header split device option

To enable header split via ethtool, we first need to query the device to
get the max rx buffer size and header buffer size. Add a device option
to get these values and store them in the driver. If the header buffer
size received from the device is non-zero, it means header split is
supported in the device.

Currently the max rx buffer size will only be used when header split is
enabled which will set the data_buffer_size_dqo to be the max rx buffer
size. Also change the data_buffer_size_dqo from int to u16 since we are
modifying it and making it to be consistent with max_rx_buffer_size.

Co-developed-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Ziwei Xiao <ziweixiao@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: ip6_tunnel: Leverage core stats allocator
Breno Leitao [Wed, 28 Feb 2024 18:03:17 +0000 (10:03 -0800)]
net: ip6_tunnel: Leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the ip6_tunnel driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'ionic-cleanups-and-perf-tuning'
David S. Miller [Mon, 4 Mar 2024 09:38:14 +0000 (09:38 +0000)]
Merge branch 'ionic-cleanups-and-perf-tuning'

Shannon Nelson says:

====================
ionic: code cleanup and performance tuning

Brett has been performance testing and code tweaking and has
come up with several improvements for our fast path operations.

In a simple single thread / single queue iperf case on a 1500 MTU
connection we see an improvement from 74.2 to 86.7 Gbits/sec.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: change MODULE_AUTHOR to person name
Shannon Nelson [Thu, 29 Feb 2024 19:39:35 +0000 (11:39 -0800)]
ionic: change MODULE_AUTHOR to person name

The MODULE_AUTHOR macro is supposed to be a person
not a company.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Clean RCT ordering issues
Brett Creeley [Thu, 29 Feb 2024 19:39:34 +0000 (11:39 -0800)]
ionic: Clean RCT ordering issues

Clean up complaints from an xmastree.py scan.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Use CQE profile for dim
Brett Creeley [Thu, 29 Feb 2024 19:39:33 +0000 (11:39 -0800)]
ionic: Use CQE profile for dim

Use the kernel's CQE dim table to align better with the
driver's use of completion queues, and use the tx moderation
when using Tx interrupts.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: change the hwstamp likely check
Brett Creeley [Thu, 29 Feb 2024 19:39:32 +0000 (11:39 -0800)]
ionic: change the hwstamp likely check

An earlier change moved the hwstamp queue check into a helper
function with an unlikely(). However, it makes more sense for
the caller to decide if it's likely() or unlikely(), so make
the change to support that.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: reduce the use of netdev
Shannon Nelson [Thu, 29 Feb 2024 19:39:31 +0000 (11:39 -0800)]
ionic: reduce the use of netdev

To help make sure we're only accessing things we really need
to access we can cut down on the q->lif->netdev references by
using q->dev which is already in cache.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Pass local netdev instead of referencing struct
Brett Creeley [Thu, 29 Feb 2024 19:39:30 +0000 (11:39 -0800)]
ionic: Pass local netdev instead of referencing struct

Instead of using q->lif->netdev, just pass the netdev when it's
locally defined.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Check stop no restart
Brett Creeley [Thu, 29 Feb 2024 19:39:29 +0000 (11:39 -0800)]
ionic: Check stop no restart

If there is a lot of transmit traffic the driver can get into a
situation that the device is starved due to the doorbell never
being rung. This can happen if xmit_more is set constantly
and __netdev_tx_sent_queue() keeps returning false. Fix this
by checking if the queue needs to be stopped right before
calling __netdev_tx_sent_queue(). Use MAX_SKB_FRAGS + 1 as the
stop condition because that's the maximum number of frags
supported for non-TSO transmit.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Clean up BQL logic
Brett Creeley [Thu, 29 Feb 2024 19:39:28 +0000 (11:39 -0800)]
ionic: Clean up BQL logic

The driver currently calls netdev_tx_completed_queue() for every
Tx completion. However, this API is only meant to be called once
per NAPI if any Tx work is done.  Make the necessary changes to
support calling netdev_tx_completed_queue() only once per NAPI.

Also, use the __netdev_tx_sent_queue() API, which supports the
xmit_more functionality.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Make use napi_consume_skb
Brett Creeley [Thu, 29 Feb 2024 19:39:27 +0000 (11:39 -0800)]
ionic: Make use napi_consume_skb

Make use of napi_consume_skb so that skb recycling
can happen by way of the napi_skb_cache.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Shorten a Tx hotpath
Brett Creeley [Thu, 29 Feb 2024 19:39:26 +0000 (11:39 -0800)]
ionic: Shorten a Tx hotpath

Perf was showing some hot spots in ionic_tx_descs_needed()
for TSO traffic.  Rework the function to return sooner where
possible.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Change default number of descriptors for Tx and Rx
Brett Creeley [Thu, 29 Feb 2024 19:39:25 +0000 (11:39 -0800)]
ionic: Change default number of descriptors for Tx and Rx

Cut down the number of default Tx and Rx descriptors to save
initial memory requirements.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoionic: Rework Tx start/stop flow
Brett Creeley [Thu, 29 Feb 2024 19:39:24 +0000 (11:39 -0800)]
ionic: Rework Tx start/stop flow

Currently the driver attempts to wake the Tx queue
for every descriptor processed. However, this is
overkill and can cause thrashing since Tx xmit can be
running concurrently on a different CPU than Tx clean.
Fix this by refactoring Tx cq servicing into its own
function so the Tx wake code can run after processing
all Tx descriptors.

The driver isn't using the expected memory barriers
to make sure the stop/start bits are coherent. Fix
this by  making sure to use the correct memory barriers.

Also, the driver is using the wake API during Tx
xmit even though it's already scheduled. Fix this by
using the start API during Tx xmit.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agodt-bindings: leds: pwm-multicolour: re-allow active-low
Conor Dooley [Thu, 29 Feb 2024 18:24:00 +0000 (18:24 +0000)]
dt-bindings: leds: pwm-multicolour: re-allow active-low

active-low was lifted to the common schema for leds, but it went
unnoticed that the leds-multicolour binding had "additionalProperties:
false" where the other users had "unevaluatedProperties: false", thereby
disallowing active-low for multicolour leds. Explicitly permit it again.

Fixes: c94d1783136e ("dt-bindings: net: phy: Make LED active-low property common")
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Lee Jones <lee@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: bareudp: Remove generic .ndo_get_stats64
Breno Leitao [Thu, 29 Feb 2024 17:04:24 +0000 (09:04 -0800)]
net: bareudp: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: bareudp: Do not allocate stats in the driver
Breno Leitao [Thu, 29 Feb 2024 17:04:23 +0000 (09:04 -0800)]
net: bareudp: Do not allocate stats in the driver

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the bareudp driver and leverage the network
core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'skb-helpers'
David S. Miller [Mon, 4 Mar 2024 08:47:07 +0000 (08:47 +0000)]
Merge branch 'skb-helpers'

Eric Dumazet says:

====================
net: better use of skb helpers

First patch is a pure cleanup.

Second patch adds a DEBUG_NET_WARN_ON_ONCE() in skb_network_header_len(),
this could help to discover old bugs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: adopt skb_network_header_len() more broadly
Eric Dumazet [Thu, 29 Feb 2024 09:39:08 +0000 (09:39 +0000)]
net: adopt skb_network_header_len() more broadly

(skb_transport_header(skb) - skb_network_header(skb))
can be replaced by skb_network_header_len(skb)

Add a DEBUG_NET_WARN_ON_ONCE() in skb_network_header_len()
to catch cases were the transport_header was not set.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonet: adopt skb_network_offset() and similar helpers
Eric Dumazet [Thu, 29 Feb 2024 09:39:07 +0000 (09:39 +0000)]
net: adopt skb_network_offset() and similar helpers

This is a cleanup patch, making code a bit more concise.

1) Use skb_network_offset(skb) in place of
       (skb_network_header(skb) - skb->data)

2) Use -skb_network_offset(skb) in place of
       (skb->data - skb_network_header(skb))

3) Use skb_transport_offset(skb) in place of
       (skb_transport_header(skb) - skb->data)

4) Use skb_inner_transport_offset(skb) in place of
       (skb_inner_transport_header(skb) - skb->data)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Sun, 3 Mar 2024 04:50:59 +0000 (20:50 -0800)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'inet_dump_ifaddr-no-rtnl'
David S. Miller [Fri, 1 Mar 2024 11:09:40 +0000 (11:09 +0000)]
Merge branch 'inet_dump_ifaddr-no-rtnl'

Eric Dumazet says:

====================
inet: no longer use RTNL to protect inet_dump_ifaddr()

This series convert inet so that a dump of addresses (ip -4 addr)
no longer requires RTNL.
====================

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: use xa_array iterator to implement inet_dump_ifaddr()
Eric Dumazet [Thu, 29 Feb 2024 11:40:16 +0000 (11:40 +0000)]
inet: use xa_array iterator to implement inet_dump_ifaddr()

1) inet_dump_ifaddr() can can run under RCU protection
   instead of RTNL.

2) properly return 0 at the end of a dump, avoiding an
   an extra recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: prepare inet_base_seq() to run without RTNL
Eric Dumazet [Thu, 29 Feb 2024 11:40:15 +0000 (11:40 +0000)]
inet: prepare inet_base_seq() to run without RTNL

In the following patch, inet_base_seq() will no longer be called
with RTNL held.

Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: annotate data-races around ifa->ifa_flags
Eric Dumazet [Thu, 29 Feb 2024 11:40:14 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_flags

ifa->ifa_flags can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: annotate data-races around ifa->ifa_preferred_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:13 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_preferred_lft

ifa->ifa_preferred_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: annotate data-races around ifa->ifa_valid_lft
Eric Dumazet [Thu, 29 Feb 2024 11:40:12 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_valid_lft

ifa->ifa_valid_lft can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoinet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp
Eric Dumazet [Thu, 29 Feb 2024 11:40:11 +0000 (11:40 +0000)]
inet: annotate data-races around ifa->ifa_tstamp and ifa->ifa_cstamp

ifa->ifa_tstamp can be read locklessly.

Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

Do the same for ifa->ifa_cstamp to prepare upcoming changes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'netdevsim-link'
David S. Miller [Fri, 1 Mar 2024 10:43:11 +0000 (10:43 +0000)]
Merge branch 'netdevsim-link'

David Wei says:

====================
netdevsim: link and forward skbs between ports

This patchset adds the ability to link two netdevsim ports together and
forward skbs between them, similar to veth. The goal is to use netdevsim
for testing features e.g. zero copy Rx using io_uring.

This feature was tested locally on QEMU, and a selftest is included.

I ran netdev selftests CI style and all tests but the following passed:
- gro.sh
- l2tp.sh
- ip_local_port_range.sh

gro.sh fails because virtme-ng mounts as read-only and it tries to write
to log.txt. This issue was reported to virtme-ng upstream.

l2tp.sh and ip_local_port_range.sh both fail for me on net-next/main as
well.

---
v13->v14:
- implement ndo_get_iflink()
- fix returning 0 if peer is already linked during linking or not linked
  during unlinking
- bump dropped counter if nsim_ipsec_tx() fails and generally reorder
  nsim_start_xmit()
- fix overflowing lines and indentations

v12->v13:
- wait for socat listening port to be ready before sending data in
  selftest

v11->v12:
- fix leaked netns refs
- fix rtnetlink.sh kci_test_ipsec_offload() selftest

v10->v11:
- add udevadm settle after creating netdevsims in selftest

v9->v10:
- fix not freeing skb when not there is no peer
- prevent possible id clashes in selftest
- cleanup selftest on error paths

v8->v9:
- switch to getting netns using fd rather than id
- prevent linking a netdevsim to itself
- update tests

v7->v8:
- fix not dereferencing RCU ptr using rcu_dereference()
- remove unused variables in selftest

v6->v7:
- change link syntax to netnsid:ifidx
- replace dev_get_by_index() with __dev_get_by_index()
- check for NULL peer when linking
- add a sysfs attribute for unlinking
- only update Tx stats if not dropped
- update selftest

v5->v6:
- reworked to link two netdevsims using sysfs attribute on the bus
  device instead of debugfs due to deadlock possibility if a netdevsim
  is removed during linking
- removed unnecessary patch maintaining a list of probed nsim_devs
- updated selftest

v4->v5:
- reduce nsim_dev_list_lock critical section
- fixed missing mutex unlock during unwind ladder
- rework nsim_dev_peer_write synchronization to take devlink lock as
  well as rtnl_lock
- return err msgs to user during linking if port doesn't exist or
  linking to self
- update tx stats outside of RCU lock

v3->v4:
- maintain a mutex protected list of probed nsim_devs instead of using
  nsim_bus_dev
- fixed synchronization issues by taking rtnl_lock
- track tx_dropped skbs

v2->v3:
- take lock when traversing nsim_bus_dev_list
- take device ref when getting a nsim_bus_dev
- return 0 if nsim_dev_peer_read cannot find the port
- address code formatting
- do not hard code values in selftests
- add Makefile for selftests

v1->v2:
- renamed debugfs file from "link" to "peer"
- replaced strstep() with sscanf() for consistency
- increased char[] buf sz to 22 for copying id + port from user
- added err msg w/ expected fmt when linking as a hint to user
- prevent linking port to itself
- protect peer ptr using RCU

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonetdevsim: fix rtnetlink.sh selftest
David Wei [Wed, 28 Feb 2024 23:22:53 +0000 (15:22 -0800)]
netdevsim: fix rtnetlink.sh selftest

I cleared IFF_NOARP flag from netdevsim dev->flags in order to support
skb forwarding. This breaks the rtnetlink.sh selftest
kci_test_ipsec_offload() test because ipsec does not connect to peers it
cannot transmit to.

Fix the issue by adding a neigh entry manually. ipsec_offload test now
successfully pass.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agonetdevsim: add selftest for forwarding skb between connected ports
David Wei [Wed, 28 Feb 2024 23:22:52 +0000 (15:22 -0800)]
netdevsim: add selftest for forwarding skb between connected ports

Connect two netdevsim ports in different namespaces together, then send
packets between them using socat.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net>
Signed-off-by: David S. Miller <davem@davemloft.net>