git.kernel.dk Git - linux-2.6-block.git/log

tipc: apply bearer link tolerance on running links

Currently, the default link tolerance set in struct tipc_bearer only
has effect on links going up after that moment. I.e., a user has to
reset all the node's links across that bearer to have the new value
applied. This is too limiting and disturbing on a running cluster to
be useful.

We now change this so that also already existing links are updated
dynamically, without any need for a reset, when the bearer value is
changed. We leverage the already existing per-link functionality
for this to achieve the wanted effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-speed-up-reading-on-chip-memory'

Rahul Lakkireddy says:

====================
cxgb4: speed up reading on-chip memory

This series of patches speed up reading on-chip memory (EDC and MC)
by reading 64-bits at a time.

Patch 1 reworks logic to read EDC and MC.

Patch 2 adds logic to read EDC and MC 64-bits at a time.

v2:
- Dropped AVX CPU intrinsic instructions.
- Use readq() to read 64-bits at a time.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: speed up on-chip memory read

Use readq() (via t4_read_reg64()) to read 64-bits at a time.
Read residual in 32-bit multiples.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: rework on-chip memory read

Rework logic to read EDC and MC. Do 32-bit reads at a time.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Move ipv4 set_lwt_redirect helper to lwtunnel

IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as
needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change
IPv6 to also use it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'PTP-support-for-DSA-and-mv88e6xxx-driver'

Andrew Lunn says:

====================
PTP support for DSA and mv88e6xxx driver.

This patchset adds support for using the PTP hardware in switches
supported by the mv88e6xxx driver. The code was produces in
collaboration with Brandon Streiff doing the initial implementation,
and then Richard Cochran and Andrew Lunn making further changes and
cleanups.

The code is sufficient to use ptp4l on a single DSA interface, either
as a master or a slave. Due to the use of an MDIO bus to access the
switch, reading hardware timestamps is slower than what ptp4l
expects. Thus it is necessary to use the option
--tx_timestamp_timeout=32. Heavy use of ethtool -S, or bridge fdb show
can also upset ptp4l. Patches to address this will follow.

Further work is requires to support bridges using Boundary Clock or
Transparent Clock mode.

Since the RFC, an overflow bug has been fixed. Brandon Streiff
has also Acked-by: the updates to his initial patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add workaround for 6341 timestamping

88E6341 devices default to timestamping at the PHY, but due to a
hardware issue, timestamps via this component are unreliable. For
this family, configure the PTP hardware to force the timestamping
to occur at the MAC.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add rx/tx timestamping support

This patch implements RX/TX timestamping support.

The Marvell PTP hardware supports RX timestamping individual message
types, but for simplicity we only support the EVENT receive filter since
few if any clients bother with the more specific filter types.

checkpatch and reverse Christmas tree changes by Andrew Lunn.

Re-factor duplicated code paths and avoid IfOk anti-pattern, use the
common ptp worker thread from the class layer and time stamp UDP/IPv4
frames as well as Layer-2 frame by Richard Cochran.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: forward timestamping callbacks to switch drivers

Forward the rx/tx timestamp machinery from the dsa infrastructure to the
switch driver.

On the rx side, defer delivery of skbs until we have an rx timestamp.
This mimicks the behavior of skb_defer_rx_timestamp.

On the tx side, identify PTP packets, clone them, and pass them to the
underlying switch driver before we transmit. This mimicks the behavior
of skb_tx_timestamp.

Adjusted txstamp API to keep the allocation and freeing of the clone
in the same central function by Richard Cochran

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: forward hardware timestamping ioctls to switch driver

This patch adds support to the dsa slave network device so that
switch drivers can implement the SIOC[GS]HWTSTAMP ioctls and the
ethtool timestamp-info interface.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add support for event capture

This patch adds support for configuring mv88e6xxx GPIO lines as PTP
pins, so that they may be used for time stamping external events or for
periodic output.

Checkpatch and reverse Christmas tree fixes by Andrew Lunn

Periodic output removed by Richard Cochran, until a better abstraction
of a VCO is added to Linux in general.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add support for GPIO configuration

MV88E6352 and later switches support GPIO control through the "Scratch
& Misc" global2 register. (Older switches do too, though with a slightly
different register interface. Only the 6352-style is implemented here.)

Add a new file, global2_scratch.c, for operations in the Scratch & Misc
space. Additionally, add a GPIO operations structure to present an
abstract view over GPIO manipulation.

Reverse Christmas tree and unsigned has been replaced with unsigned
int by Andrew Lunn.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: expose switch time as a PTP hardware clock

This patch adds basic support for exposing the 32-bit timestamp counter
inside the mv88e6xxx switch as a ptp_clock.

Adjfine implemented by Richard Cochran.
Andrew Lunn: fix return value of PTP stub function.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add accessors for PTP/TAI registers

This patch implements support for accessing the Precision Time Protocol
and Time Application Interface registers via the AVB register interface
in the Global 2 register.

The register interface differs slightly between different models; older
models use a 3-bit operations field, while newer models use a 2-bit
field. The operations values and the special "global port" values are
different between the two. This is a similar split to the differences
in the "Ingress Rate" register between models, so, like in that case,
we call the two variants "6352" and "6390" and create an ops structure
to abstract between the two.

checkpatch fixups by Andrew Lunn

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: export g2 register accessors

Let the mv88e6xxx_g2_* register accessor functions be accessible
outside of global2.c.

Signed-off-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ptp: Add stub for ptp_classify_raw()

When NET_PTP_CLASSIFY is disabled, a stub function is required in
order that the drivers compile.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: try to keep packet if SYN_RCV race is lost

배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.

This involves RPS, and fragmentation.

배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.

This is caused by a CPU losing the race, and simply not caring enough.

Since this seems to occur frequently, we can do better and perform
a second lookup.

Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.

Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.

Many thanks to 배석진 for a very detailed analysis.

Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-dev-Make-protocol-ptr-dependent-on-CONFIG'

David Ahern says:

====================
net: dev: Make protocol ptr dependent on CONFIG

Found these in a branch from 3-years ago. Still relevant today.
Make decnet, ax25, and atalk ptrs in net_device based on their
respective CONFIG.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Remove atalk header from socket.c

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make atalk_ptr depend on ATALK or IRDA

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make ax25_ptr depend on CONFIG_AX25

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make dn_ptr depend on CONFIG_DECNET

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-02-13

This series contains updates to i40e and i40evf.

Wei Yongjun fixes a function that needed to be "static". Also fixes the
use of GFP_KERNEL to GFP_ATOMIC when we have taken a spinlock.

Mitch cleans up several info messages to not include the memory
addresses being used on the off chance this information could be used
maliciously.

Alan provides several fixes to the broadcast filters starting with the
triggering of overflow promiscuous in circumstances where we run out of
space for broadcast filters to prevent traffic from being unexpectedly
dropped. Refactored the code to improve the readability and
maintainability when we are concerned about when and how overflow
promiscuous is changed.

Harshitha cleans up a message to make it more clear on what is being
reset, so users are not confused and think the PF is resetting.

Dave fixes an issue where the MAC, firmware version and NPAR checks used
to determine if shutting off the firmware LLDP engine is supported or
not, instead set a hardware flag which ethtool can use.

Jake updates the VF driver to use __dev_uc_sync and __dev_mc_sync, like
the PF driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: Add FIB onlink tests

Add test cases verifying FIB onlink commands work as expected in
various conditions - IPv4, IPv6, main table, and VRF.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40evf: Make VF reset warning message more clear

When the PF resets the VF, the VF puts out a warning message
indicating that the VF received a reset message from the PF.
Make this message more clear so that we do not mistakenly
think that the PF is undergoing a reset.

Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: use __dev_[um]c_sync routines in .set_rx_mode

Similar to changes done to the PF driver in commit 6622f5cdbaf3 ("i40e:
make use of __dev_uc_sync and __dev_mc_sync"), replace our
home-rolled method for updating the internal status of MAC filters with
__dev_uc_sync and __dev_mc_sync.

These new functions use internal state within the netdev struct in order
to efficiently break the question of "which filters in this list need to
be added or removed" into singular "add this filter" and "delete this
filter" requests.

This vastly improves our handling of .set_rx_mode especially with large
number of MAC filters being added to the device, and even results in
a simpler .set_rx_mode handler.

Under some circumstances, such as when attached to a bridge, we may
receive a request to delete our own permanent address. Prevent deletion
of this address during i40evf_addr_unsync so that we don't accidentally
stop receiving traffic.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: i40e: Change ethtool check from MAC to HW flag

The MAC, FW Version and NPAR check used to determine
if shutting off the FW LLDP engine is supported is not
using the usual feature check mechanism.

This patch fixes the problem by moving the feature check
to i40e_sw_init in order to set a flag in pf->hw_features
that ethtool will use for priv_flags disable operation.

Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: do not force filter failure in overflow promiscuous

Broadcast filters can now cause overflow promiscuous to trigger when
adding "too many" VLANs to all the ports of a device and the driver
needs a way to exit overflow promiscuous once triggered.

Currently the driver looks to see if there are "too many" filters and/or
we have any failed filters to determine when it is safe to exit overflow
promiscuous.  If we trigger overflow promiscuous with broadcast filters,
any new filters added will be "auto-failed" until we exit overflow
promiscuous.  Since the user can't manually remove the failed broadcast
filters for VLANs (nor should we expect the user to do such), there is
no way to exit overflow promiscuous without reloading the driver.

The easiest way to do this is to remove the shortcut to "auto-fail"
filters in overflow promiscuous.  If the user removes the VLANs, the
failed filters will be removed and since we're no longer "auto-failing"
new filters, we'll eventually get a good set of filters and exit
overflow promiscuous.

This has the side benefit of making filter state more explicit in that
if a filter says it's failed we know for a fact it failed and not just
assuming it will if we're in overflow promiscuous.  This is nice because
if the user removes some filters and then adds some, even if we're in
overflow promiscuous, the filter might succeed; we were just assuming it
won't because the user hasn't rectified other existing failed filters.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: refactor promisc_changed in i40e_sync_vsi_filters

This code here is quite complex and easy to screw up.  Let's see if we
can't improve the readability and maintainability a bit.  This refactors
out promisc_changed into two variables 'old_overflow' and 'new_overflow'
which makes it a bit clearer when we're concerned about when and how
overflow promiscuous is changed.  This also makes so that we no longer
need to pass a boolean pointer to i40e_aqc_add_filters.  Instead we can
simply check if we changed the overflow promiscuous flag since the
function start.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: Use an iterator of the same type as the list

When iterating through the linked list of VLAN filters, make the
iterator the same type as that of the linked list.

Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: broadcast filters can trigger overflow promiscuous

When adding a bunch of VLANs to all the ports on a device, it's possible
to run out of space for broadcast filters. The driver should trigger
overflow promiscuous in this circumstance to prevent traffic from being
unexpectedly dropped.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: don't leak memory addresses

Could a Bad Person do Bad Things to a server if they found these
addresses printed in the log? Who knows? But let's not take that risk.

Remove pointers from a bunch of printks. In some cases, I was able to
adjust the message to indicate whether or not the value was null. In
others, I just removed the entire message as there was really no hope of
saving it.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: use GFP_ATOMIC under spin lock

A spin lock is taken here so we should use GFP_ATOMIC.

Fixes: 504398f0a78e ("i40evf: use spinlock to protect (mac|vlan)_filter_list")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Make local function i40e_get_link_speed static

Fixes the following sparse warning:

drivers/net/ethernet/intel/i40e/i40e_main.c:5440:5: warning:
symbol 'i40e_get_link_speed' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'selftests-fib_tests-simplifications-verbosity-and-a-race'

David Ahern says:

====================
selftests: fib_tests: simplifications, verbosity and a race

Improve efficiency of fib_tests.sh and make the test result more verbose,
from this summary:
$ fib_tests.sh is failing in a VM:
    $ fib_tests.sh
    Running netdev unregister tests
    PASS: unicast route test
    PASS: multipath route test
    Running netdev down tests
    PASS: unicast route test
    PASS: multipath route test
    Running netdev carrier change tests
    PASS: local route carrier test
    FAIL: unicast route carrier test

where a single entry actually corresponds to many checks to a much more
verbse output that clarifies test cases:
$fib_tests.sh
Single path route carrier test
    ....
    Carrier down
        IPv4 fibmatch                                         [ OK ]
        IPv6 fibmatch                                         [ OK ]
        IPv4 linkdown flag set                                [FAIL]
        IPv6 linkdown flag set                                [FAIL]
    Second address added with carrier down
        IPv4 fibmatch                                         [ OK ]
        IPv6 fibmatch                                         [ OK ]
        IPv4 linkdown flag set                                [FAIL]
        IPv6 linkdown flag set                                [ OK ]

And then fix the race in changing carrier down on dummy device to checking
the corresponding routes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_tests: sleep after changing carrier

sleep for a second after setting carrier down to allow linkwatch
to propagate the change to the routing stack via netdev_state_change.
As it stands there is a race setting carrier down on the dummy
device and then checking the linkdown flag in the routes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_tests: Move admin of dummy0 to helpers

Move setup and teardown of testns and dummy0 to helpers.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_tests: Make test results more verbose

fib_tests.sh is failing in a VM:
    $ fib_tests.sh
    Running netdev unregister tests
    PASS: unicast route test
    PASS: multipath route test
    Running netdev down tests
    PASS: unicast route test
    PASS: multipath route test
    Running netdev carrier change tests
    PASS: local route carrier test
    FAIL: unicast route carrier test

The last test corresponds to fib_carrier_unicast_test which 12 places
that could be failing. Be more verbose in the output so a failure is
easier to track down and separate test setup failures with set -e and
set +e pairs.

With the verbose logging it is easier to see which checks are failing:
    $fib_tests.sh
    Single path route carrier test
        ....
        Carrier down
            IPv4 fibmatch                                         [ OK ]
            IPv6 fibmatch                                         [ OK ]
            IPv4 linkdown flag set                                [FAIL]
            IPv6 linkdown flag set                                [FAIL]
        Second address added with carrier down
            IPv4 fibmatch                                         [ OK ]
            IPv6 fibmatch                                         [ OK ]
            IPv4 linkdown flag set                                [FAIL]
            IPv6 linkdown flag set                                [ OK ]

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: fib_tests: simplify ip commands in a namespace

'ip netns exec testns ip' is more efficiently handled using 'ip -netns';
runs the ip command after switching the namespace and avoids an exec.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4: Unexport fib_multipath_hash and fib_select_path

Do not export fib_multipath_hash or fib_select_path; both are only used
by core ipv4 code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4: Simplify fib_select_path

If flow oif is set and it is not an l3mdev, then fib_select_path
can jump to the source address check.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sctp-rename-sctp-diag-file-and-add-file-comments-for-it'

Xin Long says:

====================
sctp: rename sctp diag file and add file comments for it

This patchset is to remove the sctp_ prefix for sctp diag file,
and also to add the missing file comments for it.

v1->v2:
split them into two patches as Marcelo suggested.
====================

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add file comments in diag.c

This patch is to add the missing file comments for sctp diag file.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: rename sctp_diag.c as diag.c

Remove 'sctp_' prefix for diag file, to keep consistent with other
files' names.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Use NL_SET_ERR_MSG_MOD

Use NL_SET_ERR_MSG_MOD helper which adds the module name instead
of specifying the prefix each time.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-SPAN-cleanups'

Jiri Pirko says:

====================
mlxsw: SPAN cleanups

In patch one of this short series, a misplaced pointer star is moved to
the correct place.

In the second patch, we observe that if SPAN entries carry their
reference count anyway, it's redundant to also carry a "used" flag.

In the third patch, SPAN support code is moved to a separate module.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Move SPAN code to separate module

For the upcoming work on SPAN, it makes sense to move the current code
to a module of its own. It already has a well-defined API boundary to
the mirror management (which is used from matchall and ACL code). A
couple more functions need to be exported for the functions that
spectrum.c needs to use for MTU handling and subsystem init/fini.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Drop struct span_entry.used

The member ref_count already determines whether a given SPAN entry is
used, and is as easy to use as a dedicated boolean.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Fix a coding style nit

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-IPIP-cleanups'

Jiri Pirko says:

====================
mlxsw: IPIP cleanups

In the first patch, a forgotten #include is added. Even though the code
compiles as-is, the include is necessary for modules that should include
spectrum_ipip.h.

The second patch corrects an assumption that IPv6 tunnels use struct
ip_tunnel_parm to store tunnel parameters.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Distinguish between IPv4/6 tunnels

struct ip_tunnel_parm, where GRE and several other tunnel types hold
information, is IPv4-specific. The current router / ipip code in mlxsw
however uses it as if it were generic.

Make it clear that it's not. Rename many functions from _params_ to
_params4_. mlxsw_sp_ipip_parms_saddr() and _daddr() take a proto
argument to dispatch on it. Move the dispatch logic to
mlxsw_sp_ipip_netdev_saddr() and _daddr(), and replace with
single-protocol functions.

In struct mlxsw_sp_ipip_entry, move the "parms" field to a (for the time
being, singleton) union. Update users throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Add a forgotten include

struct ip_tunnel_parm, which is used in spectrum_ipip.h, is defined in
if_tunnel.h. However, the former neglects to include the latter.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa_eth: fix incorrect comment

The comment stated that a thread was started, but
that is not the case.

Signed-off-by: Jake Moroni <mail@jakemoroni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Replacing-net_mutex-with-rw_semaphore'

Kirill Tkhai says:

====================
Replacing net_mutex with rw_semaphore

this is the third version of the patchset introducing net_sem
instead of net_mutex. The patchset adds net_sem in addition
to net_mutex and allows pernet_operations to be "async". This
flag means, the pernet_operations methods are safe to be
executed with any other pernet_operations (un)initializing
another net.

If there are only async pernet_operations in the system,
net_mutex is not used either for setup_net() or for cleanup_net().

The pernet_operations converted in this patchset allow
to create minimal .config to have network working, and
the changes improve the performance like you may see
below:

    %for i in {1..10000}; do unshare -n bash -c exit; done

    *before*
    real 1m40,377s
    user 0m9,672s
    sys 0m19,928s

    *after*
    real 0m17,007s
    user 0m5,311s
    sys 0m11,779

    (5.8 times faster)

In the future, when all pernet_operations become async,
we'll just remove this "async" field tree-wide.

All the new logic is concentrated in patches [1-5/32].
The rest of patches converts specific operations:
review, rationale of they can be converted, and setting
of async flag.

Kirill

v3: Improved patches descriptions. Added comment into [5/32].
Added [32/32] converting netlink_tap_net_ops (new pernet operations
introduced in 2018).

v2: Single patch -> patchset with rationale of every conversion
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert netlink_tap_net_ops

These pernet_operations init just allocated net memory,
and they obviously can be executed in parallel in any
others.

v3: New

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert diag_net_ops

These pernet operations just create and destroy netlink
socket. The socket is pernet and else operations don't
touch it.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert default_device_ops

These pernet operations consist of exit() and exit_batch() methods.

default_device_exit() moves not-local and virtual devices to init_net.
There is nothing exciting, because this may happen in any time
on a working system, and rtnl_lock() and synchronize_net() protect
us from all cases of external dereference.

The same for default_device_exit_batch(). Similar unregisteration
may happen in any time on a system. Here several lists (like todo_list),
which are accessed under rtnl_lock(). After rtnl_unlock() and
netdev_run_todo() all the devices are flushed.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert loopback_net_ops

These pernet_operations have only init() method. It allocates
memory for net_device, calls register_netdev() and assigns
net::loopback_dev.

register_netdev() is allowed be used without additional locks,
as it's synchronized on rtnl_lock(). There are many examples
of using this functon directly from ioctl().

The only difference, compared to ioctl(), is that net is not
completely alive at this moment. But it looks like, there is
no way for parallel pernet_operations to dereference
the net_device, as the most of struct net_device lists,
where it's linked, are related to net, and the net is not liked.

The exceptions are net_device::unreg_list, close_list, todo_list,
used for unregistration, and ::link_watch_list, where net_device
may be linked to global lists.

Unregistration of loopback_dev obviously can't happen, when
loopback_net_init() is executing, as the net as alive. It occurs
in default_device_ops, which currently requires net_mutex,
and it behaves as a barrier at the moment. It will be considered
in next patch.

Speaking about link_watch_list, it seems, there is no way
for loopback_dev at time of registration to be linked in lweventlist
and be available for another pernet_operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert addrconf_ops

These pernet_operations (un)register sysctl, which
are not touched by anybody else.

So, it's safe to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert ipv4_sysctl_ops

These pernet_operations create and destroy sysctl,
which are not touched by anybody else.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert packet_net_ops

These pernet_operations just create and destroy /proc entry,
and another operations do not touch it.

Also, nobody else are interested in foreign net::packet::sklist.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert unix_net_ops

These pernet_operations are just create and destroy
/proc and sysctl entries, and are not touched by
foreign pernet_operations.

So, we are able to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert pernet_subsys, registered from inet_init()

arp_net_ops just addr/removes /proc entry.

devinet_ops allocates and frees duplicate of init_net tables
and (un)registers sysctl entries.

fib_net_ops allocates and frees pernet tables, creates/destroys
netlink socket and (un)initializes /proc entries. Foreign
pernet_operations do not touch them.

ip_rt_proc_ops only modifies pernet /proc entries.

xfrm_net_ops creates/destroys /proc entries, allocates/frees
pernet statistics, hashes and tables, and (un)initializes
sysctl files. These are not touched by foreigh pernet_operations

xfrm4_net_ops allocates/frees private pernet memory, and
configures sysctls.

sysctl_route_ops creates/destroys sysctls.

rt_genid_ops only initializes fields of just allocated net.

ipv4_inetpeer_ops allocated/frees net private memory.

igmp_net_ops just creates/destroys /proc files and socket,
noone else interested in.

tcp_sk_ops seems to be safe, because tcp_sk_init() does not
depend on any other pernet_operations modifications. Iteration
over hash table in inet_twsk_purge() is made under RCU lock,
and it's safe to iterate the table this way. Removing from
the table happen from inet_twsk_deschedule_put(), but this
function is safe without any extern locks, as it's synchronized
inside itself. There are many examples, it's used in different
context. So, it's safe to leave tcp_sk_exit_batch() unlocked.

tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.

udplite4_net_ops only creates/destroys pernet /proc file.

icmp_sk_ops creates percpu sockets, not touched by foreign
pernet_operations.

ipmr_net_ops creates/destroys pernet fib tables, (un)registers
fib rules and /proc files. This seem to be safe to execute
in parallel with foreign pernet_operations.

af_inet_ops just sets up default parameters of newly created net.

ipv4_mib_ops creates and destroys pernet percpu statistics.

raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
and ip_proc_ops only create/destroy pernet /proc files.

ip4_frags_ops creates and destroys sysctl file.

So, it's safe to make the pernet_operations async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert sysctl_core_ops

These pernet_operations register and destroy sysctl
directory, and it's not interesting for foreign
pernet_operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert wext_pernet_ops

These pernet_operations initialize and purge net::wext_nlevents
queue, and are not touched by foreign pernet_operations.

Mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert genl_pernet_ops

This pernet_operations create and destroy net::genl_sock.
Foreign pernet_operations don't touch it.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert subsys_initcall() registered pernet_operations from net/sched

psched_net_ops only creates and destroyes /proc entry,
and safe to be executed in parallel with any foreigh
pernet_operations.

tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
which is not touched by foreign pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert fib_* pernet_operations, registered via subsys_initcall

Both of them create and initialize lists, which are not touched
by another foreing pernet_operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert pernet_subsys ops, registered via net_dev_init()

There are:
1)dev_proc_ops and dev_mc_net_ops, which create and destroy
pernet proc file and not interesting for another net namespaces;
2)netdev_net_ops, which creates pernet hashes, which are not
touched by another pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert proto_net_ops

This patch starts to convert pernet_subsys, registered
from subsys initcalls.

It seems safe to be executed in parallel with others,
as it's only creates/destoyes proc entry,
which nobody else is not interested in.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert uevent_net_ops

uevent_net_init() and uevent_net_exit() create and
destroy netlink socket, and these actions serialized
in netlink code.

Parallel execution with other pernet_operations
makes the socket disappear earlier from uevent_sock_list
on ->exit. As userspace can't be interested in broadcast
messages of dying net, and, as I see, no one in kernel
listen them, we may safely make uevent_net_ops async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert audit_net_ops

This patch starts to convert pernet_subsys, registered
from postcore initcalls.

audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we make audit_net_ops async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert rtnetlink_net_ops

rtnetlink_net_init() and rtnetlink_net_exit()
create and destroy netlink socket net::rtnl.

The socket is used to send rtnl notification via
rtnl_net_notifyid(). There is no a problem
to create and destroy it in parallel with other
pernet operations, as we link net in setup_net()
after the socket is created, and destroy
in cleanup_net() after net is unhashed from all
the lists and there is no RCU references on it.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert netlink_net_ops

The methods of netlink_net_ops create and destroy "netlink"
file, which are not interesting for foreigh pernet_operations.
So, netlink_net_ops may safely be made async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert net_defaults_ops

net_defaults_ops introduce only net_defaults_init_net method,
and it acts on net::core::sysctl_somaxconn, which
is not interesting for the rest of pernet_subsys and
pernet_device lists. Then, make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert net_inuse_ops

net_inuse_ops methods expose statistics in /proc.
No one from the rest of pernet_subsys or pernet_device
lists touch net::core::inuse.

So, it's safe to make net_inuse_ops async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert nf_log_net_ops

The pernet_operations would have had a problem in parallel
execution with others, if init_net had been able to released.
But it's not, and the rest is safe for that.
There is memory allocation, which nobody else interested in,
and sysctl registration. So, we make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert netfilter_net_ops

Methods netfilter_net_init() and netfilter_net_exit()
initialize net::nf::hooks and change net-related proc
directory of net. Another pernet_operations are not
interested in forein net::nf::hooks or proc entries,
so it's safe to make them executed in parallel with
methods of other pernet operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert sysctl_pernet_ops

This patch starts to convert pernet_subsys, registered
from core initcalls.

Methods sysctl_net_init() and sysctl_net_exit() initialize
net::sysctls table of a namespace.

pernet_operations::init()/exit() methods from the rest
of the list do not touch net::sysctls of strangers,
so it's safe to execute sysctl_pernet_ops's methods
in parallel with any other pernet_operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert net_ns_ops methods

This patch starts to convert pernet_subsys, registered
from pure initcalls.

net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
ida_simple_* functions, which are not need a synchronization.
They are synchronized by idr subsystem.

So, net_ns_ops methods are able to be executed
in parallel with methods of other pernet operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Convert proc_net_ns_ops

This patch starts to convert pernet_subsys, registered
before initcalls.

proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
{un,}register pernet net->proc_net and ->proc_net_stat.

Constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.

So, proc_net_ns_ops methods are able to be executed
in parallel with methods of any other pernet operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Allow pernet_operations to be executed in parallel

This adds new pernet_operations::async flag to indicate operations,
which ->init(), ->exit() and ->exit_batch() methods are allowed
to be executed in parallel with the methods of any other pernet_operations.

When there are only asynchronous pernet_operations in the system,
net_mutex won't be taken for a net construction and destruction.

Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic()
without replacing with the equivalent net_sem check, as there is
one more lockdep assert below.

v3: Add comment near net_mutex.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Move mutex_unlock() in cleanup_net() up

net_sem protects from pernet_list changing, while
ops_free_list() makes simple kfree(), and it can't
race with other pernet_operations callbacks.

So we may release net_mutex earlier then it was.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Introduce net_sem for protection of pernet_list

Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.

The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.

Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..10000}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Cleanup in copy_net_ns()

Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Assign net to net_namespace_list in setup_net()

This patch merges two repeating pieces of code in one,
and they will live in setup_net() now.

The only change is that assignment:

init_net_initialized = true;

becomes reordered with:

list_add_tail_rcu(&net->list, &net_namespace_list);

The order does not have visible effect, and it is a simple
cleanup because of:

init_net_initialized is used in !CONFIG_NET_NS case
to order proc_net_ns_ops registration occuring at boot time:

start_kernel()->proc_root_init()->proc_net_init(),
with
net_ns_init()->setup_net(&init_net, &init_user_ns)

also occuring in boot time from the same init_task.

When there are no another tasks to race with them,
for the single task it does not matter, which order
two sequential independent loads should be made.
So we make them reordered.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-02-12

This series contains updates to i40e and i40evf.

Alan fixes a spelling mistake in code comments. Fixes an issue on older
firmware versions or NPAR enabled PFs which do not support the
I40E_FLAG_DISABLE_FW_LLDP flag and would get into a situation where any
attempt to change any priv flag would be forbidden.

Alex got busy with the ITR code and made several cleanups and fixes so
that we can more easily understand what is going on. The fixes included
a computational fix when determining the register offset, as well as a
fix for unnecessarily toggling the CLEARPBA bit which could lead to
potential lost events if auto-masking is not enabled.

Filip adds a necessary delay to recover after a EMP reset when using
firmware version 4.33.

Paweł adds a warning message for MFP devices when the link-down-on-close
flag is set because it may affect other partitions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

i40e/i40evf: Add support for new mechanism of updating adaptive ITR

This patch replaces the existing mechanism for determining the correct
value to program for adaptive ITR with yet another new and more
complicated approach.

The basic idea from a 30K foot view is that this new approach will push the
Rx interrupt moderation up so that by default it starts in low latency and
is gradually pushed up into a higher latency setup as long as doing so
increases the number of packets processed, if the number of packets drops
to 4 to 1 per packet we will reset and just base our ITR on the size of the
packets being received. For Tx we leave it floating at a high interrupt
delay and do not pull it down unless we start processing more than 112
packets per interrupt. If we start exceeding that we will cut our interrupt
rates in half until we are back below 112.

The side effect of these patches are that we will be processing more
packets per interrupt. This is both a good and a bad thing as it means we
will not be blocking processing in the case of things like pktgen and XDP,
but we will also be consuming a bit more CPU in the cases of things such as
network throughput tests using netperf.

One delta from this versus the ixgbe version of the changes is that I have
made the interrupt moderation a bit more aggressive when we are in bulk
mode by moving our "goldilocks zone" up from 48 to 96 to 56 to 112. The
main motivation behind moving this is to address the fact that we need to
update less frequently, and have more fine grained control due to the
separate Tx and Rx ITR times.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Split container ITR into current_itr and target_itr

This patch is mostly prep-work for replacing the current approach to
programming the dynamic aka adaptive ITR. Specifically here what we are
doing is splitting the Tx and Rx ITR each into two separate values.

The first value current_itr represents the current value of the register.

The second value target_itr represents the desired value of the register.

The general plan by doing this is to allow for deferring the update of the
ITR value under certain circumstances. For now we will work with what we
have, but in the future I hope to change the behavior so that we always
only update one ITR at a time using some simple logic to determine which
ITR requires an update.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: Correctly populate rxitr_idx and txitr_idx

While testing code for the recent ITR changes I found that updating the Tx
ITR appeared to have no effect with everything defaulting to the Rx ITR. A
bit of digging narrowed it down the fact that we were asking the PF to
associate all causes with ITR 0 as we weren't populating the itr_idx values
for either Rx or Tx.

To correct it I have added the configuration for these values to this
patch. In addition I did some minor clean-up to just add a local pointer
for the vector map instead of dereferencing it based off of the index
repeatedly. In my opinion this makes the resultant code a bit more readable
and saves us a few characters.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Use usec value instead of reg value for ITR defines

Instead of using the register value for the defines when setting up the
ring ITR we can just use the actual values and avoid the use of shifts and
macros to translate between the values we have and the values we want.

This helps to make the code more readable as we can quickly translate from
one value to the other.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

net: make getname() functions return length rather than use int* parameter

Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e/i40evf: Don't bother setting the CLEARPBA bit

The CLEARPBA bit in the dynamic interrupt control register actually has
no effect either way on the hardware. As per errata 28 in the XL710
specification update the interrupt is actually cleared any time the
register is written with the INTENA_MSK bit set to 0. As such the act of
toggling the enable bit actually will trigger the interrupt being
cleared and could lead to potential lost events if auto-masking is
not enabled.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Clean-up of bits related to using q_vector->reg_idx

This patch is a further clean-up related to the change over to using
q_vector->reg_idx when accessing the ITR registers. Specifically the code
appears to have several other spots where we were computing the register
offset manually and this resulted in errors in a few spots.

Specifically in the i40evf functions for mapping queues to vectors it
appears we may have had an off by 1 error since (v_idx - 1) for the first
q_vector with an index of 0 would result in us returning -1 if I am not
mistaken.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: use changed_flags to check I40E_FLAG_DISABLE_FW_LLDP

Currently in i40e_set_priv_flags we use new_flags to check for the
I40E_FLAG_DISABLE_FW_LLDP flag.  This is an issue for a few a reasons.
DISABLE_FW_LLDP is persistent across reboots/driver reloads.  This means
we need some way to detect if FW LLDP is enabled on init.  We do this by
trying to init_dcb and if it fails with EPERM we know LLDP is disabled
in FW.

This could be a problem on older FW versions or NPAR enabled PFs because
there are situations where the FW could disable LLDP, but they do _not_
support using this flag to change it.  If we do end up in this
situation, the flag will be set, then when the user tries to change any
priv flags, the driver thinks the user is trying to disable FW LLDP on a
FW that doesn't support it and essentially forbids any priv flag
changes.

The fix is simple, instead of checking if this flag is set, we should be
checking if the user is trying to _change_ the flag on unsupported FW
versions.

This patch also adds a comment explaining that the cmpxchg is the point
of no return.  Once we put the new flags into pf->flags we can't back
out.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Warn when setting link-down-on-close while in MFP

This patch adds a warning message when the link-down-on-close flag is
setting on. The warning is printed only on MFP devices

Signed-off-by: Paweł Jabłoński <pawel.jablonski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Add delay after EMP reset for firmware to recover

This patch adds necessary delay for 4.33 firmware to recover after
EMP reset. Without this patch driver occasionally reinitializes
structures too quickly to communicate with firmware after EMP reset
causing AdminQ to timeout.

Signed-off-by: Filip Sadowski <filip.sadowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Clean up logic for adaptive ITR

The logic for dynamic ITR update is confusing at best as there were odd
paths chosen for how to find the rings associated with a given queue based
on the vector index and other inconsistencies throughout the code.

This patch is an attempt to clean up the logic so that we can more easily
understand what is going on. Specifically if there is a Rx or Tx ring that
is enabled in dynamic mode on the q_vector it is allowed to override the
other side of the interrupt moderation. While it isn't correct all this
patch is doing is cleaning up the logic for now so that when we come
through and fix it we can more easily identify that this is wrong.

The other big change made here is that we replace references to:
vsi->rx_rings[q_vector->v_idx]->itr_setting
with:
q_vector->rx.ring->itr_setting

The general idea is we can avoid the long pointer chase since just
accessing q_vector->rx.ring is a single pointer access versus having to
chase down vsi->rx_rings, and then finding the pointer in the array, and
finally chasing down the itr_setting from there.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Only track one ITR setting per ring instead of Tx/Rx

The rings are already split out into Tx and Rx rings so it doesn't make
sense to have any single ring store both a Tx and Rx itr_setting value.
Since that is the case drop the pair in favor of storing just a single ITR
value.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: fix typo in function description

'bufer' should be 'buffer'

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>