linux-2.6-block.git
5 years agoIB/mlx5: Support scatter to CQE for DC transport type
Yonatan Cohen [Tue, 9 Oct 2018 09:05:13 +0000 (12:05 +0300)]
IB/mlx5: Support scatter to CQE for DC transport type

Scatter to CQE is a HW offload that saves PCI writes by scattering the
payload to the CQE.
This patch extends already existing functionality to support DC
transport type.

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoMerge remote-tracking branch 'mlx5-next' into for-next
Doug Ledford [Wed, 17 Oct 2018 15:23:37 +0000 (11:23 -0400)]
Merge remote-tracking branch 'mlx5-next' into for-next

Pick up changes to mlx5_ifc.h needed for direct Scatter to CQE support
series to come next.

Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/drivers: Use core provided API for registering device attributes
Parav Pandit [Thu, 11 Oct 2018 19:31:54 +0000 (22:31 +0300)]
RDMA/drivers: Use core provided API for registering device attributes

Use rdma_set_device_sysfs_group() to register device attributes and
simplify the driver.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Allow existing drivers to set one sysfs group per device
Parav Pandit [Thu, 11 Oct 2018 19:31:53 +0000 (22:31 +0300)]
RDMA/core: Allow existing drivers to set one sysfs group per device

Currently many rdma drivers are creating device attribute files using
device_create_file() with device specific attributes.  Device specific
attributes should be exposed via well defined netlink device attributes in
future.

Introduce an API rdma_set_device_sysfs_group() for existing drivers to set
a group for sysfs attributes for legacy.

This API is only for exposing legacy attributes which existed for sometime
now.  New drivers should not be using this API and rather follow netlink
path.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rxe: Remove unnecessary enum values
Nathan Chancellor [Thu, 27 Sep 2018 05:12:23 +0000 (22:12 -0700)]
IB/rxe: Remove unnecessary enum values

Clang warns when an emumerated type is implicitly converted to another.

drivers/infiniband/sw/rxe/rxe.c:106:27: warning: implicit conversion
from enumeration type 'enum rxe_device_param' to different enumeration
type 'enum ib_atomic_cap' [-Wenum-conversion]
        rxe->attr.atomic_cap                    = RXE_ATOMIC_CAP;
                                                ~ ^~~~~~~~~~~~~~
drivers/infiniband/sw/rxe/rxe.c:131:22: warning: implicit conversion
from enumeration type 'enum rxe_port_param' to different enumeration
type 'enum ib_port_state' [-Wenum-conversion]
        port->attr.state                = RXE_PORT_STATE;
                                        ~ ^~~~~~~~~~~~~~
drivers/infiniband/sw/rxe/rxe.c:132:24: warning: implicit conversion
from enumeration type 'enum rxe_port_param' to different enumeration
type 'enum ib_mtu' [-Wenum-conversion]
        port->attr.max_mtu              = RXE_PORT_MAX_MTU;
                                        ~ ^~~~~~~~~~~~~~~~
drivers/infiniband/sw/rxe/rxe.c:133:27: warning: implicit conversion
from enumeration type 'enum rxe_port_param' to different enumeration
type 'enum ib_mtu' [-Wenum-conversion]
        port->attr.active_mtu           = RXE_PORT_ACTIVE_MTU;
                                        ~ ^~~~~~~~~~~~~~~~~~~
drivers/infiniband/sw/rxe/rxe.c:151:24: warning: implicit conversion
from enumeration type 'enum rxe_port_param' to different enumeration
type 'enum ib_mtu' [-Wenum-conversion]
                                ib_mtu_enum_to_int(RXE_PORT_ACTIVE_MTU);
                                ~~~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~~~~~~~
5 warnings generated.

Use the appropriate values from the expected enumerated type so no
conversion needs to happen then remove the unneeded definitions.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: Expose DC scatter to CQE capability bit
Yonatan Cohen [Tue, 9 Oct 2018 09:05:12 +0000 (12:05 +0300)]
net/mlx5: Expose DC scatter to CQE capability bit

dc_req_scat_data_cqe capability bit determines
if requester scatter to cqe is available for 64 bytes CQE over
DC transport type.

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoRDMA/umad: Use kernel API to allocate umad indexes
Leon Romanovsky [Tue, 2 Oct 2018 08:13:30 +0000 (11:13 +0300)]
RDMA/umad: Use kernel API to allocate umad indexes

Replace custom code to allocate indexes to generic kernel API.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/uverbs: Use kernel API to allocate uverbs indexes
Leon Romanovsky [Tue, 2 Oct 2018 08:13:29 +0000 (11:13 +0300)]
RDMA/uverbs: Use kernel API to allocate uverbs indexes

Replace custom code to allocate indexes to generic kernel API.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Increase total number of RDMA ports across all devices
Leon Romanovsky [Tue, 2 Oct 2018 08:13:28 +0000 (11:13 +0300)]
RDMA/core: Increase total number of RDMA ports across all devices

IDA adds overhead to store IDs bitmap with maximal value of IDA
can be upto 2099202 (IDA_MAX = 0x80000000U / IDA_BITMAP_BITS - 1).

However, there is no need to add such enormous number of devices
and it is enough for now to limit it to be 8192.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoIB/mlx4: Add port and TID to MAD debug print
Håkon Bugge [Tue, 9 Oct 2018 13:27:40 +0000 (15:27 +0200)]
IB/mlx4: Add port and TID to MAD debug print

Add said information and make the debug print format consistent.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoIB/mlx4: Enable debug print of SMPs
Håkon Bugge [Tue, 9 Oct 2018 13:27:39 +0000 (15:27 +0200)]
IB/mlx4: Enable debug print of SMPs

IB Subnet Management Packets (SMPs) were excluded from debug prints.

Fixed by enabling print even on QP0 MADs.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Rename ports_parent to ports_kobj
Parav Pandit [Sun, 7 Oct 2018 09:12:41 +0000 (12:12 +0300)]
RDMA/core: Rename ports_parent to ports_kobj

Normally kobj objects have kobj suffix to reflect it.
Rename ports_parent to ports_kobj.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Do not expose unsupported counters
Parav Pandit [Sun, 7 Oct 2018 09:12:40 +0000 (12:12 +0300)]
RDMA/core: Do not expose unsupported counters

If the provider driver (such as rdma_rxe) doesn't support pma counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.

Fixes: 35c4cbb17811 ("IB/core: Create get_perf_mad function in sysfs.c")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoIB/mlx4: Refer to the device kobject instead of ports_parent
Parav Pandit [Sun, 7 Oct 2018 09:12:39 +0000 (12:12 +0300)]
IB/mlx4: Refer to the device kobject instead of ports_parent

iov sysfs tree is created under ib device at
/sys/class/infiniband/mlx4_0/iov.
And,
ibdev->ports_parent->parent = &ibdev->dev.

Therefore, refer to device's kobject directly instead of
indirect access to it.

Additionally, iov entries are created under device kobject and deleted
before device is removed. There is no need to hold additional reference
to device kobject in provider driver.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/nldev: Allow IB device rename through RDMA netlink
Leon Romanovsky [Wed, 10 Oct 2018 06:19:12 +0000 (09:19 +0300)]
RDMA/nldev: Allow IB device rename through RDMA netlink

Provide an option to rename IB device name through RDMA netlink and
limit it to users with ADMIN capability only.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Implement IB device rename function
Leon Romanovsky [Wed, 10 Oct 2018 06:19:11 +0000 (09:19 +0300)]
RDMA/core: Implement IB device rename function

Generic implementation of IB device rename function.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Annotate timeout as unsigned long
Leon Romanovsky [Thu, 11 Oct 2018 14:30:05 +0000 (17:30 +0300)]
RDMA/core: Annotate timeout as unsigned long

The ucma users supply timeout in u32 format, it means that any number
with most significant bit set will be converted to negative value
by various rdma_*, cma_* and sa_query functions, which treat timeout
as int.

In the lowest level, the timeout is converted back to be unsigned long.
Remove this ambiguous conversion by updating all function signatures to
receive unsigned long.

Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/core: Align multiple functions to kernel coding style
Leon Romanovsky [Thu, 11 Oct 2018 14:30:04 +0000 (17:30 +0300)]
RDMA/core: Align multiple functions to kernel coding style

This patch changes the small number of functions to be aligned to kernel
coding style. It is needed to minimize the diffstat of the following
patch. It doesn't change any functionality.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/cma: Remove unused timeout_ms parameter from cma_resolve_iw_route()
Leon Romanovsky [Thu, 11 Oct 2018 14:30:03 +0000 (17:30 +0300)]
RDMA/cma: Remove unused timeout_ms parameter from cma_resolve_iw_route()

cma_resolve_iw_route() doesn't use timeout_ms parameter, so let's remove it.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoIB/mlx5: Fix MR cache initialization
Artemy Kovalyov [Mon, 15 Oct 2018 11:13:35 +0000 (14:13 +0300)]
IB/mlx5: Fix MR cache initialization

Schedule MR cache work only after bucket was initialized.

Cc: <stable@vger.kernel.org> # 4.10
Fixes: 49780d42dfc9 ("IB/mlx5: Expose MR cache for mlx5_ib")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cm: Respect returned status of cm_init_av_by_path
Leon Romanovsky [Thu, 11 Oct 2018 19:36:10 +0000 (22:36 +0300)]
RDMA/cm: Respect returned status of cm_init_av_by_path

Add missing check for failure of cm_init_av_by_path

Fixes: e1444b5a163e ("IB/cm: Fix automatic path migration support")
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/ipoib: Clear IPCB before icmp_send
Denis Drozdov [Thu, 11 Oct 2018 19:33:57 +0000 (22:33 +0300)]
IB/ipoib: Clear IPCB before icmp_send

IPCB should be cleared before icmp_send, since it may contain data from
previous layers and the data could be misinterpreted as ip header options,
which later caused the ihl to be set to an invalid value and resulted in
the following stack corruption:

[ 1083.031512] ib0: packet len 57824 (> 2048) too long to send, dropping
[ 1083.031843] ib0: packet len 37904 (> 2048) too long to send, dropping
[ 1083.032004] ib0: packet len 4040 (> 2048) too long to send, dropping
[ 1083.032253] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.032481] ib0: packet len 23960 (> 2048) too long to send, dropping
[ 1083.033149] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033439] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033700] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034124] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034387] ==================================================================
[ 1083.034602] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xf08/0x1310
[ 1083.034798] Write of size 4 at addr ffff880353457c5f by task kworker/u16:0/7
[ 1083.034990]
[ 1083.035104] CPU: 7 PID: 7 Comm: kworker/u16:0 Tainted: G           O      4.19.0-rc5+ #1
[ 1083.035316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 1083.035573] Workqueue: ipoib_wq ipoib_cm_skb_reap [ib_ipoib]
[ 1083.035750] Call Trace:
[ 1083.035888]  dump_stack+0x9a/0xeb
[ 1083.036031]  print_address_description+0xe3/0x2e0
[ 1083.036213]  kasan_report+0x18a/0x2e0
[ 1083.036356]  ? __ip_options_echo+0xf08/0x1310
[ 1083.036522]  __ip_options_echo+0xf08/0x1310
[ 1083.036688]  icmp_send+0x7b9/0x1cd0
[ 1083.036843]  ? icmp_route_lookup.constprop.9+0x1070/0x1070
[ 1083.037018]  ? netif_schedule_queue+0x5/0x200
[ 1083.037180]  ? debug_show_all_locks+0x310/0x310
[ 1083.037341]  ? rcu_dynticks_curr_cpu_in_eqs+0x85/0x120
[ 1083.037519]  ? debug_locks_off+0x11/0x80
[ 1083.037673]  ? debug_check_no_obj_freed+0x207/0x4c6
[ 1083.037841]  ? check_flags.part.27+0x450/0x450
[ 1083.037995]  ? debug_check_no_obj_freed+0xc3/0x4c6
[ 1083.038169]  ? debug_locks_off+0x11/0x80
[ 1083.038318]  ? skb_dequeue+0x10e/0x1a0
[ 1083.038476]  ? ipoib_cm_skb_reap+0x2b5/0x650 [ib_ipoib]
[ 1083.038642]  ? netif_schedule_queue+0xa8/0x200
[ 1083.038820]  ? ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.038996]  ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.039174]  process_one_work+0x912/0x1830
[ 1083.039336]  ? wq_pool_ids_show+0x310/0x310
[ 1083.039491]  ? lock_acquire+0x145/0x3a0
[ 1083.042312]  worker_thread+0x87/0xbb0
[ 1083.045099]  ? process_one_work+0x1830/0x1830
[ 1083.047865]  kthread+0x322/0x3e0
[ 1083.050624]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 1083.053354]  ret_from_fork+0x3a/0x50

For instance __ip_options_echo is failing to proceed with invalid srr and
optlen passed from another layer via IPCB

[  762.139568] IPv4: __ip_options_echo rr=0 ts=0 srr=43 cipso=0
[  762.139720] IPv4: ip_options_build: IPCB 00000000f3cd969e opt 000000002ccb3533
[  762.139838] IPv4: __ip_options_echo in srr: optlen 197 soffset 84
[  762.139852] IPv4: ip_options_build srr=0 is_frag=0 rr_needaddr=0 ts_needaddr=0 ts_needtime=0 rr=0 ts=0
[  762.140269] ==================================================================
[  762.140713] IPv4: __ip_options_echo rr=0 ts=0 srr=0 cipso=0
[  762.141078] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x12ec/0x1680
[  762.141087] Write of size 4 at addr ffff880353457c7f by task kworker/u16:0/7

Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Protect from reentry to resource return path
Leon Romanovsky [Thu, 11 Oct 2018 19:10:10 +0000 (22:10 +0300)]
RDMA/restrack: Protect from reentry to resource return path

Nullify the resource task struct pointer to ensure that subsequent calls
won't try to release task_struct again.

------------[ cut here ]------------
ODEBUG: free active (active state 1) object type: rcu_head hint:
(null)
WARNING: CPU: 0 PID: 6048 at lib/debugobjects.c:329
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 6048 Comm: syz-executor022 Not tainted
4.19.0-rc7-next-20181008+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x244/0x3ab lib/dump_stack.c:113
  panic+0x238/0x4e7 kernel/panic.c:184
  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
  report_bug+0x254/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
  do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969
RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Code: 41 88 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 92 00 00 00 48 8b 14
dd
60 02 41 88 4c 89 fe 48 c7 c7 00 f8 40 88 e8 36 2f b4 fd <0f> 0b 83 05
a9
f4 5e 06 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f
RSP: 0018:ffff8801d8c3eda8 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8164d235 RDI: 0000000000000005
RBP: ffff8801d8c3ede8 R08: ffff8801d70aa280 R09: ffffed003b5c3eda
R10: ffffed003b5c3eda R11: ffff8801dae1f6d7 R12: 0000000000000001
R13: ffffffff8939a760 R14: 0000000000000000 R15: ffffffff8840fca0
  __debug_check_no_obj_freed lib/debugobjects.c:786 [inline]
  debug_check_no_obj_freed+0x3ae/0x58d lib/debugobjects.c:818
  kmem_cache_free+0x202/0x290 mm/slab.c:3759
  free_task_struct kernel/fork.c:163 [inline]
  free_task+0x16e/0x1f0 kernel/fork.c:457
  __put_task_struct+0x2e6/0x620 kernel/fork.c:730
  put_task_struct include/linux/sched/task.h:96 [inline]
  finish_task_switch+0x66c/0x900 kernel/sched/core.c:2715
  context_switch kernel/sched/core.c:2834 [inline]
  __schedule+0x8d7/0x21d0 kernel/sched/core.c:3480
  schedule+0xfe/0x460 kernel/sched/core.c:3524
  freezable_schedule include/linux/freezer.h:172 [inline]
  futex_wait_queue_me+0x3f9/0x840 kernel/futex.c:2530
  futex_wait+0x45c/0xa50 kernel/futex.c:2645
  do_futex+0x31a/0x26d0 kernel/futex.c:3528
  __do_sys_futex kernel/futex.c:3589 [inline]
  __se_sys_futex kernel/futex.c:3557 [inline]
  __x64_sys_futex+0x472/0x6a0 kernel/futex.c:3557
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446549
Code: e8 2c b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 2b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f3a998f5da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00000000006dbc38 RCX: 0000000000446549
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000000006dbc38
RBP: 00000000006dbc30 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc3c
R13: 2f646e6162696e69 R14: 666e692f7665642f R15: 00000000006dbd2c
Kernel Offset: disabled

Reported-by: syzbot+71aff6ea121ffefc280f@syzkaller.appspotmail.com
Fixes: ed7a01fd3fd7 ("RDMA/restrack: Release task struct which was hold by CM_ID object")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Add support for flow tag to raw create flow
Mark Bloch [Wed, 10 Oct 2018 06:55:10 +0000 (09:55 +0300)]
RDMA/mlx5: Add support for flow tag to raw create flow

A user can provide a hint which will be attached to the packet and written
to the CQE on receive. This can be used as a way to offload operations
into the HW, for example parsing a packet which is a tunneled packet, and
if so, pass 0x1 as the hint. The software can use that hint to decapsulate
the packet and parse only the inner headers thus saving CPU cycles.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Remove extraneous error check
Gal Pressman [Mon, 8 Oct 2018 16:44:03 +0000 (19:44 +0300)]
RDMA/mlx5: Remove extraneous error check

Remove double error check from create user RQ error flow.

Fixes: 79b20a6c3014 ("IB/mlx5: Add receive Work Queue verbs")
Signed-off-by: Gal Pressman <pressmangal@gmail.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Verify DEVX object type
Yishai Hadas [Sun, 7 Oct 2018 09:06:34 +0000 (12:06 +0300)]
IB/mlx5: Verify DEVX object type

Verify that the input DEVX object type matches the created object.

As the obj_id in the firmware is not globally unique the object type must
be considered upon checking for a valid object id.

Once both the type and the id match we know that the lock was taken on the
correct object by the uverbs layer.

Fixes: e662e14d801b ("IB/mlx5: Add DEVX support for modify and query commands")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add FRMR support for hip08
Yixian Liu [Fri, 5 Oct 2018 09:53:24 +0000 (17:53 +0800)]
RDMA/hns: Add FRMR support for hip08

This patch adds fast register physical memory region (FRMR) support for
hip08.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Avoid resource leak in case the NQ registration fails
Selvin Xavier [Mon, 8 Oct 2018 10:28:04 +0000 (03:28 -0700)]
RDMA/bnxt_re: Avoid resource leak in case the NQ registration fails

In case the NQ alloc/enable fails, free up the already allocated/enabled
NQ before reporting failure. Also, track the alloc/enable using proper
state checking.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Wait for delayed work to finish before device removal
Selvin Xavier [Mon, 8 Oct 2018 10:28:03 +0000 (03:28 -0700)]
RDMA/bnxt_re: Wait for delayed work to finish before device removal

Delayed work bnxt_re_worker would be still running even after
cancel_delayed_work returns. This causes crash as the driver proceeds with
device removal. To make sure that the work is finished before returning,
use cancel_delayed_work_sync.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Limit max_pkey to 16 bit value
Devesh Sharma [Mon, 8 Oct 2018 10:28:02 +0000 (03:28 -0700)]
RDMA/bnxt_re: Limit max_pkey to 16 bit value

Some FW versios return pkey values more than 0xFFFF. pkey_tbl_len of
ib_port_attr is 16bit value. So restricting max_pkeys to 0xFFFF.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Fix qp async event reporting
Devesh Sharma [Mon, 8 Oct 2018 10:28:01 +0000 (03:28 -0700)]
RDMA/bnxt_re: Fix qp async event reporting

Reports affiliated async event on the qp-async event channel instead of
global event channel.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Report out of sequence hw counters
Selvin Xavier [Mon, 8 Oct 2018 10:28:00 +0000 (03:28 -0700)]
RDMA/bnxt_re: Report out of sequence hw counters

Expose out of sequence errors received from FW.  This counter is a 32 bit
counter and driver has to accumulate the counter. Stores the previous
value for calculating the difference in the next query.

Also, update the HW statistics structure with new fields.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Expose rx discards and drop counters
Selvin Xavier [Mon, 8 Oct 2018 10:27:59 +0000 (03:27 -0700)]
RDMA/bnxt_re: Expose rx discards and drop counters

Expose the RoCE discard and drop counters from the HW statistics context

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Prevent driver crash due to NULL pointer in error message print
Somnath Kotur [Mon, 8 Oct 2018 10:27:58 +0000 (03:27 -0700)]
RDMA/bnxt_re: Prevent driver crash due to NULL pointer in error message print

crsqe->resp would be NULL in case the host command timed out before
getting a response from HW. Check for NULL pointer to avoid a potential
crash while printing the error message.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Drop L2 async events silently
Devesh Sharma [Mon, 8 Oct 2018 10:27:57 +0000 (03:27 -0700)]
RDMA/bnxt_re: Drop L2 async events silently

In some FW versions, RoCE driver also receives an async notification which
was directed to L2 driver.  RoCE driver does not handle this and print a
message to syslog.  Drop these notifications silently.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Avoid accessing nq->bar_reg_iomem in failure case
Selvin Xavier [Mon, 8 Oct 2018 10:27:56 +0000 (03:27 -0700)]
RDMA/bnxt_re: Avoid accessing nq->bar_reg_iomem in failure case

In the failure path, nq->bar_reg_iomem gets accessed without
initializing. Avoid this by calling the bnxt_qplib_nq_stop_irq only if the
initialization is complete.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Fixes: 6e04b1035689 ("RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Avoid NULL check after accessing the pointer
Selvin Xavier [Mon, 8 Oct 2018 10:27:55 +0000 (03:27 -0700)]
RDMA/bnxt_re: Avoid NULL check after accessing the pointer

This is reported by smatch check.  rcfw->creq_bar_reg_iomem is accessed in
bnxt_qplib_rcfw_stop_irq and this variable check afterwards doesn't make
sense.  Also, rcfw->creq_bar_reg_iomem will never be NULL.  So Removing
this check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 6e04b1035689 ("RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Remove the unnecessary version macro definition
Selvin Xavier [Mon, 8 Oct 2018 10:27:54 +0000 (03:27 -0700)]
RDMA/bnxt_re: Remove the unnecessary version macro definition

Version macro is not required as the driver is not maintaining the
version. Removing the references of this macro too.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Fix recursive lock warning in debug kernel
Selvin Xavier [Mon, 8 Oct 2018 10:27:53 +0000 (03:27 -0700)]
RDMA/bnxt_re: Fix recursive lock warning in debug kernel

Fix possible recursive lock warning. Its a false warning as the locks are
part of two differnt HW Queue data structure - cmdq and creq. Debug kernel
is throwing the following warning and stack trace.

[  783.914967] ============================================
[  783.914970] WARNING: possible recursive locking detected
[  783.914973] 4.19.0-rc2+ #33 Not tainted
[  783.914976] --------------------------------------------
[  783.914979] swapper/2/0 is trying to acquire lock:
[  783.914982] 000000002aa3949d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[  783.914999]
but task is already holding lock:
[  783.915002] 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
[  783.915013]
other info that might help us debug this:
[  783.915016]  Possible unsafe locking scenario:

[  783.915019]        CPU0
[  783.915021]        ----
[  783.915034]   lock(&(&hwq->lock)->rlock);
[  783.915035]   lock(&(&hwq->lock)->rlock);
[  783.915037]
 *** DEADLOCK ***

[  783.915038]  May be due to missing lock nesting notation

[  783.915039] 1 lock held by swapper/2/0:
[  783.915040]  #0: 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
[  783.915044]
stack backtrace:
[  783.915046] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0-rc2+ #33
[  783.915047] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[  783.915048] Call Trace:
[  783.915049]  <IRQ>
[  783.915054]  dump_stack+0x90/0xe3
[  783.915058]  __lock_acquire+0x106c/0x1080
[  783.915061]  ? sched_clock+0x5/0x10
[  783.915063]  lock_acquire+0xbd/0x1a0
[  783.915065]  ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[  783.915069]  _raw_spin_lock_irqsave+0x4a/0x90
[  783.915071]  ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[  783.915073]  bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[  783.915078]  tasklet_action_common.isra.17+0x197/0x1b0
[  783.915081]  __do_softirq+0xcb/0x3a6
[  783.915084]  irq_exit+0xe9/0x100
[  783.915085]  do_IRQ+0x6a/0x120
[  783.915087]  common_interrupt+0xf/0xf
[  783.915088]  </IRQ>

Use nested notation for the spin_lock to avoid this warning.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/bnxt_re: Add missing spin lock initialization
Selvin Xavier [Mon, 8 Oct 2018 10:27:52 +0000 (03:27 -0700)]
RDMA/bnxt_re: Add missing spin lock initialization

Add the missing initalization of the cq_lock and qplib.flush_lock.

Fixes: 942c9b6ca8de ("RDMA/bnxt_re: Avoid Hard lockup during error CQE processing")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge branch 'for-rc' into rdma.git for-next
Jason Gunthorpe [Tue, 16 Oct 2018 06:01:02 +0000 (00:01 -0600)]
Merge branch 'for-rc' into rdma.git for-next

From git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

This is required to resolve dependencies of the next series of RDMA
patches.

The code motion conflicts in drivers/infiniband/core/cache.c were
resolved.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Unmap DMA addr from HCA before IOMMU
Valentine Fatiev [Wed, 10 Oct 2018 06:56:25 +0000 (09:56 +0300)]
IB/mlx5: Unmap DMA addr from HCA before IOMMU

The function that puts back the MR in cache also removes the DMA address
from the HCA. Therefore we need to call this function before we remove
the DMA mapping from MMU. Otherwise the HCA may access a memory that
is no longer DMA mapped.

Call trace:
NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0-rc6+ #4
Hardware name: HP ProLiant DL360p Gen8, BIOS P71 08/20/2012
RIP: 0010:intel_idle+0x73/0x120
Code: 80 5c 01 00 0f ae 38 0f ae f0 31 d2 65 48 8b 04 25 80 5c 01 00 48 89 d1 0f 60 02
RSP: 0018:ffffffff9a403e38 EFLAGS: 00000046
RAX: 0000000000000030 RBX: 0000000000000005 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff9a5790c0 RDI: 0000000000000000
RBP: 0000000000000030 R08: 0000000000000000 R09: 0000000000007cf9
R10: 000000000000030a R11: 0000000000000018 R12: 0000000000000000
R13: ffffffff9a5792b8 R14: ffffffff9a5790c0 R15: 0000002b48471e4d
FS:  0000000000000000(0000) GS:ffff9c6caf400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5737185000 CR3: 0000000590c0a002 CR4: 00000000000606f0
Call Trace:
 cpuidle_enter_state+0x7e/0x2e0
 do_idle+0x1ed/0x290
 cpu_startup_entry+0x6f/0x80
 start_kernel+0x524/0x544
 ? set_init_arg+0x55/0x55
 secondary_startup_64+0xa4/0xb0
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [04:00.0] fault addr b34d2000 [fault reason 06] PTE Read access is not set
DMAR: [DMA Read] Request device [01:00.2] fault addr bff8b000 [fault reason 06] PTE Read access is not set

Fixes: f3f134f5260a ("RDMA/mlx5: Fix crash while accessing garbage pointer and freed memory")
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
5 years agoRDMA/restrack: Release task struct which was hold by CM_ID object
Leon Romanovsky [Tue, 2 Oct 2018 08:48:03 +0000 (11:48 +0300)]
RDMA/restrack: Release task struct which was hold by CM_ID object

Tracking CM_ID resource is performed in two stages: creation of cm_id
and connecting it to the cma_dev. It is needed because rdma-cm protocol
exports two separate user-visible calls rdma_create_id and rdma_accept.

At the time of CM_ID creation, the real owner of that object is unknown
yet and we need to grab task_struct. This task_struct is released or
reassigned in attach phase later on. but call to rdma_destroy_id left
this task_struct unreleased.

Such separation is unique to CM_ID and other restrack objects initialize
in one shot. It means that it is safe to use "res->valid" check to catch
unfinished CM_ID flow and release task_struct for that object.

Fixes: 00313983cda6 ("RDMA/nldev: provide detailed CM_ID information")
Reported-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Consolidate task name updates in one place
Leon Romanovsky [Tue, 2 Oct 2018 08:48:02 +0000 (11:48 +0300)]
RDMA/restrack: Consolidate task name updates in one place

Unify task update and kernel name set in one place.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/restrack: Un-inline set task implementation
Leon Romanovsky [Tue, 2 Oct 2018 08:48:01 +0000 (11:48 +0300)]
RDMA/restrack: Un-inline set task implementation

Prepare rdma_restrack_set_task() call to accommodate more
code by moving its implementation from *.h to *.c.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Check error status of rdma_find_ndev_for_src_ip_rcu
Parav Pandit [Fri, 21 Sep 2018 14:18:24 +0000 (09:18 -0500)]
RDMA/core: Check error status of rdma_find_ndev_for_src_ip_rcu

rdma_find_ndev_for_src_ip_rcu() returns either valid netdev pointer or
ERR_PTR().  Instead of checking for NULL, check for error.

Fixes: caf1e3ae9fa6 ("RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu")
Reported-by: syzbot+20c32fa6ff84a2d28c36@syzkaller.appspotmail.com
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt
Venkata Sandeep Dhanalakota [Wed, 26 Sep 2018 17:44:52 +0000 (10:44 -0700)]
IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt

This patch moves ruc_loopback() from hfi1 into rdmavt for code sharing
with the qib driver.

Reviewed-by: Brian Welty <brian.welty@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{hfi1, qib, rdmavt}: Move send completion logic to rdmavt
Venkata Sandeep Dhanalakota [Wed, 26 Sep 2018 17:44:42 +0000 (10:44 -0700)]
IB/{hfi1, qib, rdmavt}: Move send completion logic to rdmavt

Moving send completion code into rdmavt in order to have shared logic
between qib and hfi1 drivers.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{hfi1, qib, rdmavt}: Move copy SGE logic into rdmavt
Brian Welty [Wed, 26 Sep 2018 17:44:33 +0000 (10:44 -0700)]
IB/{hfi1, qib, rdmavt}: Move copy SGE logic into rdmavt

This patch moves hfi1_copy_sge() into rdmavt for sharing with qib.
This patch also moves all the wss_*() functions into rdmavt as
several wss_*() functions are called from hfi1_copy_sge()

When SGE copy mode is adaptive, cacheless copy may be done in some cases
for performance reasons. In those cases, X86 cacheless copy function
is called since the drivers that use rdmavt and may set SGE copy mode
to adaptive are X86 only. For this reason, this patch adds
"depends on X86_64" to rdmavt/Kconfig.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx4: Avoid implicit enumerated type conversion
Nathan Chancellor [Mon, 24 Sep 2018 19:57:16 +0000 (12:57 -0700)]
IB/mlx4: Avoid implicit enumerated type conversion

Clang warns when one enumerated type is implicitly converted to another.

drivers/infiniband/hw/mlx4/mad.c:1811:41: warning: implicit conversion
from enumeration type 'enum mlx4_ib_qp_flags' to different enumeration
type 'enum ib_qp_create_flags' [-Wenum-conversion]
                qp_init_attr.init_attr.create_flags = MLX4_IB_SRIOV_TUNNEL_QP;
                                                    ~ ^~~~~~~~~~~~~~~~~~~~~~~

drivers/infiniband/hw/mlx4/mad.c:1819:41: warning: implicit conversion
from enumeration type 'enum mlx4_ib_qp_flags' to different enumeration
type 'enum ib_qp_create_flags' [-Wenum-conversion]
                qp_init_attr.init_attr.create_flags = MLX4_IB_SRIOV_SQP;
                                                    ~ ^~~~~~~~~~~~~~~~~

The type mlx4_ib_qp_flags explicitly provides supplemental values to the
type ib_qp_create_flags. Make that clear to Clang by changing the
create_flags type to u32.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Bugfix for atomic operation
Lijun Ou [Sun, 30 Sep 2018 09:00:38 +0000 (17:00 +0800)]
RDMA/hns: Bugfix for atomic operation

The atomic operation not supported inline. Besides, the standard atomic
operation only support a sge and the sge is placed in the wqe.

Fix: 384f881("RDMA/hns: Add atomic support")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add vlan enable bit for hip08
Lijun Ou [Sun, 30 Sep 2018 09:00:37 +0000 (17:00 +0800)]
RDMA/hns: Add vlan enable bit for hip08

In order to extend vlan device range, the design add two field of qp
context for checking vlan packet in sender and in recevicer.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Support local invalidate for hip08 in kernel space
Lijun Ou [Sun, 30 Sep 2018 09:00:36 +0000 (17:00 +0800)]
RDMA/hns: Support local invalidate for hip08 in kernel space

This patch adds local invalidate Memory Region (MR) support in the kernel
space driver.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Update some fields of qp context
Lijun Ou [Sun, 30 Sep 2018 09:00:35 +0000 (17:00 +0800)]
RDMA/hns: Update some fields of qp context

The hip08 hardware has two version. the version id are 0x20 and 0x21
according to the pci revision. It needs to adjust some fields for
extending new features. The specific updates include:

1. Add some fields for supporting new features by enabling some reserved
   fields in 0x20 version.
2. remove some fields which the user is not visiable in order to support
   the extend features.
3. Init some fields with zero.

These updates is compatible with 0x20 version.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Limit extend sq sge num
Lijun Ou [Sun, 30 Sep 2018 09:00:34 +0000 (17:00 +0800)]
RDMA/hns: Limit extend sq sge num

According to hip08 limit, the buffer size of extend sge needs to be an
integer wqe_sge_buf_page size. For example, the value of sge_shift field
of qp context is greater or equal to eight when buffer page size is 4K
size. The value of sge_shift field of qp context assigned by
hr_qp->sge.sge_cnt.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Update some attributes of the RoCE device
Lijun Ou [Sun, 30 Sep 2018 09:00:33 +0000 (17:00 +0800)]
RDMA/hns: Update some attributes of the RoCE device

According to the IB protocol definition, the driver needs to show the
correct device information and the information will be queryed by device
attribute.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Configure ecn field of ip header
Lijun Ou [Sun, 30 Sep 2018 09:00:32 +0000 (17:00 +0800)]
RDMA/hns: Configure ecn field of ip header

In order to compatible with the third party RoCE device, The hardware
modify the set method for the ecn field of ip header in new hip08
version. The high 6bit of tclass be assigned for dscp field of packet.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Limit the size of extend sge of sq
Lijun Ou [Sun, 30 Sep 2018 09:00:31 +0000 (17:00 +0800)]
RDMA/hns: Limit the size of extend sge of sq

The hip08 split two hardware version. The version id are 0x20 and 0x21
according to the PCI revison. The max size of extend sge of sq is limited
to 2M for 0x20 version and 8M for 0x21 version. It may be exceeded to 2M
according to the algorithm that compute the product of wqe count and
extend sge number of every wqe. But the product always less than 8M.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Bugfix for CM test
Lijun Ou [Sun, 30 Sep 2018 09:00:30 +0000 (17:00 +0800)]
RDMA/hns: Bugfix for CM test

It will print the warning when the MSB bit of SLID is not zero running
cm_req_handler function that test CM. It needs to fixed zero when test
RoCE device.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Submit bad wr when post send wr exception
Lijun Ou [Sun, 30 Sep 2018 09:00:29 +0000 (17:00 +0800)]
RDMA/hns: Submit bad wr when post send wr exception

When user issues a RDMA read and enables sq inline, it needs to report a
bad wr to user.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Bugfix for reserved qp number
Lijun Ou [Sun, 30 Sep 2018 09:00:28 +0000 (17:00 +0800)]
RDMA/hns: Bugfix for reserved qp number

It needs to include two special qps for every port. The hip08 have four
ports and the all reserved qp numbers are eight.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/netlink: Simplify netlink listener existence check
Leon Romanovsky [Tue, 2 Oct 2018 08:49:24 +0000 (11:49 +0300)]
RDMA/netlink: Simplify netlink listener existence check

All users of rdma_nl_chk_listeners() are interested to get boolean answer
if netlink socket has listeners, so update all places to boolean function.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA: Remove unused parameter from ib_modify_qp_is_ok()
Kamal Heib [Tue, 2 Oct 2018 13:11:21 +0000 (16:11 +0300)]
RDMA: Remove unused parameter from ib_modify_qp_is_ok()

The ll parameter is not used in ib_modify_qp_is_ok(), so remove it.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rxe: Remove unused addr_same()
Kamal Heib [Tue, 2 Oct 2018 08:03:10 +0000 (11:03 +0300)]
RDMA/rxe: Remove unused addr_same()

This function is not in use - delete it.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rxe: avoid srq memory leak
Zhu Yanjun [Sun, 30 Sep 2018 06:27:16 +0000 (02:27 -0400)]
IB/rxe: avoid srq memory leak

In rxe_queue_init, q and q->buf are allocated. In do_mmap_info, q->ip is
allocated. When error occurs, rxe_srq_from_init and the later error
handler do not free these allocated memories.  This will make memory leak.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mthca: Fix error return code in __mthca_init_one()
Wei Yongjun [Sat, 29 Sep 2018 03:55:16 +0000 (03:55 +0000)]
IB/mthca: Fix error return code in __mthca_init_one()

Fix to return a negative error code from the mthca_cmd_init() error
handling case instead of 0, as done elsewhere in this function.

Fixes: 80fd8238734c ("[PATCH] IB/mthca: Encapsulate command interface init")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/uverbs: Fix RCU annotation for radix slot deference
Jason Gunthorpe [Fri, 28 Sep 2018 22:28:02 +0000 (16:28 -0600)]
RDMA/uverbs: Fix RCU annotation for radix slot deference

The uapi radix tree is a write-once data structure protected by kref.
Once we get to the ioctl() fop it is not possible for anything else
to be writing to it, so the access should use rcu_dereference_protected.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA: Fix building with CONFIG_MMU=n
Jason Gunthorpe [Fri, 28 Sep 2018 21:20:23 +0000 (15:20 -0600)]
RDMA: Fix building with CONFIG_MMU=n

The zap_vma_ptes() is declared but not defined on NOMMU kernels, causing a
link error for the newly added uverbs code:

drivers/infiniband/core/uverbs_main.o: In function `uverbs_user_mmap_disassociate':
uverbs_main.c:(.text+0x114c): undefined reference to `zap_vma_ptes'
drivers/infiniband/core/uverbs_main.o: In function `rdma_umap_open':
uverbs_main.c:(.text+0x53c): undefined reference to `zap_vma_ptes'

Since all user access for all of our drivers depend on remapping pages to
user space disable USER_ACCESS when there is no mmu.

Fixes: 5f9794dc94f5 ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cma: Introduce and use cma_ib_acquire_dev()
Parav Pandit [Sat, 15 Sep 2018 09:07:57 +0000 (12:07 +0300)]
RDMA/cma: Introduce and use cma_ib_acquire_dev()

When RDMA CM connect request arrives for IB transport, it already contains
device, port, netdevice (optional).

Instead of traversing all the cma devices, use the cma device already
found by the cma_find_listener() for which a listener id is provided.

iWarp devices doesn't need to derive RoCE GIDs, therefore drop RoCE
specific checks from cma_acquire_dev() and rename it to
cma_iw_acquire_dev().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()
Parav Pandit [Sat, 15 Sep 2018 09:07:56 +0000 (12:07 +0300)]
RDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()

Light weight version of cma_acquire_dev() just for binding with rdma
device based on source IP(v4/v6) address.

This simplifies cma_acquire_dev() to avoid listen_id specific checks and
also for subsequent simplification for IB vs iWarp.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/cma: Allow accepting requests for multi port rdma device
Parav Pandit [Sat, 15 Sep 2018 09:07:55 +0000 (12:07 +0300)]
RDMA/cma: Allow accepting requests for multi port rdma device

When IP failover is used between multiple ports of a given rdma device,
allow accepting CM requests from either of the ports.  This is applicable
for IPv4 and IPv6 non link local addressing scheme.

IPv6 link local addresses are bound. IP failover requests for listen
cm_ids bound to specific netdev interfaces cannot be supported.
(Similar to traditional sockets).

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Use VL15 for SM packets
Kaike Wan [Wed, 26 Sep 2018 17:56:12 +0000 (10:56 -0700)]
IB/hfi1: Use VL15 for SM packets

Subnet Management Packets (SMP) should exclusively use VL15 and their SL
is ignored (IBTA v1.3, Section 3.5.8.2). Therefore, when an SMP is posted,
the SL in the address handle can be set to 0 by a user
application. Consequently, when an address handle is created by the IB
core, some fields in struct rvt_ah may not be set correctly by using the
SL2SC and SC2VL tables at the time. Subsequently, when the request is post
sent, the incoming swqe may fail the validation check, resulting in the
rejection of the send request.

This patch fixes the problem by using VL15 for any validation, ignoring
the SL in the address handle.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Add mtu check for operational data VLs
Alex Estrin [Wed, 26 Sep 2018 17:56:03 +0000 (10:56 -0700)]
IB/hfi1: Add mtu check for operational data VLs

Since Virtual Lanes BCT credits and MTU are set through separate MADs, we
have to ensure both are valid, and data VLs are ready for transmission
before we allow port transition to Armed state.

Fixes: 5e2d6764a729 ("IB/hfi1: Verify port data VLs credits on transition to Armed")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Ensure ucast_dlid access doesnt exceed bounds
Dennis Dalessandro [Wed, 26 Sep 2018 17:55:53 +0000 (10:55 -0700)]
IB/hfi1: Ensure ucast_dlid access doesnt exceed bounds

The dlid assignment made by looking into the u_ucast_dlid array does not
do an explicit check for the size of the array. The code path to arrive at
def_port, the index value is long and complicated so its best to just have
an explicit check here.

Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Add static trace for iowait
Kaike Wan [Wed, 26 Sep 2018 17:27:03 +0000 (10:27 -0700)]
IB/hfi1: Add static trace for iowait

This patch adds the static trace for resource wait.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Prepare resource waits for dual leg
Dennis Dalessandro [Fri, 28 Sep 2018 14:17:09 +0000 (07:17 -0700)]
IB/hfi1: Prepare resource waits for dual leg

Current implementation allows each qp to have only one send engine.  As
such, each qp has only one list to queue prebuilt packets when send engine
resources are not available. To improve performance, it is desired to
support multiple send engines for each qp.

This patch creates the framework to support two send engines
(two legs) for each qp for the TID RDMA protocol, which can be easily
extended to support more send engines. It achieves the goal by creating a
leg specific struct, iowait_work in the iowait struct, to hold the
work_struct and the tx_list as well as a pointer to the parent iowait
struct.

The hfi1_pkt_state now has an additional field to record the current legs
work structure and that is now passed to all egress waiters to determine
the leg that needs to wait via a new iowait helper.  The APIs are adjusted
to use the new leg specific struct as required.

Many new and modified helpers are added to support this change.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rdmavt: Rename check_send_wqe as setup_wqe
Kaike Wan [Wed, 26 Sep 2018 17:26:44 +0000 (10:26 -0700)]
IB/rdmavt: Rename check_send_wqe as setup_wqe

The driver-provided function check_send_wqe allows the hardware driver to
check and set up the incoming send wqe before it is inserted into the swqe
ring. This patch will rename it as setup_wqe to better reflect its
usage. In addition, this function is only called when all setup is
complete in rdmavt.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: remove set but not used variable 'dseg'
YueHaibing [Fri, 28 Sep 2018 10:59:53 +0000 (10:59 +0000)]
RDMA/hns: remove set but not used variable 'dseg'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'hns_roce_v2_post_send':
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:194:35: warning:
 variable 'dseg' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/qedr: Remove enumerated type qed_roce_ll2_tx_dest
Nathan Chancellor [Thu, 27 Sep 2018 20:55:58 +0000 (13:55 -0700)]
RDMA/qedr: Remove enumerated type qed_roce_ll2_tx_dest

Clang warns when one enumerated type is explicitly converted to another.

drivers/infiniband/hw/qedr/qedr_roce_cm.c:198:28: warning: implicit
conversion from enumeration type 'enum qed_roce_ll2_tx_dest' to
different enumeration type 'enum qed_ll2_tx_dest' [-Wenum-conversion]
        ll2_tx_pkt.tx_dest = pkt->tx_dest;
                           ~ ~~~~~^~~~~~~
1 warning generated.

Turns out that QED_ROCE_LL2_TX_DEST_NW and QED_ROCE_LL2_TX_DEST_LB are
only used once in the whole tree and QED_ROCE_LL2_TX_DEST_MAX is used
nowhere. Remove them and use the equivalent values from qed_ll2_tx_dest
in their place.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Michal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Error path MAD response size is incorrect
Michael J. Ruhl [Fri, 28 Sep 2018 14:34:57 +0000 (07:34 -0700)]
IB/hfi1: Error path MAD response size is incorrect

If a MAD packet has incorrect header information, the logic uses the reply
path to report the error.  The reply path expects *resp_len to be set
prior to return.  Unfortunately, *resp_len is set to 0 for this path.
This causes an incorrect response packet.

Fix by ensuring that the *resp_len is defaulted to the incoming packet
size (wc->bytes_len - sizeof(GRH)).

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rxe: replace kvfree with vfree
Zhu Yanjun [Sun, 30 Sep 2018 05:57:42 +0000 (01:57 -0400)]
IB/rxe: replace kvfree with vfree

The buf is allocated by vmalloc_user in the function rxe_queue_init.
So it is better to free it by vfree.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/iser: Fix possible NULL deref at iser_inv_desc()
Israel Rukshin [Wed, 26 Sep 2018 09:44:18 +0000 (09:44 +0000)]
IB/iser: Fix possible NULL deref at iser_inv_desc()

In case target remote invalidates bogus rkey and signature is not used,
pi_ctx is NULL deref.

The commit also fails the connection on bogus remote invalidation.

Fixes: 59caaed7a72a ("IB/iser: Support the remote invalidation exception")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Enable DEVX on IB
Yishai Hadas [Thu, 20 Sep 2018 18:45:21 +0000 (21:45 +0300)]
IB/mlx5: Enable DEVX on IB

IB has additional protections with SELinux that cannot be extended to the
DEVX domain. SELinux can restrict access to pkeys. The first version of
DEVX blocked IB entirely until this could be understood.

Since DEVX requires CAP_NET_RAW, it supersedes the SELinux restriction and
allows userspace to form arbitrary packets with arbitrary pkeys.

Thus we enable IB for DEVX when CAP_NET_RAW is given.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Enable DEVX white list commands
Yishai Hadas [Thu, 20 Sep 2018 18:45:20 +0000 (21:45 +0300)]
IB/mlx5: Enable DEVX white list commands

Enable DEVX white list commands without the need for CAP_NET_RAW.

DEVX uid must exist from the ucontext or the device so that the firmware
will mask unprivileged capabilities.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Manage device uid for DEVX white list commands
Yishai Hadas [Thu, 20 Sep 2018 18:45:19 +0000 (21:45 +0300)]
IB/mlx5: Manage device uid for DEVX white list commands

Manage device uid for DEVX white list commands.  The created device uid
will be used on white list commands if the user didn't supply its own uid.

This will enable the firmware to filter out non privileged functionality
as of the recognition of the uid.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Expose RAW QP device handles to user space
Yishai Hadas [Thu, 20 Sep 2018 18:45:18 +0000 (21:45 +0300)]
IB/mlx5: Expose RAW QP device handles to user space

Expose RAW QP device handles to user space by extending the UHW part of
mlx5_ib_create_qp_resp.

This data is returned only when DEVX context is used where it may be
applicable.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Acquire and release mmap_sem on page range
Parav Pandit [Tue, 25 Sep 2018 09:04:04 +0000 (12:04 +0300)]
RDMA/core: Acquire and release mmap_sem on page range

Currently mmap_sem is read locked while pinning the memory.  In a
multi-threaded application of a process, holding mmap_sem lock creates
contention with other threads who might be either registering memory,
creating QPs or simply doing mmap() as such operations also require to
hold the mmap_sem write lock.

All such operation cannot make forward progress until one memory pin
operation is completed.  It becomes more worse if the memory is unpinned
and/or memory registration is large (in GB range).

Therefore, instead of holding mmap_sem for too long (for whole region
pinning), acquire and release the lock for every few pages.  For example
on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
memory chunk.

This allows other competing threads to make progress who might wish to
hold mmap_sem for shorter duration.

When memory registration latency is measured using [1] for memory sizes
ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
runs no difference is seen other than run-to-run variance.

In other targeted tests of users with large memory, desired improvements
are seen due to reduced contention of mmap_sem.

[1] https://github.com/paravmellanox/rtool

$ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A

It registers pinned memory from 4K to 48GB size with 500 iterations for
each memory size.

$ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4

4 competing threads pin memory, each of 12GB size with 500 iterations.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: fix spelling mistake "reseved" -> "reserved"
Colin Ian King [Thu, 27 Sep 2018 13:24:30 +0000 (14:24 +0100)]
RDMA/hns: fix spelling mistake "reseved" -> "reserved"

Trivial fix to spelling mistake in dev_err error message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/sa: simplify return code logic for ib_nl_send_msg()
Alex Estrin [Wed, 26 Sep 2018 17:02:32 +0000 (10:02 -0700)]
IB/sa: simplify return code logic for ib_nl_send_msg()

rdma_nl_multicast() returns either negative error code
or zero if succeeded. Remove unnecessary ret code checks
and reassignments.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Move UnsupportedVL bits definitions to the correct header
Michael J. Ruhl [Wed, 26 Sep 2018 17:02:22 +0000 (10:02 -0700)]
IB/hfi1: Move UnsupportedVL bits definitions to the correct header

The UnsupportedVL SendCtrl register bit information is defined in
the module rather than the chip register header file.

Move the defines to the appropriate header file.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mthca: remove redundant inner check of mdev->mthca_flags
Colin Ian King [Wed, 26 Sep 2018 12:26:08 +0000 (13:26 +0100)]
IB/mthca: remove redundant inner check of mdev->mthca_flags

The inner check for mdev->mthca_flags & MTHCA_FLAG_MSI_X is redundant
as this is already true because of the previous identical check in
an outer if statement.  Remove it

Detected by cppcheck:
(warning) Identical inner 'if' condition is always true.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add MW support for hip08
Yixian Liu [Sun, 23 Sep 2018 09:20:46 +0000 (17:20 +0800)]
RDMA/hns: Add MW support for hip08

This patch adds memory window (mw) support in the kernel space.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add enable judgement for UD vlan
Lijun Ou [Sat, 22 Sep 2018 08:21:08 +0000 (16:21 +0800)]
RDMA/hns: Add enable judgement for UD vlan

According to the hardware modification, the vlan of the UD packet is based
on the ud_vlan_en field of the UD wqe to determine whether to add a vlan
header to the UD packet. The ud_vlan_en field is filled by the driver
according to the net device.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add CM of vlan device support
Lijun Ou [Sat, 22 Sep 2018 08:21:07 +0000 (16:21 +0800)]
RDMA/hns: Add CM of vlan device support

This patch mainly sets the vlan_id field in the WC for rdma_listen() to
work over vlan. This is required by ib_init_ah_attr_from_wc() which is
called by the CM REQ handler.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Add atomic support
Lijun Ou [Sat, 22 Sep 2018 08:21:06 +0000 (16:21 +0800)]
RDMA/hns: Add atomic support

This patch adds atomic operations for hip08, includes fetchadd and cmpswap
operation.  In order to enable atomic, the driver needs to do the
following steps:

1. Enable the atomic caps for RoCE device
2. Post the wqe context of atomic type
3. Configure the atomic type of mtpt

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Refactor the codes for setting transport opode
Lijun Ou [Sat, 22 Sep 2018 08:21:05 +0000 (16:21 +0800)]
RDMA/hns: Refactor the codes for setting transport opode

Currently the transport opcodes which come from users configuration is set
by similar code. This patch simplifies it.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/ulp: Use dev_name instead of ibdev->name
Jason Gunthorpe [Thu, 20 Sep 2018 22:42:27 +0000 (16:42 -0600)]
RDMA/ulp: Use dev_name instead of ibdev->name

These return the same thing but dev_name is a more conventional use of the
kernel API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
5 years agoRDMA/drivers: Use dev_name instead of ibdev->name
Jason Gunthorpe [Thu, 20 Sep 2018 22:42:26 +0000 (16:42 -0600)]
RDMA/drivers: Use dev_name instead of ibdev->name

These return the same thing but dev_name is a more conventional use of the
kernel API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoRDMA/core: Use dev_name instead of ibdev->name
Jason Gunthorpe [Thu, 20 Sep 2018 22:42:25 +0000 (16:42 -0600)]
RDMA/core: Use dev_name instead of ibdev->name

These return the same thing but dev_name is a more conventional use of the
kernel API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
5 years agoRDMA/drivers: Use dev_err/dbg/etc instead of pr_* + ibdev->name
Jason Gunthorpe [Thu, 20 Sep 2018 22:42:24 +0000 (16:42 -0600)]
RDMA/drivers: Use dev_err/dbg/etc instead of pr_* + ibdev->name

Kernel convention is that a driver for a subsystem will print using
dev_* on the subsystem's struct device, or with dev_* on the physical
device. Drivers should rarely use a pr_* function.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>