linux-2.6-block.git
8 years agoMerge branch 'for-4.13/block' into for-4.13/merge for-4.13/merge
Jens Axboe [Fri, 30 Jun 2017 00:09:58 +0000 (18:09 -0600)]
Merge branch 'for-4.13/block' into for-4.13/merge

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 29 Jun 2017 21:30:07 +0000 (14:30 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Need to access netdev->num_rx_queues behind an accessor in netvsc
    driver otherwise the build breaks with some configs, from Arnd
    Bergmann.

 2) Add dummy xfrm_dev_event() so that build doesn't fail when
    CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu.

 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan
    Carpenter.

 4) Fix MCDI command size for filter operations in sfc driver, from
    Martin Habets.

 5) Fix UFO segmenting so that we don't calculate incorrect checksums,
    from Michal Kubecek.

 6) When ipv6 datagram connects fail, reset destination address and
    port. From Wei Wang.

 7) TCP disconnect must reset the cached receive DST, from WANG Cong.

 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric
    Dumazet.

 9) fman driver has to depend on HAS_DMA, from Madalin Bucur.

10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann.

11) Fix negative page counts with GFO, from Michal Kubecek.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  sfc: fix attempt to translate invalid filter ID
  net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
  bpf: prevent leaking pointer via xadd on unpriviledged
  arcnet: com20020-pci: add missing pdev setup in netdev structure
  arcnet: com20020-pci: fix dev_id calculation
  arcnet: com20020: remove needless base_addr assignment
  Trivial fix to spelling mistake in arc_printk message
  arcnet: change irq handler to lock irqsave
  rocker: move dereference before free
  mlxsw: spectrum_router: Fix NULL pointer dereference
  net: sched: Fix one possible panic when no destroy callback
  virtio-net: serialize tx routine during reset
  net: usb: asix88179_178a: Add support for the Belkin B2B128
  fsl/fman: add dependency on HAS_DMA
  net: prevent sign extension in dev_get_stats()
  tcp: reset sk_rx_dst in tcp_disconnect()
  net: ipv6: reset daddr and dport in sk if connect() fails
  bnx2x: Don't log mc removal needlessly
  bnxt_en: Fix netpoll handling.
  bnxt_en: Add missing logic to handle TPA end error conditions.
  ...

8 years agoMerge tag 'for-4.12/dm-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 29 Jun 2017 21:23:02 +0000 (14:23 -0700)]
Merge tag 'for-4.12/dm-fixes-5' of git://git./linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - dm thinp fix for crash that will occur when metadata device failure
   races with discard passdown to the underlying data device.

 - dm raid fix to not access the superblock's >= 1.9.0 'sectors' member
   unconditionally.

* tag 'for-4.12/dm-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: do not queue freed thin mapping for next stage processing
  dm raid: fix oops on upgrading to extended superblock format

8 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Thu, 29 Jun 2017 21:10:37 +0000 (14:10 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Two fixes that should go into this release.

  One is an nvme regression fix from Keith, fixing a missing queue
  freeze if the controller is being reset. This causes the reset to
  hang.

  The other is a fix for a leak of the bio protection info, if smaller
  sized O_DIRECT is used. This fix should be more involved as we have
  other problematic paths in the kernel, but given as this isn't a
  regression in this series, we'll tackle those for 4.13"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: provide bio_uninit() free freeing integrity/task associations
  nvme/pci: Fix stuck nvme reset

8 years agosfc: fix attempt to translate invalid filter ID
Edward Cree [Thu, 29 Jun 2017 15:50:06 +0000 (16:50 +0100)]
sfc: fix attempt to translate invalid filter ID

When filter insertion fails with no rollback, we were trying to convert
 EFX_EF10_FILTER_ID_INVALID to an id to store in 'ids' (which is either
 vlan->uc or vlan->mc).  This would WARN_ON_ONCE and then record a bogus
 filter ID of 0x1fff, neither of which is a good thing.

Fixes: 0ccb998bf46d ("sfc: fix filter_id misinterpretation in edge case")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
Michal Kubeček [Thu, 29 Jun 2017 09:13:36 +0000 (11:13 +0200)]
net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()

Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit c21b48cc1bbf ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.

Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.

To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().

Fixes: d7e8883cfcf4 ("net: make GRO aware of skb->head_frag")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobpf: prevent leaking pointer via xadd on unpriviledged
Daniel Borkmann [Thu, 29 Jun 2017 01:04:59 +0000 (03:04 +0200)]
bpf: prevent leaking pointer via xadd on unpriviledged

Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:

  0: (b7) r0 = 0
  1: (18) r2 = 0xffff897e82304400
  3: (7b) *(u64 *)(r1 +48) = r2
  R2 leaks addr into ctx

Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:

  0: (b7) r0 = 0
  1: (7b) *(u64 *)(r1 +48) = r0
  2: (18) r2 = 0xffff897e82304400 ; map
  4: (db) lock *(u64 *)(r1 +48) += r2
  5: (95) exit

We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:

   0: (bf) r6 = r1
   1: (7a) *(u64 *)(r10 -8) = 0
   2: (bf) r2 = r10
   3: (07) r2 += -8
   4: (18) r1 = 0x0
   6: (85) call bpf_map_lookup_elem#1
   7: (15) if r0 == 0x0 goto pc+3
   R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
   8: (b7) r3 = 0
   9: (7b) *(u64 *)(r0 +0) = r3
  10: (db) lock *(u64 *)(r0 +0) += r6
  11: (b7) r0 = 0
  12: (95) exit

  from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
  11: (b7) r0 = 0
  12: (95) exit

Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'arcnet-fixes'
David S. Miller [Thu, 29 Jun 2017 19:18:38 +0000 (15:18 -0400)]
Merge branch 'arcnet-fixes'

Michael Grzeschik says:

====================
arcnet: Collection of latest fixes

Here we sum up the recent fixes I collected on the way to use and
stabilise the framework. Part of it is an possible deadlock that we
prevent as well to fix the calculation of the dev_id that can be setup
by an rotary encoder. Beside that we added an trivial spelling patch and
fix some wrong and missing assignments that improves the code footprint.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoarcnet: com20020-pci: add missing pdev setup in netdev structure
Michael Grzeschik [Wed, 28 Jun 2017 16:28:37 +0000 (18:28 +0200)]
arcnet: com20020-pci: add missing pdev setup in netdev structure

We add the pdev data to the pci devices netdev structure. This way
the interface get consistent device names in the userspace (udev).

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoarcnet: com20020-pci: fix dev_id calculation
Michael Grzeschik [Wed, 28 Jun 2017 16:28:36 +0000 (18:28 +0200)]
arcnet: com20020-pci: fix dev_id calculation

The dev_id was miscalculated. Only the two bits 4-5 are relevant for the
MA1 card. PCIARC1 and PCIFB2 use the four bits 4-7 for id selection.

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoarcnet: com20020: remove needless base_addr assignment
Michael Grzeschik [Wed, 28 Jun 2017 16:28:35 +0000 (18:28 +0200)]
arcnet: com20020: remove needless base_addr assignment

The assignment is superfluous.

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoTrivial fix to spelling mistake in arc_printk message
Colin Ian King [Wed, 28 Jun 2017 16:28:34 +0000 (18:28 +0200)]
Trivial fix to spelling mistake in arc_printk message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoarcnet: change irq handler to lock irqsave
Michael Grzeschik [Wed, 28 Jun 2017 16:28:33 +0000 (18:28 +0200)]
arcnet: change irq handler to lock irqsave

This patch prevents the arcnet driver from the following deadlock.

[   41.273910] ======================================================
[   41.280397] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[   41.287433] 4.4.0-00034-gc0ae784 #536 Not tainted
[   41.292366] ------------------------------------------------------
[   41.298863] arcecho/233 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
[   41.305628]  (&(&lp->lock)->rlock){+.+...}, at: [<bf083bc8>] arcnet_send_packet+0x60/0x1c0 [arcnet]
[   41.315199]
[   41.315199] and this task is already holding:
[   41.321324]  (_xmit_ARCNET#2){+.-...}, at: [<c06b934c>] packet_direct_xmit+0xfc/0x1c8
[   41.329593] which would create a new lock dependency:
[   41.334893]  (_xmit_ARCNET#2){+.-...} -> (&(&lp->lock)->rlock){+.+...}
[   41.341801]
[   41.341801] but this new dependency connects a SOFTIRQ-irq-safe lock:
[   41.350108]  (_xmit_ARCNET#2){+.-...}
... which became SOFTIRQ-irq-safe at:
[   41.357539]   [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.362677]   [<c063ab8c>] dev_watchdog+0x5c/0x264
[   41.367723]   [<c0094edc>] call_timer_fn+0x6c/0xf4
[   41.372759]   [<c00950b8>] run_timer_softirq+0x154/0x210
[   41.378340]   [<c0036b30>] __do_softirq+0x144/0x298
[   41.383469]   [<c0036fb4>] irq_exit+0xcc/0x130
[   41.388138]   [<c0085c50>] __handle_domain_irq+0x60/0xb4
[   41.393728]   [<c0014578>] __irq_svc+0x58/0x78
[   41.398402]   [<c0010274>] arch_cpu_idle+0x24/0x3c
[   41.403443]   [<c007127c>] cpu_startup_entry+0x1f8/0x25c
[   41.409029]   [<c09adc90>] start_kernel+0x3c0/0x3cc
[   41.414170]
[   41.414170] to a SOFTIRQ-irq-unsafe lock:
[   41.419931]  (&(&lp->lock)->rlock){+.+...}
... which became SOFTIRQ-irq-unsafe at:
[   41.427996] ...  [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.433409]   [<bf083d54>] arcnet_interrupt+0x2c/0x800 [arcnet]
[   41.439646]   [<c0089120>] handle_nested_irq+0x8c/0xec
[   41.445063]   [<c03c1170>] regmap_irq_thread+0x190/0x314
[   41.450661]   [<c0087244>] irq_thread_fn+0x1c/0x34
[   41.455700]   [<c0087548>] irq_thread+0x13c/0x1dc
[   41.460649]   [<c0050f10>] kthread+0xe4/0xf8
[   41.465158]   [<c000f810>] ret_from_fork+0x14/0x24
[   41.470207]
[   41.470207] other info that might help us debug this:
[   41.470207]
[   41.478627]  Possible interrupt unsafe locking scenario:
[   41.478627]
[   41.485763]        CPU0                    CPU1
[   41.490521]        ----                    ----
[   41.495279]   lock(&(&lp->lock)->rlock);
[   41.499414]                                local_irq_disable();
[   41.505636]                                lock(_xmit_ARCNET#2);
[   41.511967]                                lock(&(&lp->lock)->rlock);
[   41.518741]   <Interrupt>
[   41.521490]     lock(_xmit_ARCNET#2);
[   41.525356]
[   41.525356]  *** DEADLOCK ***
[   41.525356]
[   41.531587] 1 lock held by arcecho/233:
[   41.535617]  #0:  (_xmit_ARCNET#2){+.-...}, at: [<c06b934c>] packet_direct_xmit+0xfc/0x1c8
[   41.544355]
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[   41.552362] -> (_xmit_ARCNET#2){+.-...} ops: 27 {
[   41.557357]    HARDIRQ-ON-W at:
[   41.560664]                     [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.567445]                     [<c063ba28>] dev_deactivate_many+0x114/0x304
[   41.574866]                     [<c063bc3c>] dev_deactivate+0x24/0x38
[   41.581646]                     [<c0630374>] linkwatch_do_dev+0x40/0x74
[   41.588613]                     [<c06305d8>] __linkwatch_run_queue+0xec/0x140
[   41.596120]                     [<c0630658>] linkwatch_event+0x2c/0x34
[   41.602991]                     [<c004af30>] process_one_work+0x188/0x40c
[   41.610131]                     [<c004b200>] worker_thread+0x4c/0x480
[   41.616912]                     [<c0050f10>] kthread+0xe4/0xf8
[   41.623048]                     [<c000f810>] ret_from_fork+0x14/0x24
[   41.629735]    IN-SOFTIRQ-W at:
[   41.633039]                     [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.639820]                     [<c063ab8c>] dev_watchdog+0x5c/0x264
[   41.646508]                     [<c0094edc>] call_timer_fn+0x6c/0xf4
[   41.653190]                     [<c00950b8>] run_timer_softirq+0x154/0x210
[   41.660425]                     [<c0036b30>] __do_softirq+0x144/0x298
[   41.667201]                     [<c0036fb4>] irq_exit+0xcc/0x130
[   41.673518]                     [<c0085c50>] __handle_domain_irq+0x60/0xb4
[   41.680754]                     [<c0014578>] __irq_svc+0x58/0x78
[   41.687077]                     [<c0010274>] arch_cpu_idle+0x24/0x3c
[   41.693769]                     [<c007127c>] cpu_startup_entry+0x1f8/0x25c
[   41.701006]                     [<c09adc90>] start_kernel+0x3c0/0x3cc
[   41.707791]    INITIAL USE at:
[   41.711003]                    [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.717696]                    [<c063ba28>] dev_deactivate_many+0x114/0x304
[   41.725026]                    [<c063bc3c>] dev_deactivate+0x24/0x38
[   41.731718]                    [<c0630374>] linkwatch_do_dev+0x40/0x74
[   41.738593]                    [<c06305d8>] __linkwatch_run_queue+0xec/0x140
[   41.746011]                    [<c0630658>] linkwatch_event+0x2c/0x34
[   41.752789]                    [<c004af30>] process_one_work+0x188/0x40c
[   41.759847]                    [<c004b200>] worker_thread+0x4c/0x480
[   41.766541]                    [<c0050f10>] kthread+0xe4/0xf8
[   41.772596]                    [<c000f810>] ret_from_fork+0x14/0x24
[   41.779198]  }
[   41.780945]  ... key      at: [<c124d620>] netdev_xmit_lock_key+0x38/0x1c8
[   41.788192]  ... acquired at:
[   41.791309]    [<c007bed8>] lock_acquire+0x70/0x90
[   41.796361]    [<c06f9140>] _raw_spin_lock_irqsave+0x40/0x54
[   41.802324]    [<bf083bc8>] arcnet_send_packet+0x60/0x1c0 [arcnet]
[   41.808844]    [<c06b9380>] packet_direct_xmit+0x130/0x1c8
[   41.814622]    [<c06bc7e4>] packet_sendmsg+0x3b8/0x680
[   41.820034]    [<c05fe8b0>] sock_sendmsg+0x14/0x24
[   41.825091]    [<c05ffd68>] SyS_sendto+0xb8/0xe0
[   41.829956]    [<c05ffda8>] SyS_send+0x18/0x20
[   41.834638]    [<c000f780>] ret_fast_syscall+0x0/0x1c
[   41.839954]
[   41.841514]
the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[   41.850302] -> (&(&lp->lock)->rlock){+.+...} ops: 5 {
[   41.855644]    HARDIRQ-ON-W at:
[   41.858945]                     [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.865726]                     [<bf083d54>] arcnet_interrupt+0x2c/0x800 [arcnet]
[   41.873607]                     [<c0089120>] handle_nested_irq+0x8c/0xec
[   41.880666]                     [<c03c1170>] regmap_irq_thread+0x190/0x314
[   41.887901]                     [<c0087244>] irq_thread_fn+0x1c/0x34
[   41.894593]                     [<c0087548>] irq_thread+0x13c/0x1dc
[   41.901195]                     [<c0050f10>] kthread+0xe4/0xf8
[   41.907338]                     [<c000f810>] ret_from_fork+0x14/0x24
[   41.914025]    SOFTIRQ-ON-W at:
[   41.917328]                     [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.924106]                     [<bf083d54>] arcnet_interrupt+0x2c/0x800 [arcnet]
[   41.931981]                     [<c0089120>] handle_nested_irq+0x8c/0xec
[   41.939028]                     [<c03c1170>] regmap_irq_thread+0x190/0x314
[   41.946264]                     [<c0087244>] irq_thread_fn+0x1c/0x34
[   41.952954]                     [<c0087548>] irq_thread+0x13c/0x1dc
[   41.959548]                     [<c0050f10>] kthread+0xe4/0xf8
[   41.965689]                     [<c000f810>] ret_from_fork+0x14/0x24
[   41.972379]    INITIAL USE at:
[   41.975595]                    [<c06f8fc8>] _raw_spin_lock+0x30/0x40
[   41.982283]                    [<bf083d54>] arcnet_interrupt+0x2c/0x800 [arcnet]
[   41.990063]                    [<c0089120>] handle_nested_irq+0x8c/0xec
[   41.997027]                    [<c03c1170>] regmap_irq_thread+0x190/0x314
[   42.004172]                    [<c0087244>] irq_thread_fn+0x1c/0x34
[   42.010766]                    [<c0087548>] irq_thread+0x13c/0x1dc
[   42.017267]                    [<c0050f10>] kthread+0xe4/0xf8
[   42.023314]                    [<c000f810>] ret_from_fork+0x14/0x24
[   42.029903]  }
[   42.031648]  ... key      at: [<bf0854cc>] __key.42091+0x0/0xfffff0f8 [arcnet]
[   42.039255]  ... acquired at:
[   42.042372]    [<c007bed8>] lock_acquire+0x70/0x90
[   42.047413]    [<c06f9140>] _raw_spin_lock_irqsave+0x40/0x54
[   42.053364]    [<bf083bc8>] arcnet_send_packet+0x60/0x1c0 [arcnet]
[   42.059872]    [<c06b9380>] packet_direct_xmit+0x130/0x1c8
[   42.065634]    [<c06bc7e4>] packet_sendmsg+0x3b8/0x680
[   42.071030]    [<c05fe8b0>] sock_sendmsg+0x14/0x24
[   42.076069]    [<c05ffd68>] SyS_sendto+0xb8/0xe0
[   42.080926]    [<c05ffda8>] SyS_send+0x18/0x20
[   42.085601]    [<c000f780>] ret_fast_syscall+0x0/0x1c
[   42.090918]
[   42.092481]
[   42.092481] stack backtrace:
[   42.097065] CPU: 0 PID: 233 Comm: arcecho Not tainted 4.4.0-00034-gc0ae784 #536
[   42.104751] Hardware name: Generic AM33XX (Flattened Device Tree)
[   42.111183] [<c0017ec8>] (unwind_backtrace) from [<c00139d0>] (show_stack+0x10/0x14)
[   42.119337] [<c00139d0>] (show_stack) from [<c02a82c4>] (dump_stack+0x8c/0x9c)
[   42.126937] [<c02a82c4>] (dump_stack) from [<c0078260>] (check_usage+0x4bc/0x63c)
[   42.134815] [<c0078260>] (check_usage) from [<c0078438>] (check_irq_usage+0x58/0xb0)
[   42.142964] [<c0078438>] (check_irq_usage) from [<c007aaa0>] (__lock_acquire+0x1524/0x20b0)
[   42.151740] [<c007aaa0>] (__lock_acquire) from [<c007bed8>] (lock_acquire+0x70/0x90)
[   42.159886] [<c007bed8>] (lock_acquire) from [<c06f9140>] (_raw_spin_lock_irqsave+0x40/0x54)
[   42.168768] [<c06f9140>] (_raw_spin_lock_irqsave) from [<bf083bc8>] (arcnet_send_packet+0x60/0x1c0 [arcnet])
[   42.179115] [<bf083bc8>] (arcnet_send_packet [arcnet]) from [<c06b9380>] (packet_direct_xmit+0x130/0x1c8)
[   42.189182] [<c06b9380>] (packet_direct_xmit) from [<c06bc7e4>] (packet_sendmsg+0x3b8/0x680)
[   42.198059] [<c06bc7e4>] (packet_sendmsg) from [<c05fe8b0>] (sock_sendmsg+0x14/0x24)
[   42.206199] [<c05fe8b0>] (sock_sendmsg) from [<c05ffd68>] (SyS_sendto+0xb8/0xe0)
[   42.213978] [<c05ffd68>] (SyS_sendto) from [<c05ffda8>] (SyS_send+0x18/0x20)
[   42.221388] [<c05ffda8>] (SyS_send) from [<c000f780>] (ret_fast_syscall+0x0/0x1c)

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
   ---
   v1 -> v2: removed unneeded zero assignment of flags
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agorocker: move dereference before free
Dan Carpenter [Wed, 28 Jun 2017 11:44:21 +0000 (14:44 +0300)]
rocker: move dereference before free

My static checker complains that ofdpa_neigh_del() can sometimes free
"found".   It just makes sense to use it first before deleting it.

Fixes: ecf244f753e0 ("rocker: fix maybe-uninitialized warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agomlxsw: spectrum_router: Fix NULL pointer dereference
Ido Schimmel [Wed, 28 Jun 2017 06:03:12 +0000 (09:03 +0300)]
mlxsw: spectrum_router: Fix NULL pointer dereference

In case a VLAN device is enslaved to a bridge we shouldn't create a
router interface (RIF) for it when it's configured with an IP address.
This is already handled by the driver for other types of netdevs, such
as physical ports and LAG devices.

If this IP address is then removed and the interface is subsequently
unlinked from the bridge, a NULL pointer dereference can happen, as the
original 802.1d FID was replaced with an rFID which was then deleted.

To reproduce:
$ ip link set dev enp3s0np9 up
$ ip link add name enp3s0np9.111 link enp3s0np9 type vlan id 111
$ ip link set dev enp3s0np9.111 up
$ ip link add name br0 type bridge
$ ip link set dev br0 up
$ ip link set enp3s0np9.111 master br0
$ ip address add dev enp3s0np9.111 192.168.0.1/24
$ ip address del dev enp3s0np9.111 192.168.0.1/24
$ ip link set dev enp3s0np9.111 nomaster

Fixes: 99724c18fc66 ("mlxsw: spectrum: Introduce support for router interfaces")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Petr Machata <petrm@mellanox.com>
Tested-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: sched: Fix one possible panic when no destroy callback
Gao Feng [Wed, 28 Jun 2017 04:53:54 +0000 (12:53 +0800)]
net: sched: Fix one possible panic when no destroy callback

When qdisc fail to init, qdisc_create would invoke the destroy callback
to cleanup. But there is no check if the callback exists really. So it
would cause the panic if there is no real destroy callback like the qdisc
codel, fq, and so on.

Take codel as an example following:
When a malicious user constructs one invalid netlink msg, it would cause
codel_init->codel_change->nla_parse_nested failed.
Then kernel would invoke the destroy callback directly but qdisc codel
doesn't define one. It causes one panic as a result.

Now add one the check for destroy to avoid the possible panic.

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agovirtio-net: serialize tx routine during reset
Jason Wang [Wed, 28 Jun 2017 01:51:03 +0000 (09:51 +0800)]
virtio-net: serialize tx routine during reset

We don't hold any tx lock when trying to disable TX during reset, this
would lead a use after free since ndo_start_xmit() tries to access
the virtqueue which has already been freed. Fix this by using
netif_tx_disable() before freeing the vqs, this could make sure no tx
after vq freeing.

Reported-by: Jean-Philippe Menil <jpmenil@gmail.com>
Tested-by: Jean-Philippe Menil <jpmenil@gmail.com>
Fixes commit f600b6905015 ("virtio_net: Add XDP support")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Robert McCabe <robert.mccabe@rockwellcollins.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonvme: Makefile: remove dead build rule
Valentin Rothberg [Thu, 29 Jun 2017 06:59:07 +0000 (08:59 +0200)]
nvme: Makefile: remove dead build rule

Remove dead build rule for drivers/nvme/host/scsi.c which has been
removed by commit ("nvme: Remove SCSI translations").

Signed-off-by: Valentin Rothberg <vrothberg@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblk-mq: map all HWQ also in hyperthreaded system
Max Gurtovoy [Thu, 29 Jun 2017 14:40:11 +0000 (08:40 -0600)]
blk-mq: map all HWQ also in hyperthreaded system

This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).

Performance results running fio (72 jobs, 128 iodepth)
using null_blk (w/w.o patch):

bs      IOPS(read submit_queues=72)   IOPS(write submit_queues=72)   IOPS(read submit_queues=24)  IOPS(write submit_queues=24)
-----  ----------------------------  ------------------------------ ---------------------------- -----------------------------
512    4890.4K/4723.5K                 4524.7K/4324.2K                   4280.2K/4264.3K               3902.4K/3909.5K
1k     4910.1K/4715.2K                 4535.8K/4309.6K                   4296.7K/4269.1K               3906.8K/3914.9K
2k     4906.3K/4739.7K                 4526.7K/4330.6K                   4301.1K/4262.4K               3890.8K/3900.1K
4k     4918.6K/4730.7K                 4556.1K/4343.6K                   4297.6K/4264.5K               3886.9K/3893.9K
8k     4906.4K/4748.9K                 4550.9K/4346.7K                   4283.2K/4268.8K               3863.4K/3858.2K
16k    4903.8K/4782.6K                 4501.5K/4233.9K                   4292.3K/4282.3K               3773.1K/3773.5K
32k    4885.8K/4782.4K                 4365.9K/4184.2K                   4307.5K/4289.4K               3780.3K/3687.3K
64k    4822.5K/4762.7K                 2752.8K/2675.1K                   4308.8K/4312.3K               2651.5K/2655.7K
128k   2388.5K/2313.8K                 1391.9K/1375.7K                   2142.8K/2152.2K               1395.5K/1374.2K

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoarch: remove unused macro/function thread_saved_pc()
Tobias Klauser [Wed, 28 Jun 2017 13:30:02 +0000 (15:30 +0200)]
arch: remove unused macro/function thread_saved_pc()

The only user of thread_saved_pc() in non-arch-specific code was removed
in commit 8243d5597793 ("sched/core: Remove pointless printout in
sched_show_task()").  Remove the implementations as well.

Some architectures use thread_saved_pc() in their arch-specific code.
Leave their thread_saved_pc() intact.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8 years agoblock: provide bio_uninit() free freeing integrity/task associations
Jens Axboe [Wed, 28 Jun 2017 21:30:13 +0000 (15:30 -0600)]
block: provide bio_uninit() free freeing integrity/task associations

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoMerge tag 'nfs-for-4.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Wed, 28 Jun 2017 20:27:15 +0000 (13:27 -0700)]
Merge tag 'nfs-for-4.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Bugfixes include:

   - stable fix for exclusive create if the server supports the umask
     attribute

   - trunking detection should handle ERESTARTSYS/EINTR

   - stable fix for a race in the LAYOUTGET function

   - stable fix to revert "nfs_rename() handle -ERESTARTSYS dentry left
     behind"

   - nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()"

* tag 'nfs-for-4.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()
  Revert "NFS: nfs_rename() handle -ERESTARTSYS dentry left behind"
  NFSv4.1: Fix a race in nfs4_proc_layoutget
  NFS: Trunking detection should handle ERESTARTSYS/EINTR
  NFSv4.2: Don't send mode again in post-EXCLUSIVE4_1 SETATTR with umask

8 years agoMerge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Wed, 28 Jun 2017 20:22:26 +0000 (13:22 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "This is the final set of fixes for -rc8, just a few i915 and one
  vmwgfx ones.

  I'm off on holidays for a week, so if anything shows up for fixes I've
  asked Daniel or Sean Paul to herd it in the right direction"

[ The additional etnaviv fixes were already herded towards me as seen in
  my previous pull - Linus ]

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
  drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
  drm/i915: Hold struct_mutex for per-file stats in debugfs/i915_gem_object
  drm/i915: Retire the VMA's fence tracker before unbinding

8 years agoMerge branch 'etnaviv/fixes' of git://git.pengutronix.de/git/lst/linux
Linus Torvalds [Wed, 28 Jun 2017 20:13:48 +0000 (13:13 -0700)]
Merge branch 'etnaviv/fixes' of git://git.pengutronix.de/git/lst/linux

Pull drm/etnaviv fixes from Lucas Stach:
 "I realized I just missed the cut-off point for the final drm fixes
  pull, but I have 2 more etnaviv fixes that need to go into 4.12, as
  they fix fallout from the explicit sync work introduced in the last
  merge window"

[ Pulling directly because Dave is on vacation. Noted by Daniel Vetter,
  and acked by Dave Airlie  - Linus ]

* 'etnaviv/fixes' of git://git.pengutronix.de/git/lst/linux:
  drm/etnaviv: Fix implicit/explicit sync sense inversion
  drm/etnaviv: fix submit flags getting overwritten by BO content

8 years agonvmet-rdma: register ib_client to not deadlock in device removal
Sagi Grimberg [Tue, 27 Jun 2017 06:23:33 +0000 (09:23 +0300)]
nvmet-rdma: register ib_client to not deadlock in device removal

We can deadlock in case we got to a device removal
event on a queue which is already in the process of
destroying the cm_id is this is blocking until all
events on this cm_id will drain. On the other hand
we cannot guarantee that rdma_destroy_id was invoked
as we only have indication that the queue disconnect
flow has been queued (the queue state is updated before
the realease work has been queued).

So, we leave all the queue removal to a separate ib_client
to avoid this deadlock as ib_client device removal is in
a different context than the cm_id itself.

Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme_fc: fix error recovery on link down.
James Smart [Thu, 22 Jun 2017 00:43:21 +0000 (17:43 -0700)]
nvme_fc: fix error recovery on link down.

Currently, the fc transport invokes nvme_fc_error_recovery() on every
io in which the transport detects an error.  Which means:
a) it's really noisy on large io loads that all get hit by a link down.
b) we repeatively call nvme_stop_queues() even though queues are
 stopped upon the first error or as first steps of reset_work.

Correct by:
Errors are only meaningful if the controller is in the LIVE state.
Thus, enact the reset_work only if LIVE. If called repeatively, state
will have already transitioned.
There's no need to stop the queues here. Let the first steps of
reset_work do the queue stopping.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvmet_fc: fix crashes on bad opcodes
James Smart [Fri, 16 Jun 2017 06:41:41 +0000 (23:41 -0700)]
nvmet_fc: fix crashes on bad opcodes

if a nvme command is issued with an opcode that is not supported by
the target (example: opcode 21 - detach namespace), the target
crashes due to a null pointer.

nvmet_req_init() detects the bad opcode and immediately calls the nvme
command done routine with an error status, allowing the transport to
send the response. However, the FC transport was aborting the command
on error, so the abort freed the lldd point, but the rsp transmit path
referenced it psot the free.

Fix by removing the abort call on nvmet_req_init() failure.
The completion response will be sent with an error status code.

As the completion path will terminate the io, ensure the data_sg
lists show an unused state so that teardown paths are successful.

Signed-off-by: Paul Ely <Paul.Ely@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme_fc: Fix crash when nvme controller connection fails.
James Smart [Fri, 16 Jun 2017 06:40:54 +0000 (23:40 -0700)]
nvme_fc: Fix crash when nvme controller connection fails.

If a controller connection is attempted (say to a subsystem that
does not exist), the first attempt errors out.  If another connect
is attempted, it crashes.

Issue is the prior controller has yet execute it's final put, thus
its still on lists. However, opts points on it have been cleared, thus
causing the crash if they are referenced.

Fix is to add the missing put after the nvme_uninit_ctrl() call on
the attachment failure.

Signed-off-by: Paul Ely <Paul.Ely@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme_fc: replace ioabort msleep loop with completion
James Smart [Mon, 22 May 2017 22:28:42 +0000 (15:28 -0700)]
nvme_fc: replace ioabort msleep loop with completion

Per the recommendation by Sagi on:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html

Wait for io aborts to complete wait converted from msleep look to
using a struct completion.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme_fc: fix double calls to nvme_cleanup_cmd()
James Smart [Thu, 22 Jun 2017 00:43:05 +0000 (17:43 -0700)]
nvme_fc: fix double calls to nvme_cleanup_cmd()

Current fc transport code, on io termination, is calling
nvme_cleanup_cmd() followed by the transport dma unmap routine
which also calls nvme_cleanup_cmd(). Which means two kfrees occur
on the same address, raising havoc. This resulted in odd data errors,
effectively corruption..

Fix by removing the extraneous double calls. Call now occurs only in
teardown paths and as part of dma unmap routine.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme-fabrics: verify that a controller returns the correct NQN
Christoph Hellwig [Mon, 26 Jun 2017 10:39:04 +0000 (12:39 +0200)]
nvme-fabrics: verify that a controller returns the correct NQN

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: simplify nvme_dev_attrs_are_visible
Christoph Hellwig [Mon, 26 Jun 2017 10:39:03 +0000 (12:39 +0200)]
nvme: simplify nvme_dev_attrs_are_visible

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: read the subsystem NQN from Identify Controller
Christoph Hellwig [Mon, 26 Jun 2017 10:39:02 +0000 (12:39 +0200)]
nvme: read the subsystem NQN from Identify Controller

NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the
Identify controller data structures.  Use this NQN for the subsysnqn
sysfs attribute by storing it in the nvme_ctrl structure after verifying
it.  For older controllers we generate a "fake" NQN per non-normative
text in the NVMe 1.3 spec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: remove a misleading comment on struct nvme_ns
Christoph Hellwig [Mon, 26 Jun 2017 10:39:01 +0000 (12:39 +0200)]
nvme: remove a misleading comment on struct nvme_ns

While a NVMe Namespace is somewhat similar to a SCSI Logical Unit (and not
a Logical Unit Number anyway) there are subtile differences.  Remove the
misleading comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grmberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: explicitly disable APST on quirked devices
Kai-Heng Feng [Mon, 26 Jun 2017 20:39:54 +0000 (16:39 -0400)]
nvme: explicitly disable APST on quirked devices

A user reports APST is enabled, even when the NVMe is quirked or with
option "default_ps_max_latency_us=0".

The current logic will not set APST if the device is quirked. But the
NVMe in question will enable APST automatically.

Separate the logic "apst is supported" and "to enable apst", so we can
use the latter one to explicitly disable APST at initialiaztion.

BugLink: https://bugs.launchpad.net/bugs/1699004
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: use a single NVME_AQ_DEPTH and relax it to 32
Sagi Grimberg [Sun, 18 Jun 2017 13:15:59 +0000 (16:15 +0300)]
nvme: use a single NVME_AQ_DEPTH and relax it to 32

No need to differentiate fabrics from pci/loop, also lower
it to 32 as we don't really need 256 inflight admin commands.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: add hostid token to fabric options
Johannes Thumshirn [Tue, 20 Jun 2017 12:23:01 +0000 (14:23 +0200)]
nvme: add hostid token to fabric options

Currently we have no way to define a stable host-id but always use the one
which is randomly generated when we add the host or use the default host.

Provide a "hostid=%s" for user-space to pass in a persistent host-id which
overrides the randomly generated one.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: Remove SCSI translations
Keith Busch [Tue, 20 Jun 2017 19:09:56 +0000 (15:09 -0400)]
nvme: Remove SCSI translations

The SCSI-to-NVMe translations were added to assist storage applications
utilizing SG_IO transitioning to NVMe. It was always recommended,
however, to use native NVMe for device management as too much is lost
in translation and the maintenance burden in keeping this kludgey
layer around has been neglected such that much of the translations are
completely broken.

This patch removes SG_IO handling from NVMe to avoid any confusion
regarding maintenance support for this interface. The config option for
NVMe SCSI emulation has been disabled by default since 4.5. The driver
has supported native nvme user commands since the beginning, and native
tooling is publicly available for use or as reference for anyone writing
their own tools, so there's no excuse for hanging onto a broken crutch.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme-pci: open-code polling logic in nvme_poll
Sagi Grimberg [Sun, 18 Jun 2017 14:28:10 +0000 (17:28 +0300)]
nvme-pci: open-code polling logic in nvme_poll

Given that the code is simple enough it seems better
then passing a tag by reference for each call site, also
we can now get rid of __nvme_process_cq.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme-pci: factor out the cqe reading mechanics from __nvme_process_cq
Sagi Grimberg [Sun, 18 Jun 2017 14:28:09 +0000 (17:28 +0300)]
nvme-pci: factor out the cqe reading mechanics from __nvme_process_cq

Also, maintain a consumed counter to rely on for doorbell and
cqe_seen update instead of directly relying on the cq head and phase.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme-pci: factor out cqe handling into a dedicated routine
Sagi Grimberg [Sun, 18 Jun 2017 14:28:08 +0000 (17:28 +0300)]
nvme-pci: factor out cqe handling into a dedicated routine

Makes the code slightly more readable.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme-pci: Introduce nvme_ring_cq_doorbell
Sagi Grimberg [Sun, 18 Jun 2017 14:28:07 +0000 (17:28 +0300)]
nvme-pci: Introduce nvme_ring_cq_doorbell

Nice abstraction of the actual mechanics of how to do it.
Note the change that we call it after we assign nvmeq->cq_head
to avoid passing it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agofs/fcntl: use copy_to/from_user() for u64 types
Jens Axboe [Wed, 28 Jun 2017 14:09:45 +0000 (08:09 -0600)]
fs/fcntl: use copy_to/from_user() for u64 types

Some architectures (at least PPC) doesn't like get/put_user with
64-bit types on a 32-bit system. Use the variably sized copy
to/from user variants instead.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agodrm/etnaviv: Fix implicit/explicit sync sense inversion
Daniel Stone [Thu, 22 Jun 2017 11:22:22 +0000 (12:22 +0100)]
drm/etnaviv: Fix implicit/explicit sync sense inversion

We were reading the no-implicit sync flag the wrong way around,
synchronizing too much for the explicit case, and not at all for the
implicit case. Oops.

Signed-off-by: Daniel Stone <daniels@collabora.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
8 years agodrm/etnaviv: fix submit flags getting overwritten by BO content
Lucas Stach [Tue, 27 Jun 2017 14:02:51 +0000 (16:02 +0200)]
drm/etnaviv: fix submit flags getting overwritten by BO content

The addition of the flags member to etnaviv_gem_submit structure didn't
take into account that the last member of this structure is a variable
length array.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
8 years agoMerge tag 'drm-intel-fixes-2017-06-27' of git://anongit.freedesktop.org/git/drm-intel...
Dave Airlie [Wed, 28 Jun 2017 07:07:15 +0000 (17:07 +1000)]
Merge tag 'drm-intel-fixes-2017-06-27' of git://anongit.freedesktop.org/git/drm-intel into drm-fixes

Just a few minor fixes. Important one is the execbuf async fix (aka
ANDROID_native_sync). There was another patch for a display coherency
corner case on APL, but we've random-walked in that space too much,
and the cherry-pick looked really invasive.

* tag 'drm-intel-fixes-2017-06-27' of git://anongit.freedesktop.org/git/drm-intel:
  drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
  drm/i915: Hold struct_mutex for per-file stats in debugfs/i915_gem_object
  drm/i915: Retire the VMA's fence tracker before unbinding

8 years agoMerge branch 'vmwgfx-fixes-4.12' of git://people.freedesktop.org/~thomash/linux into...
Dave Airlie [Wed, 28 Jun 2017 07:06:58 +0000 (17:06 +1000)]
Merge branch 'vmwgfx-fixes-4.12' of git://people.freedesktop.org/~thomash/linux into drm-fixes

Single vmwgfx fix
* 'vmwgfx-fixes-4.12' of git://people.freedesktop.org/~thomash/linux:
  drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr

8 years agoNFSv4.1: nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()
Trond Myklebust [Tue, 27 Jun 2017 21:40:50 +0000 (17:40 -0400)]
NFSv4.1: nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()

The current code works only for the case where we have exactly one slot,
which is no longer true.
nfs4_free_slot() will automatically declare the callback channel to be
drained when all slots have been returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoRevert "NFS: nfs_rename() handle -ERESTARTSYS dentry left behind"
Benjamin Coddington [Fri, 16 Jun 2017 15:12:59 +0000 (11:12 -0400)]
Revert "NFS: nfs_rename() handle -ERESTARTSYS dentry left behind"

This reverts commit 920b4530fb80430ff30ef83efe21ba1fa5623731 which could
call d_move() without holding the directory's i_mutex, and reverts commit
d4ea7e3c5c0e341c15b073016dbf3ab6c65f12f3 "NFS: Fix old dentry rehash after
move", which was a follow-up fix.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 920b4530fb80 ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
Cc: stable@vger.kernel.org # v4.10+
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.1: Fix a race in nfs4_proc_layoutget
Trond Myklebust [Tue, 27 Jun 2017 21:33:38 +0000 (17:33 -0400)]
NFSv4.1: Fix a race in nfs4_proc_layoutget

If the task calling layoutget is signalled, then it is possible for the
calls to nfs4_sequence_free_slot() and nfs4_layoutget_prepare() to race,
in which case we leak a slot.
The fix is to move the call to nfs4_sequence_free_slot() into the
nfs4_layoutget_release() so that it gets called at task teardown time.

Fixes: 2e80dbe7ac51 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFS: Trunking detection should handle ERESTARTSYS/EINTR
Trond Myklebust [Wed, 21 Jun 2017 14:16:56 +0000 (10:16 -0400)]
NFS: Trunking detection should handle ERESTARTSYS/EINTR

Currently, it will return EIO in those cases.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agodrbd: Drop unnecessary static
Julia Lawall [Tue, 27 Jun 2017 23:56:50 +0000 (17:56 -0600)]
drbd: Drop unnecessary static

Drop static on a local variable, when the variable is initialized before
any use, on every possible execution path through the function.  The
static has no benefit, and dropping it reduces the code size.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@bad exists@
position p;
identifier x;
type T;
@@

static T x@p;
...
x = <+...x...+>

@@
identifier x;
expression e;
type T;
position p != bad.p;
@@

-static
 T x@p;
 ... when != x
     when strict
?x = e;
// </smpl>

The change in code size is indicates by the following output from the size
command.

before:
   text    data     bss     dec     hex filename
  67299    2291    1056   70646   113f6 drivers/block/drbd/drbd_nl.o

after:
   text    data     bss     dec     hex filename
  67283    2291    1056   70630   113e6 drivers/block/drbd/drbd_nl.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme/pci: Fix stuck nvme reset
Keith Busch [Tue, 27 Jun 2017 23:44:05 +0000 (17:44 -0600)]
nvme/pci: Fix stuck nvme reset

The controller state is set to resetting prior to disabling the
controller, so this patch accounts for that state when deciding if it
needs to freeze the queues. Without this, an 'nvme reset /dev/nvme0'
blocks forever because the queues were never frozen.

Fixes: 82b057caefaf ("nvme-pci: fix multiple ctrl removal scheduling")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonet: usb: asix88179_178a: Add support for the Belkin B2B128
Andrew F. Davis [Mon, 26 Jun 2017 17:41:20 +0000 (12:41 -0500)]
net: usb: asix88179_178a: Add support for the Belkin B2B128

The Belkin B2B128 is a USB 3.0 Hub + Gigabit Ethernet Adapter, the
Ethernet adapter uses the ASIX AX88179 USB 3.0 to Gigabit Ethernet
chip supported by this driver, add the USB ID for the same.

This patch is based on work by Geoffrey Tran <geoffrey.tran@gmail.com>
who has indicated they would like this upstreamed by someone more
familiar with the upstreaming process.

Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agofsl/fman: add dependency on HAS_DMA
Madalin Bucur [Mon, 26 Jun 2017 15:47:00 +0000 (18:47 +0300)]
fsl/fman: add dependency on HAS_DMA

A previous commit (5567e989198b5a8d) inserted a dependency on DMA
API that requires HAS_DMA to be added in Kconfig.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodm thin: do not queue freed thin mapping for next stage processing
Vallish Vaidyeshwara [Fri, 23 Jun 2017 18:53:06 +0000 (18:53 +0000)]
dm thin: do not queue freed thin mapping for next stage processing

process_prepared_discard_passdown_pt1() should cleanup
dm_thin_new_mapping in cases of error.

dm_pool_inc_data_range() can fail trying to get a block reference:

metadata operation 'dm_pool_inc_data_range' failed: error = -61

When dm_pool_inc_data_range() fails, dm thin aborts current metadata
transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
is released as well. However, current thin mapping will be queued
onto next stage as part of queue_passdown_pt2() or passdown_endio().
This dangling thin mapping memory when processed and accessed in
next stage will lead to device mapper crashing.

Code flow without fix:
-> process_prepared_discard_passdown_pt1(m)
   -> dm_thin_remove_range()
   -> discard passdown
      --> passdown_endio(m) queues m onto next stage
   -> dm_pool_inc_data_range() fails, frees memory m
            but does not remove it from next stage queue

-> process_prepared_discard_passdown_pt2(m)
   -> processes freed memory m and crashes

One such stack:

Call Trace:
[<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
[<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
[<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
[<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
[<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
[<ffffffff8152bf54>] ? __schedule+0x244/0x680
[<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
[<ffffffff81089f53>] process_one_work+0x153/0x3f0
[<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
[<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
[<ffffffff8108fd6a>] kthread+0xca/0xe0
[<ffffffff8108fca0>] ? kthread_park+0x60/0x60
[<ffffffff81530b45>] ret_from_fork+0x25/0x30

The fix is to first take the block ref count for discarded block and
then do a passdown discard of this block. If block ref count fails,
then bail out aborting current metadata transaction, mark pool as
PM_READ_ONLY and also free current thin mapping memory (existing error
handling code) without queueing this thin mapping onto next stage of
processing. If block ref count succeeds, then passdown discard of this
block. Discard callback of passdown_endio() will queue this thin mapping
onto next stage of processing.

Code flow with fix:
-> process_prepared_discard_passdown_pt1(m)
   -> dm_thin_remove_range()
   -> dm_pool_inc_data_range()
      --> if fails, free memory m and bail out
   -> discard passdown
      --> passdown_endio(m) queues m onto next stage

Cc: stable <stable@vger.kernel.org> # v4.9+
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Reviewed-by: Cristian Gafton <gafton@amazon.com>
Reviewed-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Vallish Vaidyeshwara <vallish@amazon.com>
Reviewed-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
8 years agonet: prevent sign extension in dev_get_stats()
Eric Dumazet [Tue, 27 Jun 2017 14:02:20 +0000 (07:02 -0700)]
net: prevent sign extension in dev_get_stats()

Similar to the fix provided by Dominik Heidler in commit
9b3dc0a17d73 ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().

When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.

Fixes: caf586e5f23ce ("net: add a core netdev->rx_dropped counter")
Fixes: 015f0688f57ca ("net: net: add a core netdev->tx_dropped counter")
Fixes: 6e7333d315a76 ("net: add rx_nohandler stat counter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoblock, bfq: update wr_busy_queues if needed on a queue split
Paolo Valente [Tue, 27 Jun 2017 18:30:47 +0000 (12:30 -0600)]
block, bfq: update wr_busy_queues if needed on a queue split

This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agommc/block: remove a call to blk_queue_bounce_limit
Christoph Hellwig [Mon, 19 Jun 2017 07:26:28 +0000 (09:26 +0200)]
mmc/block: remove a call to blk_queue_bounce_limit

BLK_BOUNCE_ANY is the defauly now, so the call is superflous.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agodm: don't set bounce limit
Christoph Hellwig [Mon, 19 Jun 2017 07:26:27 +0000 (09:26 +0200)]
dm: don't set bounce limit

Now all queues allocators come without abounce limit by default,
dm doesn't have to override this anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: don't set bounce limit in blk_init_queue
Christoph Hellwig [Mon, 19 Jun 2017 07:26:26 +0000 (09:26 +0200)]
block: don't set bounce limit in blk_init_queue

Instead move it to the callers.  Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: don't set bounce limit in blk_init_allocated_queue
Christoph Hellwig [Mon, 19 Jun 2017 07:26:25 +0000 (09:26 +0200)]
block: don't set bounce limit in blk_init_allocated_queue

And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblk-mq: don't bounce by default
Christoph Hellwig [Mon, 19 Jun 2017 07:26:24 +0000 (09:26 +0200)]
blk-mq: don't bounce by default

For historical reasons we default to bouncing highmem pages for all block
queues.  But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: don't bother with bounce limits for make_request drivers
Christoph Hellwig [Mon, 19 Jun 2017 07:26:23 +0000 (09:26 +0200)]
block: don't bother with bounce limits for make_request drivers

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: remove the queue_bounce_pfn helper
Christoph Hellwig [Mon, 19 Jun 2017 07:26:22 +0000 (09:26 +0200)]
block: remove the queue_bounce_pfn helper

Only used inside the bounce code, and opencoding it makes it more obvious
what is going on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: move bounce declarations to block/blk.h
Christoph Hellwig [Mon, 19 Jun 2017 07:26:21 +0000 (09:26 +0200)]
block: move bounce declarations to block/blk.h

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblk-map: call blk_queue_bounce from blk_rq_append_bio
Christoph Hellwig [Tue, 27 Jun 2017 18:13:21 +0000 (12:13 -0600)]
blk-map: call blk_queue_bounce from blk_rq_append_bio

This makes moves the knowledge about bouncing out of the callers into the
block core (just like we do for the normal I/O path), and allows to unexport
blk_queue_bounce.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agopktcdvd: remove the call to blk_queue_bounce
Christoph Hellwig [Mon, 19 Jun 2017 07:26:19 +0000 (09:26 +0200)]
pktcdvd: remove the call to blk_queue_bounce

pktcdvd is a make_request based stacking driver and thus doesn't have any
addressing limits on it's own.  It also doesn't use bio_data() or
page_address(), so it doesn't need a lowmem bounce either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agonvme: add support for streams and directives
Jens Axboe [Tue, 27 Jun 2017 18:03:06 +0000 (12:03 -0600)]
nvme: add support for streams and directives

This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agobtrfs: add support for passing in write hints for buffered writes
Jens Axboe [Tue, 27 Jun 2017 17:51:28 +0000 (11:51 -0600)]
btrfs: add support for passing in write hints for buffered writes

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoxfs: add support for passing in write hints for buffered writes
Jens Axboe [Tue, 27 Jun 2017 15:34:01 +0000 (09:34 -0600)]
xfs: add support for passing in write hints for buffered writes

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoext4: add support for passing in write hints for buffered writes
Jens Axboe [Tue, 27 Jun 2017 15:32:37 +0000 (09:32 -0600)]
ext4: add support for passing in write hints for buffered writes

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agofs: add support for buffered writeback to pass down write hints
Jens Axboe [Tue, 27 Jun 2017 15:30:05 +0000 (09:30 -0600)]
fs: add support for buffered writeback to pass down write hints

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agofs: add O_DIRECT and aio support for sending down write life time hints
Jens Axboe [Tue, 27 Jun 2017 17:01:22 +0000 (11:01 -0600)]
fs: add O_DIRECT and aio support for sending down write life time hints

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblk-mq: expose write hints through debugfs
Jens Axboe [Mon, 26 Jun 2017 14:15:27 +0000 (08:15 -0600)]
blk-mq: expose write hints through debugfs

Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->write_hints[] if they handle a given
write hint.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoblock: add support for write hints in a bio
Jens Axboe [Tue, 27 Jun 2017 15:22:02 +0000 (09:22 -0600)]
block: add support for write hints in a bio

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agofs: add fcntl() interface for setting/getting write life time hints
Jens Axboe [Tue, 27 Jun 2017 17:47:04 +0000 (11:47 -0600)]
fs: add fcntl() interface for setting/getting write life time hints

Define a set of write life time hints:

RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.

F_SET_RW_HINT Set one of the above write hints on the
underlying inode.

F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.

F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.

This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.

/*
 * writehint.c: get or set an inode write hint
 */
 #include <stdio.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdbool.h>
 #include <inttypes.h>

 #ifndef F_GET_RW_HINT
 #define F_LINUX_SPECIFIC_BASE 1024
 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
 #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
 #endif

static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;

if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}

if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}

ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}

printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoMerge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Tue, 27 Jun 2017 15:56:52 +0000 (08:56 -0700)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:
 "Three more fixes:

   - Fix the previous fix merged in the last pull for the Thumb2
     decompressor.

   - A fix from Vladimir to correctly identify the V7M cache type.

   - The optimised 3G vmsplit case does not work with LPAE, so don't
     allow this to be selected for LPAE configurations"

* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8682/1: V7M: Set cacheid iff DminLine or IminLine is nonzero
  ARM: 8681/1: make VMSPLIT_3G_OPT depends on !ARM_LPAE
  ARM: 8680/1: boot/compressed: fix inappropriate Thumb2 mnemonic for __nop

8 years agolightnvm: if LUNs are already allocated fix return
Rakesh Pandit [Tue, 27 Jun 2017 11:55:33 +0000 (14:55 +0300)]
lightnvm: if LUNs are already allocated fix return

While creating new device with NVM_DEV_CREATE if LUNs are already
allocated ioctl would return -ENOMEM which is wrong.  This patch
propagates -EBUSY from nvm_reserve_luns which is correct response.

Fixes: ade69e243 ("lightnvm: merge gennvm with core")
Reviewed-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: fail gracefully on irrec. error
Javier González [Mon, 26 Jun 2017 09:57:29 +0000 (11:57 +0200)]
lightnvm: pblk: fail gracefully on irrec. error

Due to user writes being decoupled from media writes because of the need
of an intermediate write buffer, irrecoverable media write errors lead
to pblk stalling; user writes fill up the buffer and end up in an
infinite retry loop.

In order to let user writes fail gracefully, it is necessary for pblk to
keep track of its own internal state and prevent further writes from
being placed into the write buffer.

This patch implements a state machine to keep track of internal errors
and, in case of failure, fail further user writes in an standard way.
Depending on the type of error, pblk will do its best to persist
buffered writes (which are already acknowledged) and close down on a
graceful manner. This way, data might be recovered by re-instantiating
pblk. Such state machine paves out the way for a state-based FTL log.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: set mempool and workqueue params.
Javier González [Mon, 26 Jun 2017 09:57:28 +0000 (11:57 +0200)]
lightnvm: pblk: set mempool and workqueue params.

Make constants to define sizes for internal mempools and workqueues. In
this process, adjust the values to be more meaningful given the internal
constrains of the FTL. In order to do this for workqueues, separate the
current auxiliary workqueue into two dedicated workqueues to manage
lines being closed and bad blocks.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: redesign GC algorithm
Javier González [Mon, 26 Jun 2017 09:57:27 +0000 (11:57 +0200)]
lightnvm: pblk: redesign GC algorithm

At the moment, in order to get enough read parallelism, we have recycled
several lines at the same time. This approach has proven not to work
well when reaching capacity, since we end up mixing valid data from all
lines, thus not maintaining a sustainable free/recycled line ratio.

The new design, relies on a two level workqueue mechanism. In the first
level, we read the metadata for a number of lines based on the GC list
they reside on (this is governed by the number of valid sectors in each
line). In the second level, we recycle a single line at a time. Here, we
issue reads in parallel, while a single GC write thread places data in
the write buffer. This design allows to (i) only move data from one line
at a time, thus maintaining a sane free/recycled ration and (ii)
maintain the GC writer busy with recycled data.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: add lock assertions on helpers
Javier González [Mon, 26 Jun 2017 09:57:26 +0000 (11:57 +0200)]
lightnvm: pblk: add lock assertions on helpers

Add lockdep assertions on helper functions.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: cleanup unnecessary code
Javier González [Mon, 26 Jun 2017 09:57:25 +0000 (11:57 +0200)]
lightnvm: pblk: cleanup unnecessary code

Cleanup unnecessary headers and code lines.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: set metadata list for all I/Os
Javier González [Mon, 26 Jun 2017 09:57:24 +0000 (11:57 +0200)]
lightnvm: pblk: set metadata list for all I/Os

Set a dma area for all I/Os in order to read/write from/to the metadata
stored on the per-sector out-of-bound area.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: choose optimal victim GC line
Javier González [Mon, 26 Jun 2017 09:57:23 +0000 (11:57 +0200)]
lightnvm: pblk: choose optimal victim GC line

At the moment, we separate the closed lines on three different list
based on their number of valid sectors. GC recycles lines from each list
based on capacity. Lines from each list are taken in a FIFO fashion.

Since the number of lines is limited (it corresponds to the number of
blocks in a LUN, which is somewhere between 1000-2000), we can afford
scanning the lists to choose the optimal line to be recycled. This helps
specially in lines with a high number of valid sectors.

If the number of blocks per LUN increases, we will consider a more
efficient policy.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: decouple bad block from line alloc
Javier González [Mon, 26 Jun 2017 09:57:22 +0000 (11:57 +0200)]
lightnvm: pblk: decouple bad block from line alloc

Decouple bad block discovery from line allocation logic. This allows to
return meaningful error codes in case of bad block discovery failure.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: simplify meta. memory allocation
Javier González [Mon, 26 Jun 2017 09:57:21 +0000 (11:57 +0200)]
lightnvm: pblk: simplify meta. memory allocation

smeta size will always be suitable for a kmalloc allocation. Simplify
the code and leave the vmalloc fallback only for emeta, where the pblk
configuration has an impact.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: issue multiplane reads if possible
Javier González [Mon, 26 Jun 2017 09:57:20 +0000 (11:57 +0200)]
lightnvm: pblk: issue multiplane reads if possible

If a read request is sequential and its size aligns with a
multi-plane page size, use the multi-plane hint to process the I/O in
parallel in the controller.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: delete redundant buffer pointer
Javier González [Mon, 26 Jun 2017 09:57:19 +0000 (11:57 +0200)]
lightnvm: pblk: delete redundant buffer pointer

After refactoring the metadata path, the backpointer controlling
synced I/Os in a line becomes unnecessary; metadata is scheduled
on the write thread, thus we know when the end of the line is reached
and act on it directly.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: delete redundant debug line stat
Javier González [Mon, 26 Jun 2017 09:57:18 +0000 (11:57 +0200)]
lightnvm: pblk: delete redundant debug line stat

Remove a legacy variable that helped verifying the consistency of the
run-time metadata for the free line list. With the new metadata layout,
this check is no longer necessary.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: sched. metadata on write thread
Javier González [Mon, 26 Jun 2017 09:57:17 +0000 (11:57 +0200)]
lightnvm: pblk: sched. metadata on write thread

At the moment, line metadata is persisted on a separate work queue, that
is kicked each time that a line is closed. The assumption when designing
this was that freeing the write thread from creating a new write request
was better than the potential impact of writes colliding on the media
(user I/O and metadata I/O). Experimentation has proven that this
assumption is wrong; collision can cause up to 25% of bandwidth and
introduce long tail latencies on the write thread, which potentially
cause user write threads to spend more time spinning to get a free entry
on the write buffer.

This patch moves the metadata logic to the write thread. When a line is
closed, remaining metadata is written in memory and is placed on a
metadata queue. The write thread then takes the metadata corresponding
to the previous line, creates the write request and schedules it to
minimize collisions on the media. Using this approach, we see that we
can saturate the media's bandwidth, which helps reducing both write
latencies and the spinning time for user writer threads.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: rename read request pool
Javier González [Mon, 26 Jun 2017 22:27:13 +0000 (16:27 -0600)]
lightnvm: pblk: rename read request pool

Read requests allocate some extra memory to store its per I/O context.
Instead of requiring yet another memory pool for other type of requests,
generalize this context allocation (and change naming accordingly).

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: generalize erase path
Javier González [Mon, 26 Jun 2017 09:57:15 +0000 (11:57 +0200)]
lightnvm: pblk: generalize erase path

Erase I/Os are scheduled with the following goals in mind: (i) minimize
LUNs collisions with write I/Os, and (ii) even out the price of erasing
on every write, instead of putting all the burden on when garbage
collection runs. This works well on the current design, but is specific
to the default mapping algorithm.

This patch generalizes the erase path so that other mapping algorithms
can select an arbitrary line to be erased instead. It also gets rid of
the erase semaphore since it creates jittering for user writes.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: expose max sec per write on sysfs
Javier González [Mon, 26 Jun 2017 09:57:14 +0000 (11:57 +0200)]
lightnvm: pblk: expose max sec per write on sysfs

Allow to configure the number of maximum sectors per write command
through sysfs. This makes it easier to tune write command sizes for
different controller configurations.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: add debug stat for read cache hits
Javier González [Mon, 26 Jun 2017 09:57:13 +0000 (11:57 +0200)]
lightnvm: pblk: add debug stat for read cache hits

Add a new debug counter to measure cache hits on the read path

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: pblk: spare double cpu_to_le64 calc.
Javier González [Mon, 26 Jun 2017 09:57:12 +0000 (11:57 +0200)]
lightnvm: pblk: spare double cpu_to_le64 calc.

Spare a double calculation on the fast write path.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: propagate right error code to target
Javier González [Mon, 26 Jun 2017 09:57:11 +0000 (11:57 +0200)]
lightnvm: propagate right error code to target

If nvme_alloc_request fails, propagate the right error, instead of
assuming ENOMEM.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agolightnvm: re-convert ppa format on I/O failure
Javier González [Mon, 26 Jun 2017 09:57:10 +0000 (11:57 +0200)]
lightnvm: re-convert ppa format on I/O failure

In case of a failure when submitting a request, convert the ppa_list
addresses to the target format so that it can interpret ppas for
recovery

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 years agoMerge tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming
Linus Torvalds [Mon, 26 Jun 2017 19:25:59 +0000 (12:25 -0700)]
Merge tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming

Pull c6x fixlet from Mark Salter:
 "Update maintainer email"

* tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
  MAINTAINERS: update email address for C6x maintainer