linux-2.6-block.git
8 years agoIB/hfi1: Add tracing support for send with invalidate opcode
Jianxin Xiong [Tue, 24 May 2016 19:50:17 +0000 (12:50 -0700)]
IB/hfi1: Add tracing support for send with invalidate opcode

Enable trace generation for packets with the "Send Last with
Invalidate" and "Send Only with Invalidate" opcodes.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1, qib: Add ieth to the packet header definitions
Jianxin Xiong [Tue, 24 May 2016 19:50:10 +0000 (12:50 -0700)]
IB/hfi1, qib: Add ieth to the packet header definitions

A new union member "ieth" (Invalidate Extended Transport Header) is
added to the packet header definition in preparation of supporting
the send with invalidate opcode.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris...
Linus Torvalds [Thu, 26 May 2016 16:15:19 +0000 (09:15 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security

Pull Yama locking fix from James Morris:
 "Fix for the Yama LSM"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  Yama: fix double-spinlock and user access in atomic context

8 years agoMerge branches 'misc-4.7-2', 'ipoib' and 'ib-router' into k.o/for-4.7
Doug Ledford [Thu, 26 May 2016 15:55:19 +0000 (11:55 -0400)]
Merge branches 'misc-4.7-2', 'ipoib' and 'ib-router' into k.o/for-4.7

8 years agoIB/hfi1: Move driver out of staging
Dennis Dalessandro [Thu, 19 May 2016 12:26:51 +0000 (05:26 -0700)]
IB/hfi1: Move driver out of staging

The TODO list for the hfi1 driver was completed during 4.6. In addition
other objections raised (which are far beyond what was in the TODO list)
have been addressed as well. It is now time to remove the driver from
staging and into the drivers/infiniband sub-tree.

Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Do not free hfi1 cdev parent structure early
Dennis Dalessandro [Thu, 19 May 2016 12:26:44 +0000 (05:26 -0700)]
IB/hfi1: Do not free hfi1 cdev parent structure early

The deletion of a cdev is not a fence for holding off references to the
structure. The driver attempts to delete the cdev and then proceeds to
free the parent structure, the hfi1_devdata, or dd. This can potentially
lead to a kernel panic in situations where a user has an FD for the cdev
open, and the pci device gets removed. If the user then closes the FD
there will be a NULL dereference when trying to do put on the cdev's
kobject.

Fix this by pointing the cdev's kobject.parent at a new kobject embedded
in its parent structure. Also take a reference when the device is opened
and put it back when it is closed.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Add trace message in user IOCTL handling
Dennis Dalessandro [Thu, 19 May 2016 12:26:37 +0000 (05:26 -0700)]
IB/hfi1: Add trace message in user IOCTL handling

Add a trace message to HFI1s user IOCTL handling. This allows debugging
of which IOCTLs are being handled by the driver.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove write(), use ioctl() for user cmds
Dennis Dalessandro [Thu, 19 May 2016 12:26:31 +0000 (05:26 -0700)]
IB/hfi1: Remove write(), use ioctl() for user cmds

Remove the write() handler for user space commands now that ioctl
handling is available. User apps will need to change to use ioctl from
this point forward.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Add ioctl() interface for user commands
Dennis Dalessandro [Thu, 19 May 2016 12:26:24 +0000 (05:26 -0700)]
IB/hfi1: Add ioctl() interface for user commands

IOCTL is more suited to what user space commands need to do than the
write() interface. Add IOCTL definitions for all existing write commands
and the handling for those. The write() interface will be removed in a
follow on patch.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove unused user command
Dennis Dalessandro [Thu, 19 May 2016 12:26:17 +0000 (05:26 -0700)]
IB/hfi1: Remove unused user command

The HFI1_CMD_SDMA_STATUS_UPD command was never implemented it has no
reason to live in the driver. Remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove snoop/diag interface
Dennis Dalessandro [Thu, 19 May 2016 12:26:10 +0000 (05:26 -0700)]
IB/hfi1: Remove snoop/diag interface

The snoop/diag interface is better served by an implementation which is
more general and usable by other drivers perhaps. Go ahead and remove
the code now and get rid of the char dev. We can put the feature back
when we have a more agreeable solution.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove EPROM functionality from data device
Dennis Dalessandro [Thu, 19 May 2016 12:26:04 +0000 (05:26 -0700)]
IB/hfi1: Remove EPROM functionality from data device

Remove EPROM handling from the cdev which is used for user application
data traffic.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove UI char device
Dennis Dalessandro [Thu, 19 May 2016 12:25:57 +0000 (05:25 -0700)]
IB/hfi1: Remove UI char device

Remove UI char device which exposes direct access to registers for user
space. This was put in to aid in debugging the hardware. We are looking
into alternatives means of providing the same functionality. This
removes another char device from HFI1's footprint.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove multiple device cdev
Dennis Dalessandro [Thu, 19 May 2016 12:25:50 +0000 (05:25 -0700)]
IB/hfi1: Remove multiple device cdev

hfi1 current exports a cdev that can be used to target all of the hfi's
in the system. However there is a problem with this approach in
that the devices could be on different subnets. This is a problem that
user space can figure out and explicitly tell the driver on which device
to create a context.

Remove the multi-purpose cdev leaving a dedicated cdev for each port.
Also remove the striping capability that is dependent upon the user
choosing the multi-purpose cdev. It is now up to user space to determine
how to stripe contexts.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove anti-pattern in cdev init
Dennis Dalessandro [Thu, 19 May 2016 12:22:03 +0000 (05:22 -0700)]
IB/hfi1: Remove anti-pattern in cdev init

Remove the usage of an anti-pattern goto in hfi1_cdev_init to improve
code readability.

Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix bug that blocks process on exit after port bounce
Jianxin Xiong [Thu, 19 May 2016 12:21:57 +0000 (05:21 -0700)]
IB/hfi1: Fix bug that blocks process on exit after port bounce

During the processing of a user SDMA request, if there was an
error before the request counter was increased, the state of
the packet queue could be updated incorrectly, causing the
counter to underflow. As the result, the process could get
stuck later since the counter could never get back to 0.

This patch adds a condition to guard the packet queue update
so that the counter is only decreased if it has been increased
before the error happens.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/qib: Remove unused qib_7322_intr_msgs[]
Jubin John [Thu, 19 May 2016 12:21:50 +0000 (05:21 -0700)]
IB/qib: Remove unused qib_7322_intr_msgs[]

Building the qib driver with gcc version 6.1.0 raises the following
build warning:
drivers/infiniband/hw/qib/qib_iba7322.c:1311:39: warning:
'qib_7322_intr_msgs' defined but not used [-Wunused-const-variable=]
 static const struct  qib_hwerror_msgs qib_7322_intr_msgs[] = {
                                       ^~~~~~~~~~~~~~~~~~
Remove the unused qib_7322_intr_msgs[]

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove unnecessary comment
Ira Weiny [Thu, 19 May 2016 12:21:44 +0000 (05:21 -0700)]
IB/hfi1: Remove unnecessary comment

This comment was old, the MTU enums have been defined.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix sdma_event_names[] build warning
Jubin John [Thu, 19 May 2016 12:21:37 +0000 (05:21 -0700)]
IB/hfi1: Fix sdma_event_names[] build warning

sdma_event_names[] is only used within CONFIG_SDMA_VERBOSITY ifdefs, so
when CONFIG_SDMA_VERBOSITY is disabled, it results in the following
0-day build warning:
>> drivers/infiniband/hw/hfi1/sdma.c:137:27: warning: 'sdma_event_names'
>> defined but not used [-Wunused-const-variable=]
    static const char * const sdma_event_names[] = {
                              ^~~~~~~~~~~~~~~~
This occurs on the following compiler:
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430

For more information check:
https://lists.01.org/pipermail/kbuild-all/2016-May/020060.html

Fix this warning by defining sdma_event_name[] only within the
CONFIG_SDMA_VERBOSITY ifdefs.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/rdmavt: Use kzalloc_node
Jubin John [Thu, 19 May 2016 12:21:31 +0000 (05:21 -0700)]
IB/rdmavt: Use kzalloc_node

Use kzalloc_node instead of kzalloc for rdmavt memory region segment
allocation to optimize for performance on NUMA platforms.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/rdmavt: Insure QP vmalloc variants zero memory
Mike Marciniszyn [Thu, 19 May 2016 12:21:25 +0000 (05:21 -0700)]
IB/rdmavt: Insure QP vmalloc variants zero memory

The usage of the various vmalloc APIs do not consistently zero memory
when allocating the swqe. Insure zeroing variants are used.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix an interval RB node reference count leak
Mitko Haralanov [Thu, 19 May 2016 12:21:18 +0000 (05:21 -0700)]
IB/hfi1: Fix an interval RB node reference count leak

Commit e88c9271d9f8 ("IB/hfi1: Fix buffer cache corner case which
may cause corruption") introduced a bug which may cause a reference
count of a interval RB node to be leaked in the case where an SDMA
transfer from that node completes at the same time as the node is
being extended.

If a node is being extended, it is first removed from the RB tree
in order to be processed without the risk of an invalidation event
removing the node at the same time.

If a SDMA completion happens during that time, the completion handler
will fail to find the node in the RB tree and, therefore, fail to
correctly decrement its refcount. This leaves the node in the tree and
its pages pinned for the duration of the user process.

To prevent this from happening the io vector adds a reference to the
RB node, which is used during the SDMA completion instead of looking
up the node in the RB tree.

This change adds a performance improvement as a side effect by avoiding
the RB tree lookup.

Fixes: e88c9271d9f8 ("IB/hfi1: Fix buffer cache corner case which may cause corruption")
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoblk-mq: clear q->mq_ops if init fail
Ming Lin [Thu, 26 May 2016 06:23:27 +0000 (23:23 -0700)]
blk-mq: clear q->mq_ops if init fail

blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops
was not cleared when blk_mq_init_allocated_queue() fails.
Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because:
- q->all_q_node is not added to all_q_list yet
- q->tag_set is NULL
- hctx was not setup yet or already freed

Fixed it by clearing q->mq_ops on error path.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
8 years agopnfs: pnfs_update_layout needs to consider if strict iomode checking is on
Tom Haynes [Wed, 25 May 2016 14:31:14 +0000 (07:31 -0700)]
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on

As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agonfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading...
Tom Haynes [Wed, 25 May 2016 14:31:13 +0000 (07:31 -0700)]
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
8 years agoDocumentation/hwmon: Update links in max34440
Glenn Dayton [Thu, 26 May 2016 09:06:53 +0000 (11:06 +0200)]
Documentation/hwmon: Update links in max34440

It appears the website for maxim-ic.com changed to
maximintegrated.com.

Signed-off-by: Glenn Dayton <glenn.dayton24@gmail.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
8 years agohwmon: (emc2103) Fix typo in MODULE_PARM_DESC
Dan Carpenter [Thu, 26 May 2016 09:06:53 +0000 (11:06 +0200)]
hwmon: (emc2103) Fix typo in MODULE_PARM_DESC

"apd" was intended here instead of "init".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
8 years agorestore killability of old mutex_lock_killable(&inode->i_mutex) users
Al Viro [Thu, 26 May 2016 04:05:12 +0000 (00:05 -0400)]
restore killability of old mutex_lock_killable(&inode->i_mutex) users

The ones that are taking it exclusive, that is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years agoadd down_write_killable_nested()
Al Viro [Thu, 26 May 2016 04:04:58 +0000 (00:04 -0400)]
add down_write_killable_nested()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years agoupdate D/f/directory-locking
Al Viro [Thu, 26 May 2016 04:04:18 +0000 (00:04 -0400)]
update D/f/directory-locking

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years agoRevert "mtd: atmel_nand: Support variable RB_EDGE interrupts"
Wenyou Yang [Mon, 9 May 2016 06:51:18 +0000 (14:51 +0800)]
Revert "mtd: atmel_nand: Support variable RB_EDGE interrupts"

This reverts commit 5ddc7bd43ccc ("mtd: atmel_nand: Support variable
RB_EDGE interrupts")

Because for current SoCs, the RB_EDGE3(i.e. bit 27) of HSMC_SR
register does not exist, the RB_EDGE0 (i.e. bit 24) is the ready/busy
line edge status bit. It is a datasheet bug.

Cc: <stable@vger.kernel.org>
Fixes: commit 5ddc7bd43ccc ("mtd: atmel_nand: Support variable RB_EDGE interrupts")
Signed-off-by: Wenyou Yang <wenyou.yang@atmel.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
8 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 26 May 2016 00:37:33 +0000 (17:37 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Misc fixes: EFI, entry code, pkeys and MPX fixes, TASK_SIZE cleanups
  and a tsc frequency table fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Switch from TASK_SIZE to TASK_SIZE_MAX in the page fault code
  x86/fsgsbase/64: Use TASK_SIZE_MAX for FSBASE/GSBASE upper limits
  x86/mm/mpx: Work around MPX erratum SKD046
  x86/entry/64: Fix stack return address retrieval in thunk
  x86/efi: Fix 7-parameter efi_call()s
  x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys
  x86/tsc: Add missing Cherrytrail frequency to the table

8 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 26 May 2016 00:11:43 +0000 (17:11 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Two fixes: one for a lost wakeup, the other to fix the compiler
  optimizing out preempt operations on ARM64 (and possibly other non-x86
  architectures)"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix remote wakeups
  sched/preempt: Fix preempt_count manipulations

8 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 26 May 2016 00:05:40 +0000 (17:05 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
 "Mostly tooling and PMU driver fixes, but also a number of late updates
  such as the reworking of the call-chain size limiting logic to make
  call-graph recording more robust, plus tooling side changes for the
  new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  perf record: Read from backward ring buffer
  perf record: Rename variable to make code clear
  perf record: Prevent reading invalid data in record__mmap_read
  perf evlist: Add API to pause/resume
  perf trace: Use the ptr->name beautifier as default for "filename" args
  perf trace: Use the fd->name beautifier as default for "fd" args
  perf report: Add srcline_from/to branch sort keys
  perf evsel: Record fd into perf_mmap
  perf evsel: Add overwrite attribute and check write_backward
  perf tools: Set buildid dir under symfs when --symfs is provided
  perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
  perf annotate: Sort list of recognised instructions
  perf annotate: Fix identification of ARM blt and bls instructions
  perf tools: Fix usage of max_stack sysctl
  perf callchain: Stop validating callchains by the max_stack sysctl
  perf trace: Fix exit_group() formatting
  perf top: Use machine->kptr_restrict_warned
  perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
  perf machine: Do not bail out if not managing to read ref reloc symbol
  perf/x86/intel/p4: Trival indentation fix, remove space
  ...

8 years agoYama: fix double-spinlock and user access in atomic context
Jann Horn [Sun, 22 May 2016 04:01:34 +0000 (06:01 +0200)]
Yama: fix double-spinlock and user access in atomic context

Commit 8a56038c2aef ("Yama: consolidate error reporting") causes lockups
when someone hits a Yama denial. Call chain:

process_vm_readv -> process_vm_rw -> process_vm_rw_core -> mm_access
-> ptrace_may_access
task_lock(...) is taken
__ptrace_may_access -> security_ptrace_access_check
-> yama_ptrace_access_check -> report_access -> kstrdup_quotable_cmdline
-> get_cmdline -> access_process_vm -> get_task_mm
task_lock(...) is taken again

task_lock(p) just calls spin_lock(&p->alloc_lock), so at this point,
spin_lock() is called on a lock that is already held by the current
process.

Also: Since the alloc_lock is a spinlock, sleeping inside
security_ptrace_access_check hooks is probably not allowed at all? So it's
not even possible to print the cmdline from in there because that might
involve paging in userspace memory.

It would be tempting to rewrite ptrace_may_access() to drop the alloc_lock
before calling the LSM, but even then, ptrace_may_access() itself might be
called from various contexts in which you're not allowed to sleep; for
example, as far as I understand, to be able to hold a reference to another
task, usually an RCU read lock will be taken (see e.g. kcmp() and
get_robust_list()), so that also prohibits sleeping. (And using e.g. FUSE,
a user can cause pagefault handling to take arbitrary amounts of time -
see https://bugs.chromium.org/p/project-zero/issues/detail?id=808.)

Therefore, AFAIK, in order to print the name of a process below
security_ptrace_access_check(), you'd have to either grab a reference to
the mm_struct and defer the access violation reporting or just use the
"comm" value that's stored in kernelspace and accessible without big
complications. (Or you could try to use some kind of atomic remote VM
access that fails if the memory isn't paged in, similar to
copy_from_user_inatomic(), and if necessary fall back to comm, but
that'd be kind of ugly because the comm/cmdline choice would look
pretty random to the user.)

Fix it by deferring reporting of the access violation until current
exits kernelspace the next time.

v2: Don't oops on PTRACE_TRACEME, call report_access under
task_lock(current). Also fix nonsensical comment. And don't use
GPF_ATOMIC for memory allocation with no locks held.
This patch is tested both for ptrace attach and ptrace traceme.

Fixes: 8a56038c2aef ("Yama: consolidate error reporting")
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
8 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 25 May 2016 23:52:19 +0000 (16:52 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull objtool build fix from Ingo Molnar:
 "An libtool fix for older libelf versions"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Allow building with older libelf

8 years agoceph: fix wake_up_session_cb()
Yan, Zheng [Thu, 19 May 2016 11:15:19 +0000 (19:15 +0800)]
ceph: fix wake_up_session_cb()

We should reset i_requested_max_size before waking the waiters.
(zero i_requested_max_size make waiter re-request the max size)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't use truncate_pagecache() to invalidate read cache
Yan, Zheng [Wed, 18 May 2016 12:58:26 +0000 (20:58 +0800)]
ceph: don't use truncate_pagecache() to invalidate read cache

truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: SetPageError() for writeback pages if writepages fails
Yan, Zheng [Fri, 13 May 2016 09:54:17 +0000 (17:54 +0800)]
ceph: SetPageError() for writeback pages if writepages fails

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: handle interrupted ceph_writepage()
Yan, Zheng [Fri, 13 May 2016 09:29:51 +0000 (17:29 +0800)]
ceph: handle interrupted ceph_writepage()

writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make ceph_update_writeable_page() uninterruptible
Yan, Zheng [Fri, 13 May 2016 03:30:24 +0000 (11:30 +0800)]
ceph: make ceph_update_writeable_page() uninterruptible

ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agolibceph: make ceph_osdc_wait_request() uninterruptible
Yan, Zheng [Fri, 13 May 2016 03:04:33 +0000 (11:04 +0800)]
libceph: make ceph_osdc_wait_request() uninterruptible

Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: handle -EAGAIN returned by ceph_update_writeable_page()
Yan, Zheng [Tue, 10 May 2016 11:09:06 +0000 (19:09 +0800)]
ceph: handle -EAGAIN returned by ceph_update_writeable_page()

when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
Yan, Zheng [Tue, 10 May 2016 10:59:13 +0000 (18:59 +0800)]
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: block non-fatal signals for fault/page_mkwrite
Yan, Zheng [Tue, 10 May 2016 10:40:28 +0000 (18:40 +0800)]
ceph: block non-fatal signals for fault/page_mkwrite

Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make logical calculation functions return bool
Zhang Zhuoyu [Fri, 25 Mar 2016 09:18:39 +0000 (05:18 -0400)]
ceph: make logical calculation functions return bool

This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.

Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
8 years agoceph: tolerate bad i_size for symlink inode
Yan, Zheng [Thu, 5 May 2016 08:40:17 +0000 (16:40 +0800)]
ceph: tolerate bad i_size for symlink inode

A mds bug can cause symlink's size to be truncated to zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: improve fragtree change detection
Yan, Zheng [Wed, 4 May 2016 03:40:30 +0000 (11:40 +0800)]
ceph: improve fragtree change detection

check if number of splits in i_fragtree is equal to number of splits
in mds reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: keep leaf frag when updating fragtree
Yan, Zheng [Wed, 4 May 2016 03:05:10 +0000 (11:05 +0800)]
ceph: keep leaf frag when updating fragtree

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: fix dir_auth check in ceph_fill_dirfrag()
Yan, Zheng [Tue, 3 May 2016 14:33:20 +0000 (22:33 +0800)]
ceph: fix dir_auth check in ceph_fill_dirfrag()

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't assume frag tree splits in mds reply are sorted
Yan, Zheng [Tue, 3 May 2016 12:55:50 +0000 (20:55 +0800)]
ceph: don't assume frag tree splits in mds reply are sorted

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: fix inode reference leak
Yan, Zheng [Fri, 29 Apr 2016 15:40:23 +0000 (23:40 +0800)]
ceph: fix inode reference leak

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: using hash value to compose dentry offset
Yan, Zheng [Fri, 29 Apr 2016 03:27:30 +0000 (11:27 +0800)]
ceph: using hash value to compose dentry offset

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't forbid marking directory complete after forward seek
Yan, Zheng [Thu, 28 Apr 2016 14:56:44 +0000 (22:56 +0800)]
ceph: don't forbid marking directory complete after forward seek

Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: record 'offset' for each entry of readdir result
Yan, Zheng [Thu, 28 Apr 2016 07:17:40 +0000 (15:17 +0800)]
ceph: record 'offset' for each entry of readdir result

This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: define 'end/complete' in readdir reply as bit flags
Yan, Zheng [Wed, 27 Apr 2016 09:48:30 +0000 (17:48 +0800)]
ceph: define 'end/complete' in readdir reply as bit flags

Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: define struct for dir entry in readdir reply
Yan, Zheng [Thu, 28 Apr 2016 01:37:39 +0000 (09:37 +0800)]
ceph: define struct for dir entry in readdir reply

This avoids defining multiple arrays for entries in readdir reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: simplify 'offset in frag'
Yan, Zheng [Wed, 27 Apr 2016 09:32:34 +0000 (17:32 +0800)]
ceph: simplify 'offset in frag'

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: remove unnecessary checks in __dcache_readdir
Yan, Zheng [Fri, 29 Apr 2016 07:58:32 +0000 (15:58 +0800)]
ceph: remove unnecessary checks in __dcache_readdir

we never add snapdir and the hidden .ceph dir into readdir cache

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: search cache postion for dcache readdir
Yan, Zheng [Thu, 28 Apr 2016 09:43:35 +0000 (17:43 +0800)]
ceph: search cache postion for dcache readdir

use binary search to find cache index that corresponds to readdir
postion.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: use CEPH_MDS_OP_RMXATTR request to remove xattr
Yan, Zheng [Thu, 21 Apr 2016 04:11:54 +0000 (12:11 +0800)]
ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr

Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: report mount root in session metadata
Yan, Zheng [Thu, 21 Apr 2016 03:09:55 +0000 (11:09 +0800)]
ceph: report mount root in session metadata

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't show symlink target in debugfs/mdsc
Yan, Zheng [Mon, 18 Apr 2016 08:51:37 +0000 (16:51 +0800)]
ceph: don't show symlink target in debugfs/mdsc

symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't call truncate_pagecache in ceph_writepages_start
Yan, Zheng [Fri, 15 Apr 2016 05:56:12 +0000 (13:56 +0800)]
ceph: don't call truncate_pagecache in ceph_writepages_start

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: renew caps for read/write if mds session got killed.
Yan, Zheng [Fri, 8 Apr 2016 07:27:16 +0000 (15:27 +0800)]
ceph: renew caps for read/write if mds session got killed.

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: CEPH_FEATURE_MDSENC support
Yan, Zheng [Thu, 31 Mar 2016 07:53:01 +0000 (15:53 +0800)]
ceph: CEPH_FEATURE_MDSENC support

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: multiple filesystem support
Yan, Zheng [Wed, 30 Mar 2016 09:18:34 +0000 (17:18 +0800)]
ceph: multiple filesystem support

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for subscribing to "mdsmap.<id>" maps
Ilya Dryomov [Wed, 25 May 2016 22:05:01 +0000 (00:05 +0200)]
libceph: support for subscribing to "mdsmap.<id>" maps

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: replace ceph_monc_request_next_osdmap()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:28 +0000 (16:07 +0200)]
libceph: replace ceph_monc_request_next_osdmap()

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: take osdc->lock in osdmap_show() and dump flags in hex
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: take osdc->lock in osdmap_show() and dump flags in hex

There is now about a dozen CEPH_OSDMAP_* flags.  This is a debugging
interface, so just dump in hex instead of spelling each flag out.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: pool deletion detection
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: pool deletion detection

This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: async MON client generic requests
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: async MON client generic requests

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for checking on status of watch
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for checking on status of watch

Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for sending notifies
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for sending notifies

Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph, rbd: ceph_osd_linger_request, watch/notify v2
Ilya Dryomov [Wed, 25 May 2016 23:15:02 +0000 (01:15 +0200)]
libceph, rbd: ceph_osd_linger_request, watch/notify v2

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agorbd: rbd_dev_header_unwatch_sync() variant
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
rbd: rbd_dev_header_unwatch_sync() variant

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks.  This is for the new rados_watcherrcb_t, which would be
called from a notify callback.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: wait_request_timeout()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: wait_request_timeout()

The unwatch timeout is currently implemented in rbd.  With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: request_init() and request_release_checks()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: request_init() and request_release_checks()

These are going to be used by request_reinit() code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: a major OSD client update
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: a major OSD client update

This is a major sync up, up to ~Jewel.  The highlights are:

- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support

The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished.  This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: protect osdc->osd_lru list with a spinlock
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: protect osdc->osd_lru list with a spinlock

OSD client is getting moved from the big per-client lock to a set of
per-session locks.  The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: allocate ceph_osd with GFP_NOFAIL
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate ceph_osd with GFP_NOFAIL

create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.

If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: osd_init() and osd_cleanup()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: osd_init() and osd_cleanup()

These are going to be used by homeless OSD sessions code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: handle_one_map()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: handle_one_map()

Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message.  This sets up the scene for the updated OSD
client.

Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.

We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agoMerge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Wed, 25 May 2016 22:59:09 +0000 (15:59 -0700)]
Merge branch 'work.iov_iter' of git://git./linux/kernel/git/viro/vfs

Pull vfs iov_iter regression fix from Al Viro:
 "Fix for braino in 'fold checks into iterate_and_advance()'"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do "fold checks into iterate_and_advance()" right

8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Wed, 25 May 2016 22:54:35 +0000 (15:54 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs

Pull vfs xattr regression fixes from Al Viro.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make xattr_resolve_handlers() safe to use with NULL ->s_xattr
  xattr: Fail with -EINVAL for NULL attribute names

8 years agoMerge tag 'acpi-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Wed, 25 May 2016 22:38:56 +0000 (15:38 -0700)]
Merge tag 'acpi-4.7-rc1-more' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "Additional ACPI update for v4.7-rc1

  Just one fix for incorrect async_synchronize_cookie() usage in the
  ACPI battery driver (Chris Wilson)"

* tag 'acpi-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / battery: Correctly serialise with the pending async probe

8 years agolibceph: allocate dummy osdmap in ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate dummy osdmap in ceph_osdc_init()

This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: schedule tick from ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: schedule tick from ceph_osdc_init()

Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: move schedule_delayed_work() in ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: move schedule_delayed_work() in ceph_osdc_init()

ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails.  Call schedule_delayed_work() after everything is
setup, just before returning.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: redo callbacks and factor out MOSDOpReply decoding
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: redo callbacks and factor out MOSDOpReply decoding

If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: drop msg argument from ceph_osdc_callback_t
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: drop msg argument from ceph_osdc_callback_t

finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success.  This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: switch to calc_target(), part 2
Ilya Dryomov [Wed, 25 May 2016 22:29:52 +0000 (00:29 +0200)]
libceph: switch to calc_target(), part 2

The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target.  Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded.  If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op().  While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: switch to calc_target(), part 1
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: switch to calc_target(), part 1

Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.

ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid.  This is fixed in
"libceph: switch to calc_target(), part 2".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: introduce ceph_osd_request_target, calc_target()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: introduce ceph_osd_request_target, calc_target()

Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: pi->min_size, pi->last_force_request_resend
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: pi->min_size, pi->last_force_request_resend

Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: make pgid_cmp() global
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: make pgid_cmp() global

calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: rename ceph_calc_pg_primary()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: rename ceph_calc_pg_primary()

Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: ceph_osds, ceph_pg_to_up_acting_osds()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: ceph_osds, ceph_pg_to_up_acting_osds()

Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: rename ceph_oloc_oid_to_pg()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: rename ceph_oloc_oid_to_pg()

Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: fix ceph_eversion encoding
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: fix ceph_eversion encoding

eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request().  Reoder ceph_eversion fields.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>