linux-2.6-block.git
11 years agoNVMe: Do not set IO queue depth beyond device max
Keith Busch [Fri, 27 Jul 2012 17:57:23 +0000 (13:57 -0400)]
NVMe: Do not set IO queue depth beyond device max

Set the depth for IO queues to the device's maximum supported queue
entries if the requested depth exceeds the device's capabilities.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
11 years agoNVMe: Set block queue max sectors
Keith Busch [Thu, 26 Jul 2012 17:29:57 +0000 (11:29 -0600)]
NVMe: Set block queue max sectors

Set the max hw sectors in a namespace's request queue if the nvme device
has a max data transfer size.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
11 years agoNVMe: use namespace id for nvme_get_features
Keith Busch [Wed, 25 Jul 2012 22:06:38 +0000 (16:06 -0600)]
NVMe: use namespace id for nvme_get_features

The specification does not provide a use for command dword11 in the NVMe
Get Features command, but does use the NSID for some features.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
11 years agoNVMe: replace nvme_ns with nvme_dev for user admin
Keith Busch [Wed, 25 Jul 2012 22:07:55 +0000 (16:07 -0600)]
NVMe: replace nvme_ns with nvme_dev for user admin

The function nvme_user_admin_command does not require a namespace to
proceed.  Replace with the nvme_dev structure so that it can be called
from contexts that do not have a namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
11 years agoNVMe: Fix nvme module init when nvme_major is set
Keith Busch [Wed, 25 Jul 2012 22:05:18 +0000 (16:05 -0600)]
NVMe: Fix nvme module init when nvme_major is set

register_blkdev returns 0 when given a valid major number.

Reported-by:Ross Zwisler <ross.zwisler@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
11 years agoNVMe: Set request queue logical block size
Keith Busch [Tue, 24 Jul 2012 21:01:04 +0000 (15:01 -0600)]
NVMe: Set request queue logical block size

Sets the request queue logical block size with the block size of the
namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Set number of queues correctly
Matthew Wilcox [Wed, 11 Jan 2012 14:29:56 +0000 (07:29 -0700)]
NVMe: Set number of queues correctly

The number of submission & completion queues should be set by calling
Set Features, not Get Features.

Reported-by: Kwok Kong <Kwok.Kong@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Version 0.8
Matthew Wilcox [Tue, 10 Jan 2012 21:30:15 +0000 (16:30 -0500)]
NVMe: Version 0.8

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Set queue flags correctly
Matthew Wilcox [Tue, 10 Jan 2012 21:35:08 +0000 (14:35 -0700)]
NVMe: Set queue flags correctly

QUEUE_FLAG_* are flags (other than QUEUE_FLAG_DEFAULT), so they cannot
be ORed together.  Set the queue flags using queue_flag_set_unlocked().

Reported-by: Donald Wood <donald.e.wood@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Simplify nvme_unmap_user_pages
Matthew Wilcox [Fri, 6 Jan 2012 20:52:56 +0000 (13:52 -0700)]
NVMe: Simplify nvme_unmap_user_pages

By using the iod->nents field (the same way other I/O paths do), we can
avoid recalculating the number of sg entries at unmap time, and make
nvme_unmap_user_pages() easier to call.

Also, use the 'write' parameter instead of assuming DMA_FROM_DEVICE.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Mark the end of the sg list
Matthew Wilcox [Fri, 6 Jan 2012 20:49:25 +0000 (13:49 -0700)]
NVMe: Mark the end of the sg list

For user I/O and admin commands, we were forgetting to mark the end of
the SG list.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix DMA mapping for admin commands
Matthew Wilcox [Fri, 6 Jan 2012 20:42:45 +0000 (13:42 -0700)]
NVMe: Fix DMA mapping for admin commands

We were always mapping as DMA_FROM_DEVICE then unmapping with
DMA_TO_DEVICE which was clearly not correct.  Follow the same pattern as
nvme_submit_io() and key off the bottom bit of the opcode to determine
whether this is a read or a write.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Rename IO_TIMEOUT to NVME_IO_TIMEOUT
Matthew Wilcox [Tue, 20 Dec 2011 18:53:01 +0000 (13:53 -0500)]
NVMe: Rename IO_TIMEOUT to NVME_IO_TIMEOUT

IO_TIMEOUT is a little too generic and might be used by other parts of
the kernel in the future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Merge the nvme_bio and nvme_prp data structures
Matthew Wilcox [Tue, 20 Dec 2011 18:34:52 +0000 (13:34 -0500)]
NVMe: Merge the nvme_bio and nvme_prp data structures

The new merged data structure is called nvme_iod.  This improves performance
for mid-sized I/Os (in the 16k range) since we save a memory allocation.
It is also a slightly simpler interface to use.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Change nvme_completion_fn to take a dev
Matthew Wilcox [Tue, 20 Dec 2011 16:54:53 +0000 (11:54 -0500)]
NVMe: Change nvme_completion_fn to take a dev

The queue is only needed for some rare occasions, and it's more consistent
to pass the device around.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Change get_nvmeq to take a dev instead of a namespace
Matthew Wilcox [Tue, 20 Dec 2011 16:04:12 +0000 (11:04 -0500)]
NVMe: Change get_nvmeq to take a dev instead of a namespace

Upcoming patches require calling get_nvmeq when we don't have a namespace.
Some callers already have the device in a local variable anyway.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Simplify completion handling
Matthew Wilcox [Sat, 15 Oct 2011 11:33:46 +0000 (07:33 -0400)]
NVMe: Simplify completion handling

Instead of encoding the handler type in the bottom two bits of the
per-completion context pointer, store the handler function as well
as the context pointer.  This gives us more flexibility and the code
is clearer.  It comes at the cost of an extra 8k of memory per queue,
but this feels like a reasonable price to pay.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Update Identify Controller data structure
Matthew Wilcox [Fri, 4 Nov 2011 20:24:23 +0000 (16:24 -0400)]
NVMe: Update Identify Controller data structure

The driver was still using an old definition of Identify Controller
which only came to light once we started using the 'number of namespaces'
field properly.

Reported-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Reported-by: Khosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Implement doorbell stride capability
Matthew Wilcox [Thu, 20 Oct 2011 21:00:41 +0000 (17:00 -0400)]
NVMe: Implement doorbell stride capability

The doorbell stride allows devices to spread out their doorbells instead
of packing them tightly.  This feature was added as part of ECN 003.

This patch also enables support for more than 512 queues :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Version 0.7
Matthew Wilcox [Fri, 7 Oct 2011 17:20:37 +0000 (13:20 -0400)]
NVMe: Version 0.7

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Don't probe namespace 0
Matthew Wilcox [Fri, 7 Oct 2011 17:10:13 +0000 (13:10 -0400)]
NVMe: Don't probe namespace 0

ECN 001 documented that namespace 0 is not valid.  Sending an Identify
with CNS of 0 and Namespace of 0 is an undefined command.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoFix calculation of number of pages in a PRP List
Nisheeth Bhat [Thu, 29 Sep 2011 14:10:10 +0000 (10:10 -0400)]
Fix calculation of number of pages in a PRP List

The existing calculation underestimated the number of pages required
as it did not take into account the pointer at the end of each page.
The replacement calculation may overestimate the number of pages required
if the last page in the PRP List is entirely full.  By using ->npages
as a counter as we fill in the pages, we ensure that we don't try to
free a page that was never allocated.

Signed-off-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Create nvme_identify and nvme_get_features functions
Matthew Wilcox [Mon, 19 Sep 2011 21:08:14 +0000 (17:08 -0400)]
NVMe: Create nvme_identify and nvme_get_features functions

Instead of open-coding calls to nvme_submit_admin_cmd, these
small wrappers are simpler to use (the patch removes 14 lines from
nvme_dev_add() for example).

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix memory leak in nvme_dev_add()
Matthew Wilcox [Mon, 19 Sep 2011 21:14:53 +0000 (17:14 -0400)]
NVMe: Fix memory leak in nvme_dev_add()

The driver was allocating 8k of memory, then freeing 4k of it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix calls to dma_unmap_sg
Nisheeth Bhat [Thu, 15 Sep 2011 20:52:24 +0000 (16:52 -0400)]
NVMe: Fix calls to dma_unmap_sg

dma_unmap_sg() must be called with the same 'nents' passed to
dma_map_sg(), not the number returned from dma_map_sg().

Signed-off-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Correct sg list setup in nvme_map_user_pages
Matthew Wilcox [Tue, 13 Sep 2011 21:01:39 +0000 (17:01 -0400)]
NVMe: Correct sg list setup in nvme_map_user_pages

Our SG list was constructed to always fill the entire first page, even
if that was more than the length of the I/O.  This is probably harmless,
but some IOMMUs might do something bad.

Correcting the first call to sg_set_page() made it look a lot closer to
the sg_set_page() in the loop, so fold the first call to sg_set_page()
into the loop.

Reported-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
12 years agoFix bug in NVME_IOCTL_SUBMIT_IO
Matthew Wilcox [Tue, 9 Aug 2011 16:56:37 +0000 (12:56 -0400)]
Fix bug in NVME_IOCTL_SUBMIT_IO

Missing 'break' in the switch statement meant that we'd fall through
to the 'return -EINVAL' case.

12 years agoNVMe: Rework ioctls
Matthew Wilcox [Fri, 20 May 2011 17:03:42 +0000 (13:03 -0400)]
NVMe: Rework ioctls

Remove the special-purpose IDENTIFY, GET_RANGE_TYPE, DOWNLOAD_FIRMWARE
and ACTIVATE_FIRMWARE commands.  Replace them with a generic ADMIN_CMD
ioctl that can submit any admin command.

Add a new ID ioctl that returns the namespace ID of the queried device.
It corresponds to the SCSI Idlun ioctl.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add the nvme thread to the wait queue before waking it up
Matthew Wilcox [Fri, 20 May 2011 13:34:43 +0000 (09:34 -0400)]
NVMe: Add the nvme thread to the wait queue before waking it up

If the I/O was not completed by a single NVMe command, we add the
bio to the congestion list and wake up the kthread to resubmit it.
But the kthread calls remove_wait_queue() unconditionally, which
will oops if it's not on the wait queue.  So add the kthread to
the wait queue before waking it up.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Return real error from nvme_create_queue
Matthew Wilcox [Wed, 11 May 2011 20:30:59 +0000 (13:30 -0700)]
NVMe: Return real error from nvme_create_queue

nvme_setup_io_queues() was assuming that a NULL return from
nvme_create_queue() was an out-of-memory error.  That's not necessarily
true; the adapter might return -EIO, for example.  Change the calling
convention to return an ERR_PTR on failure instead of NULL.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Version 0.6
Matthew Wilcox [Thu, 12 May 2011 01:38:57 +0000 (21:38 -0400)]
NVMe: Version 0.6

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add a few calling convention notes
Matthew Wilcox [Thu, 12 May 2011 01:36:38 +0000 (21:36 -0400)]
NVMe: Add a few calling convention notes

For the benefit of reviewers, add comments to a few functions describing
their calling context

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Handle failures from memory allocations in nvme_setup_prps
Matthew Wilcox [Thu, 12 May 2011 17:51:41 +0000 (13:51 -0400)]
NVMe: Handle failures from memory allocations in nvme_setup_prps

If any of the memory allocations in nvme_setup_prps fail, handle it by
modifying the passed-in data length to reflect the number of bytes we are
actually able to send.  Also allow the caller to specify the GFP flags
they need; for user-initiated commands, we can use GFP_KERNEL allocations.

The various callers are updated to handle this possibility; the main
I/O path is already prepared for this possibility (as it may happen
due to nvme_map_bio being unable to map all the segments of the I/O).
The other callers return -ENOMEM instead of doing partial I/Os.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Use an IDA to allocate minor numbers
Matthew Wilcox [Fri, 6 May 2011 12:45:47 +0000 (08:45 -0400)]
NVMe: Use an IDA to allocate minor numbers

The current approach of using the namespace ID as the minor number
doesn't work when there are multiple adapters in the machine.  Rather
than statically partitioning the number of namespaces between adapters,
dynamically allocate minor numbers to namespaces as they are detected.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add include of delay.h for msleep
Matthew Wilcox [Fri, 6 May 2011 12:37:54 +0000 (08:37 -0400)]
NVMe: Add include of delay.h for msleep

Previously it was being implicitly included through some other header file

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add support for timing out I/Os
Matthew Wilcox [Thu, 12 May 2011 17:50:28 +0000 (13:50 -0400)]
NVMe: Add support for timing out I/Os

In the kthread, walk the list of outstanding I/Os and check they've not
hit the timeout.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Rename cancel_cmdid_data to cancel_cmdid
Matthew Wilcox [Fri, 29 Apr 2011 06:17:36 +0000 (23:17 -0700)]
NVMe: Rename cancel_cmdid_data to cancel_cmdid

The trailing '_data' on the end was annoying and inconsistent.  Also, make
it actually return the data since this is needed for timing out commands.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix bug in error handling
Matthew Wilcox [Fri, 29 Apr 2011 06:09:09 +0000 (23:09 -0700)]
NVMe: Fix bug in error handling

When an I/O completed with an error, we would call bio_endio twice
(once with -EIO and once with 0).  Found by inspection.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Time out initialisation after a few seconds
Matthew Wilcox [Tue, 19 Apr 2011 19:04:20 +0000 (15:04 -0400)]
NVMe: Time out initialisation after a few seconds

THe device reports (in its capability register) how long it will take
to initialise.  If that time elapses before the ready bit becomes set,
conclude the device is broken and refuse to initialise it.  Log a nice
error message so the user knows why we did nothing.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix warning in free_irq
Matthew Wilcox [Sun, 27 Mar 2011 12:52:06 +0000 (08:52 -0400)]
NVMe: Fix warning in free_irq

We need to clear the affinity mask before calling free_irq()

Reported-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Correct the Controller Configuration settings
Matthew Wilcox [Tue, 22 Mar 2011 19:55:45 +0000 (15:55 -0400)]
NVMe: Correct the Controller Configuration settings

The arbitration field was extended by one bit, shifting the shutdown
notification bits by one.  Also, the SQ/CQ entry size was made
configurable for future extensions.

Reported-by: Paul Luse <paul.e.luse@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Version 0.5
Matthew Wilcox [Mon, 21 Mar 2011 14:28:43 +0000 (10:28 -0400)]
NVMe: Version 0.5

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Change the definition of nvme_user_io
Matthew Wilcox [Mon, 21 Mar 2011 13:48:57 +0000 (09:48 -0400)]
NVMe: Change the definition of nvme_user_io

The read and write commands don't define a 'result', so there's no need
to copy it back to userspace.

Remove the ability of the ioctl to submit commands to a different
namespace; it's just asking for trouble, and the use case I have in mind
will be addressed througha  different ioctl in the future.  That removes
the need for both the block_shift and nsid arguments.

Check that the opcode is one of 'read' or 'write'.  Future opcodes may
be added in the future, but we will need a different structure definition
for them.

The nblocks field is redefined to be 0-based.  This allows the user to
request the full 65536 blocks.

Don't byteswap the reftag, apptag and appmask.  Martin Petersen tells
me these are calculated in big-endian and are transmitted to the device
in big-endian.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Correct the definitions of two ioctls
Matthew Wilcox [Sun, 20 Mar 2011 11:27:10 +0000 (07:27 -0400)]
NVMe: Correct the definitions of two ioctls

NVME_IOCTL_SUBMIT_IO has a struct nvme_user_io, not a struct nvme_rw_command
as a parameter, and NVME_IOCTL_DOWNLOAD_FW is a Write, not a Read.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add compat_ioctl
Matthew Wilcox [Sat, 19 Mar 2011 18:55:38 +0000 (14:55 -0400)]
NVMe: Add compat_ioctl

Make ioctls work for 32-bit applications on 64-bit kernels.  The structures
are defined to be the same for both 32- and 64-bit applications, so
we can use the same handler for both.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Simplify queue lookup
Matthew Wilcox [Wed, 16 Mar 2011 20:52:19 +0000 (16:52 -0400)]
NVMe: Simplify queue lookup

Fill in all the num_possible_cpus() entries with duplicate pointers.
This reduces the complexity of the frequently-called get_nvmeq(), as
well as avoiding a bug in it when there are fewer queues than CPUs.

Reported-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Remove the kthread from the wait queue
Matthew Wilcox [Wed, 16 Mar 2011 20:45:49 +0000 (16:45 -0400)]
NVMe: Remove the kthread from the wait queue

Once there are no more bios on the congestion list, we can stop waking
up the nvme kthread every time a completion happens.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix off-by-one when filling in PRP lists
Matthew Wilcox [Wed, 16 Mar 2011 20:43:40 +0000 (16:43 -0400)]
NVMe: Fix off-by-one when filling in PRP lists

If the last element in the PRP list fits on the end of the page, there's
no need to allocate an extra page to put that single element in.  It can
fit on the end of the page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix interpretation of 'Number of Namespaces' field
Matthew Wilcox [Wed, 16 Mar 2011 20:29:58 +0000 (16:29 -0400)]
NVMe: Fix interpretation of 'Number of Namespaces' field

The spec says this is a 0s based value.  We don't need to handle the
maximal value because it's reserved to mean "every namespace".

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Remove outdated comments
Matthew Wilcox [Wed, 16 Mar 2011 20:29:24 +0000 (16:29 -0400)]
NVMe: Remove outdated comments

The head can never overrun the tail since we won't allocate enough command
IDs to let that happen.  The status codes are in sync with the spec.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix comment formatting
Matthew Wilcox [Wed, 16 Mar 2011 20:29:00 +0000 (16:29 -0400)]
NVMe: Fix comment formatting

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Convert comments to kernel-doc notation
Matthew Wilcox [Wed, 16 Mar 2011 20:28:24 +0000 (16:28 -0400)]
NVMe: Convert comments to kernel-doc notation

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Update admin opcodes to match the 1.0RC spec
Krzysztof Wierzbicki [Mon, 28 Feb 2011 07:27:13 +0000 (08:27 +0100)]
NVMe: Update admin opcodes to match the 1.0RC spec

Signed-off-by: Krzysztof Wierzbicki <krzysztof.wierzbicki@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Version 0.4
Matthew Wilcox [Thu, 24 Feb 2011 21:20:14 +0000 (16:20 -0500)]
NVMe: Version 0.4

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Reduce maximum queue depth by 1
Matthew Wilcox [Thu, 24 Feb 2011 13:49:41 +0000 (08:49 -0500)]
NVMe: Reduce maximum queue depth by 1

The spec says we're not allowed to completely fill the submission queue.
Solve this by reducing the number of allocatable cmdids by 1.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Fix discontiguous accesses
Matthew Wilcox [Thu, 24 Feb 2011 13:46:00 +0000 (08:46 -0500)]
NVMe: Fix discontiguous accesses

When we submit subsequent portions of the I/O, we need to access the
updated block, not start reading again from the original position.
This was showing up as miscompares in the XFS randholes testcase.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Handle bios that contain non-virtually contiguous addresses
Matthew Wilcox [Wed, 23 Feb 2011 20:20:00 +0000 (15:20 -0500)]
NVMe: Handle bios that contain non-virtually contiguous addresses

NVMe scatterlists must be virtually contiguous, like almost all I/Os.
However, when the filesystem lays out files with a hole, it can be that
adjacent LBAs map to non-adjacent virtual addresses.  Handle this by
submitting one NVMe command at a time for each virtually discontiguous
range.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Implement Flush
Matthew Wilcox [Tue, 22 Feb 2011 19:18:30 +0000 (14:18 -0500)]
NVMe: Implement Flush

Linux implements Flush as a bit in the bio.  That means there may also be
data associated with the flush; if so the flush should be sent before the
data.  To avoid completing the bio twice, I add CMD_CTX_FLUSH to indicate
the completion routine should do nothing.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Mark CMD_CTX_CANCELLED as being unlikely
Matthew Wilcox [Tue, 22 Feb 2011 19:15:34 +0000 (14:15 -0500)]
NVMe: Mark CMD_CTX_CANCELLED as being unlikely

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Correct SQ doorbell semantics
Matthew Wilcox [Wed, 16 Feb 2011 14:59:59 +0000 (09:59 -0500)]
NVMe: Correct SQ doorbell semantics

The value written to the doorbell needs to be the first free index in
the queue, not the most recently used index in the queue.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Let the kthread take care of devices earlier
Matthew Wilcox [Tue, 15 Feb 2011 21:28:20 +0000 (16:28 -0500)]
NVMe: Let the kthread take care of devices earlier

If interrupts are misconfigured, the kthread will be needed to process
admin queue completions.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Rename nr_queues to nr_io_queues
Matthew Wilcox [Tue, 15 Feb 2011 21:16:02 +0000 (16:16 -0500)]
NVMe: Rename nr_queues to nr_io_queues

I got confused about whether this included the admin queue or not, and
had to resort to reading the spec.  It doesn't include the admin queue,
so make that clear in the name.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Remove setting of 'flags' in rw command
Matthew Wilcox [Tue, 15 Feb 2011 18:44:13 +0000 (13:44 -0500)]
NVMe: Remove setting of 'flags' in rw command

This was the data transfer bit until spec rev 0.92

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Release 0.3
Matthew Wilcox [Mon, 14 Feb 2011 22:35:00 +0000 (17:35 -0500)]
NVMe: Release 0.3

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add a kthread to handle the congestion list
Matthew Wilcox [Wed, 2 Mar 2011 23:37:18 +0000 (18:37 -0500)]
NVMe: Add a kthread to handle the congestion list

Instead of trying to resubmit I/Os in the I/O completion path (in
interrupt context), wake up a kthread which will resubmit I/O from
user context.  This allows mke2fs to run to completion.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Handle failures differently in nvme_submit_bio_queue()
Matthew Wilcox [Mon, 14 Feb 2011 20:55:33 +0000 (15:55 -0500)]
NVMe: Handle failures differently in nvme_submit_bio_queue()

Return -EBUSY if the queue is full or -ENOMEM if we failed to allocate
memory (or map a scatterlist).  Also use GFP_ATOMIC to allocate the
nvme_bio and move the locking to the callers of nvme_submit_bio_queue().

In nvme_make_request(), don't permit an I/O to jump the queue -- if the
congestion list already has an entry, just add to the tail, rather than
trying to submit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Update BAR structure to match the current spec
Matthew Wilcox [Mon, 14 Feb 2011 17:20:15 +0000 (12:20 -0500)]
NVMe: Update BAR structure to match the current spec

Add two reserved registers in the middle of the BAR to match the 1.0
spec plus ECN 0002.

Also rename IMC and ISC to INTMC and INTSC to conform with the spec.
We still don't need to use them :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Handle physical merging of bvec entries
Matthew Wilcox [Thu, 10 Feb 2011 18:55:39 +0000 (13:55 -0500)]
NVMe: Handle physical merging of bvec entries

In order to not overrun the sg array, we have to merge physically
contiguous pages into a single sg entry.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Check for DMA mapping failure
Matthew Wilcox [Thu, 10 Feb 2011 17:01:09 +0000 (12:01 -0500)]
NVMe: Check for DMA mapping failure

If dma_map_sg returns 0 (failure), we need to fail the I/O.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Pass the nvme_dev to nvme_free_prps and nvme_setup_prps
Matthew Wilcox [Thu, 10 Feb 2011 15:47:55 +0000 (10:47 -0500)]
NVMe: Pass the nvme_dev to nvme_free_prps and nvme_setup_prps

We were passing the nvme_queue to access the q_dmadev for the
dma_alloc_coherent calls, but since we moved to the dma pool API,
we really only need the nvme_dev.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Optimise memory usage for I/Os between 4k and 128k
Matthew Wilcox [Thu, 10 Feb 2011 15:30:34 +0000 (10:30 -0500)]
NVMe: Optimise memory usage for I/Os between 4k and 128k

Add a second memory pool for smaller I/Os.  We can pack 16 of these on a
single page instead of using an entire page for each one.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Switch to use DMA Pool API
Matthew Wilcox [Thu, 10 Feb 2011 14:56:01 +0000 (09:56 -0500)]
NVMe: Switch to use DMA Pool API

Calling dma_free_coherent from interrupt context causes warnings.
Using the DMA pools delays freeing until pool destruction, so avoids
the problem.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Rename nvme_req_info to nvme_bio
Matthew Wilcox [Thu, 10 Feb 2011 14:03:06 +0000 (09:03 -0500)]
NVMe: Rename nvme_req_info to nvme_bio

There are too many things called 'info' in this driver.  This data
structure is auxiliary information for a struct bio, so call it nvme_bio,
or nbio when used as a variable.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Initial PRP List support
Shane Michael Matthews [Thu, 10 Feb 2011 13:51:24 +0000 (08:51 -0500)]
NVMe: Initial PRP List support

Add a pointer to the nvme_req_info to hold a new data structure
(nvme_prps) which contains a list of the pages allocated to this
particular request for holding PRP list entries.  nvme_setup_prps()
now returns this pointer.

To allocate and free the memory used for PRP lists, we need a struct
device, so we need to pass the nvme_queue pointer to many functions
which didn't use to need it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Advance the sg pointer when filling in an sg list
Matthew Wilcox [Thu, 10 Feb 2011 13:49:59 +0000 (08:49 -0500)]
NVMe: Advance the sg pointer when filling in an sg list

For multipage BIOs, we were always using sg[0] instead of advancing
through the list.  Oops :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Renumber the special context values
Matthew Wilcox [Mon, 7 Feb 2011 20:55:59 +0000 (15:55 -0500)]
NVMe: Renumber the special context values

If POISON_POINTER_DELTA isn't defined, ensure they're in page 0 which
should never be mapped.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Handle the congestion list a little better
Matthew Wilcox [Mon, 7 Feb 2011 17:45:24 +0000 (12:45 -0500)]
NVMe: Handle the congestion list a little better

In the bio completion handler, check for bios on the congestion list
for this NVM queue.  Also, lock the congestion list in the make_request
function as the queue may end up being shared between multiple CPUs.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Record the timeout for each command
Matthew Wilcox [Sun, 6 Feb 2011 23:30:16 +0000 (18:30 -0500)]
NVMe: Record the timeout for each command

In addition to recording the completion data for each command, record
the anticipated completion time.  Choose a timeout of 5 seconds for
normal I/Os and 60 seconds for admin I/Os.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Need to lock queue during interrupt handling
Matthew Wilcox [Sun, 6 Feb 2011 14:01:00 +0000 (09:01 -0500)]
NVMe: Need to lock queue during interrupt handling

If we're sharing a queue between multiple CPUs and we cancel a sync I/O,
we must have the queue locked to avoid corrupting the stack of the thread
that submitted the I/O.  It turns out this is the same locking that's needed
for the threaded irq handler, so share that code.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Detect command IDs completing that are out of range
Matthew Wilcox [Sun, 6 Feb 2011 13:51:15 +0000 (08:51 -0500)]
NVMe: Detect command IDs completing that are out of range

If the adapter completes a command ID that is outside the bounds of
the array, return CMD_CTX_INVALID instead of random data, and print a
message in the sync_completion handler (which is rapidly becoming the
misc completion handler :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Detect commands that are completed twice
Matthew Wilcox [Sun, 6 Feb 2011 13:49:55 +0000 (08:49 -0500)]
NVMe: Detect commands that are completed twice

Set the context value to CMD_CTX_COMPLETED, and print a message in the
sync_completion handler if we see it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Use a symbolic name to represent cancelled commands instead of 0
Matthew Wilcox [Sun, 6 Feb 2011 12:53:23 +0000 (07:53 -0500)]
NVMe: Use a symbolic name to represent cancelled commands instead of 0

I have plans for other special values in sync_completion.  Plus, this
is more self-documenting, and lets us detect bogus usages.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add a module parameter to use a threaded interrupt
Matthew Wilcox [Sun, 6 Feb 2011 12:28:06 +0000 (07:28 -0500)]
NVMe: Add a module parameter to use a threaded interrupt

We're currently calling bio_endio from hard interrupt context.  This is
not a good idea for preemptible kernels as it will cause longer latencies.
Using a threaded interrupt will run the entire queue processing mechanism
(including bio_endio) in a thread, which can be preempted.  Unfortuantely,
it also adds about 7us of latency to the single-I/O case, so make it a
module parameter for the moment.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Call put_nvmeq() before calling nvme_submit_sync_cmd()
Matthew Wilcox [Fri, 4 Feb 2011 21:14:30 +0000 (16:14 -0500)]
NVMe: Call put_nvmeq() before calling nvme_submit_sync_cmd()

We can't have preemption disabled when we call schedule().  Accept the
possibility that we'll get preempted, and it'll cost us some cacheline
bounces.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Allow fatal signals to interrupt I/O
Matthew Wilcox [Fri, 4 Feb 2011 21:03:56 +0000 (16:03 -0500)]
NVMe: Allow fatal signals to interrupt I/O

If the user sends a fatal signal, sleeping in the TASK_KILLABLE state
permits the task to be aborted.  The only wrinkle is making sure that
if/when the command completes later that it doesn't upset anything.
Handle this by setting the data pointer to 0, and checking the value
isn't NULL in the sync completion path.  Eventually, bios can be cancelled
through this path too.  Note that the cmdid isn't freed to prevent reuse.

We should also abort the command in the future, but this is a good start.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Release 0.2
Matthew Wilcox [Thu, 3 Feb 2011 19:36:07 +0000 (14:36 -0500)]
NVMe: Release 0.2

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add download / activate firmware ioctls
Matthew Wilcox [Thu, 3 Feb 2011 15:58:26 +0000 (10:58 -0500)]
NVMe: Add download / activate firmware ioctls

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add remaining status codes
Matthew Wilcox [Thu, 3 Feb 2011 14:20:57 +0000 (09:20 -0500)]
NVMe: Add remaining status codes

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Move sysfs entries to the right place
Matthew Wilcox [Tue, 1 Feb 2011 17:49:38 +0000 (12:49 -0500)]
NVMe: Move sysfs entries to the right place

Because I wasn't setting driverfs_dev, the devices were showing up under
/sys/devices/virtual/block.  Now they appear underneath the PCI device
which they belong to.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Disable the device before we write the admin queues
Shane Michael Matthews [Tue, 1 Feb 2011 16:31:55 +0000 (11:31 -0500)]
NVMe: Disable the device before we write the admin queues

In case the card has been left in a partially-configured state,
write 0 to the Enable bit.

Signed-off-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Request I/O regions
Matthew Wilcox [Tue, 1 Feb 2011 21:24:35 +0000 (16:24 -0500)]
NVMe: Request I/O regions

Calling pci_request_selected_regions() reserves these regions for our use.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Allow queues to be allocated above 4GB
Matthew Wilcox [Tue, 1 Feb 2011 21:23:39 +0000 (16:23 -0500)]
NVMe: Allow queues to be allocated above 4GB

Need to call dma_set_coherent_mask() to allow queues to be allocated
above 4GB.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Enable device DMA
Matthew Wilcox [Tue, 1 Feb 2011 14:01:59 +0000 (09:01 -0500)]
NVMe: Enable device DMA

Need to call pci_set_master() to enable device DMA

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Enable and disable the PCI device
Shane Michael Matthews [Tue, 1 Feb 2011 13:49:30 +0000 (08:49 -0500)]
NVMe: Enable and disable the PCI device

Call pci_enable_device_mem() at initialisation and pci_disable_device
at exit.

Signed-off-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Check returns from nvme_alloc_queue()
Matthew Wilcox [Tue, 1 Feb 2011 13:39:04 +0000 (08:39 -0500)]
NVMe: Check returns from nvme_alloc_queue()

It can return NULL, so handle that.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Remove 'node' from nvme_dev
Matthew Wilcox [Mon, 31 Jan 2011 15:46:14 +0000 (10:46 -0500)]
NVMe: Remove 'node' from nvme_dev

We don't keep a list of nvme_dev any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Read the model, serial & firmware rev from the controller
Matthew Wilcox [Tue, 1 Feb 2011 21:18:08 +0000 (16:18 -0500)]
NVMe: Read the model, serial & firmware rev from the controller

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Add NVME_IOCTL_SUBMIT_IO
Matthew Wilcox [Tue, 1 Feb 2011 21:13:29 +0000 (16:13 -0500)]
NVMe: Add NVME_IOCTL_SUBMIT_IO

Allow userspace to submit synchronous I/O like the SCSI sg interface does.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Create nvme_map_user_pages() and nvme_unmap_user_pages()
Matthew Wilcox [Wed, 26 Jan 2011 22:05:50 +0000 (17:05 -0500)]
NVMe: Create nvme_map_user_pages() and nvme_unmap_user_pages()

These are generalisations of the code that was in
nvme_submit_user_admin_command().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
12 years agoNVMe: Change NVME_IOCTL_GET_RANGE_TYPE to return all the ranges
Matthew Wilcox [Wed, 26 Jan 2011 19:34:32 +0000 (14:34 -0500)]
NVMe: Change NVME_IOCTL_GET_RANGE_TYPE to return all the ranges

Factor out most of nvme_identify() into a new nvme_submit_user_admin_command()
function.  Change nvme_get_range_type() to call it and change nvme_ioctl to
realise that it's getting back all 64 ranges.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>