Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git...

author Linus Torvalds <torvalds@linux-foundation.org>

Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)

committer Linus Torvalds <torvalds@linux-foundation.org>

Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)
author Linus Torvalds <torvalds@linux-foundation.org>
Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)
committer Linus Torvalds <torvalds@linux-foundation.org>
Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt

index d3ca8af21a31adf4eed39eade2067a48bdb8dda6..86786d87d9a8e91b217ef1915e8f844b6ed29f33 100644 (file)
--- a/Documentation/device-mapper/cache-policies.txt
+++ b/Documentation/device-mapper/cache-policies.txt
@@ -60,7 +60,7 @@ Memory usage:
  The mq policy used a lot of memory; 88 bytes per cache block on a 64
  bit machine.
  
-smq uses 28bit indexes to implement it's data structures rather than
+smq uses 28bit indexes to implement its data structures rather than
  pointers.  It avoids storing an explicit hit count for each block.  It
  has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
  the entries (each hotspot block covers a larger area than a single
@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions.
  
  Adaptability:
  The mq policy maintained a hit count for each cache block.  For a
-different block to get promoted to the cache it's hit count has to
+different block to get promoted to the cache its hit count has to
  exceed the lowest currently in the cache.  This meant it could take a
  long time for the cache to adapt between varying IO patterns.
  
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt

index cdfd0feb294e6f0be9851223b7669de3109d875a..ff0841711fd524063e2acf0fa4644ec2fbb4b764 100644 (file)
--- a/Documentation/device-mapper/cache.txt
+++ b/Documentation/device-mapper/cache.txt
@@ -59,7 +59,7 @@ Fixed block size
  The origin is divided up into blocks of a fixed size.  This block size
  is configurable when you first create the cache.  Typically we've been
  using block sizes of 256KB - 1024KB.  The block size must be between 64
-(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).
+sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
  
  Having a fixed block size simplifies the target a lot.  But it is
  something of a compromise.  For instance, a small part of a block may be
@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments.
  
  For the time being, a message "migration_threshold <#sectors>"
  can be used to set the maximum number of sectors being migrated,
-the default being 204800 sectors (or 100MB).
+the default being 2048 sectors (1MB).
  
  Updating on-disk metadata
  -------------------------
@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small.  Like the
  dirty flags this data is lost if there's a crash so a safe fallback
  value should always be possible.
  
-For instance, the 'mq' policy, which is currently the default policy,
-uses this facility to store the hit count of the cache blocks.  If
-there's a crash this information will be lost, which means the cache
-may be less efficient until those hit counts are regenerated.
-
  Policy hints affect performance, not correctness.
  
  Policy messaging
diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt

index 32df07e29f6860d4c1de9e3c3c63d2b4d0fc454f..390c145f01d717fc9bc9499fe4f0fc5d8dbc1257 100644 (file)
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -343,5 +343,8 @@ Version History
  1.11.0  Fix table line argument order
         (wrong raid10_copies/raid10_format sequence)
  1.11.1  Add raid4/5/6 journal write-back support via journal_mode option
-1.12.1  fix for MD deadlock between mddev_suspend() and md_write_start() available
+1.12.1  Fix for MD deadlock between mddev_suspend() and md_write_start() available
  1.13.0  Fix dev_health status at end of "recover" (was 'a', now 'A')
+1.13.1  Fix deadlock caused by early md_stop_writes().  Also fix size an
+       state races.
+1.13.2  Fix raid redundancy validation and avoid keeping raid set frozen
diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt

index ad6949bff2e392d63e7ddb82391746687ea6235e..b8bbb516f9894a86f2f94b93c7a056633c45f85c 100644 (file)
--- a/Documentation/device-mapper/snapshot.txt
+++ b/Documentation/device-mapper/snapshot.txt
@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient
  snapshots less metadata must be saved on disk - they can be kept in
  memory by the kernel.
  
+When loading or unloading the snapshot target, the corresponding
+snapshot-origin or snapshot-merge target must be suspended. A failure to
+suspend the origin target could result in data corruption.
+
  
  * snapshot-merge <origin> <COW device> <persistent> <chunksize>
  
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt

index 1699a55b7b709adddd18b97cd769cb8946662e48..4bcd4b7f79f9437077b5c0b5dc0e9b8e119fb9ac 100644 (file)
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size.  If
  free space on the data device drops below this level then a dm event
  will be triggered which a userspace daemon should catch allowing it to
  extend the pool device.  Only one such event will be sent.
-Resuming a device with a new table itself triggers an event so the
-userspace daemon can use this to detect a situation where a new table
-already exceeds the threshold.
+
+No special event is triggered if a just resumed device's free space is below
+the low water mark. However, resuming a device always triggers an
+event; a userspace daemon should verify that free space exceeds the low
+water mark when handling this event.
  
  A low water mark for the metadata device is maintained in the kernel and
  will trigger a dm event if free space on the metadata device drops below
@@ -274,7 +276,8 @@ ii) Status
  
      <transaction id> <used metadata blocks>/<total metadata blocks>
      <used data blocks>/<total data blocks> <held metadata root>
-    [no_]discard_passdown ro|rw
+    ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
+    needs_check|-
  
      transaction id:
         A 64-bit number used by userspace to help synchronise with metadata
@@ -394,3 +397,6 @@ ii) Status
         If the pool has encountered device errors and failed, the status
         will just contain the string 'Fail'.  The userspace recovery
         tools should then be used.
+
+    In the case where <nr mapped sectors> is 0, there is no highest
+    mapped sector and the value of <highest mapped sector> is unspecified.
diff --git a/Documentation/device-mapper/unstriped.txt b/Documentation/device-mapper/unstriped.txt

new file mode 100644 (file)

index 0000000..0b2a306
--- /dev/null
+++ b/Documentation/device-mapper/unstriped.txt
@@ -0,0 +1,124 @@
+Introduction
+============
+
+The device-mapper "unstriped" target provides a transparent mechanism to
+unstripe a device-mapper "striped" target to access the underlying disks
+without having to touch the true backing block-device.  It can also be
+used to unstripe a hardware RAID-0 to access backing disks.
+
+Parameters:
+<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
+
+<number of stripes>
+        The number of stripes in the RAID 0.
+
+<chunk size>
+       The amount of 512B sectors in the chunk striping.
+
+<dev_path>
+       The block device you wish to unstripe.
+
+<stripe #>
+        The stripe number within the device that corresponds to physical
+        drive you wish to unstripe.  This must be 0 indexed.
+
+
+Why use this module?
+====================
+
+An example of undoing an existing dm-stripe
+-------------------------------------------
+
+This small bash script will setup 4 loop devices and use the existing
+striped target to combine the 4 devices into one.  It then will use
+the unstriped target ontop of the striped device to access the
+individual backing loop devices.  We write data to the newly exposed
+unstriped devices and verify the data written matches the correct
+underlying device on the striped array.
+
+#!/bin/bash
+
+MEMBER_SIZE=$((128 * 1024 * 1024))
+NUM=4
+SEQ_END=$((${NUM}-1))
+CHUNK=256
+BS=4096
+
+RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
+DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
+COUNT=$((${MEMBER_SIZE} / ${BS}))
+
+for i in $(seq 0 ${SEQ_END}); do
+  dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
+  losetup /dev/loop${i} member-${i}
+  DM_PARMS+=" /dev/loop${i} 0"
+done
+
+echo $DM_PARMS | dmsetup create raid0
+for i in $(seq 0 ${SEQ_END}); do
+  echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
+done;
+
+for i in $(seq 0 ${SEQ_END}); do
+  dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
+  diff /dev/mapper/set-${i} member-${i}
+done;
+
+for i in $(seq 0 ${SEQ_END}); do
+  dmsetup remove set-${i}
+done
+
+dmsetup remove raid0
+
+for i in $(seq 0 ${SEQ_END}); do
+  losetup -d /dev/loop${i}
+  rm -f member-${i}
+done
+
+Another example
+---------------
+
+Intel NVMe drives contain two cores on the physical device.
+Each core of the drive has segregated access to its LBA range.
+The current LBA model has a RAID 0 128k chunk on each core, resulting
+in a 256k stripe across the two cores:
+
+   Core 0:       Core 1:
+  __________    __________
+  | LBA 512|    | LBA 768|
+  | LBA 0  |    | LBA 256|
+  ----------    ----------
+
+The purpose of this unstriping is to provide better QoS in noisy
+neighbor environments. When two partitions are created on the
+aggregate drive without this unstriping, reads on one partition
+can affect writes on another partition.  This is because the partitions
+are striped across the two cores.  When we unstripe this hardware RAID 0
+and make partitions on each new exposed device the two partitions are now
+physically separated.
+
+With the dm-unstriped target we're able to segregate an fio script that
+has read and write jobs that are independent of each other.  Compared to
+when we run the test on a combined drive with partitions, we were able
+to get a 92% reduction in read latency using this device mapper target.
+
+
+Example dmsetup usage
+=====================
+
+unstriped ontop of Intel NVMe device that has 2 cores
+-----------------------------------------------------
+dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
+dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
+
+There will now be two devices that expose Intel NVMe core 0 and 1
+respectively:
+/dev/mapper/nvmset0
+/dev/mapper/nvmset1
+
+unstriped ontop of striped with 4 drives using 128K chunk size
+--------------------------------------------------------------
+dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
+dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
+dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
+dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig

index 83b9362be09c61cec0a171d8715963c2dc812a38..2c8ac3688815e5f0360a47bde82d1a68bf888ce1 100644 (file)
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -269,6 +269,13 @@ config DM_BIO_PRISON
  
  source "drivers/md/persistent-data/Kconfig"
  
+config DM_UNSTRIPED
+       tristate "Unstriped target"
+       depends on BLK_DEV_DM
+       ---help---
+         Unstripes I/O so it is issued solely on a single drive in a HW
+         RAID0 or dm-striped target.
+
  config DM_CRYPT
         tristate "Crypt target support"
         depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile

index f701bb2117839e7dd6c9fd081b1be3c3c0159aa1..63255f3ebd97c6cbbb70a5caab97a83a2f7785ca 100644 (file)
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE)          += bcache/
  obj-$(CONFIG_BLK_DEV_MD)       += md-mod.o
  obj-$(CONFIG_BLK_DEV_DM)       += dm-mod.o
  obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
+obj-$(CONFIG_DM_UNSTRIPED)     += dm-unstripe.o
  obj-$(CONFIG_DM_BUFIO)         += dm-bufio.o
  obj-$(CONFIG_DM_BIO_PRISON)    += dm-bio-prison.o
  obj-$(CONFIG_DM_CRYPT)         += dm-crypt.o
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c

index c546b567f3b50a3f43b0c074e9319ca908ec5971..414c9af54ded2fde89531cedf3da430c53fcac9d 100644 (file)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io)
  
         sector = (b->block << b->c->sectors_per_block_bits) + b->c->start;
  
-       if (rw != WRITE) {
+       if (rw != REQ_OP_WRITE) {
                 n_sectors = 1 << b->c->sectors_per_block_bits;
                 offset = 0;
         } else {
@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b,
         b->write_end = b->dirty_end;
  
         if (!write_list)
-               submit_io(b, WRITE, write_endio);
+               submit_io(b, REQ_OP_WRITE, write_endio);
         else
                 list_add_tail(&b->write_list, write_list);
  }
@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list)
                 struct dm_buffer *b =
                         list_entry(write_list->next, struct dm_buffer, write_list);
                 list_del(&b->write_list);
-               submit_io(b, WRITE, write_endio);
+               submit_io(b, REQ_OP_WRITE, write_endio);
                 cond_resched();
         }
         blk_finish_plug(&plug);
@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
                 return NULL;
  
         if (need_submit)
-               submit_io(b, READ, read_endio);
+               submit_io(b, REQ_OP_READ, read_endio);
  
         wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE);
  
@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
                         dm_bufio_unlock(c);
  
                         if (need_submit)
-                               submit_io(b, READ, read_endio);
+                               submit_io(b, REQ_OP_READ, read_endio);
                         dm_bufio_release(b);
  
                         cond_resched();
@@ -1454,7 +1454,7 @@ retry:
                 old_block = b->block;
                 __unlink_buffer(b);
                 __link_buffer(b, new_block, b->list_mode);
-               submit_io(b, WRITE, write_endio);
+               submit_io(b, REQ_OP_WRITE, write_endio);
                 wait_on_bit_io(&b->state, B_WRITING,
                                TASK_UNINTERRUPTIBLE);
                 __unlink_buffer(b);
@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
                         if (!DM_BUFIO_CACHE_NAME(c)) {
                                 r = -ENOMEM;
                                 mutex_unlock(&dm_bufio_clients_lock);
-                               goto bad_cache;
+                               goto bad;
                         }
                 }
  
@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
                         if (!DM_BUFIO_CACHE(c)) {
                                 r = -ENOMEM;
                                 mutex_unlock(&dm_bufio_clients_lock);
-                               goto bad_cache;
+                               goto bad;
                         }
                 }
         }
@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
  
                 if (!b) {
                         r = -ENOMEM;
-                       goto bad_buffer;
+                       goto bad;
                 }
                 __free_buffer_wake(b);
         }
  
+       c->shrinker.count_objects = dm_bufio_shrink_count;
+       c->shrinker.scan_objects = dm_bufio_shrink_scan;
+       c->shrinker.seeks = 1;
+       c->shrinker.batch = 0;
+       r = register_shrinker(&c->shrinker);
+       if (r)
+               goto bad;
+
         mutex_lock(&dm_bufio_clients_lock);
         dm_bufio_client_count++;
         list_add(&c->client_list, &dm_bufio_all_clients);
         __cache_size_refresh();
         mutex_unlock(&dm_bufio_clients_lock);
  
-       c->shrinker.count_objects = dm_bufio_shrink_count;
-       c->shrinker.scan_objects = dm_bufio_shrink_scan;
-       c->shrinker.seeks = 1;
-       c->shrinker.batch = 0;
-       register_shrinker(&c->shrinker);
-
         return c;
  
-bad_buffer:
-bad_cache:
+bad:
         while (!list_empty(&c->reserved_buffers)) {
                 struct dm_buffer *b = list_entry(c->reserved_buffers.next,
                                                  struct dm_buffer, lru_list);
@@ -1767,6 +1768,7 @@ bad_cache:
         }
         dm_io_client_destroy(c->dm_io);
  bad_dm_io:
+       mutex_destroy(&c->lock);
         kfree(c);
  bad_client:
         return ERR_PTR(r);
@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
                 BUG_ON(c->n_buffers[i]);
  
         dm_io_client_destroy(c->dm_io);
+       mutex_destroy(&c->lock);
         kfree(c);
  }
  EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h

index 6a14f945783c6cb296eabf999e5dcd05fea6596b..3222e21cbbf82cab0039cf1a992f14bf48383b8a 100644 (file)
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -91,8 +91,7 @@ struct mapped_device {
         /*
          * io objects are allocated from here.
          */
-       mempool_t *io_pool;
-
+       struct bio_set *io_bs;
         struct bio_set *bs;
  
         /*
@@ -130,8 +129,6 @@ struct mapped_device {
         struct srcu_struct io_barrier;
  };
  
-void dm_init_md_queue(struct mapped_device *md);
-void dm_init_normal_md_queue(struct mapped_device *md);
  int md_in_flight(struct mapped_device *md);
  void disable_write_same(struct mapped_device *md);
  void disable_write_zeroes(struct mapped_device *md);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c

index 2ad429100d25df57974f2981e8b6cdcd587b7af6..8168f737590e7cff141b933fe91421da8a6ac326 100644 (file)
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti)
         kzfree(cc->cipher_auth);
         kzfree(cc->authenc_key);
  
+       mutex_destroy(&cc->bio_alloc_lock);
+
         /* Must zero key material before freeing */
         kzfree(cc);
  }
@@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
                 goto bad;
         }
  
-       cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS |
-                                           BIOSET_NEED_RESCUER));
+       cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
         if (!cc->bs) {
                 ti->error = "Cannot allocate crypt bioset";
                 goto bad;
diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c

index 288386bfbfb569e62f5bc46f3b0baffaf8e09ba1..1783d80c9cad029b61c261c41ed5c5dbdc8c0624 100644 (file)
--- a/drivers/md/dm-delay.c
+++ b/drivers/md/dm-delay.c
@@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti)
         if (dc->dev_write)
                 dm_put_device(ti, dc->dev_write);
  
+       mutex_destroy(&dc->timer_lock);
+
         kfree(dc);
  }
  
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c

index b82cb1ab1eaa338a67800bfdb0c38d36f7913ef2..1b907b15f5c3be312569452ea1476ada54e01db5 100644 (file)
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc,
                 arg_name = dm_shift_arg(as);
                 argc--;
  
+               if (!arg_name) {
+                       ti->error = "Insufficient feature arguments";
+                       return -EINVAL;
+               }
+
                 /*
                  * drop_writes
                  */
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c

index b4357ed4d541621fa69eec798fdb3a7d85ae68a4..a8d914d5abbe6c91b4050349c46e80ad02f26249 100644 (file)
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
         if (!client->pool)
                 goto bad;
  
-       client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS |
-                                                 BIOSET_NEED_RESCUER));
+       client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
         if (!client->bios)
                 goto bad;
  
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c

index eb45cc3df31da109bcb9846967cbf8926d7524b8..e6e7c686646d2946f5461b8b4ab0bdb347af96f2 100644 (file)
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job)
          * If this is the master job, the sub jobs have already
          * completed so we can free everything.
          */
-       if (job->master_job == job)
+       if (job->master_job == job) {
+               mutex_destroy(&job->lock);
                 mempool_free(job, kc->job_pool);
+       }
         fn(read_err, write_err, context);
  
         if (atomic_dec_and_test(&kc->nr_jobs))
@@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
          * followed by SPLIT_COUNT sub jobs.
          */
         job = mempool_alloc(kc->job_pool, GFP_NOIO);
+       mutex_init(&job->lock);
  
         /*
          * set up for the read.
@@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
         if (job->source.count <= SUB_JOB_SIZE)
                 dispatch_job(job);
         else {
-               mutex_init(&job->lock);
                 job->progress = 0;
                 split_job(job);
         }
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c

index 189badbeddaf7bf39ffd97544232aa8ba7a2f9d4..3362d866793b320149bcb7a83b663e1bb68cf927 100644 (file)
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data)
                 return -ENOMEM;
         }
  
-       block->data = kstrndup(data, maxsize, GFP_KERNEL);
+       block->data = kstrndup(data, maxsize - 1, GFP_KERNEL);
         if (!block->data) {
                 DMERR("Error copying mark data");
                 kfree(block);
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c

index ef57c6d1c887c6739b9255157bced1a638aa8bc3..7d3e572072f51f865d07a79ae5a0ac974bd98e4a 100644 (file)
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -64,36 +64,30 @@ struct priority_group {
  
  /* Multipath context */
  struct multipath {
-       struct list_head list;
-       struct dm_target *ti;
-
-       const char *hw_handler_name;
-       char *hw_handler_params;
+       unsigned long flags;            /* Multipath state flags */
  
         spinlock_t lock;
-
-       unsigned nr_priority_groups;
-       struct list_head priority_groups;
-
-       wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
+       enum dm_queue_mode queue_mode;
  
         struct pgpath *current_pgpath;
         struct priority_group *current_pg;
         struct priority_group *next_pg; /* Switch to this PG if set */
  
-       unsigned long flags;            /* Multipath state flags */
+       atomic_t nr_valid_paths;        /* Total number of usable paths */
+       unsigned nr_priority_groups;
+       struct list_head priority_groups;
  
+       const char *hw_handler_name;
+       char *hw_handler_params;
+       wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
         unsigned pg_init_retries;       /* Number of times to retry pg_init */
         unsigned pg_init_delay_msecs;   /* Number of msecs before pg_init retry */
-
-       atomic_t nr_valid_paths;        /* Total number of usable paths */
         atomic_t pg_init_in_progress;   /* Only one pg_init allowed at once */
         atomic_t pg_init_count;         /* Number of times pg_init called */
  
-       enum dm_queue_mode queue_mode;
-
         struct mutex work_mutex;
         struct work_struct trigger_event;
+       struct dm_target *ti;
  
         struct work_struct process_queued_bios;
         struct bio_list queued_bios;
@@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void)
  {
         struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
  
-       if (pgpath) {
-               pgpath->is_active = true;
-               INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work);
-       }
+       if (!pgpath)
+               return NULL;
+
+       pgpath->is_active = true;
  
         return pgpath;
  }
@@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
         if (m) {
                 INIT_LIST_HEAD(&m->priority_groups);
                 spin_lock_init(&m->lock);
-               set_bit(MPATHF_QUEUE_IO, &m->flags);
                 atomic_set(&m->nr_valid_paths, 0);
-               atomic_set(&m->pg_init_in_progress, 0);
-               atomic_set(&m->pg_init_count, 0);
-               m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
                 INIT_WORK(&m->trigger_event, trigger_event);
-               init_waitqueue_head(&m->pg_init_wait);
                 mutex_init(&m->work_mutex);
  
                 m->queue_mode = DM_TYPE_NONE;
@@ -221,13 +210,26 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m)
                         m->queue_mode = DM_TYPE_MQ_REQUEST_BASED;
                 else
                         m->queue_mode = DM_TYPE_REQUEST_BASED;
-       } else if (m->queue_mode == DM_TYPE_BIO_BASED) {
+
+       } else if (m->queue_mode == DM_TYPE_BIO_BASED ||
+                  m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
                 INIT_WORK(&m->process_queued_bios, process_queued_bios);
-               /*
-                * bio-based doesn't support any direct scsi_dh management;
-                * it just discovers if a scsi_dh is attached.
-                */
-               set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
+
+               if (m->queue_mode == DM_TYPE_BIO_BASED) {
+                       /*
+                        * bio-based doesn't support any direct scsi_dh management;
+                        * it just discovers if a scsi_dh is attached.
+                        */
+                       set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
+               }
+       }
+
+       if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
+               set_bit(MPATHF_QUEUE_IO, &m->flags);
+               atomic_set(&m->pg_init_in_progress, 0);
+               atomic_set(&m->pg_init_count, 0);
+               m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
+               init_waitqueue_head(&m->pg_init_wait);
         }
  
         dm_table_set_type(ti->table, m->queue_mode);
@@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m)
  
         kfree(m->hw_handler_name);
         kfree(m->hw_handler_params);
+       mutex_destroy(&m->work_mutex);
         kfree(m);
  }
  
@@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio)
         return dm_per_bio_data(bio, multipath_per_bio_data_size());
  }
  
-static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio)
+static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio)
  {
         /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */
-       struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
         void *bio_details = mpio + 1;
-
         return bio_details;
  }
  
-static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p,
-                                       struct dm_bio_details **bio_details_p)
+static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p)
  {
         struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
-       struct dm_bio_details *bio_details = get_bio_details_from_bio(bio);
+       struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio);
  
-       memset(mpio, 0, sizeof(*mpio));
-       memset(bio_details, 0, sizeof(*bio_details));
-       dm_bio_record(bio_details, bio);
+       mpio->nr_bytes = bio->bi_iter.bi_size;
+       mpio->pgpath = NULL;
+       *mpio_p = mpio;
  
-       if (mpio_p)
-               *mpio_p = mpio;
-       if (bio_details_p)
-               *bio_details_p = bio_details;
+       dm_bio_record(bio_details, bio);
  }
  
  /*-----------------------------------------------
@@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg)
  {
         m->current_pg = pg;
  
+       if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
+               return;
+
         /* Must we initialise the PG first, and queue I/O till it's ready? */
         if (m->hw_handler_name) {
                 set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
@@ -385,7 +385,8 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
         unsigned bypassed = 1;
  
         if (!atomic_read(&m->nr_valid_paths)) {
-               clear_bit(MPATHF_QUEUE_IO, &m->flags);
+               if (m->queue_mode != DM_TYPE_NVME_BIO_BASED)
+                       clear_bit(MPATHF_QUEUE_IO, &m->flags);
                 goto failed;
         }
  
@@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
                 return DM_MAPIO_KILL;
         } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
                    test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
-               if (pg_init_all_paths(m))
-                       return DM_MAPIO_DELAY_REQUEUE;
-               return DM_MAPIO_REQUEUE;
+               pg_init_all_paths(m);
+               return DM_MAPIO_DELAY_REQUEUE;
         }
  
-       memset(mpio, 0, sizeof(*mpio));
         mpio->pgpath = pgpath;
         mpio->nr_bytes = nr_bytes;
  
@@ -530,12 +529,23 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
         clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
         if (IS_ERR(clone)) {
                 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
-               bool queue_dying = blk_queue_dying(q);
-               if (queue_dying) {
+               if (blk_queue_dying(q)) {
                         atomic_inc(&m->pg_init_in_progress);
                         activate_or_offline_path(pgpath);
+                       return DM_MAPIO_DELAY_REQUEUE;
                 }
-               return DM_MAPIO_DELAY_REQUEUE;
+
+               /*
+                * blk-mq's SCHED_RESTART can cover this requeue, so we
+                * needn't deal with it by DELAY_REQUEUE. More importantly,
+                * we have to return DM_MAPIO_REQUEUE so that blk-mq can
+                * get the queue busy feedback (via BLK_STS_RESOURCE),
+                * otherwise I/O merging can suffer.
+                */
+               if (q->mq_ops)
+                       return DM_MAPIO_REQUEUE;
+               else
+                       return DM_MAPIO_DELAY_REQUEUE;
         }
         clone->bio = clone->biotail = NULL;
         clone->rq_disk = bdev->bd_disk;
@@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone)
  /*
   * Map cloned bios (bio-based multipath)
   */
-static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio)
+
+static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
  {
-       size_t nr_bytes = bio->bi_iter.bi_size;
         struct pgpath *pgpath;
         unsigned long flags;
         bool queue_io;
@@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
         pgpath = READ_ONCE(m->current_pgpath);
         queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
         if (!pgpath || !queue_io)
-               pgpath = choose_pgpath(m, nr_bytes);
+               pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
  
         if ((pgpath && queue_io) ||
             (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
@@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
                 spin_lock_irqsave(&m->lock, flags);
                 bio_list_add(&m->queued_bios, bio);
                 spin_unlock_irqrestore(&m->lock, flags);
+
                 /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */
                 if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
                         pg_init_all_paths(m);
                 else if (!queue_io)
                         queue_work(kmultipathd, &m->process_queued_bios);
-               return DM_MAPIO_SUBMITTED;
+
+               return ERR_PTR(-EAGAIN);
         }
  
+       return pgpath;
+}
+
+static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
+{
+       struct pgpath *pgpath;
+       unsigned long flags;
+
+       /* Do we need to select a new pgpath? */
+       /*
+        * FIXME: currently only switching path if no path (due to failure, etc)
+        * - which negates the point of using a path selector
+        */
+       pgpath = READ_ONCE(m->current_pgpath);
+       if (!pgpath)
+               pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
+
+       if (!pgpath) {
+               if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
+                       /* Queue for the daemon to resubmit */
+                       spin_lock_irqsave(&m->lock, flags);
+                       bio_list_add(&m->queued_bios, bio);
+                       spin_unlock_irqrestore(&m->lock, flags);
+                       queue_work(kmultipathd, &m->process_queued_bios);
+
+                       return ERR_PTR(-EAGAIN);
+               }
+               return NULL;
+       }
+
+       return pgpath;
+}
+
+static int __multipath_map_bio(struct multipath *m, struct bio *bio,
+                              struct dm_mpath_io *mpio)
+{
+       struct pgpath *pgpath;
+
+       if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
+               pgpath = __map_bio_nvme(m, bio);
+       else
+               pgpath = __map_bio(m, bio);
+
+       if (IS_ERR(pgpath))
+               return DM_MAPIO_SUBMITTED;
+
         if (!pgpath) {
                 if (must_push_back_bio(m))
                         return DM_MAPIO_REQUEUE;
@@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
         }
  
         mpio->pgpath = pgpath;
-       mpio->nr_bytes = nr_bytes;
  
         bio->bi_status = 0;
         bio_set_dev(bio, pgpath->path.dev->bdev);
@@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
         if (pgpath->pg->ps.type->start_io)
                 pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
                                               &pgpath->path,
-                                             nr_bytes);
+                                             mpio->nr_bytes);
         return DM_MAPIO_REMAPPED;
  }
  
@@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio)
         struct multipath *m = ti->private;
         struct dm_mpath_io *mpio = NULL;
  
-       multipath_init_per_bio_data(bio, &mpio, NULL);
-
+       multipath_init_per_bio_data(bio, &mpio);
         return __multipath_map_bio(m, bio, mpio);
  }
  
@@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m)
  {
         if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED)
                 dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table));
-       else if (m->queue_mode == DM_TYPE_BIO_BASED)
+       else if (m->queue_mode == DM_TYPE_BIO_BASED ||
+                m->queue_mode == DM_TYPE_NVME_BIO_BASED)
                 queue_work(kmultipathd, &m->process_queued_bios);
  }
  
@@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work)
  
         blk_start_plug(&plug);
         while ((bio = bio_list_pop(&bios))) {
-               r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio));
+               struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
+               dm_bio_restore(get_bio_details_from_mpio(mpio), bio);
+               r = __multipath_map_bio(m, bio, mpio);
                 switch (r) {
                 case DM_MAPIO_KILL:
                         bio->bi_status = BLK_STS_IOERR;
@@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg,
         return 0;
  }
  
-static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
-                              struct dm_target *ti)
+static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error)
  {
-       int r;
-       struct pgpath *p;
-       struct multipath *m = ti->private;
-       struct request_queue *q = NULL;
+       struct request_queue *q = bdev_get_queue(bdev);
         const char *attached_handler_name;
-
-       /* we need at least a path arg */
-       if (as->argc < 1) {
-               ti->error = "no device given";
-               return ERR_PTR(-EINVAL);
-       }
-
-       p = alloc_pgpath();
-       if (!p)
-               return ERR_PTR(-ENOMEM);
-
-       r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
-                         &p->path.dev);
-       if (r) {
-               ti->error = "error getting device";
-               goto bad;
-       }
-
-       if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name)
-               q = bdev_get_queue(p->path.dev->bdev);
+       int r;
  
         if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
  retain:
@@ -811,26 +847,59 @@ retain:
                         char b[BDEVNAME_SIZE];
  
                         printk(KERN_INFO "dm-mpath: retaining handler on device %s\n",
-                               bdevname(p->path.dev->bdev, b));
+                              bdevname(bdev, b));
                         goto retain;
                 }
                 if (r < 0) {
-                       ti->error = "error attaching hardware handler";
-                       dm_put_device(ti, p->path.dev);
-                       goto bad;
+                       *error = "error attaching hardware handler";
+                       return r;
                 }
  
                 if (m->hw_handler_params) {
                         r = scsi_dh_set_params(q, m->hw_handler_params);
                         if (r < 0) {
-                               ti->error = "unable to set hardware "
-                                                       "handler parameters";
-                               dm_put_device(ti, p->path.dev);
-                               goto bad;
+                               *error = "unable to set hardware handler parameters";
+                               return r;
                         }
                 }
         }
  
+       return 0;
+}
+
+static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
+                                struct dm_target *ti)
+{
+       int r;
+       struct pgpath *p;
+       struct multipath *m = ti->private;
+
+       /* we need at least a path arg */
+       if (as->argc < 1) {
+               ti->error = "no device given";
+               return ERR_PTR(-EINVAL);
+       }
+
+       p = alloc_pgpath();
+       if (!p)
+               return ERR_PTR(-ENOMEM);
+
+       r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
+                         &p->path.dev);
+       if (r) {
+               ti->error = "error getting device";
+               goto bad;
+       }
+
+       if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
+               INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
+               r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error);
+               if (r) {
+                       dm_put_device(ti, p->path.dev);
+                       goto bad;
+               }
+       }
+
         r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error);
         if (r) {
                 dm_put_device(ti, p->path.dev);
@@ -838,7 +907,6 @@ retain:
         }
  
         return p;
-
   bad:
         free_pgpath(p);
         return ERR_PTR(r);
@@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m)
         if (!hw_argc)
                 return 0;
  
-       if (m->queue_mode == DM_TYPE_BIO_BASED) {
+       if (m->queue_mode == DM_TYPE_BIO_BASED ||
+           m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
                 dm_consume_args(as, hw_argc);
                 DMERR("bio-based multipath doesn't allow hardware handler args");
                 return 0;
@@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
  
                         if (!strcasecmp(queue_mode_name, "bio"))
                                 m->queue_mode = DM_TYPE_BIO_BASED;
+                       else if (!strcasecmp(queue_mode_name, "nvme"))
+                               m->queue_mode = DM_TYPE_NVME_BIO_BASED;
                         else if (!strcasecmp(queue_mode_name, "rq"))
                                 m->queue_mode = DM_TYPE_REQUEST_BASED;
                         else if (!strcasecmp(queue_mode_name, "mq"))
@@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
         ti->num_discard_bios = 1;
         ti->num_write_same_bios = 1;
         ti->num_write_zeroes_bios = 1;
-       if (m->queue_mode == DM_TYPE_BIO_BASED)
+       if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED)
                 ti->per_io_data_size = multipath_per_bio_data_size();
         else
                 ti->per_io_data_size = sizeof(struct dm_mpath_io);
@@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
  
  static void flush_multipath_work(struct multipath *m)
  {
-       set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
-       smp_mb__after_atomic();
+       if (m->hw_handler_name) {
+               set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
+               smp_mb__after_atomic();
+
+               flush_workqueue(kmpath_handlerd);
+               multipath_wait_for_pg_init_completion(m);
+
+               clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
+               smp_mb__after_atomic();
+       }
  
-       flush_workqueue(kmpath_handlerd);
-       multipath_wait_for_pg_init_completion(m);
         flush_workqueue(kmultipathd);
         flush_work(&m->trigger_event);
-
-       clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
-       smp_mb__after_atomic();
  }
  
  static void multipath_dtr(struct dm_target *ti)
@@ -1496,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
         if (error && blk_path_error(error)) {
                 struct multipath *m = ti->private;
  
-               r = DM_ENDIO_REQUEUE;
+               if (error == BLK_STS_RESOURCE)
+                       r = DM_ENDIO_DELAY_REQUEUE;
+               else
+                       r = DM_ENDIO_REQUEUE;
  
                 if (pgpath)
                         fail_path(pgpath);
@@ -1521,7 +1598,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
  }
  
  static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
-               blk_status_t *error)
+                               blk_status_t *error)
  {
         struct multipath *m = ti->private;
         struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
@@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
                 goto done;
         }
  
-       /* Queue for the daemon to resubmit */
-       dm_bio_restore(get_bio_details_from_bio(clone), clone);
-
         spin_lock_irqsave(&m->lock, flags);
         bio_list_add(&m->queued_bios, clone);
         spin_unlock_irqrestore(&m->lock, flags);
@@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
                         case DM_TYPE_BIO_BASED:
                                 DMEMIT("queue_mode bio ");
                                 break;
+                       case DM_TYPE_NVME_BIO_BASED:
+                               DMEMIT("queue_mode nvme ");
+                               break;
                         case DM_TYPE_MQ_REQUEST_BASED:
                                 DMEMIT("queue_mode mq ");
                                 break;
diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c

index 23f1786417946e9ee8375a051544e1024db1f7d1..969c4f1a3633636c1f5f49ad93057a2bf59a19d0 100644 (file)
--- a/drivers/md/dm-queue-length.c
+++ b/drivers/md/dm-queue-length.c
@@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
         if (list_empty(&s->valid_paths))
                 goto out;
  
-       /* Change preferred (first in list) path to evenly balance. */
-       list_move_tail(s->valid_paths.next, &s->valid_paths);
-
         list_for_each_entry(pi, &s->valid_paths, list) {
                 if (!best ||
                     (atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
@@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
         if (!best)
                 goto out;
  
+       /* Move most recently used to least preferred to evenly balance. */
+       list_move_tail(&best->list, &s->valid_paths);
+
         ret = best->path;
  out:
         spin_unlock_irqrestore(&s->lock, flags);
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c

index e5ef0757fe230a8b4712c148564246a287e04739..7ef469e902c620126b95f69d899e528ae114b3bc 100644 (file)
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -29,6 +29,9 @@
   */
  #define        MIN_RAID456_JOURNAL_SPACE (4*2048)
  
+/* Global list of all raid sets */
+static LIST_HEAD(raid_sets);
+
  static bool devices_handle_discard_safely = false;
  
  /*
@@ -105,8 +108,6 @@ struct raid_dev {
  #define CTR_FLAG_JOURNAL_DEV           (1 << __CTR_FLAG_JOURNAL_DEV)
  #define CTR_FLAG_JOURNAL_MODE          (1 << __CTR_FLAG_JOURNAL_MODE)
  
-#define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET)
-
  /*
   * Definitions of various constructor flags to
   * be used in checks of valid / invalid flags
@@ -209,6 +210,8 @@ struct raid_dev {
  #define RT_FLAG_UPDATE_SBS             3
  #define RT_FLAG_RESHAPE_RS             4
  #define RT_FLAG_RS_SUSPENDED           5
+#define RT_FLAG_RS_IN_SYNC             6
+#define RT_FLAG_RS_RESYNCING           7
  
  /* Array elements of 64 bit needed for rebuild/failed disk bits */
  #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8)
@@ -224,8 +227,8 @@ struct rs_layout {
  
  struct raid_set {
         struct dm_target *ti;
+       struct list_head list;
  
-       uint32_t bitmap_loaded;
         uint32_t stripe_cache_entries;
         unsigned long ctr_flags;
         unsigned long runtime_flags;
@@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l)
         mddev->new_chunk_sectors = l->new_chunk_sectors;
  }
  
+/* Find any raid_set in active slot for @rs on global list */
+static struct raid_set *rs_find_active(struct raid_set *rs)
+{
+       struct raid_set *r;
+       struct mapped_device *md = dm_table_get_md(rs->ti->table);
+
+       list_for_each_entry(r, &raid_sets, list)
+               if (r != rs && dm_table_get_md(r->ti->table) == md)
+                       return r;
+
+       return NULL;
+}
+
  /* raid10 algorithms (i.e. formats) */
  #define        ALGORITHM_RAID10_DEFAULT        0
  #define        ALGORITHM_RAID10_NEAR           1
@@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout)
  }
  
  /* Return md raid10 algorithm for @name */
-static int raid10_name_to_format(const char *name)
+static const int raid10_name_to_format(const char *name)
  {
         if (!strcasecmp(name, "near"))
                 return ALGORITHM_RAID10_NEAR;
@@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout)
         return NULL;
  }
  
-/*
- * Conditionally change bdev capacity of @rs
- * in case of a disk add/remove reshape
- */
-static void rs_set_capacity(struct raid_set *rs)
+/* Adjust rdev sectors */
+static void rs_set_rdev_sectors(struct raid_set *rs)
  {
         struct mddev *mddev = &rs->md;
         struct md_rdev *rdev;
-       struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
  
         /*
          * raid10 sets rdev->sector to the device size, which
@@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs)
         rdev_for_each(rdev, mddev)
                 if (!test_bit(Journal, &rdev->flags))
                         rdev->sectors = mddev->dev_sectors;
+}
  
-       set_capacity(gendisk, mddev->array_sectors);
+/*
+ * Change bdev capacity of @rs in case of a disk add/remove reshape
+ */
+static void rs_set_capacity(struct raid_set *rs)
+{
+       struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
+
+       set_capacity(gendisk, rs->md.array_sectors);
         revalidate_disk(gendisk);
  }
  
@@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
  
         mddev_init(&rs->md);
  
+       INIT_LIST_HEAD(&rs->list);
         rs->raid_disks = raid_devs;
         rs->delta_disks = 0;
  
@@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
         for (i = 0; i < raid_devs; i++)
                 md_rdev_init(&rs->dev[i].rdev);
  
+       /* Add @rs to global list. */
+       list_add(&rs->list, &raid_sets);
+
         /*
          * Remaining items to be initialized by further RAID params:
          *  rs->md.persistent
@@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
         return rs;
  }
  
+/* Free all @rs allocations and remove it from global list. */
  static void raid_set_free(struct raid_set *rs)
  {
         int i;
@@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs)
                         dm_put_device(rs->ti, rs->dev[i].data_dev);
         }
  
+       list_del(&rs->list);
+
         kfree(rs);
  }
  
@@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs)
                     !rs->dev[i].rdev.sb_page)
                         rebuild_cnt++;
  
-       switch (rs->raid_type->level) {
+       switch (rs->md.level) {
         case 0:
                 break;
         case 1:
@@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs)
                 break;
         case 10:
                 copies = raid10_md_layout_to_copies(rs->md.new_layout);
+               if (copies < 2) {
+                       DMERR("Bogus raid10 data copies < 2!");
+                       return -EINVAL;
+               }
+
                 if (rebuild_cnt < copies)
                         break;
  
@@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs)
         return 0;
  }
  
+/* Check that calculated dev_sectors fits all component devices. */
+static int _check_data_dev_sectors(struct raid_set *rs)
+{
+       sector_t ds = ~0;
+       struct md_rdev *rdev;
+
+       rdev_for_each(rdev, &rs->md)
+               if (!test_bit(Journal, &rdev->flags) && rdev->bdev) {
+                       ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode)));
+                       if (ds < rs->md.dev_sectors) {
+                               rs->ti->error = "Component device(s) too small";
+                               return -EINVAL;
+                       }
+               }
+
+       return 0;
+}
+
  /* Calculate the sectors per device and per array used for @rs */
  static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
  {
@@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
         mddev->array_sectors = array_sectors;
         mddev->dev_sectors = dev_sectors;
  
-       return 0;
+       return _check_data_dev_sectors(rs);
  bad:
         rs->ti->error = "Target length not divisible by number of data devices";
         return -EINVAL;
@@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws)
         struct raid_set *rs = container_of(ws, struct raid_set, md.event_work);
  
         smp_rmb(); /* Make sure we access most actual mddev properties */
-       if (!rs_is_reshaping(rs))
+       if (!rs_is_reshaping(rs)) {
+               if (rs_is_raid10(rs))
+                       rs_set_rdev_sectors(rs);
                 rs_set_capacity(rs);
+       }
         dm_table_event(rs->ti->table);
  }
  
@@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
         if (rs_takeover_requested(rs))
                 return false;
  
-       if (!mddev->level)
+       if (rs_is_raid0(rs))
                 return false;
  
         change = mddev->new_layout != mddev->layout ||
@@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
                  rs->delta_disks;
  
         /* Historical case to support raid1 reshape without delta disks */
-       if (mddev->level == 1) {
+       if (rs_is_raid1(rs)) {
                 if (rs->delta_disks)
                         return !!rs->delta_disks;
  
@@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
                        mddev->raid_disks != rs->raid_disks;
         }
  
-       if (mddev->level == 10)
+       if (rs_is_raid10(rs))
                 return change &&
                        !__is_raid10_far(mddev->new_layout) &&
                        rs->delta_disks >= 0;
@@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev)
                         DMERR("new device%s provided without 'rebuild'",
                               new_devs > 1 ? "s" : "");
                         return -EINVAL;
-               } else if (rs_is_recovering(rs)) {
+               } else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) {
                         DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)",
                               (unsigned long long) mddev->recovery_cp);
                         return -EINVAL;
@@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs)
          * Make sure we got a minimum amount of free sectors per device
          */
         if (rs->data_offset &&
-           to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) {
+           to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) {
                 rs->ti->error = data_offset ? "No space for forward reshape" :
                                               "No space for backward reshape";
                 return -ENOSPC;
         }
  out:
+       /*
+        * Raise recovery_cp in case data_offset != 0 to
+        * avoid false recovery positives in the constructor.
+        */
+       if (rs->md.recovery_cp < rs->md.dev_sectors)
+               rs->md.recovery_cp += rs->dev[0].rdev.data_offset;
+
         /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */
         rdev_for_each(rdev, &rs->md) {
                 if (!test_bit(Journal, &rdev->flags)) {
@@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs)
         sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset;
  
         if (rt_is_raid10(rs->raid_type)) {
-               if (mddev->level == 0) {
+               if (rs_is_raid0(rs)) {
                         /* Userpace reordered disks -> adjust raid_disk indexes */
                         __reorder_raid_disk_indexes(rs);
  
                         /* raid0 -> raid10_far layout */
                         mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR,
                                                                    rs->raid10_copies);
-               } else if (mddev->level == 1)
+               } else if (rs_is_raid1(rs))
                         /* raid1 -> raid10_near layout */
                         mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR,
                                                                    rs->raid_disks);
@@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs)
         return 0;
  }
  
+/* Get reshape sectors from data_offsets or raid set */
+static sector_t _get_reshape_sectors(struct raid_set *rs)
+{
+       struct md_rdev *rdev;
+       sector_t reshape_sectors = 0;
+
+       rdev_for_each(rdev, &rs->md)
+               if (!test_bit(Journal, &rdev->flags)) {
+                       reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ?
+                                       rdev->data_offset - rdev->new_data_offset :
+                                       rdev->new_data_offset - rdev->data_offset;
+                       break;
+               }
+
+       return max(reshape_sectors, (sector_t) rs->data_offset);
+}
+
  /*
   *
   * - change raid layout
@@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs)
  {
         int r = 0;
         unsigned int cur_raid_devs, d;
+       sector_t reshape_sectors = _get_reshape_sectors(rs);
         struct mddev *mddev = &rs->md;
         struct md_rdev *rdev;
  
@@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs)
         /*
          * Adjust array size:
          *
-        * - in case of adding disks, array size has
+        * - in case of adding disk(s), array size has
          *   to grow after the disk adding reshape,
          *   which'll hapen in the event handler;
          *   reshape will happen forward, so space has to
          *   be available at the beginning of each disk
          *
-        * - in case of removing disks, array size
+        * - in case of removing disk(s), array size
          *   has to shrink before starting the reshape,
          *   which'll happen here;
          *   reshape will happen backward, so space has to
@@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs)
                         rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector;
                 }
  
-               mddev->reshape_backwards = 0; /* adding disks -> forward reshape */
+               mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */
  
         /* Remove disk(s) */
         } else if (rs->delta_disks < 0) {
@@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs)
                 mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1;
         }
  
+       /*
+        * Adjust device size for forward reshape
+        * because md_finish_reshape() reduces it.
+        */
+       if (!mddev->reshape_backwards)
+               rdev_for_each(rdev, &rs->md)
+                       if (!test_bit(Journal, &rdev->flags))
+                               rdev->sectors += reshape_sectors;
+
         return r;
  }
  
@@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs)
         /*
          * XXX: RAID level 4,5,6 require zeroing for safety.
          */
-       raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6);
+       raid456 = rs_is_raid456(rs);
  
         for (i = 0; i < rs->raid_disks; i++) {
                 struct request_queue *q;
@@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs)
          * RAID1 and RAID10 personalities require bio splitting,
          * RAID0/4/5/6 don't and process large discard bios properly.
          */
-       ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10);
+       ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs));
         ti->num_discard_bios = 1;
  }
  
@@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs)
  static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
  {
         int r;
-       bool resize;
+       bool resize = false;
         struct raid_type *rt;
         unsigned int num_raid_params, num_raid_devs;
-       sector_t calculated_dev_sectors, rdev_sectors;
+       sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors;
         struct raid_set *rs = NULL;
         const char *arg;
         struct rs_layout rs_layout;
@@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
                 goto bad;
         }
  
-       resize = calculated_dev_sectors != rdev_sectors;
+
+       reshape_sectors = _get_reshape_sectors(rs);
+       if (calculated_dev_sectors != rdev_sectors)
+               resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors);
  
         INIT_WORK(&rs->md.event_work, do_table_event);
         ti->private = rs;
@@ -3105,19 +3195,22 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
                         goto bad;
                 }
  
-               /*
-                 * We can only prepare for a reshape here, because the
-                 * raid set needs to run to provide the repective reshape
-                 * check functions via its MD personality instance.
-                 *
-                 * So do the reshape check after md_run() succeeded.
-                 */
-               r = rs_prepare_reshape(rs);
-               if (r)
-                       return r;
+               /* Out-of-place space has to be available to allow for a reshape unless raid1! */
+               if (reshape_sectors || rs_is_raid1(rs)) {
+                       /*
+                         * We can only prepare for a reshape here, because the
+                         * raid set needs to run to provide the repective reshape
+                         * check functions via its MD personality instance.
+                         *
+                         * So do the reshape check after md_run() succeeded.
+                         */
+                       r = rs_prepare_reshape(rs);
+                       if (r)
+                               return r;
  
-               /* Reshaping ain't recovery, so disable recovery */
-               rs_setup_recovery(rs, MaxSector);
+                       /* Reshaping ain't recovery, so disable recovery */
+                       rs_setup_recovery(rs, MaxSector);
+               }
                 rs_set_cur(rs);
         } else {
                 /* May not set recovery when a device rebuild is requested */
@@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
         mddev_lock_nointr(&rs->md);
         r = md_run(&rs->md);
         rs->md.in_sync = 0; /* Assume already marked dirty */
-
         if (r) {
                 ti->error = "Failed to run raid array";
                 mddev_unlock(&rs->md);
@@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
  }
  
  /* Return string describing the current sync action of @mddev */
-static const char *decipher_sync_action(struct mddev *mddev)
+static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery)
  {
-       if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
+       if (test_bit(MD_RECOVERY_FROZEN, &recovery))
                 return "frozen";
  
-       if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
-           (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) {
-               if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
+       /* The MD sync thread can be done with io but still be running */
+       if (!test_bit(MD_RECOVERY_DONE, &recovery) &&
+           (test_bit(MD_RECOVERY_RUNNING, &recovery) ||
+            (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) {
+               if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
                         return "reshape";
  
-               if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-                       if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+               if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
+                       if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
                                 return "resync";
-                       else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
+                       else if (test_bit(MD_RECOVERY_CHECK, &recovery))
                                 return "check";
                         return "repair";
                 }
  
-               if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery))
+               if (test_bit(MD_RECOVERY_RECOVER, &recovery))
                         return "recover";
         }
  
@@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev)
   *  'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device
   *  '-' = Non-existing device (i.e. uspace passed '- -' into the ctr)
   */
-static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync)
+static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev)
  {
         if (!rdev->bdev)
                 return "-";
@@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev,
                 return "D";
         else if (test_bit(Journal, &rdev->flags))
                 return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a";
-       else if (!array_in_sync || !test_bit(In_sync, &rdev->flags))
+       else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) ||
+                (!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) &&
+                 !test_bit(In_sync, &rdev->flags)))
                 return "a";
         else
                 return "A";
  }
  
-/* Helper to return resync/reshape progress for @rs and @array_in_sync */
-static sector_t rs_get_progress(struct raid_set *rs,
-                               sector_t resync_max_sectors, bool *array_in_sync)
+/* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */
+static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery,
+                               sector_t resync_max_sectors)
  {
-       sector_t r, curr_resync_completed;
+       sector_t r;
         struct mddev *mddev = &rs->md;
  
-       curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp;
-       *array_in_sync = false;
+       clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
+       clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
  
         if (rs_is_raid0(rs)) {
                 r = resync_max_sectors;
-               *array_in_sync = true;
+               set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
  
         } else {
-               r = mddev->reshape_position;
-
-               /* Reshape is relative to the array size */
-               if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
-                   r != MaxSector) {
-                       if (r == MaxSector) {
-                               *array_in_sync = true;
-                               r = resync_max_sectors;
-                       } else {
-                               /* Got to reverse on backward reshape */
-                               if (mddev->reshape_backwards)
-                                       r = mddev->array_sectors - r;
-
-                               /* Devide by # of data stripes */
-                               sector_div(r, mddev_data_stripes(rs));
-                       }
-
-               /* Sync is relative to the component device size */
-               } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-                       r = curr_resync_completed;
+               if (test_bit(MD_RECOVERY_NEEDED, &recovery) ||
+                   test_bit(MD_RECOVERY_RESHAPE, &recovery) ||
+                   test_bit(MD_RECOVERY_RUNNING, &recovery))
+                       r = mddev->curr_resync_completed;
                 else
                         r = mddev->recovery_cp;
  
-               if ((r == MaxSector) ||
-                   (test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
-                    (mddev->curr_resync_completed == resync_max_sectors))) {
+               if (r >= resync_max_sectors &&
+                   (!test_bit(MD_RECOVERY_REQUESTED, &recovery) ||
+                    (!test_bit(MD_RECOVERY_FROZEN, &recovery) &&
+                     !test_bit(MD_RECOVERY_NEEDED, &recovery) &&
+                     !test_bit(MD_RECOVERY_RUNNING, &recovery)))) {
                         /*
                          * Sync complete.
                          */
-                       *array_in_sync = true;
-                       r = resync_max_sectors;
-               } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
+                       /* In case we have finished recovering, the array is in sync. */
+                       if (test_bit(MD_RECOVERY_RECOVER, &recovery))
+                               set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
+
+               } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) {
+                       /*
+                        * In case we are recovering, the array is not in sync
+                        * and health chars should show the recovering legs.
+                        */
+                       ;
+
+               } else if (test_bit(MD_RECOVERY_SYNC, &recovery) &&
+                          !test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
+                       /*
+                        * If "resync" is occurring, the raid set
+                        * is or may be out of sync hence the health
+                        * characters shall be 'a'.
+                        */
+                       set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
+
+               } else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) &&
+                          !test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
+                       /*
+                        * If "reshape" is occurring, the raid set
+                        * is or may be out of sync hence the health
+                        * characters shall be 'a'.
+                        */
+                       set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
+
+               } else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
                         /*
                          * If "check" or "repair" is occurring, the raid set has
                          * undergone an initial sync and the health characters
                          * should not be 'a' anymore.
                          */
-                       *array_in_sync = true;
+                       set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
+
                 } else {
                         struct md_rdev *rdev;
  
+                       /*
+                        * We are idle and recovery is needed, prevent 'A' chars race
+                        * caused by components still set to in-sync by constrcuctor.
+                        */
+                       if (test_bit(MD_RECOVERY_NEEDED, &recovery))
+                               set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
+
                         /*
                          * The raid set may be doing an initial sync, or it may
                          * be rebuilding individual components.  If all the
                          * devices are In_sync, then it is the raid set that is
                          * being initialized.
                          */
+                       set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
                         rdev_for_each(rdev, mddev)
                                 if (!test_bit(Journal, &rdev->flags) &&
-                                   !test_bit(In_sync, &rdev->flags))
-                                       *array_in_sync = true;
-#if 0
-                       r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */
-#endif
+                                   !test_bit(In_sync, &rdev->flags)) {
+                                       clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
+                                       break;
+                               }
                 }
         }
  
-       return r;
+       return min(r, resync_max_sectors);
  }
  
  /* Helper to return @dev name or "-" if !@dev */
@@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
         struct mddev *mddev = &rs->md;
         struct r5conf *conf = mddev->private;
         int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0;
-       bool array_in_sync;
+       unsigned long recovery;
         unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */
         unsigned int sz = 0;
         unsigned int rebuild_disks;
@@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type,
  
                 /* Access most recent mddev properties for status output */
                 smp_rmb();
+               recovery = rs->md.recovery;
                 /* Get sensible max sectors even if raid set not yet started */
                 resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ?
                                       mddev->resync_max_sectors : mddev->dev_sectors;
-               progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync);
+               progress = rs_get_progress(rs, recovery, resync_max_sectors);
                 resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ?
                                     atomic64_read(&mddev->resync_mismatches) : 0;
-               sync_action = decipher_sync_action(&rs->md);
+               sync_action = decipher_sync_action(&rs->md, recovery);
  
                 /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */
                 for (i = 0; i < rs->raid_disks; i++)
-                       DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync));
+                       DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev));
  
                 /*
                  * In-sync/Reshape ratio:
@@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
                  * v1.10.0+:
                  */
                 DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ?
-                             __raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-");
+                             __raid_dev_status(rs, &rs->journal_dev.rdev) : "-");
                 break;
  
         case STATUSTYPE_TABLE:
@@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
         blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs));
  }
  
-static void raid_presuspend(struct dm_target *ti)
-{
-       struct raid_set *rs = ti->private;
-
-       md_stop_writes(&rs->md);
-}
-
  static void raid_postsuspend(struct dm_target *ti)
  {
         struct raid_set *rs = ti->private;
  
         if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+               /* Writes have to be stopped before suspending to avoid deadlocks. */
+               if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
+                       md_stop_writes(&rs->md);
+
                 mddev_lock_nointr(&rs->md);
                 mddev_suspend(&rs->md);
                 mddev_unlock(&rs->md);
         }
-
-       rs->md.ro = 1;
  }
  
  static void attempt_restore_of_faulty_devices(struct raid_set *rs)
@@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti)
         struct raid_set *rs = ti->private;
         struct mddev *mddev = &rs->md;
  
-       /* This is a resume after a suspend of the set -> it's already started */
+       /* This is a resume after a suspend of the set -> it's already started. */
         if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags))
                 return 0;
  
+       if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) {
+               struct raid_set *rs_active = rs_find_active(rs);
+
+               if (rs_active) {
+                       /*
+                        * In case no rebuilds have been requested
+                        * and an active table slot exists, copy
+                        * current resynchonization completed and
+                        * reshape position pointers across from
+                        * suspended raid set in the active slot.
+                        *
+                        * This resumes the new mapping at current
+                        * offsets to continue recover/reshape without
+                        * necessarily redoing a raid set partially or
+                        * causing data corruption in case of a reshape.
+                        */
+                       if (rs_active->md.curr_resync_completed != MaxSector)
+                               mddev->curr_resync_completed = rs_active->md.curr_resync_completed;
+                       if (rs_active->md.reshape_position != MaxSector)
+                               mddev->reshape_position = rs_active->md.reshape_position;
+               }
+       }
+
         /*
          * The superblocks need to be updated on disk if the
          * array is new or new devices got added (thus zeroed
@@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti)
                 mddev->resync_min = mddev->recovery_cp;
         }
  
-       rs_set_capacity(rs);
-
         /* Check for any reshape request unless new raid set */
-       if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
+       if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
                 /* Initiate a reshape. */
+               rs_set_rdev_sectors(rs);
                 mddev_lock_nointr(mddev);
                 r = rs_start_reshape(rs);
                 mddev_unlock(mddev);
@@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti)
                 attempt_restore_of_faulty_devices(rs);
         }
  
-       mddev->ro = 0;
-       mddev->in_sync = 0;
-
-       /*
-        * Keep the RAID set frozen if reshape/rebuild flags are set.
-        * The RAID set is unfrozen once the next table load/resume,
-        * which clears the reshape/rebuild flags, occurs.
-        * This ensures that the constructor for the inactive table
-        * retrieves an up-to-date reshape_position.
-        */
-       if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS))
-               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-
         if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+               /* Only reduce raid set size before running a disk removing reshape. */
+               if (mddev->delta_disks < 0)
+                       rs_set_capacity(rs);
+
                 mddev_lock_nointr(mddev);
+               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+               mddev->ro = 0;
+               mddev->in_sync = 0;
                 mddev_resume(mddev);
                 mddev_unlock(mddev);
         }
@@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti)
  
  static struct target_type raid_target = {
         .name = "raid",
-       .version = {1, 13, 0},
+       .version = {1, 13, 2},
         .module = THIS_MODULE,
         .ctr = raid_ctr,
         .dtr = raid_dtr,
@@ -3912,7 +4041,6 @@ static struct target_type raid_target = {
         .message = raid_message,
         .iterate_devices = raid_iterate_devices,
         .io_hints = raid_io_hints,
-       .presuspend = raid_presuspend,
         .postsuspend = raid_postsuspend,
         .preresume = raid_preresume,
         .resume = raid_resume,
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c

index b7d175e94a0233312e63b70678f4db7a57f6b9be..aeaaaef43effda534a338f28d8ca660513994e21 100644 (file)
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
                 /* The target wants to requeue the I/O */
                 dm_requeue_original_request(tio, false);
                 break;
+       case DM_ENDIO_DELAY_REQUEUE:
+               /* The target wants to requeue the I/O after a delay */
+               dm_requeue_original_request(tio, true);
+               break;
         default:
                 DMWARN("unimplemented target endio return value: %d", r);
                 BUG();
@@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t)
         /* disable dm_old_request_fn's merge heuristic by default */
         md->seq_rq_merge_deadline_usecs = 0;
  
-       dm_init_normal_md_queue(md);
         blk_queue_softirq_done(md->queue, dm_softirq_done);
  
         /* Initialize the request-based DM worker thread */
@@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
                 err = PTR_ERR(q);
                 goto out_tag_set;
         }
-       dm_init_md_queue(md);
  
         return 0;
  
diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c

index 7b8642045c557b21ac5f4dab4411ef5993bc1687..f006a9005593baf301ea9bbf2e1dbeae062aecce 100644 (file)
--- a/drivers/md/dm-service-time.c
+++ b/drivers/md/dm-service-time.c
@@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
         if (list_empty(&s->valid_paths))
                 goto out;
  
-       /* Change preferred (first in list) path to evenly balance. */
-       list_move_tail(s->valid_paths.next, &s->valid_paths);
-
         list_for_each_entry(pi, &s->valid_paths, list)
                 if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
                         best = pi;
@@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
         if (!best)
                 goto out;
  
+       /* Move most recently used to least preferred to evenly balance. */
+       list_move_tail(&best->list, &s->valid_paths);
+
         ret = best->path;
  out:
         spin_unlock_irqrestore(&s->lock, flags);
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c

index a0613bd8ed00efc17d545a3335c39cb6cfb83919..216035be566185e3108e562d58389fa2409b7df8 100644 (file)
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -47,7 +47,7 @@ struct dm_exception_table {
  };
  
  struct dm_snapshot {
-       struct rw_semaphore lock;
+       struct mutex lock;
  
         struct dm_dev *origin;
         struct dm_dev *cow;
@@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
                 if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
                         continue;
  
-               down_read(&s->lock);
+               mutex_lock(&s->lock);
                 active = s->active;
-               up_read(&s->lock);
+               mutex_unlock(&s->lock);
  
                 if (active) {
                         if (snap_src)
@@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
         int r;
         chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
  
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
  
         /*
          * Process chunks (and associated exceptions) in reverse order
@@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
         b = __release_queued_bios_after_merge(s);
  
  out:
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
         if (b)
                 flush_bios(b);
  
@@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
                 if (linear_chunks < 0) {
                         DMERR("Read error in exception store: "
                               "shutting down merge");
-                       down_write(&s->lock);
+                       mutex_lock(&s->lock);
                         s->merge_failed = 1;
-                       up_write(&s->lock);
+                       mutex_unlock(&s->lock);
                 }
                 goto shut;
         }
@@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
                 previous_count = read_pending_exceptions_done_count();
         }
  
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
         s->first_merging_chunk = old_chunk;
         s->num_merging_chunks = linear_chunks;
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
  
         /* Wait until writes to all 'linear_chunks' drain */
         for (i = 0; i < linear_chunks; i++)
@@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
         return;
  
  shut:
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
         s->merge_failed = 1;
         b = __release_queued_bios_after_merge(s);
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
         error_bios(b);
  
         merge_shutdown(s);
@@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
         s->exception_start_sequence = 0;
         s->exception_complete_sequence = 0;
         INIT_LIST_HEAD(&s->out_of_order_list);
-       init_rwsem(&s->lock);
+       mutex_init(&s->lock);
         INIT_LIST_HEAD(&s->list);
         spin_lock_init(&s->pe_lock);
         s->state_bits = 0;
@@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti)
         /* Check whether exception handover must be cancelled */
         (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
         if (snap_src && snap_dest && (s == snap_src)) {
-               down_write(&snap_dest->lock);
+               mutex_lock(&snap_dest->lock);
                 snap_dest->valid = 0;
-               up_write(&snap_dest->lock);
+               mutex_unlock(&snap_dest->lock);
                 DMERR("Cancelling snapshot handover.");
         }
         up_read(&_origins_lock);
@@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti)
  
         dm_exception_store_destroy(s->store);
  
+       mutex_destroy(&s->lock);
+
         dm_put_device(ti, s->cow);
  
         dm_put_device(ti, s->origin);
@@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success)
  
         if (!success) {
                 /* Read/write error - snapshot is unusable */
-               down_write(&s->lock);
+               mutex_lock(&s->lock);
                 __invalidate_snapshot(s, -EIO);
                 error = 1;
                 goto out;
@@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success)
  
         e = alloc_completed_exception(GFP_NOIO);
         if (!e) {
-               down_write(&s->lock);
+               mutex_lock(&s->lock);
                 __invalidate_snapshot(s, -ENOMEM);
                 error = 1;
                 goto out;
         }
         *e = pe->e;
  
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
         if (!s->valid) {
                 free_completed_exception(e);
                 error = 1;
@@ -1498,7 +1500,7 @@ out:
                 full_bio->bi_end_io = pe->full_bio_end_io;
         increment_pending_exceptions_done_count();
  
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
  
         /* Submit any pending write bios */
         if (error) {
@@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
  
         /* FIXME: should only take write lock if we need
          * to copy an exception */
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
  
         if (!s->valid || (unlikely(s->snapshot_overflowed) &&
             bio_data_dir(bio) == WRITE)) {
@@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
         if (bio_data_dir(bio) == WRITE) {
                 pe = __lookup_pending_exception(s, chunk);
                 if (!pe) {
-                       up_write(&s->lock);
+                       mutex_unlock(&s->lock);
                         pe = alloc_pending_exception(s);
-                       down_write(&s->lock);
+                       mutex_lock(&s->lock);
  
                         if (!s->valid || s->snapshot_overflowed) {
                                 free_pending_exception(pe);
@@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
                     bio->bi_iter.bi_size ==
                     (s->store->chunk_size << SECTOR_SHIFT)) {
                         pe->started = 1;
-                       up_write(&s->lock);
+                       mutex_unlock(&s->lock);
                         start_full_bio(pe, bio);
                         goto out;
                 }
@@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
                 if (!pe->started) {
                         /* this is protected by snap->lock */
                         pe->started = 1;
-                       up_write(&s->lock);
+                       mutex_unlock(&s->lock);
                         start_copy(pe);
                         goto out;
                 }
@@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
         }
  
  out_unlock:
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
  out:
         return r;
  }
@@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
  
         chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
  
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
  
         /* Full merging snapshots are redirected to the origin */
         if (!s->valid)
@@ -1841,12 +1843,12 @@ redirect_to_origin:
         bio_set_dev(bio, s->origin->bdev);
  
         if (bio_data_dir(bio) == WRITE) {
-               up_write(&s->lock);
+               mutex_unlock(&s->lock);
                 return do_origin(s->origin, bio);
         }
  
  out_unlock:
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
  
         return r;
  }
@@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti)
         down_read(&_origins_lock);
         (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
         if (snap_src && snap_dest) {
-               down_read(&snap_src->lock);
+               mutex_lock(&snap_src->lock);
                 if (s == snap_src) {
                         DMERR("Unable to resume snapshot source until "
                               "handover completes.");
@@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti)
                               "source is suspended.");
                         r = -EINVAL;
                 }
-               up_read(&snap_src->lock);
+               mutex_unlock(&snap_src->lock);
         }
         up_read(&_origins_lock);
  
@@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti)
  
         (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
         if (snap_src && snap_dest) {
-               down_write(&snap_src->lock);
-               down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
+               mutex_lock(&snap_src->lock);
+               mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
                 __handover_exceptions(snap_src, snap_dest);
-               up_write(&snap_dest->lock);
-               up_write(&snap_src->lock);
+               mutex_unlock(&snap_dest->lock);
+               mutex_unlock(&snap_src->lock);
         }
  
         up_read(&_origins_lock);
@@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti)
         /* Now we have correct chunk size, reregister */
         reregister_snapshot(s);
  
-       down_write(&s->lock);
+       mutex_lock(&s->lock);
         s->active = 1;
-       up_write(&s->lock);
+       mutex_unlock(&s->lock);
  }
  
  static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
@@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
         switch (type) {
         case STATUSTYPE_INFO:
  
-               down_write(&snap->lock);
+               mutex_lock(&snap->lock);
  
                 if (!snap->valid)
                         DMEMIT("Invalid");
@@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
                                 DMEMIT("Unknown");
                 }
  
-               up_write(&snap->lock);
+               mutex_unlock(&snap->lock);
  
                 break;
  
@@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
                 if (dm_target_is_snapshot_merge(snap->ti))
                         continue;
  
-               down_write(&snap->lock);
+               mutex_lock(&snap->lock);
  
                 /* Only deal with valid and active snapshots */
                 if (!snap->valid || !snap->active)
@@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
  
                 pe = __lookup_pending_exception(snap, chunk);
                 if (!pe) {
-                       up_write(&snap->lock);
+                       mutex_unlock(&snap->lock);
                         pe = alloc_pending_exception(snap);
-                       down_write(&snap->lock);
+                       mutex_lock(&snap->lock);
  
                         if (!snap->valid) {
                                 free_pending_exception(pe);
@@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
                 }
  
  next_snapshot:
-               up_write(&snap->lock);
+               mutex_unlock(&snap->lock);
  
                 if (pe_to_start_now) {
                         start_copy(pe_to_start_now);
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c

index 29bc51084c82be8527e65c9cd157c54b86a708cd..56059fb56e2d255a0e635fdf56f9c6edfa20f0df 100644 (file)
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats)
                 dm_stat_free(&s->rcu_head);
         }
         free_percpu(stats->last);
+       mutex_destroy(&stats->mutex);
  }
  
  static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c

index aaffd0c0ee9a76c71f23f9bb1074ec8057b8c6f7..5fe7ec356c333c940edb41e61f0e5844de0aea72 100644 (file)
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args);
  static bool __table_type_bio_based(enum dm_queue_mode table_type)
  {
         return (table_type == DM_TYPE_BIO_BASED ||
-               table_type == DM_TYPE_DAX_BIO_BASED);
+               table_type == DM_TYPE_DAX_BIO_BASED ||
+               table_type == DM_TYPE_NVME_BIO_BASED);
  }
  
  static bool __table_type_request_based(enum dm_queue_mode table_type)
@@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t)
         return true;
  }
  
+static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
+
+struct verify_rq_based_data {
+       unsigned sq_count;
+       unsigned mq_count;
+};
+
+static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev,
+                             sector_t start, sector_t len, void *data)
+{
+       struct request_queue *q = bdev_get_queue(dev->bdev);
+       struct verify_rq_based_data *v = data;
+
+       if (q->mq_ops)
+               v->mq_count++;
+       else
+               v->sq_count++;
+
+       return queue_is_rq_based(q);
+}
+
  static int dm_table_determine_type(struct dm_table *t)
  {
         unsigned i;
         unsigned bio_based = 0, request_based = 0, hybrid = 0;
-       unsigned sq_count = 0, mq_count = 0;
+       struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0};
         struct dm_target *tgt;
-       struct dm_dev_internal *dd;
         struct list_head *devices = dm_table_get_devices(t);
         enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
  
@@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t)
                 /* target already set the table's type */
                 if (t->type == DM_TYPE_BIO_BASED)
                         return 0;
+               else if (t->type == DM_TYPE_NVME_BIO_BASED) {
+                       if (!dm_table_does_not_support_partial_completion(t)) {
+                               DMERR("nvme bio-based is only possible with devices"
+                                     " that don't support partial completion");
+                               return -EINVAL;
+                       }
+                       /* Fallthru, also verify all devices are blk-mq */
+               }
                 BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
                 goto verify_rq_based;
         }
@@ -937,8 +966,8 @@ static int dm_table_determine_type(struct dm_table *t)
                         bio_based = 1;
  
                 if (bio_based && request_based) {
-                       DMWARN("Inconsistent table: different target types"
-                              " can't be mixed up");
+                       DMERR("Inconsistent table: different target types"
+                             " can't be mixed up");
                         return -EINVAL;
                 }
         }
@@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t)
                 /* We must use this table as bio-based */
                 t->type = DM_TYPE_BIO_BASED;
                 if (dm_table_supports_dax(t) ||
-                   (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED))
+                   (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
                         t->type = DM_TYPE_DAX_BIO_BASED;
+               } else {
+                       /* Check if upgrading to NVMe bio-based is valid or required */
+                       tgt = dm_table_get_immutable_target(t);
+                       if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
+                               t->type = DM_TYPE_NVME_BIO_BASED;
+                               goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
+                       } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
+                               t->type = DM_TYPE_NVME_BIO_BASED;
+                       }
+               }
                 return 0;
         }
  
@@ -980,7 +1019,8 @@ verify_rq_based:
          * (e.g. request completion process for partial completion.)
          */
         if (t->num_targets > 1) {
-               DMWARN("Request-based dm doesn't support multiple targets yet");
+               DMERR("%s DM doesn't support multiple targets",
+                     t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
                 return -EINVAL;
         }
  
@@ -997,28 +1037,29 @@ verify_rq_based:
                 return 0;
         }
  
-       /* Non-request-stackable devices can't be used for request-based dm */
-       list_for_each_entry(dd, devices, list) {
-               struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev);
-
-               if (!queue_is_rq_based(q)) {
-                       DMERR("table load rejected: including"
-                             " non-request-stackable devices");
-                       return -EINVAL;
-               }
+       tgt = dm_table_get_immutable_target(t);
+       if (!tgt) {
+               DMERR("table load rejected: immutable target is required");
+               return -EINVAL;
+       } else if (tgt->max_io_len) {
+               DMERR("table load rejected: immutable target that splits IO is not supported");
+               return -EINVAL;
+       }
  
-               if (q->mq_ops)
-                       mq_count++;
-               else
-                       sq_count++;
+       /* Non-request-stackable devices can't be used for request-based dm */
+       if (!tgt->type->iterate_devices ||
+           !tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) {
+               DMERR("table load rejected: including non-request-stackable devices");
+               return -EINVAL;
         }
-       if (sq_count && mq_count) {
+       if (v.sq_count && v.mq_count) {
                 DMERR("table load rejected: not all devices are blk-mq request-stackable");
                 return -EINVAL;
         }
-       t->all_blk_mq = mq_count > 0;
+       t->all_blk_mq = v.mq_count > 0;
  
-       if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) {
+       if (!t->all_blk_mq &&
+           (t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) {
                 DMERR("table load rejected: all devices are not blk-mq request-stackable");
                 return -EINVAL;
         }
@@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
  {
         enum dm_queue_mode type = dm_table_get_type(t);
         unsigned per_io_data_size = 0;
-       struct dm_target *tgt;
+       unsigned min_pool_size = 0;
+       struct dm_target *ti;
         unsigned i;
  
         if (unlikely(type == DM_TYPE_NONE)) {
@@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
  
         if (__table_type_bio_based(type))
                 for (i = 0; i < t->num_targets; i++) {
-                       tgt = t->targets + i;
-                       per_io_data_size = max(per_io_data_size, tgt->per_io_data_size);
+                       ti = t->targets + i;
+                       per_io_data_size = max(per_io_data_size, ti->per_io_data_size);
+                       min_pool_size = max(min_pool_size, ti->num_flush_bios);
                 }
  
-       t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size);
+       t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported,
+                                          per_io_data_size, min_pool_size);
         if (!t->mempools)
                 return -ENOMEM;
  
@@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t,
         return true;
  }
  
+static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev,
+                                       sector_t start, sector_t len, void *data)
+{
+       char b[BDEVNAME_SIZE];
+
+       /* For now, NVMe devices are the only devices of this class */
+       return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0);
+}
+
+static bool dm_table_does_not_support_partial_completion(struct dm_table *t)
+{
+       return dm_table_all_devices_attribute(t, device_no_partial_completion);
+}
+
  static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
                                          sector_t start, sector_t len, void *data)
  {
@@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
         }
         blk_queue_write_cache(q, wc, fua);
  
+       if (dm_table_supports_dax(t))
+               queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
         if (dm_table_supports_dax_write_cache(t))
                 dax_write_cache(t->md->dax_dev, true);
  
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c

index f91d771fff4b6e9d9a488a7a67916326a1e85897..629c555890c1be277d882ae42069c2f4342c8fa3 100644 (file)
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -492,6 +492,11 @@ static void pool_table_init(void)
         INIT_LIST_HEAD(&dm_thin_pool_table.pools);
  }
  
+static void pool_table_exit(void)
+{
+       mutex_destroy(&dm_thin_pool_table.mutex);
+}
+
  static void __pool_table_insert(struct pool *pool)
  {
         BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
@@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context,
                     bio_op(bio) == REQ_OP_DISCARD)
                         bio_list_add(&info->defer_bios, bio);
                 else {
-                       struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));;
+                       struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
  
                         h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
                         inc_all_io_entry(info->tc->pool, bio);
@@ -4387,6 +4392,8 @@ static void dm_thin_exit(void)
         dm_unregister_target(&pool_target);
  
         kmem_cache_destroy(_new_mapping_cache);
+
+       pool_table_exit();
  }
  
  module_init(dm_thin_init);
diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c

new file mode 100644 (file)

index 0000000..65f838f
--- /dev/null
+++ b/drivers/md/dm-unstripe.c
@@ -0,0 +1,219 @@
+/*
+ * Copyright (C) 2017 Intel Corporation.
+ *
+ * This file is released under the GPL.
+ */
+
+#include "dm.h"
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/bitops.h>
+#include <linux/device-mapper.h>
+
+struct unstripe_c {
+       struct dm_dev *dev;
+       sector_t physical_start;
+
+       uint32_t stripes;
+
+       uint32_t unstripe;
+       sector_t unstripe_width;
+       sector_t unstripe_offset;
+
+       uint32_t chunk_size;
+       u8 chunk_shift;
+};
+
+#define DM_MSG_PREFIX "unstriped"
+
+static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti)
+{
+       if (uc->dev)
+               dm_put_device(ti, uc->dev);
+       kfree(uc);
+}
+
+/*
+ * Contruct an unstriped mapping.
+ * <number of stripes> <chunk size> <stripe #> <dev_path> <offset>
+ */
+static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+       struct unstripe_c *uc;
+       sector_t tmp_len;
+       unsigned long long start;
+       char dummy;
+
+       if (argc != 5) {
+               ti->error = "Invalid number of arguments";
+               return -EINVAL;
+       }
+
+       uc = kzalloc(sizeof(*uc), GFP_KERNEL);
+       if (!uc) {
+               ti->error = "Memory allocation for unstriped context failed";
+               return -ENOMEM;
+       }
+
+       if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) {
+               ti->error = "Invalid stripe count";
+               goto err;
+       }
+
+       if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) {
+               ti->error = "Invalid chunk_size";
+               goto err;
+       }
+
+       // FIXME: must support non power of 2 chunk_size, dm-stripe.c does
+       if (!is_power_of_2(uc->chunk_size)) {
+               ti->error = "Non power of 2 chunk_size is not supported yet";
+               goto err;
+       }
+
+       if (kstrtouint(argv[2], 10, &uc->unstripe)) {
+               ti->error = "Invalid stripe number";
+               goto err;
+       }
+
+       if (uc->unstripe > uc->stripes && uc->stripes > 1) {
+               ti->error = "Please provide stripe between [0, # of stripes]";
+               goto err;
+       }
+
+       if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) {
+               ti->error = "Couldn't get striped device";
+               goto err;
+       }
+
+       if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) {
+               ti->error = "Invalid striped device offset";
+               goto err;
+       }
+       uc->physical_start = start;
+
+       uc->unstripe_offset = uc->unstripe * uc->chunk_size;
+       uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size;
+       uc->chunk_shift = fls(uc->chunk_size) - 1;
+
+       tmp_len = ti->len;
+       if (sector_div(tmp_len, uc->chunk_size)) {
+               ti->error = "Target length not divisible by chunk size";
+               goto err;
+       }
+
+       if (dm_set_target_max_io_len(ti, uc->chunk_size)) {
+               ti->error = "Failed to set max io len";
+               goto err;
+       }
+
+       ti->private = uc;
+       return 0;
+err:
+       cleanup_unstripe(uc, ti);
+       return -EINVAL;
+}
+
+static void unstripe_dtr(struct dm_target *ti)
+{
+       struct unstripe_c *uc = ti->private;
+
+       cleanup_unstripe(uc, ti);
+}
+
+static sector_t map_to_core(struct dm_target *ti, struct bio *bio)
+{
+       struct unstripe_c *uc = ti->private;
+       sector_t sector = bio->bi_iter.bi_sector;
+
+       /* Shift us up to the right "row" on the stripe */
+       sector += uc->unstripe_width * (sector >> uc->chunk_shift);
+
+       /* Account for what stripe we're operating on */
+       sector += uc->unstripe_offset;
+
+       return sector;
+}
+
+static int unstripe_map(struct dm_target *ti, struct bio *bio)
+{
+       struct unstripe_c *uc = ti->private;
+
+       bio_set_dev(bio, uc->dev->bdev);
+       bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
+
+       return DM_MAPIO_REMAPPED;
+}
+
+static void unstripe_status(struct dm_target *ti, status_type_t type,
+                           unsigned int status_flags, char *result, unsigned int maxlen)
+{
+       struct unstripe_c *uc = ti->private;
+       unsigned int sz = 0;
+
+       switch (type) {
+       case STATUSTYPE_INFO:
+               break;
+
+       case STATUSTYPE_TABLE:
+               DMEMIT("%d %llu %d %s %llu",
+                      uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe,
+                      uc->dev->name, (unsigned long long)uc->physical_start);
+               break;
+       }
+}
+
+static int unstripe_iterate_devices(struct dm_target *ti,
+                                   iterate_devices_callout_fn fn, void *data)
+{
+       struct unstripe_c *uc = ti->private;
+
+       return fn(ti, uc->dev, uc->physical_start, ti->len, data);
+}
+
+static void unstripe_io_hints(struct dm_target *ti,
+                              struct queue_limits *limits)
+{
+       struct unstripe_c *uc = ti->private;
+
+       limits->chunk_sectors = uc->chunk_size;
+}
+
+static struct target_type unstripe_target = {
+       .name = "unstriped",
+       .version = {1, 0, 0},
+       .module = THIS_MODULE,
+       .ctr = unstripe_ctr,
+       .dtr = unstripe_dtr,
+       .map = unstripe_map,
+       .status = unstripe_status,
+       .iterate_devices = unstripe_iterate_devices,
+       .io_hints = unstripe_io_hints,
+};
+
+static int __init dm_unstripe_init(void)
+{
+       int r;
+
+       r = dm_register_target(&unstripe_target);
+       if (r < 0)
+               DMERR("target registration failed");
+
+       return r;
+}
+
+static void __exit dm_unstripe_exit(void)
+{
+       dm_unregister_target(&unstripe_target);
+}
+
+module_init(dm_unstripe_init);
+module_exit(dm_unstripe_exit);
+
+MODULE_DESCRIPTION(DM_NAME " unstriped target");
+MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c

index 70485de37b669abc7503f759109d4dde6856cfe0..969954915566f2cf1a15b684eb5b3d33dca58160 100644 (file)
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd)
  
         /* Free the zone descriptors */
         dmz_drop_zones(zmd);
+
+       mutex_destroy(&zmd->mblk_flush_lock);
+       mutex_destroy(&zmd->map_lock);
  }
  
  /*
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c

index 6d7bda6f8190583136608898d778a0ec161c7aee..caff02caf0837ae4fae4df55b55ebabb9fb557bd 100644 (file)
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -827,6 +827,7 @@ err_fwq:
  err_cwq:
         destroy_workqueue(dmz->chunk_wq);
  err_bio:
+       mutex_destroy(&dmz->chunk_lock);
         bioset_free(dmz->bio_set);
  err_meta:
         dmz_dtr_metadata(dmz->metadata);
@@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti)
  
         dmz_put_zoned_device(ti);
  
+       mutex_destroy(&dmz->chunk_lock);
+
         kfree(dmz);
  }
  
diff --git a/drivers/md/dm.c b/drivers/md/dm.c

index 8c26bfc3533562453b1d9e83479d333b34d9a491..d6de00f367efdde4137055dfa4c82fefcbcd1f9a 100644 (file)
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -60,18 +60,73 @@ void dm_issue_global_event(void)
  }
  
  /*
- * One of these is allocated per bio.
+ * One of these is allocated (on-stack) per original bio.
   */
+struct clone_info {
+       struct dm_table *map;
+       struct bio *bio;
+       struct dm_io *io;
+       sector_t sector;
+       unsigned sector_count;
+};
+
+/*
+ * One of these is allocated per clone bio.
+ */
+#define DM_TIO_MAGIC 7282014
+struct dm_target_io {
+       unsigned magic;
+       struct dm_io *io;
+       struct dm_target *ti;
+       unsigned target_bio_nr;
+       unsigned *len_ptr;
+       bool inside_dm_io;
+       struct bio clone;
+};
+
+/*
+ * One of these is allocated per original bio.
+ * It contains the first clone used for that original.
+ */
+#define DM_IO_MAGIC 5191977
  struct dm_io {
+       unsigned magic;
         struct mapped_device *md;
         blk_status_t status;
         atomic_t io_count;
-       struct bio *bio;
+       struct bio *orig_bio;
         unsigned long start_time;
         spinlock_t endio_lock;
         struct dm_stats_aux stats_aux;
+       /* last member of dm_target_io is 'struct bio' */
+       struct dm_target_io tio;
  };
  
+void *dm_per_bio_data(struct bio *bio, size_t data_size)
+{
+       struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
+       if (!tio->inside_dm_io)
+               return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
+       return (char *)bio - offsetof(struct dm_target_io, clone) - offsetof(struct dm_io, tio) - data_size;
+}
+EXPORT_SYMBOL_GPL(dm_per_bio_data);
+
+struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
+{
+       struct dm_io *io = (struct dm_io *)((char *)data + data_size);
+       if (io->magic == DM_IO_MAGIC)
+               return (struct bio *)((char *)io + offsetof(struct dm_io, tio) + offsetof(struct dm_target_io, clone));
+       BUG_ON(io->magic != DM_TIO_MAGIC);
+       return (struct bio *)((char *)io + offsetof(struct dm_target_io, clone));
+}
+EXPORT_SYMBOL_GPL(dm_bio_from_per_bio_data);
+
+unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
+{
+       return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
+}
+EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr);
+
  #define MINOR_ALLOCED ((void *)-1)
  
  /*
@@ -93,8 +148,8 @@ static int dm_numa_node = DM_NUMA_NODE;
   * For mempools pre-allocation at the table loading time.
   */
  struct dm_md_mempools {
-       mempool_t *io_pool;
         struct bio_set *bs;
+       struct bio_set *io_bs;
  };
  
  struct table_device {
@@ -103,7 +158,6 @@ struct table_device {
         struct dm_dev dm_dev;
  };
  
-static struct kmem_cache *_io_cache;
  static struct kmem_cache *_rq_tio_cache;
  static struct kmem_cache *_rq_cache;
  
@@ -170,14 +224,9 @@ static int __init local_init(void)
  {
         int r = -ENOMEM;
  
-       /* allocate a slab for the dm_ios */
-       _io_cache = KMEM_CACHE(dm_io, 0);
-       if (!_io_cache)
-               return r;
-
         _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
         if (!_rq_tio_cache)
-               goto out_free_io_cache;
+               return r;
  
         _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request),
                                       __alignof__(struct request), 0, NULL);
@@ -212,8 +261,6 @@ out_free_rq_cache:
         kmem_cache_destroy(_rq_cache);
  out_free_rq_tio_cache:
         kmem_cache_destroy(_rq_tio_cache);
-out_free_io_cache:
-       kmem_cache_destroy(_io_cache);
  
         return r;
  }
@@ -225,7 +272,6 @@ static void local_exit(void)
  
         kmem_cache_destroy(_rq_cache);
         kmem_cache_destroy(_rq_tio_cache);
-       kmem_cache_destroy(_io_cache);
         unregister_blkdev(_major, _name);
         dm_uevent_exit();
  
@@ -486,18 +532,69 @@ out:
         return r;
  }
  
-static struct dm_io *alloc_io(struct mapped_device *md)
+static void start_io_acct(struct dm_io *io);
+
+static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
  {
-       return mempool_alloc(md->io_pool, GFP_NOIO);
+       struct dm_io *io;
+       struct dm_target_io *tio;
+       struct bio *clone;
+
+       clone = bio_alloc_bioset(GFP_NOIO, 0, md->io_bs);
+       if (!clone)
+               return NULL;
+
+       tio = container_of(clone, struct dm_target_io, clone);
+       tio->inside_dm_io = true;
+       tio->io = NULL;
+
+       io = container_of(tio, struct dm_io, tio);
+       io->magic = DM_IO_MAGIC;
+       io->status = 0;
+       atomic_set(&io->io_count, 1);
+       io->orig_bio = bio;
+       io->md = md;
+       spin_lock_init(&io->endio_lock);
+
+       start_io_acct(io);
+
+       return io;
  }
  
  static void free_io(struct mapped_device *md, struct dm_io *io)
  {
-       mempool_free(io, md->io_pool);
+       bio_put(&io->tio.clone);
+}
+
+static struct dm_target_io *alloc_tio(struct clone_info *ci, struct dm_target *ti,
+                                     unsigned target_bio_nr, gfp_t gfp_mask)
+{
+       struct dm_target_io *tio;
+
+       if (!ci->io->tio.io) {
+               /* the dm_target_io embedded in ci->io is available */
+               tio = &ci->io->tio;
+       } else {
+               struct bio *clone = bio_alloc_bioset(gfp_mask, 0, ci->io->md->bs);
+               if (!clone)
+                       return NULL;
+
+               tio = container_of(clone, struct dm_target_io, clone);
+               tio->inside_dm_io = false;
+       }
+
+       tio->magic = DM_TIO_MAGIC;
+       tio->io = ci->io;
+       tio->ti = ti;
+       tio->target_bio_nr = target_bio_nr;
+
+       return tio;
  }
  
  static void free_tio(struct dm_target_io *tio)
  {
+       if (tio->inside_dm_io)
+               return;
         bio_put(&tio->clone);
  }
  
@@ -510,17 +607,15 @@ int md_in_flight(struct mapped_device *md)
  static void start_io_acct(struct dm_io *io)
  {
         struct mapped_device *md = io->md;
-       struct bio *bio = io->bio;
-       int cpu;
+       struct bio *bio = io->orig_bio;
         int rw = bio_data_dir(bio);
  
         io->start_time = jiffies;
  
-       cpu = part_stat_lock();
-       part_round_stats(md->queue, cpu, &dm_disk(md)->part0);
-       part_stat_unlock();
+       generic_start_io_acct(md->queue, rw, bio_sectors(bio), &dm_disk(md)->part0);
+
         atomic_set(&dm_disk(md)->part0.in_flight[rw],
-               atomic_inc_return(&md->pending[rw]));
+                  atomic_inc_return(&md->pending[rw]));
  
         if (unlikely(dm_stats_used(&md->stats)))
                 dm_stats_account_io(&md->stats, bio_data_dir(bio),
@@ -531,7 +626,7 @@ static void start_io_acct(struct dm_io *io)
  static void end_io_acct(struct dm_io *io)
  {
         struct mapped_device *md = io->md;
-       struct bio *bio = io->bio;
+       struct bio *bio = io->orig_bio;
         unsigned long duration = jiffies - io->start_time;
         int pending;
         int rw = bio_data_dir(bio);
@@ -752,15 +847,6 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo)
         return 0;
  }
  
-/*-----------------------------------------------------------------
- * CRUD START:
- *   A more elegant soln is in the works that uses the queue
- *   merge fn, unfortunately there are a couple of changes to
- *   the block layer that I want to make for this.  So in the
- *   interests of getting something for people to use I give
- *   you this clearly demarcated crap.
- *---------------------------------------------------------------*/
-
  static int __noflush_suspending(struct mapped_device *md)
  {
         return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
@@ -780,8 +866,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
         /* Push-back supersedes any I/O errors */
         if (unlikely(error)) {
                 spin_lock_irqsave(&io->endio_lock, flags);
-               if (!(io->status == BLK_STS_DM_REQUEUE &&
-                               __noflush_suspending(md)))
+               if (!(io->status == BLK_STS_DM_REQUEUE && __noflush_suspending(md)))
                         io->status = error;
                 spin_unlock_irqrestore(&io->endio_lock, flags);
         }
@@ -793,7 +878,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
                          */
                         spin_lock_irqsave(&md->deferred_lock, flags);
                         if (__noflush_suspending(md))
-                               bio_list_add_head(&md->deferred, io->bio);
+                               /* NOTE early return due to BLK_STS_DM_REQUEUE below */
+                               bio_list_add_head(&md->deferred, io->orig_bio);
                         else
                                 /* noflush suspend was interrupted. */
                                 io->status = BLK_STS_IOERR;
@@ -801,7 +887,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
                 }
  
                 io_error = io->status;
-               bio = io->bio;
+               bio = io->orig_bio;
                 end_io_acct(io);
                 free_io(md, io);
  
@@ -847,7 +933,7 @@ static void clone_endio(struct bio *bio)
         struct mapped_device *md = tio->io->md;
         dm_endio_fn endio = tio->ti->type->end_io;
  
-       if (unlikely(error == BLK_STS_TARGET)) {
+       if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
                 if (bio_op(bio) == REQ_OP_WRITE_SAME &&
                     !bio->bi_disk->queue->limits.max_write_same_sectors)
                         disable_write_same(md);
@@ -1005,7 +1091,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
  
  /*
   * A target may call dm_accept_partial_bio only from the map routine.  It is
- * allowed for all bio types except REQ_PREFLUSH.
+ * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
   *
   * dm_accept_partial_bio informs the dm that the target only wants to process
   * additional n_sectors sectors of the bio and the rest of the data should be
@@ -1055,7 +1141,7 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
  {
  #ifdef CONFIG_BLK_DEV_ZONED
         struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
-       struct bio *report_bio = tio->io->bio;
+       struct bio *report_bio = tio->io->orig_bio;
         struct blk_zone_report_hdr *hdr = NULL;
         struct blk_zone *zone;
         unsigned int nr_rep = 0;
@@ -1122,67 +1208,15 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
  }
  EXPORT_SYMBOL_GPL(dm_remap_zone_report);
  
-/*
- * Flush current->bio_list when the target map method blocks.
- * This fixes deadlocks in snapshot and possibly in other targets.
- */
-struct dm_offload {
-       struct blk_plug plug;
-       struct blk_plug_cb cb;
-};
-
-static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
-{
-       struct dm_offload *o = container_of(cb, struct dm_offload, cb);
-       struct bio_list list;
-       struct bio *bio;
-       int i;
-
-       INIT_LIST_HEAD(&o->cb.list);
-
-       if (unlikely(!current->bio_list))
-               return;
-
-       for (i = 0; i < 2; i++) {
-               list = current->bio_list[i];
-               bio_list_init(&current->bio_list[i]);
-
-               while ((bio = bio_list_pop(&list))) {
-                       struct bio_set *bs = bio->bi_pool;
-                       if (unlikely(!bs) || bs == fs_bio_set ||
-                           !bs->rescue_workqueue) {
-                               bio_list_add(&current->bio_list[i], bio);
-                               continue;
-                       }
-
-                       spin_lock(&bs->rescue_lock);
-                       bio_list_add(&bs->rescue_list, bio);
-                       queue_work(bs->rescue_workqueue, &bs->rescue_work);
-                       spin_unlock(&bs->rescue_lock);
-               }
-       }
-}
-
-static void dm_offload_start(struct dm_offload *o)
-{
-       blk_start_plug(&o->plug);
-       o->cb.callback = flush_current_bio_list;
-       list_add(&o->cb.list, &current->plug->cb_list);
-}
-
-static void dm_offload_end(struct dm_offload *o)
-{
-       list_del(&o->cb.list);
-       blk_finish_plug(&o->plug);
-}
-
-static void __map_bio(struct dm_target_io *tio)
+static blk_qc_t __map_bio(struct dm_target_io *tio)
  {
         int r;
         sector_t sector;
-       struct dm_offload o;
         struct bio *clone = &tio->clone;
+       struct dm_io *io = tio->io;
+       struct mapped_device *md = io->md;
         struct dm_target *ti = tio->ti;
+       blk_qc_t ret = BLK_QC_T_NONE;
  
         clone->bi_end_io = clone_endio;
  
@@ -1191,44 +1225,37 @@ static void __map_bio(struct dm_target_io *tio)
          * anything, the target has assumed ownership of
          * this io.
          */
-       atomic_inc(&tio->io->io_count);
+       atomic_inc(&io->io_count);
         sector = clone->bi_iter.bi_sector;
  
-       dm_offload_start(&o);
         r = ti->type->map(ti, clone);
-       dm_offload_end(&o);
-
         switch (r) {
         case DM_MAPIO_SUBMITTED:
                 break;
         case DM_MAPIO_REMAPPED:
                 /* the bio has been remapped so dispatch it */
                 trace_block_bio_remap(clone->bi_disk->queue, clone,
-                                     bio_dev(tio->io->bio), sector);
-               generic_make_request(clone);
+                                     bio_dev(io->orig_bio), sector);
+               if (md->type == DM_TYPE_NVME_BIO_BASED)
+                       ret = direct_make_request(clone);
+               else
+                       ret = generic_make_request(clone);
                 break;
         case DM_MAPIO_KILL:
-               dec_pending(tio->io, BLK_STS_IOERR);
                 free_tio(tio);
+               dec_pending(io, BLK_STS_IOERR);
                 break;
         case DM_MAPIO_REQUEUE:
-               dec_pending(tio->io, BLK_STS_DM_REQUEUE);
                 free_tio(tio);
+               dec_pending(io, BLK_STS_DM_REQUEUE);
                 break;
         default:
                 DMWARN("unimplemented target map return value: %d", r);
                 BUG();
         }
-}
  
-struct clone_info {
-       struct mapped_device *md;
-       struct dm_table *map;
-       struct bio *bio;
-       struct dm_io *io;
-       sector_t sector;
-       unsigned sector_count;
-};
+       return ret;
+}
  
  static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
  {
@@ -1272,28 +1299,49 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
         return 0;
  }
  
-static struct dm_target_io *alloc_tio(struct clone_info *ci,
-                                     struct dm_target *ti,
-                                     unsigned target_bio_nr)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+                               struct dm_target *ti, unsigned num_bios)
  {
         struct dm_target_io *tio;
-       struct bio *clone;
+       int try;
  
-       clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs);
-       tio = container_of(clone, struct dm_target_io, clone);
+       if (!num_bios)
+               return;
  
-       tio->io = ci->io;
-       tio->ti = ti;
-       tio->target_bio_nr = target_bio_nr;
+       if (num_bios == 1) {
+               tio = alloc_tio(ci, ti, 0, GFP_NOIO);
+               bio_list_add(blist, &tio->clone);
+               return;
+       }
  
-       return tio;
+       for (try = 0; try < 2; try++) {
+               int bio_nr;
+               struct bio *bio;
+
+               if (try)
+                       mutex_lock(&ci->io->md->table_devices_lock);
+               for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+                       tio = alloc_tio(ci, ti, bio_nr, try ? GFP_NOIO : GFP_NOWAIT);
+                       if (!tio)
+                               break;
+
+                       bio_list_add(blist, &tio->clone);
+               }
+               if (try)
+                       mutex_unlock(&ci->io->md->table_devices_lock);
+               if (bio_nr == num_bios)
+                       return;
+
+               while ((bio = bio_list_pop(blist))) {
+                       tio = container_of(bio, struct dm_target_io, clone);
+                       free_tio(tio);
+               }
+       }
  }
  
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-                                      struct dm_target *ti,
-                                      unsigned target_bio_nr, unsigned *len)
+static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
+                                          struct dm_target_io *tio, unsigned *len)
  {
-       struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
         struct bio *clone = &tio->clone;
  
         tio->len_ptr = len;
@@ -1302,16 +1350,22 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
         if (len)
                 bio_setup_sector(clone, ci->sector, *len);
  
-       __map_bio(tio);
+       return __map_bio(tio);
  }
  
  static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
                                   unsigned num_bios, unsigned *len)
  {
-       unsigned target_bio_nr;
+       struct bio_list blist = BIO_EMPTY_LIST;
+       struct bio *bio;
+       struct dm_target_io *tio;
+
+       alloc_multiple_bios(&blist, ci, ti, num_bios);
  
-       for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-               __clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+       while ((bio = bio_list_pop(&blist))) {
+               tio = container_of(bio, struct dm_target_io, clone);
+               (void) __clone_and_map_simple_bio(ci, tio, len);
+       }
  }
  
  static int __send_empty_flush(struct clone_info *ci)
@@ -1327,32 +1381,22 @@ static int __send_empty_flush(struct clone_info *ci)
  }
  
  static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
-                                    sector_t sector, unsigned *len)
+                                   sector_t sector, unsigned *len)
  {
         struct bio *bio = ci->bio;
         struct dm_target_io *tio;
-       unsigned target_bio_nr;
-       unsigned num_target_bios = 1;
-       int r = 0;
+       int r;
  
-       /*
-        * Does the target want to receive duplicate copies of the bio?
-        */
-       if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-               num_target_bios = ti->num_write_bios(ti, bio);
-
-       for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-               tio = alloc_tio(ci, ti, target_bio_nr);
-               tio->len_ptr = len;
-               r = clone_bio(tio, bio, sector, *len);
-               if (r < 0) {
-                       free_tio(tio);
-                       break;
-               }
-               __map_bio(tio);
+       tio = alloc_tio(ci, ti, 0, GFP_NOIO);
+       tio->len_ptr = len;
+       r = clone_bio(tio, bio, sector, *len);
+       if (r < 0) {
+               free_tio(tio);
+               return r;
         }
+       (void) __map_bio(tio);
  
-       return r;
+       return 0;
  }
  
  typedef unsigned (*get_num_bios_fn)(struct dm_target *ti);
@@ -1379,56 +1423,50 @@ static bool is_split_required_for_discard(struct dm_target *ti)
         return ti->split_discard_bios;
  }
  
-static int __send_changing_extent_only(struct clone_info *ci,
+static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
                                        get_num_bios_fn get_num_bios,
                                        is_split_required_fn is_split_required)
  {
-       struct dm_target *ti;
         unsigned len;
         unsigned num_bios;
  
-       do {
-               ti = dm_table_find_target(ci->map, ci->sector);
-               if (!dm_target_is_valid(ti))
-                       return -EIO;
-
-               /*
-                * Even though the device advertised support for this type of
-                * request, that does not mean every target supports it, and
-                * reconfiguration might also have changed that since the
-                * check was performed.
-                */
-               num_bios = get_num_bios ? get_num_bios(ti) : 0;
-               if (!num_bios)
-                       return -EOPNOTSUPP;
+       /*
+        * Even though the device advertised support for this type of
+        * request, that does not mean every target supports it, and
+        * reconfiguration might also have changed that since the
+        * check was performed.
+        */
+       num_bios = get_num_bios ? get_num_bios(ti) : 0;
+       if (!num_bios)
+               return -EOPNOTSUPP;
  
-               if (is_split_required && !is_split_required(ti))
-                       len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
-               else
-                       len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti));
+       if (is_split_required && !is_split_required(ti))
+               len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
+       else
+               len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti));
  
-               __send_duplicate_bios(ci, ti, num_bios, &len);
+       __send_duplicate_bios(ci, ti, num_bios, &len);
  
-               ci->sector += len;
-       } while (ci->sector_count -= len);
+       ci->sector += len;
+       ci->sector_count -= len;
  
         return 0;
  }
  
-static int __send_discard(struct clone_info *ci)
+static int __send_discard(struct clone_info *ci, struct dm_target *ti)
  {
-       return __send_changing_extent_only(ci, get_num_discard_bios,
+       return __send_changing_extent_only(ci, ti, get_num_discard_bios,
                                            is_split_required_for_discard);
  }
  
-static int __send_write_same(struct clone_info *ci)
+static int __send_write_same(struct clone_info *ci, struct dm_target *ti)
  {
-       return __send_changing_extent_only(ci, get_num_write_same_bios, NULL);
+       return __send_changing_extent_only(ci, ti, get_num_write_same_bios, NULL);
  }
  
-static int __send_write_zeroes(struct clone_info *ci)
+static int __send_write_zeroes(struct clone_info *ci, struct dm_target *ti)
  {
-       return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL);
+       return __send_changing_extent_only(ci, ti, get_num_write_zeroes_bios, NULL);
  }
  
  /*
@@ -1441,17 +1479,17 @@ static int __split_and_process_non_flush(struct clone_info *ci)
         unsigned len;
         int r;
  
-       if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
-               return __send_discard(ci);
-       else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
-               return __send_write_same(ci);
-       else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
-               return __send_write_zeroes(ci);
-
         ti = dm_table_find_target(ci->map, ci->sector);
         if (!dm_target_is_valid(ti))
                 return -EIO;
  
+       if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
+               return __send_discard(ci, ti);
+       else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
+               return __send_write_same(ci, ti);
+       else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
+               return __send_write_zeroes(ci, ti);
+
         if (bio_op(bio) == REQ_OP_ZONE_REPORT)
                 len = ci->sector_count;
         else
@@ -1468,34 +1506,33 @@ static int __split_and_process_non_flush(struct clone_info *ci)
         return 0;
  }
  
+static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
+                           struct dm_table *map, struct bio *bio)
+{
+       ci->map = map;
+       ci->io = alloc_io(md, bio);
+       ci->sector = bio->bi_iter.bi_sector;
+}
+
  /*
   * Entry point to split a bio into clones and submit them to the targets.
   */
-static void __split_and_process_bio(struct mapped_device *md,
-                                   struct dm_table *map, struct bio *bio)
+static blk_qc_t __split_and_process_bio(struct mapped_device *md,
+                                       struct dm_table *map, struct bio *bio)
  {
         struct clone_info ci;
+       blk_qc_t ret = BLK_QC_T_NONE;
         int error = 0;
  
         if (unlikely(!map)) {
                 bio_io_error(bio);
-               return;
+               return ret;
         }
  
-       ci.map = map;
-       ci.md = md;
-       ci.io = alloc_io(md);
-       ci.io->status = 0;
-       atomic_set(&ci.io->io_count, 1);
-       ci.io->bio = bio;
-       ci.io->md = md;
-       spin_lock_init(&ci.io->endio_lock);
-       ci.sector = bio->bi_iter.bi_sector;
-
-       start_io_acct(ci.io);
+       init_clone_info(&ci, md, map, bio);
  
         if (bio->bi_opf & REQ_PREFLUSH) {
-               ci.bio = &ci.md->flush_bio;
+               ci.bio = &ci.io->md->flush_bio;
                 ci.sector_count = 0;
                 error = __send_empty_flush(&ci);
                 /* dec_pending submits any data associated with flush */
@@ -1506,32 +1543,95 @@ static void __split_and_process_bio(struct mapped_device *md,
         } else {
                 ci.bio = bio;
                 ci.sector_count = bio_sectors(bio);
-               while (ci.sector_count && !error)
+               while (ci.sector_count && !error) {
                         error = __split_and_process_non_flush(&ci);
+                       if (current->bio_list && ci.sector_count && !error) {
+                               /*
+                                * Remainder must be passed to generic_make_request()
+                                * so that it gets handled *after* bios already submitted
+                                * have been completely processed.
+                                * We take a clone of the original to store in
+                                * ci.io->orig_bio to be used by end_io_acct() and
+                                * for dec_pending to use for completion handling.
+                                * As this path is not used for REQ_OP_ZONE_REPORT,
+                                * the usage of io->orig_bio in dm_remap_zone_report()
+                                * won't be affected by this reassignment.
+                                */
+                               struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+                                                                md->queue->bio_split);
+                               ci.io->orig_bio = b;
+                               bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
+                               bio_chain(b, bio);
+                               ret = generic_make_request(bio);
+                               break;
+                       }
+               }
         }
  
         /* drop the extra reference count */
         dec_pending(ci.io, errno_to_blk_status(error));
+       return ret;
  }
-/*-----------------------------------------------------------------
- * CRUD END
- *---------------------------------------------------------------*/
  
  /*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * Optimized variant of __split_and_process_bio that leverages the
+ * fact that targets that use it do _not_ have a need to split bios.
   */
-static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
+static blk_qc_t __process_bio(struct mapped_device *md,
+                             struct dm_table *map, struct bio *bio)
+{
+       struct clone_info ci;
+       blk_qc_t ret = BLK_QC_T_NONE;
+       int error = 0;
+
+       if (unlikely(!map)) {
+               bio_io_error(bio);
+               return ret;
+       }
+
+       init_clone_info(&ci, md, map, bio);
+
+       if (bio->bi_opf & REQ_PREFLUSH) {
+               ci.bio = &ci.io->md->flush_bio;
+               ci.sector_count = 0;
+               error = __send_empty_flush(&ci);
+               /* dec_pending submits any data associated with flush */
+       } else {
+               struct dm_target *ti = md->immutable_target;
+               struct dm_target_io *tio;
+
+               /*
+                * Defend against IO still getting in during teardown
+                * - as was seen for a time with nvme-fcloop
+                */
+               if (unlikely(WARN_ON_ONCE(!ti || !dm_target_is_valid(ti)))) {
+                       error = -EIO;
+                       goto out;
+               }
+
+               tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
+               ci.bio = bio;
+               ci.sector_count = bio_sectors(bio);
+               ret = __clone_and_map_simple_bio(&ci, tio, NULL);
+       }
+out:
+       /* drop the extra reference count */
+       dec_pending(ci.io, errno_to_blk_status(error));
+       return ret;
+}
+
+typedef blk_qc_t (process_bio_fn)(struct mapped_device *, struct dm_table *, struct bio *);
+
+static blk_qc_t __dm_make_request(struct request_queue *q, struct bio *bio,
+                                 process_bio_fn process_bio)
  {
-       int rw = bio_data_dir(bio);
         struct mapped_device *md = q->queuedata;
+       blk_qc_t ret = BLK_QC_T_NONE;
         int srcu_idx;
         struct dm_table *map;
  
         map = dm_get_live_table(md, &srcu_idx);
  
-       generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0);
-
         /* if we're suspended, we have to queue this io for later */
         if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
                 dm_put_live_table(md, srcu_idx);
@@ -1540,12 +1640,27 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
                         queue_io(md, bio);
                 else
                         bio_io_error(bio);
-               return BLK_QC_T_NONE;
+               return ret;
         }
  
-       __split_and_process_bio(md, map, bio);
+       ret = process_bio(md, map, bio);
+
         dm_put_live_table(md, srcu_idx);
-       return BLK_QC_T_NONE;
+       return ret;
+}
+
+/*
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
+ */
+static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
+{
+       return __dm_make_request(q, bio, __split_and_process_bio);
+}
+
+static blk_qc_t dm_make_request_nvme(struct request_queue *q, struct bio *bio)
+{
+       return __dm_make_request(q, bio, __process_bio);
  }
  
  static int dm_any_congested(void *congested_data, int bdi_bits)
@@ -1626,20 +1741,9 @@ static const struct dax_operations dm_dax_ops;
  
  static void dm_wq_work(struct work_struct *work);
  
-void dm_init_md_queue(struct mapped_device *md)
-{
-       /*
-        * Initialize data that will only be used by a non-blk-mq DM queue
-        * - must do so here (in alloc_dev callchain) before queue is used
-        */
-       md->queue->queuedata = md;
-       md->queue->backing_dev_info->congested_data = md;
-}
-
-void dm_init_normal_md_queue(struct mapped_device *md)
+static void dm_init_normal_md_queue(struct mapped_device *md)
  {
         md->use_blk_mq = false;
-       dm_init_md_queue(md);
  
         /*
          * Initialize aspects of queue that aren't relevant for blk-mq
@@ -1653,9 +1757,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
                 destroy_workqueue(md->wq);
         if (md->kworker_task)
                 kthread_stop(md->kworker_task);
-       mempool_destroy(md->io_pool);
         if (md->bs)
                 bioset_free(md->bs);
+       if (md->io_bs)
+               bioset_free(md->io_bs);
  
         if (md->dax_dev) {
                 kill_dax(md->dax_dev);
@@ -1681,6 +1786,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
                 md->bdev = NULL;
         }
  
+       mutex_destroy(&md->suspend_lock);
+       mutex_destroy(&md->type_lock);
+       mutex_destroy(&md->table_devices_lock);
+
         dm_mq_cleanup_mapped_device(md);
  }
  
@@ -1734,10 +1843,10 @@ static struct mapped_device *alloc_dev(int minor)
         md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
         if (!md->queue)
                 goto bad;
+       md->queue->queuedata = md;
+       md->queue->backing_dev_info->congested_data = md;
  
-       dm_init_md_queue(md);
-
-       md->disk = alloc_disk_node(1, numa_node_id);
+       md->disk = alloc_disk_node(1, md->numa_node_id);
         if (!md->disk)
                 goto bad;
  
@@ -1820,17 +1929,22 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
  {
         struct dm_md_mempools *p = dm_table_get_md_mempools(t);
  
-       if (md->bs) {
-               /* The md already has necessary mempools. */
-               if (dm_table_bio_based(t)) {
-                       /*
-                        * Reload bioset because front_pad may have changed
-                        * because a different table was loaded.
-                        */
+       if (dm_table_bio_based(t)) {
+               /*
+                * The md may already have mempools that need changing.
+                * If so, reload bioset because front_pad may have changed
+                * because a different table was loaded.
+                */
+               if (md->bs) {
                         bioset_free(md->bs);
-                       md->bs = p->bs;
-                       p->bs = NULL;
+                       md->bs = NULL;
+               }
+               if (md->io_bs) {
+                       bioset_free(md->io_bs);
+                       md->io_bs = NULL;
                 }
+
+       } else if (md->bs) {
                 /*
                  * There's no need to reload with request-based dm
                  * because the size of front_pad doesn't change.
@@ -1842,13 +1956,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
                 goto out;
         }
  
-       BUG_ON(!p || md->io_pool || md->bs);
+       BUG_ON(!p || md->bs || md->io_bs);
  
-       md->io_pool = p->io_pool;
-       p->io_pool = NULL;
         md->bs = p->bs;
         p->bs = NULL;
-
+       md->io_bs = p->io_bs;
+       p->io_bs = NULL;
  out:
         /* mempool bind completed, no longer need any mempools in the table */
         dm_table_free_md_mempools(t);
@@ -1894,6 +2007,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
  {
         struct dm_table *old_map;
         struct request_queue *q = md->queue;
+       bool request_based = dm_table_request_based(t);
         sector_t size;
  
         lockdep_assert_held(&md->suspend_lock);
@@ -1917,12 +2031,15 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
          * This must be done before setting the queue restrictions,
          * because request-based dm may be run just after the setting.
          */
-       if (dm_table_request_based(t)) {
+       if (request_based)
                 dm_stop_queue(q);
+
+       if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
                 /*
-                * Leverage the fact that request-based DM targets are
-                * immutable singletons and establish md->immutable_target
-                * - used to optimize both dm_request_fn and dm_mq_queue_rq
+                * Leverage the fact that request-based DM targets and
+                * NVMe bio based targets are immutable singletons
+                * - used to optimize both dm_request_fn and dm_mq_queue_rq;
+                *   and __process_bio.
                  */
                 md->immutable_target = dm_table_get_immutable_target(t);
         }
@@ -1962,13 +2079,18 @@ static struct dm_table *__unbind(struct mapped_device *md)
   */
  int dm_create(int minor, struct mapped_device **result)
  {
+       int r;
         struct mapped_device *md;
  
         md = alloc_dev(minor);
         if (!md)
                 return -ENXIO;
  
-       dm_sysfs_init(md);
+       r = dm_sysfs_init(md);
+       if (r) {
+               free_dev(md);
+               return r;
+       }
  
         *result = md;
         return 0;
@@ -2026,6 +2148,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
  
         switch (type) {
         case DM_TYPE_REQUEST_BASED:
+               dm_init_normal_md_queue(md);
                 r = dm_old_init_request_queue(md, t);
                 if (r) {
                         DMERR("Cannot initialize queue for request-based mapped device");
@@ -2043,15 +2166,10 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
         case DM_TYPE_DAX_BIO_BASED:
                 dm_init_normal_md_queue(md);
                 blk_queue_make_request(md->queue, dm_make_request);
-               /*
-                * DM handles splitting bios as needed.  Free the bio_split bioset
-                * since it won't be used (saves 1 process per bio-based DM device).
-                */
-               bioset_free(md->queue->bio_split);
-               md->queue->bio_split = NULL;
-
-               if (type == DM_TYPE_DAX_BIO_BASED)
-                       queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
+               break;
+       case DM_TYPE_NVME_BIO_BASED:
+               dm_init_normal_md_queue(md);
+               blk_queue_make_request(md->queue, dm_make_request_nvme);
                 break;
         case DM_TYPE_NONE:
                 WARN_ON_ONCE(true);
@@ -2130,7 +2248,6 @@ EXPORT_SYMBOL_GPL(dm_device_name);
  
  static void __dm_destroy(struct mapped_device *md, bool wait)
  {
-       struct request_queue *q = dm_get_md_queue(md);
         struct dm_table *map;
         int srcu_idx;
  
@@ -2141,7 +2258,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
         set_bit(DMF_FREEING, &md->flags);
         spin_unlock(&_minor_lock);
  
-       blk_set_queue_dying(q);
+       blk_set_queue_dying(md->queue);
  
         if (dm_request_based(md) && md->kworker_task)
                 kthread_flush_worker(&md->kworker);
@@ -2752,11 +2869,12 @@ int dm_noflush_suspending(struct dm_target *ti)
  EXPORT_SYMBOL_GPL(dm_noflush_suspending);
  
  struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
-                                           unsigned integrity, unsigned per_io_data_size)
+                                           unsigned integrity, unsigned per_io_data_size,
+                                           unsigned min_pool_size)
  {
         struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id);
         unsigned int pool_size = 0;
-       unsigned int front_pad;
+       unsigned int front_pad, io_front_pad;
  
         if (!pools)
                 return NULL;
@@ -2764,16 +2882,19 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
         switch (type) {
         case DM_TYPE_BIO_BASED:
         case DM_TYPE_DAX_BIO_BASED:
-               pool_size = dm_get_reserved_bio_based_ios();
+       case DM_TYPE_NVME_BIO_BASED:
+               pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size);
                 front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
-       
-               pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
-               if (!pools->io_pool)
+               io_front_pad = roundup(front_pad,  __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
+               pools->io_bs = bioset_create(pool_size, io_front_pad, 0);
+               if (!pools->io_bs)
+                       goto out;
+               if (integrity && bioset_integrity_create(pools->io_bs, pool_size))
                         goto out;
                 break;
         case DM_TYPE_REQUEST_BASED:
         case DM_TYPE_MQ_REQUEST_BASED:
-               pool_size = dm_get_reserved_rq_based_ios();
+               pool_size = max(dm_get_reserved_rq_based_ios(), min_pool_size);
                 front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
                 /* per_io_data_size is used for blk-mq pdu at queue allocation */
                 break;
@@ -2781,7 +2902,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
                 BUG();
         }
  
-       pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER);
+       pools->bs = bioset_create(pool_size, front_pad, 0);
         if (!pools->bs)
                 goto out;
  
@@ -2801,10 +2922,10 @@ void dm_free_md_mempools(struct dm_md_mempools *pools)
         if (!pools)
                 return;
  
-       mempool_destroy(pools->io_pool);
-
         if (pools->bs)
                 bioset_free(pools->bs);
+       if (pools->io_bs)
+               bioset_free(pools->io_bs);
  
         kfree(pools);
  }
diff --git a/drivers/md/dm.h b/drivers/md/dm.h

index 36399bb875ddac7fd0e74981d55fcd4d24b1b030..114a81b27c3732a5bcefb2bab77c5b1f40ba9f6a 100644 (file)
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -49,7 +49,6 @@ struct dm_md_mempools;
  /*-----------------------------------------------------------------
   * Internal table functions.
   *---------------------------------------------------------------*/
-void dm_table_destroy(struct dm_table *t);
  void dm_table_event_callback(struct dm_table *t,
                              void (*fn)(void *), void *context);
  struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
@@ -206,7 +205,8 @@ void dm_kcopyd_exit(void);
   * Mempool operations
   */
  struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
-                                           unsigned integrity, unsigned per_bio_data_size);
+                                           unsigned integrity, unsigned per_bio_data_size,
+                                           unsigned min_pool_size);
  void dm_free_md_mempools(struct dm_md_mempools *pools);
  
  /*
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h

index a5538433c927abebad75ec6763a3e6ab8f1cc0e2..da83f64952e7d07ebccded358bb363807d49b7b5 100644 (file)
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -28,6 +28,7 @@ enum dm_queue_mode {
         DM_TYPE_REQUEST_BASED    = 2,
         DM_TYPE_MQ_REQUEST_BASED = 3,
         DM_TYPE_DAX_BIO_BASED    = 4,
+       DM_TYPE_NVME_BIO_BASED   = 5,
  };
  
  typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
@@ -220,14 +221,6 @@ struct target_type {
  #define DM_TARGET_WILDCARD             0x00000008
  #define dm_target_is_wildcard(type)    ((type)->features & DM_TARGET_WILDCARD)
  
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
  /*
   * A target implements own bio data integrity.
   */
@@ -291,13 +284,6 @@ struct dm_target {
          */
         unsigned per_io_data_size;
  
-       /*
-        * If defined, this function is called to find out how many
-        * duplicate bios should be sent to the target when writing
-        * data.
-        */
-       dm_num_write_bios_fn num_write_bios;
-
         /* target specific data */
         void *private;
  
@@ -329,35 +315,9 @@ struct dm_target_callbacks {
         int (*congested_fn) (struct dm_target_callbacks *, int);
  };
  
-/*
- * For bio-based dm.
- * One of these is allocated for each bio.
- * This structure shouldn't be touched directly by target drivers.
- * It is here so that we can inline dm_per_bio_data and
- * dm_bio_from_per_bio_data
- */
-struct dm_target_io {
-       struct dm_io *io;
-       struct dm_target *ti;
-       unsigned target_bio_nr;
-       unsigned *len_ptr;
-       struct bio clone;
-};
-
-static inline void *dm_per_bio_data(struct bio *bio, size_t data_size)
-{
-       return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
-}
-
-static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
-{
-       return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone));
-}
-
-static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
-{
-       return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
-}
+void *dm_per_bio_data(struct bio *bio, size_t data_size);
+struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size);
+unsigned dm_bio_get_target_bio_nr(const struct bio *bio);
  
  int dm_register_target(struct target_type *t);
  void dm_unregister_target(struct target_type *t);
@@ -499,6 +459,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type);
   */
  int dm_table_complete(struct dm_table *t);
  
+/*
+ * Destroy the table when finished.
+ */
+void dm_table_destroy(struct dm_table *t);
+
  /*
   * Target may require that it is never sent I/O larger than len.
   */
@@ -585,6 +550,7 @@ do {                                                                        \
  #define DM_ENDIO_DONE          0
  #define DM_ENDIO_INCOMPLETE    1
  #define DM_ENDIO_REQUEUE       2
+#define DM_ENDIO_DELAY_REQUEUE 3
  
  /*
   * Definitions of return values from target map function.
@@ -592,7 +558,7 @@ do {                                                                        \
  #define DM_MAPIO_SUBMITTED     0
  #define DM_MAPIO_REMAPPED      1
  #define DM_MAPIO_REQUEUE       DM_ENDIO_REQUEUE
-#define DM_MAPIO_DELAY_REQUEUE 3
+#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
  #define DM_MAPIO_KILL          4
  
  #define dm_sector_div64(x, y)( \
author	Linus Torvalds <torvalds@linux-foundation.org>
	Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)
committer	Linus Torvalds <torvalds@linux-foundation.org>
	Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)
Documentation/device-mapper/cache-policies.txt		patch \| blob \| blame \| history
Documentation/device-mapper/cache.txt		patch \| blob \| blame \| history
Documentation/device-mapper/dm-raid.txt		patch \| blob \| blame \| history
Documentation/device-mapper/snapshot.txt		patch \| blob \| blame \| history
Documentation/device-mapper/thin-provisioning.txt		patch \| blob \| blame \| history
Documentation/device-mapper/unstriped.txt	[new file with mode: 0644]	patch \| blob
drivers/md/Kconfig		patch \| blob \| blame \| history
drivers/md/Makefile		patch \| blob \| blame \| history
drivers/md/dm-bufio.c		patch \| blob \| blame \| history
drivers/md/dm-core.h		patch \| blob \| blame \| history
drivers/md/dm-crypt.c		patch \| blob \| blame \| history
drivers/md/dm-delay.c		patch \| blob \| blame \| history
drivers/md/dm-flakey.c		patch \| blob \| blame \| history
drivers/md/dm-io.c		patch \| blob \| blame \| history
drivers/md/dm-kcopyd.c		patch \| blob \| blame \| history
drivers/md/dm-log-writes.c		patch \| blob \| blame \| history
drivers/md/dm-mpath.c		patch \| blob \| blame \| history
drivers/md/dm-queue-length.c		patch \| blob \| blame \| history
drivers/md/dm-raid.c		patch \| blob \| blame \| history
drivers/md/dm-rq.c		patch \| blob \| blame \| history
drivers/md/dm-service-time.c		patch \| blob \| blame \| history
drivers/md/dm-snap.c		patch \| blob \| blame \| history
drivers/md/dm-stats.c		patch \| blob \| blame \| history
drivers/md/dm-table.c		patch \| blob \| blame \| history
drivers/md/dm-thin.c		patch \| blob \| blame \| history
drivers/md/dm-unstripe.c	[new file with mode: 0644]	patch \| blob
drivers/md/dm-zoned-metadata.c		patch \| blob \| blame \| history
drivers/md/dm-zoned-target.c		patch \| blob \| blame \| history
drivers/md/dm.c		patch \| blob \| blame \| history
drivers/md/dm.h		patch \| blob \| blame \| history
include/linux/device-mapper.h		patch \| blob \| blame \| history