[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block
5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block
5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical
5578752 on dev /dev/mapper/test-scratch1, physical
5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical
5578752 on dev /dev/mapper/test-scratch1, physical
5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block
5578752 mirror 1 has bad bytenr, has 0 want
5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical
5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block
5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block
5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical
5570560 on dev /dev/mapper/test-scratch1 physical
5570560
BTRFS warning (device dm-4): header error at logical
5570560 on dev /dev/mapper/test-scratch1, physical
5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical
5570560 on dev /dev/mapper/test-scratch1, physical
5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe
5570560 has unrepaired metadata sector at
5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_bio(bbio, mirror);
}
-static void flush_scrub_stripes(struct scrub_ctx *sctx)
+static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
+{
+ int i;
+
+ for_each_set_bit(i, &stripe->error_bitmap, stripe->nr_sectors) {
+ if (stripe->sectors[i].is_metadata) {
+ struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+
+ btrfs_err(fs_info,
+ "stripe %llu has unrepaired metadata sector at %llu",
+ stripe->logical,
+ stripe->logical + (i << fs_info->sectorsize_bits));
+ return true;
+ }
+ }
+ return false;
+}
+
+static int flush_scrub_stripes(struct scrub_ctx *sctx)
{
struct btrfs_fs_info *fs_info = sctx->fs_info;
struct scrub_stripe *stripe;
const int nr_stripes = sctx->cur_stripe;
+ int ret = 0;
if (!nr_stripes)
- return;
+ return 0;
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
/* Submit for dev-replace. */
if (sctx->is_dev_replace) {
+ /*
+ * For dev-replace, if we know there is something wrong with
+ * metadata, we should immedately abort.
+ */
+ for (int i = 0; i < nr_stripes; i++) {
+ if (stripe_has_metadata_error(&sctx->stripes[i])) {
+ ret = -EIO;
+ goto out;
+ }
+ }
for (int i = 0; i < nr_stripes; i++) {
unsigned long good;
wait_scrub_stripe_io(stripe);
scrub_reset_stripe(stripe);
}
+out:
sctx->cur_stripe = 0;
+ return ret;
}
static void raid56_scrub_wait_endio(struct bio *bio)
int ret;
/* No available slot, submit all stripes and wait for them. */
- if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX)
- flush_scrub_stripes(sctx);
+ if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX) {
+ ret = flush_scrub_stripes(sctx);
+ if (ret < 0)
+ return ret;
+ }
stripe = &sctx->stripes[sctx->cur_stripe];
const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
const u64 chunk_logical = bg->start;
int ret;
+ int ret2;
u64 physical = map->stripes[stripe_index].physical;
const u64 dev_stripe_len = btrfs_calc_stripe_length(em);
const u64 physical_end = physical + dev_stripe_len;
break;
}
out:
- flush_scrub_stripes(sctx);
+ ret2 = flush_scrub_stripes(sctx);
+ if (!ret2)
+ ret = ret2;
if (sctx->raid56_data_stripes) {
for (int i = 0; i < nr_data_stripes(map); i++)
release_scrub_stripe(&sctx->raid56_data_stripes[i]);