Kent Overstreet [Wed, 2 Apr 2025 18:31:12 +0000 (14:31 -0400)]
bcachefs: Fix check_snapshot_exists() restart handling
Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.
This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.
This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.
But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.
This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Bharadwaj Raju [Wed, 2 Apr 2025 18:15:53 +0000 (23:45 +0530)]
bcachefs: use nonblocking variant of print_string_as_lines in error path
The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.
Replace calls to it with the nonblocking variant which is safe to call.
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Apr 2025 17:01:29 +0000 (13:01 -0400)]
bcachefs: Fix scheduling while atomic from logging changes
Two fixes from the recent logging changes:
bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.
The one syzbot found is in
bch2_bkey_pick_read_device
bch2_dev_rcu
bch2_fs_inconsistent
We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.
Secondly, in btree_node_write_endio:
00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321
00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6
00085 preempt_count: 10001, expected: 0
00085 RCU nest depth: 0, expected: 0
00085 4 locks held by bch-reclaim/fa6/618:
00085 #0:
ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198
00085 #1:
ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440
00085 #2:
ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440
00085 #3:
ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130
00085 irq event stamp: 328
00085 hardirqs last enabled at (327): [<
ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0
00085 hardirqs last disabled at (328): [<
ffffffc080971a10>] el1_interrupt+0x20/0x60
00085 softirqs last enabled at (0): [<
ffffffc08002f920>] copy_process+0x7c8/0x2118
00085 softirqs last disabled at (0): [<
0000000000000000>] 0x0
00085 Preemption disabled at:
00085 [<
ffffffc08003ada0>] irq_enter_rcu+0x18/0x90
00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted
6.14.0-rc6-ktest-g04630bde23e8 #18798
00085 Hardware name: linux,dummy-virt (DT)
00085 Call trace:
00085 show_stack+0x1c/0x30 (C)
00085 dump_stack_lvl+0x84/0xc0
00085 dump_stack+0x14/0x20
00085 __might_resched+0x180/0x288
00085 __might_sleep+0x4c/0x88
00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0
00085 krealloc_noprof+0x1a0/0x2d8
00085 bch2_printbuf_make_room+0x9c/0x120
00085 bch2_prt_printf+0x60/0x1b8
00085 btree_node_write_endio+0x1b0/0x2d8
00085 bio_endio+0x138/0x1f0
00085 btree_node_write_endio+0xe8/0x2d8
00085 bio_endio+0x138/0x1f0
00085 blk_update_request+0x220/0x4c0
00085 blk_mq_end_request+0x28/0x148
00085 virtblk_request_done+0x64/0xe8
00085 blk_mq_complete_request+0x34/0x40
00085 virtblk_done+0x78/0x130
00085 vring_interrupt+0x6c/0xb0
00085 __handle_irq_event_percpu+0x8c/0x2e0
00085 handle_irq_event+0x50/0xb0
00085 handle_fasteoi_irq+0xc4/0x250
00085 handle_irq_desc+0x44/0x60
00085 generic_handle_domain_irq+0x20/0x30
00085 gic_handle_irq+0x54/0xc8
00085 call_on_irq_stack+0x24/0x40
Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wentao Liang [Wed, 2 Apr 2025 13:45:44 +0000 (21:45 +0800)]
bcachefs: Add error handling for zlib_deflateInit2()
In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in pstore_compress().
Add an error check and return 0 immediately if the initialzation fails.
Fixes:
986e9842fb68 ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eric Biggers [Wed, 2 Apr 2025 03:26:48 +0000 (20:26 -0700)]
bcachefs: add missing selection of XARRAY_MULTI
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits
the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert
a large folio in the page cache. Fix this by making bcachefs select
XARRAY_MULTI.
Fixes:
be212d86b19c ("bcachefs: bs > ps support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 31 Mar 2025 20:09:39 +0000 (16:09 -0400)]
bcachefs: bch_dev_usage_full
All the fastpaths that need device usage don't need the sector totals or
fragmentation, just bucket counts.
Split bch_dev_usage up into two different versions, the normal one with
just bucket counts.
This is also a stack usage improvement, since we have a bch_dev_usage on
the stack in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Mar 2025 19:18:51 +0000 (15:18 -0400)]
bcachefs: Kill btree_iter.trans
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Mar 2025 21:03:12 +0000 (17:03 -0400)]
bcachefs: do_trace_key_cache_fill()
Reducing stack frame usage; this moves the printbuf out of the main
stack frame.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 03:11:08 +0000 (23:11 -0400)]
bcachefs: Split up bch_dev.io_ref
We now have separate per device io_refs for read and write access.
This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.
It's also a bit of hardening; we no longer allow writes to devices that
are read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Apr 2025 05:40:19 +0000 (01:40 -0400)]
bcachefs: fix ref leak in btree_node_read_all_replicas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 31 Mar 2025 20:19:04 +0000 (16:19 -0400)]
bcachefs: Fix null ptr deref in bch2_write_endio()
This was previously hard to hit since it requires racing with device
removal, but splitting up io_ref uncovered it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 31 Mar 2025 18:32:31 +0000 (14:32 -0400)]
bcachefs: Fix field spanning write warning
Struct with embedded VLA...
memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3)
WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730
Modules linked in:
CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted
6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:bch2_trigger_stripe+0x706/0x730
Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff
RSP: 0018:
ffff88817081f680 EFLAGS:
00010246
RAX:
f8fe7dd1c56b5600 RBX:
ffff888101265368 RCX:
0000000000000027
RDX:
0000000000000008 RSI:
00000000fffbffff RDI:
ffff888101265368
RBP:
0000000000000000 R08:
000000000003ffff R09:
ffff88817f1fe000
R10:
00000000000bfffd R11:
0000000000000004 R12:
ffff8881012652c0
R13:
0000000000000000 R14:
0000000000000008 R15:
ffff88817081f6c9
FS:
00007fc428bc7c80(0000) GS:
ffff888179280000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007ffd3ee4a038 CR3:
000000010a9bc000 CR4:
0000000000750eb0
PKRU:
55555554
Call Trace:
<TASK>
? __warn+0xce/0x1b0
? bch2_trigger_stripe+0x706/0x730
? report_bug+0x11b/0x1a0
? bch2_trigger_stripe+0x706/0x730
? handle_bug+0x5e/0x90
? exc_invalid_op+0x1a/0x50
? asm_exc_invalid_op+0x1a/0x20
? bch2_trigger_stripe+0x706/0x730
bch2_gc_mark_key+0x2cf/0x430
bch2_check_allocations+0x1a64/0x1ed0
? vsnprintf+0x1ad/0x420
? bch2_check_allocations+0x191f/0x1ed0
bch2_run_recovery_passes+0x13b/0x2b0
bch2_fs_recovery+0x9b7/0x1290
? __bch2_print+0xb2/0xf0
? bch2_printbuf_exit+0x1e/0x30
? print_mount_opts+0x153/0x180
bch2_fs_start+0x274/0x3b0
bch2_fs_get_tree+0x516/0x6e0
vfs_get_tree+0x21/0xa0
do_new_mount+0x153/0x350
__x64_sys_mount+0x16c/0x1f0
do_syscall_64+0x6c/0x140
? arch_exit_to_user_mode_prepare+0x9/0x40
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 31 Mar 2025 01:15:57 +0000 (21:15 -0400)]
bcachefs: Fix striping behaviour
For striping across devices, we maintain "clocks", and we advance them
by the inverse of "how much free space this device has left", so that we
round robin biased in favor of devices with more free space.
This code was originally trying to do EWMA-ish stuff when originally
written, ~10 years ago, and was never properly cleaned up when it was
realized that an EWMA is not the right approach here.
That left a bug, when we rescale to keep all the clocks in the correct
range and prevent overflow.
It was assumed that we'd always be allocated from the device with the
smallest clock hand, but that's actually not correct: with the target
options, allocations will be first tried from a subset of devices, and
then the entire filesystem if that fails.
Thus, the rescale from the first allocation - allocating from a subset
of devices - can pick the wrong rescale value and cause the rest of the
clocks to go to 0, losing information.
This resuls in incorrect striping behaviour when the desired number of
replicas doesn't fit on the foreground target.
Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 31 Mar 2025 00:04:16 +0000 (20:04 -0400)]
bcachefs: fix bch2_write_point_to_text() units
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 20:57:21 +0000 (16:57 -0400)]
bcachefs: Log original key being moved in data updates
There's something going on with the data move path; log the original key
being moved for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 20:50:59 +0000 (16:50 -0400)]
bcachefs: BCH_JSET_ENTRY_log_bkey
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 13:30:04 +0000 (09:30 -0400)]
bcachefs: Reorder error messages that include journal debug
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 00:58:32 +0000 (20:58 -0400)]
bcachefs: Don't use designated initializers for disk_accounting_pos
Not all compilers fully initialize these - they're not guaranteed to
because of the union shenanigans.
Fixes: https://github.com/koverstreet/bcachefs/issues/844
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 30 Mar 2025 00:02:44 +0000 (20:02 -0400)]
bcachefs: Silence errors after emergency shutdown
We don't care about errors from asynchronous ops that were because we
did an emergency shutdown; silence them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Mar 2025 23:29:33 +0000 (19:29 -0400)]
bcachefs: fix units in rebalance_status
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Mar 2025 23:01:09 +0000 (19:01 -0400)]
bcachefs: bch2_ioctl_subvolume_destroy() fixes
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly
pruning the dcache.
Also, fix missing permissions checks.
Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Mar 2025 21:59:50 +0000 (17:59 -0400)]
bcachefs: Clear fs_path_parent on subvolume unlink
This fixes recursive subvolume removal.
Subvolume deletion is asynchronous; fs_path_parent, and thus the entry
in the subvolume_children btree, need to be cleared when the subvolume
is unlinked from the fs heirarchy - else we'll spuriously think a
subvolume has children and deletion will fail.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Mar 2025 18:22:29 +0000 (14:22 -0400)]
bcachefs: Change btree_insert_node() assertion to error
Debug for https://github.com/koverstreet/bcachefs/issues/843
Print useful debug info and go emergency read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 14:41:33 +0000 (10:41 -0400)]
bcachefs: Better printing of inconsistency errors
Build up and emit the error message for an inconsistency error all at
once, instead of spread over multiple printk calls, so they're not
jumbled in the dmesg log.
Also, add better indenting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 16:15:32 +0000 (12:15 -0400)]
bcachefs: bch2_count_fsck_err()
Factor out a helper from __bch2_fsck_err(), for counting the error in
the superblock and deciding whether to print or ratelimit - will be used
to replace some log_fsck_err() calls, where we want to lift out printing
the error message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 15:59:09 +0000 (11:59 -0400)]
bcachefs: Better helpers for inconsistency errors
An inconsistency error often happens as part of an event with multiple
error messages, and we want to build up one single error message with
proper indenting to produce more readable log messages that don't get
garbled.
Add new helpers that emit messages to a printbuf instead of printing
them directly, next patch will convert to use them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 17:21:11 +0000 (13:21 -0400)]
bcachefs: Consistent indentation of multiline fsck errors
Add the new helper printbuf_indent_add_nextline(), and use it in
__bch2_fsck_err() to centralize setting the indentation of multiline
fsck errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 25 Mar 2025 17:19:40 +0000 (13:19 -0400)]
bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
To be used by the mount helper in userspace, where we still have options
to be parsed by other layers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 25 Mar 2025 14:52:00 +0000 (10:52 -0400)]
bcachefs: bch2_time_stats_init_no_pcpu()
Add a mode to disable automatic switching to percpu mode, useful when a
time_stats will only be used by one thread and we don't want to have to
flush the percpu buffers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Florian Albrechtskirchinger [Thu, 27 Mar 2025 13:31:08 +0000 (14:31 +0100)]
bcachefs: Fix bch2_fs_get_tree() error path
When a filesystem is mounted read-only, subsequent attempts to mount it
as read-write fail with EBUSY. Previously, the error path in
bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(),
improperly freeing resources for a filesystem that was still actively
mounted. This change modifies the error path to only call
__bch2_fs_stop() if the superblock has no valid root dentry, ensuring
resources are not cleaned up prematurely when the filesystem is in use.
Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 16:01:41 +0000 (12:01 -0400)]
bcachefs: fix logging in journal_entry_err_msg()
We want to log errors all at once, not spread across multiple printks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 14:20:52 +0000 (10:20 -0400)]
bcachefs: add missing newline in bch2_trans_updates_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 15:57:03 +0000 (11:57 -0400)]
bcachefs: print_string_as_lines: fix extra newline
Don't print a newline on empty string; this was causing us to also print
an extra newline when we got to the end of th string.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 16:35:05 +0000 (12:35 -0400)]
bcachefs: Fix WARN() in bch2_bkey_pick_read_device()
syzbot discovered that this one is possible: we have pointers, but none
of them are to valid devices.
Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 15:29:04 +0000 (11:29 -0400)]
bcachefs: Don't return 0 size holes from bch2_seek_hole()
The hole we find in the btree might be fully dirty in the page cache. If
so, keep searching.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 27 Mar 2025 17:34:13 +0000 (13:34 -0400)]
bcachefs: Fix bch2_seek_hole() locking
We can't call bch2_seek_pagecache_hole(), and block on page locks, with
btree locks held.
This is easily fixed because we're at the end of the transaction - we
can just unlock, we don't need a drop_locks_do().
Reported-by: https://github.com/nagalun
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 15:41:07 +0000 (11:41 -0400)]
bcachefs: Recovery no longer holds state_lock
state_lock guards against devices coming or leaving, changing state, or
the filesystem changing between ro <-> rw.
But it's not necessary for running recovery passes, and holding it
blocks asynchronous events that would cause us to go RO or kick out
devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Mar 2025 15:03:14 +0000 (11:03 -0400)]
bcachefs: Fix permissions on version modparam
There's no reason for this not to be world readable - it provides the
currently supported on disk format version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 15:44:30 +0000 (11:44 -0400)]
bcachefs: cond_resched() in journal_key_sort_cmp()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 26 Mar 2025 15:26:30 +0000 (11:26 -0400)]
bcachefs: Fix 'hung task' messages in btree node scan
btree node scan has to wait on kthread workers that scan each device,
potentially for awhile.
We would like this to be interruptible, but we may need a different
mechanism than signals for that - we've had bugs in the past where
mounts were failing due to checking for signals, and no explanation on
where they came from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Mar 2025 19:07:06 +0000 (15:07 -0400)]
bcachefs: Fix btree iter flags in data move (2)
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts
requires a not_extents iterator; this fixes the path where we're walking
the extents btree and chase a reflink pointer into the reflink btree.
bch2_lookup_indirect_extent() requires working with an extents iterator
(due to peek_slot() semantics), so we implement
bch2_lookup_indirect_extent_for_move().
This is simplified because there's no need to report
indirect_extent_missing_errors here, that can be deferred until fsck or
when a user reads that data.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 24 Mar 2025 20:25:53 +0000 (16:25 -0400)]
bcachefs: Don't unnecessarily decrypt data when moving
There's various checks for "are we going to compress this" - but we're
not going to compress if we know it's incompressible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 25 Mar 2025 14:28:53 +0000 (10:28 -0400)]
bcachefs: Document disk accounting keys and conuters
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 25 Mar 2025 14:06:33 +0000 (10:06 -0400)]
bcachefs: Validate number of counters for accounting keys
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.
This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 18 Mar 2025 21:35:50 +0000 (17:35 -0400)]
bcachefs: Use print_string_as_lines() for journal stuck messages
They were being truncated, printk has a 1k limit per call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 24 Mar 2025 20:40:22 +0000 (16:40 -0400)]
bcachefs: Fix duplicate checksum error messages in write path
Also, improve the message in prep_encoded_data() - it now prints
good/bad checksums, and checksum type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 24 Mar 2025 15:51:01 +0000 (11:51 -0400)]
bcachefs: Fix silent short reads in data read retry path
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size
to "the size we can read from the current extent" with a swap, and
restores it to "the size for the total read" after the read_extent call
with another swap.
But we neglected to do the restore before the "if (ret) goto err;" -
which is a problem if we're retrying those errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 25 Mar 2025 15:40:35 +0000 (11:40 -0400)]
bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data()
If we're moving an extent that was partially overwritten,
bch2_write_rechecksum() will trim it to the currenty live range.
If we then also want to compress it, it'll be decrypted - but the nonce
has been advanced for the overwritten start of the extent that we
dropped, and we were using the nonce we calculated before rechecksum().
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Fixes:
127d90d2823e ("bcachefs: bch2_write_prep_encoded_data() now returns errcode")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 22 Mar 2025 01:16:50 +0000 (21:16 -0400)]
bcachefs: Kill unnecessary bch2_dev_usage_read()
bch2_dev_usage_read() is fairly expensive, we should optimize this more.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 22 Mar 2025 20:26:32 +0000 (16:26 -0400)]
bcachefs: btree node write errors now print btree node
It turned out a user was wondering why we were going read-only after a
write error, and he didn't realize he didn't have replication enabled -
this will make that more obvious, and we should be printing it anyways.
Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 22 Mar 2025 01:53:41 +0000 (21:53 -0400)]
bcachefs: Fix race in print_chain()
00636 Unable to handle kernel NULL pointer dereference at virtual address
00000000000000b0
00636 Mem abort info:
00636 ESR = 0x0000000096000005
00636 EC = 0x25: DABT (current EL), IL = 32 bits
00636 SET = 0, FnV = 0
00636 EA = 0, S1PTW = 0
00636 FSC = 0x05: level 1 translation fault
00636 Data abort info:
00636 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
00636 CM = 0, WnR = 0, TnD = 0, TagAccess = 0
00636 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00636 user pgtable: 4k pages, 39-bit VAs, pgdp=
0000000101b10000
00636 [
00000000000000b0] pgd=
0000000000000000, p4d=
0000000000000000, pud=
0000000000000000
00636 Internal error: Oops:
0000000096000005 [#1] SMP
00636 Modules linked in:
00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted
6.14.0-rc6-ktest-g3783b8973ab7 #17757
00636 Hardware name: linux,dummy-virt (DT)
00636 pstate:
20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00636 pc : print_chain+0xb8/0x170
00636 lr : print_chain+0xa0/0x170
00636 sp :
ffffff80d9c1bbb0
00636 x29:
ffffff80d9c1bbb0 x28:
0000000000000002 x27:
ffffff80c1be8250
00636 x26:
ffffff80dd9b0000 x25:
0000000000000020 x24:
000000000000002d
00636 x23:
000000000000003c x22:
ffffffc080a54518 x21:
ffffff80da6e00d0
00636 x20:
ffffff80da6e0170 x19:
ffffff80c1a1d240 x18:
00000000ffffffff
00636 x17:
3535303937202d3c x16:
203139202d3c2035 x15:
00000000ffffffff
00636 x14:
0000000000000000 x13:
ffffff80d71b63f1 x12:
0000000000000006
00636 x11:
ffffffc080beb1c0 x10:
0000000000000020 x9 :
00000000000134cc
00636 x8 :
0000000000000020 x7 :
0000000000000004 x6 :
0000000000000020
00636 x5 :
ffffff80d71b63f7 x4 :
ffffffc080a5451b x3 :
0000000000000000
00636 x2 :
0000000000000000 x1 :
0000000000000000 x0 :
0000000000000000
00636 Call trace:
00636 print_chain+0xb8/0x170 (P)
00636 bch2_check_for_deadlock+0x444/0x5a0
00636 bch2_btree_deadlock_read+0xb4/0x1c8
00636 full_proxy_read+0x74/0xd8
00636 vfs_read+0x90/0x300
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Mar 2025 18:22:39 +0000 (14:22 -0400)]
bcachefs: btree_trans_restart_foreign_task()
In debug mode, we save the call stack on transaction restart - but
there's no locking, so we can't touch it if we're issuing the restart
from another thread.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Mar 2025 16:29:56 +0000 (12:29 -0400)]
bcachefs: bch2_disk_accounting_mod2()
We're hitting some issues with uninitialized struct padding, flagged by
kmsan.
They appear to be falso positives, otherwise bch2_accounting_validate()
would have flagged them as "junk at end". But for now, we'll need to
initialize disk_accounting_pos with memset().
This adds a new helper, bch2_disk_accounting_mod2(), that initializes a
disk_accounting_pos and does the accounting mod all at once - so overall
things actually get slightly more ergonomic.
BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning
about them and they're a bit special.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 21 Mar 2025 15:30:09 +0000 (11:30 -0400)]
bcachefs: zero init journal bios
fix a kmsan splat
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 18:15:33 +0000 (14:15 -0400)]
bcachefs: Eliminate padding in move_bucket_key
We appear to be tripping over a compiler/kmsan bug with padding fields -
this is an easy workaround.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 18:54:49 +0000 (14:54 -0400)]
bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
We may sometimes read from uninitialized memory; we know, and that's ok.
We check if a btree node has been reused before waiting on any
outstanding IO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 18:17:53 +0000 (14:17 -0400)]
bcachefs: kmsan asserts
Catching these early makes them a lot easier to track down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 17:24:50 +0000 (13:24 -0400)]
bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
We store to all fields, so the kmsan warnings were spurious - but
initializing via stores to bitfields appear to have been giving the
compiler/kmsan trouble, and they're not necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 16:38:59 +0000 (12:38 -0400)]
bcachefs: Disable asm memcpys when kmsan enabled
kmsan doesn't know about inline assembly, obviously; this will close a
ton of syzbot bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 19 Mar 2025 21:01:38 +0000 (17:01 -0400)]
bcachefs: Handle backpointers with unknown data types
New data types might be added later, so we don't want to disallow
unknown data types - that'll be a compatibility hassle later. Instead,
ignore them.
Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 15:53:50 +0000 (11:53 -0400)]
bcachefs: Count BCH_DATA_parity backpointers correctly
These are counted as stripe data in the corresponding alloc keys.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 15:41:07 +0000 (11:41 -0400)]
bcachefs: Run bch2_check_dirent_target() at lookup time
More on the "full online self healing" project:
We now run most of the dirent <-> inode consistency checks, with repair
code, at runtime - the exact same check and repair code that fsck runs.
This will allow us to repair the "dirent points to inode that does not
point back" inconsistencies that have been popping up at runtime.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 15:06:50 +0000 (11:06 -0400)]
bcachefs: Refactor bch2_check_dirent_target()
Prep work for calling bch2_check_dirent_target() from bch2_lookup().
- Add an inline wrapper, if the target and backpointer match we can skip
the function call.
- We don't (yet?) want to remove the dirent we did the lookup from (when
we find a directory or subvol with multiple valid dirents pointing to
it), we can defer on that until later. For now, add an "are we in
fsck?" parameter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 15:06:50 +0000 (11:06 -0400)]
bcachefs: Move bch2_check_dirent_target() to namei.c
We're gradually running more and more fsck.c checks at runtime,
whereever applicable; when we do so they get moved out of fsck.c.
Next patch will call bch2_check_dirent_target() from bch2_lookup().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 14:53:52 +0000 (10:53 -0400)]
bcachefs: fs-common.c -> namei.c
name <-> inode, code for managing the relationships between inodes and
dirents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 14:16:48 +0000 (10:16 -0400)]
bcachefs: EIO cleanup
Replace these with proper private error codes, so that when we get an
error message we're not sifting through the entire codebase to see where
it came from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 14:30:51 +0000 (10:30 -0400)]
bcachefs: bch2_write_prep_encoded_data() now returns errcode
Prep work for killing off EIO and replacing them with proper private
error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 20 Mar 2025 14:25:15 +0000 (10:25 -0400)]
bcachefs: Simplify bch2_write_op_error()
There's no reason for the caller to do the actual logging, it's all done
the same.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 19 Mar 2025 16:33:40 +0000 (12:33 -0400)]
bcachefs: Fix block/btree node size defaults
We're fixing option parsing in userspace, it now obeys
OPT_SB_FIELD_SECTORS
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Tue, 18 Mar 2025 07:50:00 +0000 (15:50 +0800)]
bcachefs: Add missing smp_rmb()
The smp_rmb() guarantees that reads from reservations.counter
occur before accessing cur_entry_u64s. It's paired with the
atomic64_try_cmpxchg in journal_entry_open.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 18 Mar 2025 19:52:08 +0000 (15:52 -0400)]
bcachefs: Kill JOURNAL_ERRORS()
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Mar 2025 04:54:10 +0000 (00:54 -0400)]
bcachefs: Filesystem discard option now propagates to devices
the discard option is special, because it's both a filesystem and a
device option.
When set at the filesytsem level, it's supposed to propagate to (if set
persistently via sysfs) or override (if non persistently as a mount
option) the devices - that now works correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Mar 2025 04:55:52 +0000 (00:55 -0400)]
bcachefs: Device state is now a runtime option
Other options can normally be set at runtime via sysfs, no reason for
this one not to be as well - it just doesn't support the degraded flags
argument this way, that requires the ioctl.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Mar 2025 04:55:23 +0000 (00:55 -0400)]
bcachefs: Setting foreground_target at runtime now triggers rebalance
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 11 Mar 2025 22:44:25 +0000 (18:44 -0400)]
bcachefs: Device options now use standard sysfs code
Device options now use the common code for sysfs, and can superblock
fields (in a struct bch_member).
This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Mar 2025 16:05:50 +0000 (12:05 -0400)]
bcachefs: Kill BCH_DEV_OPT_SETTERS()
Previously, device options had their superblock option field listed
separately, which was weird and easy to miss when defining options.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Tue, 18 Mar 2025 07:50:01 +0000 (15:50 +0800)]
bcachefs: Remove spurious smp_mb()
The smp_mb() is paired with nothing.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Mon, 17 Mar 2025 17:54:24 +0000 (01:54 +0800)]
bcachefs: Fix incorrect state count
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is
the condition in journal_entry_open where we return JOURNAL_ERR_max_open,
so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf
corresponding to seq has started to write.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Mar 2025 19:07:06 +0000 (15:07 -0400)]
bcachefs: Fix btree iter flags in data move
Rebalance requires a not_extents iterator.
This wasn't hit before because all_snapshots disableds is_extents on
snapshots btrees - but has no effect on the reflink btree.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Mar 2025 17:58:51 +0000 (13:58 -0400)]
bcachefs: Validate bch_sb.offset field
This was missed - but it needs to be correct for the superblock recovery
tool that scans the start and end of the device for backup superblocks:
we don't want to pick up superblocks that belong to a different
partition that starts at a different offset.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Mar 2025 14:54:21 +0000 (10:54 -0400)]
bcachefs: bch2_sb_validate() doesn't need bch_sb_handle
Minor refactoring, so that bch2_sb_validate() can be used in the new
userspace superblock recovery tool.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 17 Mar 2025 15:28:26 +0000 (11:28 -0400)]
bcachefs: Add missing random.h includes
Fix build in userspace, and good hygeine.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 15 Mar 2025 23:57:20 +0000 (19:57 -0400)]
bcachefs: Better incompat version/feature error messages
If we can't mount because of an incompatibility, print what's supported
and unsupported - to help solve PEBKAC issues.
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 15 Mar 2025 21:27:27 +0000 (17:27 -0400)]
bcachefs: Fix offset_into_extent in data move path
Fixes the following:
[ 17.607394] kernel BUG at fs/bcachefs/reflink.c:261!
[ 17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G OE
6.14.0-rc6-arch1-gfcb0bd9609d2 #7
0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1
[ 17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023
[ 17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs]
[ 17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8
[ 17.609136] RSP: 0018:
ffffa3714455f850 EFLAGS:
00010246
[ 17.609261] RAX:
0000000000000080 RBX:
ffff895891098790 RCX:
0000000000000000
[ 17.609387] RDX:
0000000000000080 RSI:
ffffa3714455fa90 RDI:
ffff895889550000
[ 17.609511] RBP:
ffffa3714455f8c0 R08:
ffff895891098790 R09:
0000000000000001
[ 17.609637] R10:
ffffa3714455f8d8 R11:
ffffa3714455f950 R12:
ffffa3714455fa58
[ 17.609763] R13:
ffff895891098790 R14:
ffffa3714455fa58 R15:
ffff895889550000
[ 17.609888] FS:
0000000000000000(0000) GS:
ffff896757c00000(0000) knlGS:
0000000000000000
[ 17.610015] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 17.610143] CR2:
0000716b8cda2750 CR3:
0000000914e22000 CR4:
0000000000f50ef0
[ 17.610272] PKRU:
55555554
[ 17.610403] Call Trace:
[ 17.610535] <TASK>
[ 17.610662] ? __die_body.cold+0x19/0x27
[ 17.610791] ? die+0x2e/0x50
[ 17.610918] ? do_trap+0xca/0x110
[ 17.611049] ? do_error_trap+0x6a/0x90
[ 17.611178] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.611331] ? exc_invalid_op+0x50/0x70
[ 17.611468] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.611620] ? asm_exc_invalid_op+0x1a/0x20
[ 17.611757] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.611911] ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612084] bch2_move_data_btree+0x58a/0x6c0 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612256] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612431] ? bch2_move_extent+0x3d7/0x6e0 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612607] ? __bch2_move_data+0xea/0x200 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612782] __bch2_move_data+0xea/0x200 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.612959] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.613149] do_rebalance+0x517/0x8d0 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.613342] ? local_clock_noinstr+0xd/0xd0
[ 17.613518] ? local_clock+0x15/0x30
[ 17.613693] ? __bch2_trans_get+0x152/0x300 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.613890] ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
[ 17.614090] bch2_rebalance_thread+0x66/0xb0 [bcachefs
c42b95c23facdfe11d39755520127cd771dddec2]
The offset_into_extent bit was copied from the read path, but it's
unnecessary here, where we always want to read and move the entire
indirect extent, and it causes the assertion pop - because we're using a
non-extents iterator, which always points to the end of the reflink
pointer.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eric Biggers [Sun, 16 Mar 2025 03:47:17 +0000 (20:47 -0700)]
bcachefs: use sha256() instead of crypto_shash API
Just use sha256() instead of the clunky crypto API. This is much
simpler.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eric Biggers [Sun, 16 Mar 2025 03:03:19 +0000 (20:03 -0700)]
bcachefs: Remove unnecessary softdeps on crc32c and crc64
Since bcachefs does not access crc32c and crc64 through the crypto API,
there is no need to use module softdeps to ensure they are loaded.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Mar 2025 17:39:14 +0000 (13:39 -0400)]
bcachefs: #if 0 out (enable|disable)_encryption()
These weren't hooked up, but they probably should be - add some comments
for context.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 16 Mar 2025 01:32:33 +0000 (21:32 -0400)]
bcachefs: Improve can_write_extent()
This fixes another "rebalance spinning and doing no work" issue;
rebalance was reading extents it wanted to move, but then failing in
bch2_write() -> bch2_alloc_sectors_start() due to being unable to
allocate sufficient replicas.
This was triggered by a user playing with the durability settings, the
foreground device was an NVME device with durability=2, and originally
he'd set the background device to durability=2 as well, but changed it
back to 1 (the default) after seeing IO errors.
That meant that with replicas=2, we want to move data off the NVME
device which satisfies that constraint, but with a single durability=1
device on the background target there's no way to move the extent to
that target while satisfiying the "required replicas" constraint.
The solution for now is for bch2_data_update_init() to check for this,
and return an error - before kicking off the read.
bch2_data_update_init() already had two different checks for "will we be
able to write this extent", with partially duplicated code, so this
patch combines and improves that logic.
Additionally, we now always bail out and return an error if there's
insufficient space on the destination target. Previously, we only did
this for BCH_WRITE_alloc_nowait moves, because it might be the case that
copygc just needs to free up space on the destination target.
But we really shouldn't kick off a move if the destination is full, we
can't currently distinguish between "really full" and "just need to wait
for copygc", and if we are going to wait on copygc it'd be better to do
that before kicking off the move.
This will additionally fix "rebalance spinning" issues caused by a
filesystem that has more data than can fit in background_target - which
is a valid scenario, since we don't exclude foreground/cache devices
when calculating filesystem capacity.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 15 Mar 2025 23:24:44 +0000 (19:24 -0400)]
bcachefs: trace_io_move_write_fail
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alan Huang [Sat, 15 Mar 2025 07:39:42 +0000 (15:39 +0800)]
bcachefs: Increase blacklist range
Now there are 16 journal buffers, 8 is too small to be enough.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 10 Mar 2025 17:33:41 +0000 (13:33 -0400)]
bcachefs: __bch2_read() now takes a btree_trans
Next patch will be checking if the extent we're reading from matches the
IO failure we saw before marking the failure.
For this to work, __bch2_read() needs to take the same transaction
context that bch2_rbio_retry() uses to do that check.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 12 Mar 2025 20:56:09 +0000 (16:56 -0400)]
bcachefs: BCH_READ_data_update -> bch_read_bio.data_update
Read flags are codepath dependent and change as they're passed around,
while the fields in rbio._state are mostly fixed properties of that
particular object.
Losing track of BCH_READ_data_update would be bad, and previously it was
not obvious if it was always correctly set in the rbio, so this is a
safety cleanup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Mar 2025 17:56:43 +0000 (12:56 -0500)]
bcachefs: Checksum errors get additional retries
It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.
This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!
- bch2_bkey_pick_read_device() is substantially reworked, and
bch2_dev_io_failures is expanded to record more information about the
type of failure (i.e. number of checksum errors).
It now returns an error code that describes more precisely the reason
for the failure - checksum error, io error, or offline device, instead
of the previous generic "insufficient devices". This is important for
the next patches that add poisoning, as we only want to poison extents
when we've got real checksum errors (or perhaps IO errors?) - not
because a device was offline.
- Add a new option and superblock field for the number of checksum
retries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Mar 2025 23:42:34 +0000 (18:42 -0500)]
bcachefs: Print message on successful read retry
Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 9 Mar 2025 00:37:10 +0000 (19:37 -0500)]
bcachefs: Return errors to top level bch2_rbio_retry()
Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.
Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Mar 2025 16:37:51 +0000 (11:37 -0500)]
bcachefs: BCH_ERR_data_read_buffer_too_small
Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Mar 2025 16:24:22 +0000 (11:24 -0500)]
bcachefs: Read error message now indicates if it was for an internal move
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 11 Mar 2025 13:04:09 +0000 (09:04 -0400)]
bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.
When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.
But the retry path propagates read flags differently, so needs special
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Mar 2025 22:20:22 +0000 (17:20 -0500)]
bcachefs: Convert read path to standard error codes
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.
This is going to be used for the data move path, to move but poison
extents with checksum errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Mar 2025 23:42:56 +0000 (18:42 -0500)]
bcachefs: Debug params for data corruption injection
dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>