linux-block.git
6 months agobcachefs: better backpointer_target_not_found() error message
Kent Overstreet [Tue, 10 Dec 2024 19:04:39 +0000 (14:04 -0500)]
bcachefs: better backpointer_target_not_found() error message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_backpointer_get_key() now repairs dangling backpointers
Kent Overstreet [Tue, 12 Nov 2024 08:46:31 +0000 (03:46 -0500)]
bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers

Continuing on with the self healing theme, we should be running any
check and repair code at runtime that we can - instead of declaring the
filesystemt inconsistent.

This will also let us skip running the backpointers -> extents fsck pass
except in debug mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: check_extents_to_backpointers() now only checks buckets with mismatches
Kent Overstreet [Fri, 15 Nov 2024 21:31:54 +0000 (16:31 -0500)]
bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches

Instead of walking every extent and every backpointer it points to,
first sum up backpointers in each bucket and check for mismatches, and
only look for missing backpointers if mismatches were detected, and only
check extents in those buckets.

This is a major fsck scalability improvement, since the two backpointers
passes (backpointers -> extents and extents -> backpointers) are the
most expensive fsck passes by far.

Additionally, to speed up the upgrade for backpointer bucket gens, or in
situations when we have to rebuild alloc info, add a special case for
when no backpointers are found in a bucket - don't check each individual
backpointer (in particular, avoiding the write buffer flushes), just
recreate them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Add write buffer flush param to backpointer_get_key()
Kent Overstreet [Fri, 15 Nov 2024 03:13:29 +0000 (22:13 -0500)]
bcachefs: Add write buffer flush param to backpointer_get_key()

In an upcoming patch bch2_backpointer_get_key() will be repairing when
it finds a dangling backpointer; it will need to flush the btree write
buffer before it can definitively say there's an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: kill __bch2_extent_ptr_to_bp()
Kent Overstreet [Sun, 17 Nov 2024 23:37:41 +0000 (18:37 -0500)]
bcachefs: kill __bch2_extent_ptr_to_bp()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_extent_ptr_to_bp() no longer depends on device
Kent Overstreet [Mon, 18 Nov 2024 04:58:21 +0000 (23:58 -0500)]
bcachefs: bch2_extent_ptr_to_bp() no longer depends on device

bch_backpointer no longer contains the bucket_offset field, it's just a
direct LBA mapping (with low bits to account for compressed extent
splitting), so we don't need to refer to the device to construct it
anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bcachefs_metadata_version_disk_accounting_big_endian
Kent Overstreet [Fri, 29 Nov 2024 22:41:43 +0000 (17:41 -0500)]
bcachefs: bcachefs_metadata_version_disk_accounting_big_endian

Fix sort order for disk accounting keys, in order to fix a regression on
mount times.

The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.

This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bcachefs_metadata_version_backpointer_bucket_gen
Kent Overstreet [Sun, 17 Nov 2024 04:53:07 +0000 (23:53 -0500)]
bcachefs: bcachefs_metadata_version_backpointer_bucket_gen

New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.

This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.

It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.

It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_btree_path_peek_slot() doesn't return errors
Kent Overstreet [Fri, 13 Dec 2024 10:58:34 +0000 (05:58 -0500)]
bcachefs: bch2_btree_path_peek_slot() doesn't return errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: trace_key_cache_fill
Kent Overstreet [Fri, 13 Dec 2024 10:43:00 +0000 (05:43 -0500)]
bcachefs: trace_key_cache_fill

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Log message in journal for snapshot deletion
Kent Overstreet [Thu, 12 Dec 2024 09:00:40 +0000 (04:00 -0500)]
bcachefs: Log message in journal for snapshot deletion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_trans_log_msg()
Kent Overstreet [Thu, 12 Dec 2024 05:44:28 +0000 (00:44 -0500)]
bcachefs: bch2_trans_log_msg()

Export a helper for logging to the journal when we're already in a
transaction context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Kill snapshot_t->equiv
Kent Overstreet [Thu, 12 Dec 2024 09:03:32 +0000 (04:03 -0500)]
bcachefs: Kill snapshot_t->equiv

Now entirely dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Snapshot deletion no longer uses snapshot_t->equiv
Kent Overstreet [Thu, 12 Dec 2024 08:03:58 +0000 (03:03 -0500)]
bcachefs: Snapshot deletion no longer uses snapshot_t->equiv

Switch to generating a private list of interior nodes to delete, instead
of using the equivalence class in the global data structure.

This eliminates possible races with snapshot creation, and is much
cleaner - it'll let us delete a lot of janky code for calculating and
maintaining the equivalence classes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()
Kent Overstreet [Thu, 12 Dec 2024 07:41:37 +0000 (02:41 -0500)]
bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()

When deleting dead snapshots, we move keys from redundant interior
snapshot nodes to child nodes - unless there's already a key, in which
case the ancestor key is deleted.

Previously, we tracked via equiv_seen whether the child snapshot had a
key, but this was tricky w.r.t. transaction restarts, and not
transactionally safe w.r.t. updates in the child snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't run overwrite triggers before insert
Kent Overstreet [Thu, 12 Dec 2024 07:27:52 +0000 (02:27 -0500)]
bcachefs: Don't run overwrite triggers before insert

This breaks when the trigger is inserting updates for the same btree, as
the inode trigger now does.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: alloc_data_type_set() happens in alloc trigger
Kent Overstreet [Thu, 12 Dec 2024 07:32:32 +0000 (02:32 -0500)]
bcachefs: alloc_data_type_set() happens in alloc trigger

Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.

But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix key cache + BTREE_ITER_all_snapshots
Kent Overstreet [Fri, 13 Dec 2024 10:29:27 +0000 (05:29 -0500)]
bcachefs: Fix key cache + BTREE_ITER_all_snapshots

Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.

This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots
Kent Overstreet [Thu, 12 Dec 2024 07:26:15 +0000 (02:26 -0500)]
bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots

In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: tidy btree_trans_peek_journal()
Kent Overstreet [Fri, 13 Dec 2024 11:02:24 +0000 (06:02 -0500)]
bcachefs: tidy btree_trans_peek_journal()

Change to match bch2_btree_trans_peek_updates() calling convention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: tidy up __bch2_btree_iter_peek()
Kent Overstreet [Thu, 12 Dec 2024 08:38:14 +0000 (03:38 -0500)]
bcachefs: tidy up __bch2_btree_iter_peek()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: check_indirect_extents can run online
Kent Overstreet [Mon, 9 Dec 2024 02:10:27 +0000 (21:10 -0500)]
bcachefs: check_indirect_extents can run online

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Refactor c->opts.reconstruct_alloc
Kent Overstreet [Tue, 10 Dec 2024 18:23:47 +0000 (13:23 -0500)]
bcachefs: Refactor c->opts.reconstruct_alloc

Now handled in one place.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Add empty statement between label and declaration in check_inode_hash_info_...
Nathan Chancellor [Tue, 10 Dec 2024 18:12:07 +0000 (11:12 -0700)]
bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root()

Clang 18 and newer warns (or errors with CONFIG_WERROR=y):

  fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    164 |         struct bch_inode_unpacked inode;
        |         ^

In Clang 17 and prior, this is an unconditional hard error:

  fs/bcachefs/str_hash.c:164:2: error: expected expression
    164 |         struct bch_inode_unpacked inode;
        |         ^
  fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode'
    165 |         ret = bch2_inode_unpack(k, &inode);
        |                                     ^
  fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode'
    169 |         struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode);
        |                                                              ^
  fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode'
    171 |                 ret = repair_inode_hash_info(trans, &inode);
        |                                                      ^

Add an empty statement between the label and the declaration to fix the
warning/error without disturbing the code too much.

Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: trace_write_buffer_maybe_flush
Kent Overstreet [Tue, 10 Dec 2024 15:29:12 +0000 (10:29 -0500)]
bcachefs: trace_write_buffer_maybe_flush

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_snapshot_exists()
Kent Overstreet [Mon, 9 Dec 2024 06:31:43 +0000 (01:31 -0500)]
bcachefs: bch2_snapshot_exists()

bch2_snapshot_equiv() is going away; convert users that just wanted to
know if the snapshot exists to something better

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_check_key_has_snapshot() prints btree id
Kent Overstreet [Mon, 9 Dec 2024 03:30:19 +0000 (22:30 -0500)]
bcachefs: bch2_check_key_has_snapshot() prints btree id

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_str_hash_check_key() now checks inode hash info
Kent Overstreet [Mon, 9 Dec 2024 02:47:34 +0000 (21:47 -0500)]
bcachefs: bch2_str_hash_check_key() now checks inode hash info

Versions of the same inode in different snapshots must have the same
hash info; this is critical for lookups to work correctly.

We're going to be running the str_hash checks online, at readdir or
xattr list time, so we now need str_hash_check_key() to check for inode
hash seed mismatches, since it won't be run right after check_inodes().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't BUG_ON() inode unpack error
Kent Overstreet [Mon, 9 Dec 2024 03:00:36 +0000 (22:00 -0500)]
bcachefs: Don't BUG_ON() inode unpack error

Bkey validation checks that inodes are well-formed and unpack
successfully, so an unpack error should always indicate memory
corruption or some other kind of hardware bug - but these are still
errors we can recover from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Use proper errcodes for inode unpack errors
Kent Overstreet [Mon, 9 Dec 2024 02:42:49 +0000 (21:42 -0500)]
bcachefs: Use proper errcodes for inode unpack errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: kill sysfs internal/accounting
Kent Overstreet [Mon, 9 Dec 2024 01:55:03 +0000 (20:55 -0500)]
bcachefs: kill sysfs internal/accounting

Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Kill unnecessary mark_lock usage
Kent Overstreet [Sun, 8 Dec 2024 09:11:21 +0000 (04:11 -0500)]
bcachefs: Kill unnecessary mark_lock usage

We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.

It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.

Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't start rewriting btree nodes until after journal replay
Kent Overstreet [Mon, 9 Dec 2024 11:00:33 +0000 (06:00 -0500)]
bcachefs: Don't start rewriting btree nodes until after journal replay

This fixes a deadlock during journal replay when btree node read errors
kick off a ton of rewrites: we don't want them competing with journal
replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty...
Kent Overstreet [Sat, 7 Dec 2024 04:15:05 +0000 (23:15 -0500)]
bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition

For each bucket we track when the bucket became nonempty and when it
became empty again: if we can ensure that there will be no journal
flushes in the range [nonempty, empty) (possibly because they occured at
the same journal sequence number), then it's safe to reuse the bucket
without waiting for a journal commit.

This is a major performance optimization for erasure coding, where
writes are initially replicated, but the extra replicas are quickly
dropped: if those buckets are reused and overwritten without issuing a
cache flush to the underlying device, then they only cost bus bandwidth.

But there's a tricky corner case when there's multiple empty -> nonempty
-> empty transitions in quick succession, i.e. when data is getting
overwritten immediately as it's being written.

If this happens and the previous empty transition hasn't been flushed,
we need to continue tracking the previous nonempty transition - not
start a new one.

Fixing this means we now need to track both the nonempty and empty
transitions in bch_alloc_v4.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_journal_noflush_seq() now takes [start, end)
Kent Overstreet [Sun, 8 Dec 2024 05:28:16 +0000 (00:28 -0500)]
bcachefs: bch2_journal_noflush_seq() now takes [start, end)

Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition
Kent Overstreet [Sun, 8 Dec 2024 01:43:07 +0000 (20:43 -0500)]
bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't add unknown accounting types to eytzinger tree
Kent Overstreet [Thu, 5 Dec 2024 17:35:43 +0000 (12:35 -0500)]
bcachefs: Don't add unknown accounting types to eytzinger tree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Plumb bkey_validate_context to journal_entry_validate
Kent Overstreet [Sun, 8 Dec 2024 02:36:15 +0000 (21:36 -0500)]
bcachefs: Plumb bkey_validate_context to journal_entry_validate

This lets us print the exact location in the journal if it was found in
the journal, or correctly print if it was found in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Use a heap for handling overwrites in btree node scan
Kent Overstreet [Sat, 7 Dec 2024 00:23:22 +0000 (19:23 -0500)]
bcachefs: Use a heap for handling overwrites in btree node scan

Fix an O(n^2) issue when we find many overlapping (overwritten) btree
nodes - especially when one node overwrites many smaller nodes.

This was discovered to be an issue with the bcachefs
merge_torture_flakey test - if we had a large btree that was then
emptied, the number of difficult overwrites can be unbounded.

Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agolib min_heap: Switch to size_t
Kent Overstreet [Sat, 7 Dec 2024 00:16:02 +0000 (19:16 -0500)]
lib min_heap: Switch to size_t

size_t is the correct type for a count of objects that can fit in
memory: this also means heaps now have the same memory layout as darrays
(fs/bcachefs/darray.h), and darrays can be used as heaps.

Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Coly Li <colyli@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Minor bucket alloc optimization
Kent Overstreet [Sat, 7 Dec 2024 03:37:42 +0000 (22:37 -0500)]
bcachefs: Minor bucket alloc optimization

Check open buckets and buckets waiting for journal commit before doing
other expensive lookups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Mark more errors autofix
Kent Overstreet [Sat, 7 Dec 2024 00:49:46 +0000 (19:49 -0500)]
bcachefs: Mark more errors autofix

tested repairing from a bug uncovered by the merge_torture_flakey test

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: fix bch2_btree_node_header_to_text() format string
Kent Overstreet [Sat, 7 Dec 2024 01:11:16 +0000 (20:11 -0500)]
bcachefs: fix bch2_btree_node_header_to_text() format string

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Journal space calculations should skip durability=0 devices
Kent Overstreet [Thu, 5 Dec 2024 17:35:17 +0000 (12:35 -0500)]
bcachefs: Journal space calculations should skip durability=0 devices

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: factor out str_hash.c
Kent Overstreet [Thu, 5 Dec 2024 04:36:33 +0000 (23:36 -0500)]
bcachefs: factor out str_hash.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: kill flags param to bch2_subvolume_get()
Kent Overstreet [Thu, 5 Dec 2024 04:40:26 +0000 (23:40 -0500)]
bcachefs: kill flags param to bch2_subvolume_get()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update...
Kent Overstreet [Thu, 5 Dec 2024 01:43:01 +0000 (20:43 -0500)]
bcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update succeeds

Originally, btree splits always succeeded once we got to the point of
recursing to the btree_insert_node() call.

But that changed when we switched to not taking intent locks all the way
up to the root, and that introduced a bug, because
bch2_btree_interior_update_will_free_node() cancels paending writes and
reparents a node that's going to be made visible on disk by another
btree update to the current btree update.

This was discovered in recent backpointers work, because
bch2_btree_interior_update_will_free_node() also clears the
will_make_reachable flag, causing backpointer target lookup to
spuriously thing it had found a dangling backpointer (when the
backpointer just hadn't been created yet by
btree_update_nodes_written()).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind
Kent Overstreet [Thu, 5 Dec 2024 00:46:35 +0000 (19:46 -0500)]
bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind

We should always signal to rewind if the requested pass hasn't been run,
even if called multiple times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Call bch2_btree_lost_data() on btree read error
Kent Overstreet [Thu, 5 Dec 2024 00:41:38 +0000 (19:41 -0500)]
bcachefs: Call bch2_btree_lost_data() on btree read error

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Journal write path refactoring, debug improvements
Kent Overstreet [Wed, 4 Dec 2024 23:14:14 +0000 (18:14 -0500)]
bcachefs: Journal write path refactoring, debug improvements

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: dev_alloc_list.devs -> dev_alloc_list.data
Kent Overstreet [Thu, 5 Dec 2024 00:21:22 +0000 (19:21 -0500)]
bcachefs: dev_alloc_list.devs -> dev_alloc_list.data

This lets us use darray macros on dev_alloc_list (and it will become a
darray eventually, when we increase the maximum number of devices).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix failure to allocate journal write on discard retry
Kent Overstreet [Wed, 4 Dec 2024 23:16:25 +0000 (18:16 -0500)]
bcachefs: Fix failure to allocate journal write on discard retry

When allocating a journal write fails, then retries after doing
discards, we were failing to count already allocated replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: BCH_ERR_insufficient_journal_devices
Kent Overstreet [Wed, 4 Dec 2024 22:53:38 +0000 (17:53 -0500)]
bcachefs: BCH_ERR_insufficient_journal_devices

kill another standard error code use

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Silence "unable to allocate journal write" if we're already RO
Kent Overstreet [Wed, 4 Dec 2024 22:48:06 +0000 (17:48 -0500)]
bcachefs: Silence "unable to allocate journal write" if we're already RO

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: trace_accounting_mem_insert
Kent Overstreet [Wed, 4 Dec 2024 22:44:25 +0000 (17:44 -0500)]
bcachefs: trace_accounting_mem_insert

Add a tracepoint for inserting new accounting entries: we're seeing odd
spinning behaviour in accounting read.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node
Kent Overstreet [Wed, 4 Dec 2024 06:19:28 +0000 (01:19 -0500)]
bcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node

Don't spin.

Fixes: de95cc201a97 ("bcachefs: Kill bch2_get_next_backpointer()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Simplify disk accounting validate late
Kent Overstreet [Wed, 4 Dec 2024 03:03:18 +0000 (22:03 -0500)]
bcachefs: Simplify disk accounting validate late

The validate late path was iterating over accounting entries in
eytzinger order, which is unnecessarily tricky when we may have to
remove entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: logged ops only use inum 0 of logged ops btree
Kent Overstreet [Mon, 2 Dec 2024 02:35:11 +0000 (21:35 -0500)]
bcachefs: logged ops only use inum 0 of logged ops btree

we wish to use the logged ops btree for other items that aren't strictly
logged ops: cursors for inode allocation

There's no reason to create another cached btree for inode allocator
cursors - so reserve different parts of the keyspace for different
purposes.

Older versions will ignore or delete the cursors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: rcu_pending now works in userspace
Kent Overstreet [Wed, 4 Dec 2024 02:22:26 +0000 (21:22 -0500)]
bcachefs: rcu_pending now works in userspace

Introduce a typedef to handle the difference between unsigned
long/struct urcu_gp_poll_state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: BCACHEFS_PATH_TRACEPOINTS should depend on TRACING
Geert Uytterhoeven [Tue, 3 Dec 2024 16:40:10 +0000 (17:40 +0100)]
bcachefs: BCACHEFS_PATH_TRACEPOINTS should depend on TRACING

When tracing is disabled, there is no point in asking the user about
enabling extra btree_path tracepoints in bcachefs.

Fixes: 32ed4a620c5405be ("bcachefs: Btree path tracepoints")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix allocating too big journal entry
Kent Overstreet [Tue, 3 Dec 2024 04:36:38 +0000 (23:36 -0500)]
bcachefs: Fix allocating too big journal entry

The "journal space available" calculations didn't take into account
mismatched bucket sizes; we need to take the minimum space available out
of our devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Improve "unable to allocate journal write" message
Kent Overstreet [Sun, 1 Dec 2024 21:39:54 +0000 (16:39 -0500)]
bcachefs: Improve "unable to allocate journal write" message

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: fix bch2_journal_key_insert_take() seq
Kent Overstreet [Sun, 1 Dec 2024 04:27:45 +0000 (23:27 -0500)]
bcachefs: fix bch2_journal_key_insert_take() seq

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_async_btree_node_rewrites_flush()
Kent Overstreet [Fri, 29 Nov 2024 23:53:26 +0000 (18:53 -0500)]
bcachefs: bch2_async_btree_node_rewrites_flush()

Add a method to flush btree node rewrites at the end of recovery, to
ensure that corrected errors are persisted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: If we did repair on a btree node, make sure we rewrite it
Kent Overstreet [Fri, 29 Nov 2024 23:17:00 +0000 (18:17 -0500)]
bcachefs: If we did repair on a btree node, make sure we rewrite it

Ensure that "invalid bkey" repair gets persisted, so that it doesn't
repeatedly spam the logs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bkey_fsck_err now respects errors_silent
Kent Overstreet [Fri, 29 Nov 2024 23:20:42 +0000 (18:20 -0500)]
bcachefs: bkey_fsck_err now respects errors_silent

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: list_pop_entry()
Kent Overstreet [Sat, 30 Nov 2024 00:13:54 +0000 (19:13 -0500)]
bcachefs: list_pop_entry()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Convert write path errors to inum_to_path()
Kent Overstreet [Thu, 14 Nov 2024 04:08:57 +0000 (23:08 -0500)]
bcachefs: Convert write path errors to inum_to_path()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_inum_to_path()
Kent Overstreet [Sat, 28 Sep 2024 19:40:49 +0000 (15:40 -0400)]
bcachefs: bch2_inum_to_path()

Add a function for walking backpointers to find a path from a given
inode number, and convert various error messages to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix fsck.c build in userspace
Kent Overstreet [Sat, 30 Nov 2024 02:12:47 +0000 (21:12 -0500)]
bcachefs: Fix fsck.c build in userspace

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Add missing parameter description to bch2_bucket_alloc_trans()
Yang Li [Fri, 29 Nov 2024 06:38:27 +0000 (14:38 +0800)]
bcachefs: Add missing parameter description to bch2_bucket_alloc_trans()

The function bch2_bucket_alloc_trans() lacked a description for the
nowait parameter in its documentation comment block. This patch adds the
missing description to ensure all parameters are properly documented.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=12179
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't recurse in check_discard_freespace_key
Kent Overstreet [Fri, 29 Nov 2024 00:30:23 +0000 (19:30 -0500)]
bcachefs: Don't recurse in check_discard_freespace_key

When calling check_discard_freeespace_key from the allocator, we can't
repair without recursing - run it asynchronously instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Check for extent crc uncompressed/compressed size mismatch
Kent Overstreet [Fri, 29 Nov 2024 00:02:18 +0000 (19:02 -0500)]
bcachefs: Check for extent crc uncompressed/compressed size mismatch

When not compressed, these must be equal - this fixes an assertion pop
in bch2_rechecksum_bio().

Reported-by: syzbot+50d3544c9b8db9c99fd2@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: bch2_trans_relock() is trylock for lockdep
Kent Overstreet [Thu, 28 Nov 2024 23:05:06 +0000 (18:05 -0500)]
bcachefs: bch2_trans_relock() is trylock for lockdep

fix some spurious lockdep splats

Reported-by: syzbot+e088be3c2d5c05aaac35@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: cryptographic MACs on superblock are not (yet?) supported
Kent Overstreet [Thu, 28 Nov 2024 22:57:55 +0000 (17:57 -0500)]
bcachefs: cryptographic MACs on superblock are not (yet?) supported

We should add support for cryptographic macs on the superblock - and it
won't be hard, but it'll need an incompatible feature bit (and we have a
new incompatible feature versioning scheme coming).

For now, just add a guard to avoid a dull ptr deref in gen_poly_key().

Reported-by: syzbot+dd3d9835055dacb66f35@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Check for inode journal seq in the future
Kent Overstreet [Thu, 28 Nov 2024 22:48:20 +0000 (17:48 -0500)]
bcachefs: Check for inode journal seq in the future

More check and repair code: this fixes a warning in
bch2_journal_flush_seq_async()

Reported-by: syzbot+d119b445ec739e7f3068@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Check for bucket journal seq in the future
Kent Overstreet [Thu, 28 Nov 2024 21:59:40 +0000 (16:59 -0500)]
bcachefs: Check for bucket journal seq in the future

This fixes an assertion pop in bch2_journal_noflush_seq() - log the
error to the superblock and continue instead.

Reported-by: syzbot+85700120f75fc10d4e18@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: do_fsck_ask_yn()
Kent Overstreet [Thu, 28 Nov 2024 21:25:41 +0000 (16:25 -0500)]
bcachefs: do_fsck_ask_yn()

__bch2_fsck_err() is huge, and badly needs more refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't error out when logging fsck error
Kent Overstreet [Thu, 28 Nov 2024 21:14:06 +0000 (16:14 -0500)]
bcachefs: Don't error out when logging fsck error

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: mark more errors AUTOFIX
Kent Overstreet [Thu, 28 Nov 2024 21:09:15 +0000 (16:09 -0500)]
bcachefs: mark more errors AUTOFIX

mark errors as autofix where syzbot has hit the repair paths

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: add missing printbuf_reset()
Kent Overstreet [Thu, 28 Nov 2024 21:09:04 +0000 (16:09 -0500)]
bcachefs: add missing printbuf_reset()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix journal_iter list corruption
Kent Overstreet [Thu, 28 Nov 2024 20:10:24 +0000 (15:10 -0500)]
bcachefs: Fix journal_iter list corruption

Fix exiting an iterator that wasn't initialized.

Reported-by: syzbot+2f7c2225ed8a5cb24af1@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Guard against backpointers to unknown btrees
Kent Overstreet [Thu, 28 Nov 2024 03:29:54 +0000 (22:29 -0500)]
bcachefs: Guard against backpointers to unknown btrees

Reported-by: syzbot+997f0573004dcb964555@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Issue a transaction restart after commit in repair
Kent Overstreet [Thu, 28 Nov 2024 03:09:29 +0000 (22:09 -0500)]
bcachefs: Issue a transaction restart after commit in repair

transaction commits invalidate pointers to btree values, and they also
downgrade intent locks.

This breaks the interior btree update path, which takes intent locks and
then calls into the allocator.

This isn't an ideal solution: we can't unconditionally issue a restart
after a transaction commit, because that would break other codepaths.

Reported-by: syzbot+78d82470c16a49702682@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Guard against journal seq overflow
Kent Overstreet [Thu, 28 Nov 2024 02:58:43 +0000 (21:58 -0500)]
bcachefs: Guard against journal seq overflow

Wraparound is impractical to handle since in various places we use 0 as
a sentinal value - but 64 bits (or 56, because the btree write buffer
steals a few bits) is enough for all practical purposes.

Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: BCH_FS_recovery_running
Kent Overstreet [Wed, 27 Nov 2024 08:00:54 +0000 (03:00 -0500)]
bcachefs: BCH_FS_recovery_running

If we're autofixing topology errors, we shouldn't shutdown if we're
still in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Make topology errors autofix
Kent Overstreet [Mon, 25 Nov 2024 02:28:07 +0000 (21:28 -0500)]
bcachefs: Make topology errors autofix

These repair paths are well tested, we can repair them without explicit
user intervention

This also tweaks bch2_topology_error() so that we run topology repair if
we're in recovery, not just fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: struct bkey_validate_context
Kent Overstreet [Wed, 27 Nov 2024 05:29:52 +0000 (00:29 -0500)]
bcachefs: struct bkey_validate_context

Add a new parameter to bkey validate functions, and use it to improve
invalid bkey error messages: we can now print the btree and depth it
came from, or if it came from the journal, or is a btree root.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Ignore empty btree root journal entries
Kent Overstreet [Wed, 27 Nov 2024 06:03:41 +0000 (01:03 -0500)]
bcachefs: Ignore empty btree root journal entries

There's no reason to treat them as errors: just ignore them, and go with
a previous btree root if we had one.

Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix null ptr deref in btree_path_lock_root()
Kent Overstreet [Wed, 27 Nov 2024 03:59:27 +0000 (22:59 -0500)]
bcachefs: Fix null ptr deref in btree_path_lock_root()

Historically, we required that all btree node roots point to a valid
(possibly fake) node, but we're improving our ability to continue in the
presence of errors.

Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Go RW earlier, for normal rw mount
Kent Overstreet [Wed, 27 Nov 2024 02:27:16 +0000 (21:27 -0500)]
bcachefs: Go RW earlier, for normal rw mount

Previously, when mounting read-write after a clean shutdown, we wouldn't
go read-write until after all the recovery passes completed.

Now, go RW early in recovery, the same as any other situation we'll need
to go read-write. This fixes a bug where we discover unlinked inodes
after a clean shutdown: repair fails because we're read only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix bch2_btree_node_update_key_early()
Kent Overstreet [Tue, 26 Nov 2024 20:16:57 +0000 (15:16 -0500)]
bcachefs: Fix bch2_btree_node_update_key_early()

Fix an assertion pop from the recent btree cache freelist fixes.

Fixes: baefd3f849ed ("bcachefs: btree_cache.freeable list fixes")
Reported-by: Tyler <th020394@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Change "disk accounting version 0" check to commit only
Kent Overstreet [Mon, 25 Nov 2024 22:03:13 +0000 (17:03 -0500)]
bcachefs: Change "disk accounting version 0" check to commit only

6.11 had a bug where we'd sometimes create disk accounting keys with
version 0, which causes issues for journal replay - but we don't need to
delete existing accounting keys with version 0.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Don't try to en/decrypt when encryption not available
Kent Overstreet [Mon, 25 Nov 2024 07:05:02 +0000 (02:05 -0500)]
bcachefs: Don't try to en/decrypt when encryption not available

If a btree node says it's encrypted, but the superblock never had an
encryptino key - whoops, that needs to be handled.

Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix dup/misordered check in btree node read
Kent Overstreet [Mon, 25 Nov 2024 06:26:56 +0000 (01:26 -0500)]
bcachefs: Fix dup/misordered check in btree node read

We were checking for out of order keys, but not duplicate keys.

Reported-by: syzbot+dedbd67513939979f84f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Bad btree roots are now autofix
Kent Overstreet [Mon, 25 Nov 2024 05:21:27 +0000 (00:21 -0500)]
bcachefs: Bad btree roots are now autofix

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Kill bch2_bucket_alloc_new_fs()
Kent Overstreet [Mon, 25 Nov 2024 04:28:21 +0000 (23:28 -0500)]
bcachefs: Kill bch2_bucket_alloc_new_fs()

The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.

Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix btree node scan when unknown btree IDs are present
Kent Overstreet [Mon, 25 Nov 2024 03:57:01 +0000 (22:57 -0500)]
bcachefs: Fix btree node scan when unknown btree IDs are present

btree_root entries for unknown btree IDs are created during recovery,
before reading those btree roots.

But btree_node_scan may find btree nodes with unknown btree IDs when we
haven't seen roots for those btrees.

Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: backpointer_to_missing_ptr is now autofix
Kent Overstreet [Mon, 25 Nov 2024 03:45:25 +0000 (22:45 -0500)]
bcachefs: backpointer_to_missing_ptr is now autofix

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6 months agobcachefs: Fix accounting_read when we rewind
Kent Overstreet [Mon, 25 Nov 2024 03:28:41 +0000 (22:28 -0500)]
bcachefs: Fix accounting_read when we rewind

If we rewind recovery to run topology repair, that causes
accounting_read to run twice.

This fixes accounting being double counted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>