linux-block.git
11 months agobcachefs: Optimize bch2_dirent_name_bytes
Joshua Ashton [Sat, 12 Aug 2023 21:26:30 +0000 (22:26 +0100)]
bcachefs: Optimize bch2_dirent_name_bytes

Avoids doing a full strnlen for getting the length of the name of a
dirent entry.

Given the fact that the name of dirents is stored at the end of the
bkey's value, and we know the length of that in u64s, we can find the
last u64 and figure out how many NUL bytes are at the end of the string.

On little endian systems this ends up being the leading zeros of the
last u64, whereas on big endian systems this ends up being the trailing
zeros of the last u64.
We can take that value in bits and divide it by 8 to get the number of
NUL bytes at the end.

There is no endian-fixup or other compatibility here as this is string
data interpreted as a u64.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Introduce bch2_dirent_get_name
Joshua Ashton [Sat, 12 Aug 2023 21:26:29 +0000 (22:26 +0100)]
bcachefs: Introduce bch2_dirent_get_name

A nice cleanup that avoids a bunch of open-coding name/string usage
around dirent usage.

Will be used by casefolding impl in future commits.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: six locks: Guard against wakee exiting in __six_lock_wakeup()
Kent Overstreet [Sat, 12 Aug 2023 21:10:42 +0000 (17:10 -0400)]
bcachefs: six locks: Guard against wakee exiting in __six_lock_wakeup()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Don't open code closure_nr_remaining()
Kent Overstreet [Sat, 12 Aug 2023 20:51:45 +0000 (16:51 -0400)]
bcachefs: Don't open code closure_nr_remaining()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix lifetime in bch2_write_done(), add assertion
Kent Overstreet [Sat, 12 Aug 2023 20:52:33 +0000 (16:52 -0400)]
bcachefs: Fix lifetime in bch2_write_done(), add assertion

We're hunting for an open_bucket leak, add an assertion to help track it
down: also, we can't use the bch_fs after dropping our write ref to it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add a comment for should_drop_open_bucket()
Kent Overstreet [Sat, 12 Aug 2023 20:46:54 +0000 (16:46 -0400)]
bcachefs: Add a comment for should_drop_open_bucket()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: six locks: Fix missing barrier on wait->lock_acquired
Kent Overstreet [Sat, 12 Aug 2023 19:05:06 +0000 (15:05 -0400)]
bcachefs: six locks: Fix missing barrier on wait->lock_acquired

Six locks do lock handoff via the wakeup path: the thread doing the
wakeup also takes the lock on behalf of the waiter, which means the
waiter only has to look at its waitlist entry, and doesn't have to touch
the lock cacheline while another thread is using it.

Linus noticed that this needs a real barrier, which this patch fixes.

Also add a comment for the should_sleep_fn() error path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: linux-bcachefs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
11 months agobcachefs: Check for directories in deleted inodes btree
Kent Overstreet [Sat, 12 Aug 2023 16:34:47 +0000 (12:34 -0400)]
bcachefs: Check for directories in deleted inodes btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add btree_trans* to inode_set_fn
Joshua Ashton [Sat, 12 Aug 2023 14:47:45 +0000 (15:47 +0100)]
bcachefs: Add btree_trans* to inode_set_fn

This will be used when we need to re-hash a directory tree when setting
flags.

It is not possible to have concurrent btree_trans on a thread.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Improve bch2_write_points_to_text()
Kent Overstreet [Sat, 12 Aug 2023 16:13:19 +0000 (12:13 -0400)]
bcachefs: Improve bch2_write_points_to_text()

Now we also print the open_buckets owned by each write_point - this is
to help with debugging a shutdown hang.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix check_version_upgrade()
Kent Overstreet [Sat, 12 Aug 2023 02:22:31 +0000 (22:22 -0400)]
bcachefs: Fix check_version_upgrade()

We were failing to upgrade to the latest compatible version - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix 'journal not marked as containing replicas'
Kent Overstreet [Fri, 11 Aug 2023 23:30:38 +0000 (19:30 -0400)]
bcachefs: Fix 'journal not marked as containing replicas'

This fixes the replicas_write_errors test: the patch
  bcachefs: mark journal replicas before journal write submission

partially fixed replicas marking for the journal, but it broke the case
where one replica failed - this patch re-adds marking after the journal
write completes, when we know how many replicas succeeded.

Additionally, we do not consider it a fsck error when the very last
journal entry is not correctly marked, since there is an inherent race
there.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: btree_journal_iter.c
Kent Overstreet [Sat, 5 Aug 2023 20:08:44 +0000 (16:08 -0400)]
bcachefs: btree_journal_iter.c

Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: sb-clean.c
Kent Overstreet [Sat, 5 Aug 2023 19:54:38 +0000 (15:54 -0400)]
bcachefs: sb-clean.c

Pull code for bch_sb_field_clean out into its own file.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Move bch_sb_field_crypt code to checksum.c
Kent Overstreet [Sat, 5 Aug 2023 19:43:00 +0000 (15:43 -0400)]
bcachefs: Move bch_sb_field_crypt code to checksum.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: sb-members.c
Kent Overstreet [Sat, 5 Aug 2023 19:40:21 +0000 (15:40 -0400)]
bcachefs: sb-members.c

Split out a new file for bch_sb_field_members - we'll likely want to
move more code here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Split up btree_update_leaf.c
Kent Overstreet [Sat, 5 Aug 2023 16:55:08 +0000 (12:55 -0400)]
bcachefs: Split up btree_update_leaf.c

We now have
  btree_trans_commit.c
  btree_update.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Split up fs-io.[ch]
Kent Overstreet [Thu, 3 Aug 2023 22:18:21 +0000 (18:18 -0400)]
bcachefs: Split up fs-io.[ch]

fs-io.c is too big - time for some reorganization
 - fs-dio.c: direct io
 - fs-pagecache.c: pagecache data structures (bch_folio), utility code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix assorted checkpatch nits
Kent Overstreet [Mon, 7 Aug 2023 16:04:05 +0000 (12:04 -0400)]
bcachefs: Fix assorted checkpatch nits

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix for sb buffer being misaligned
Kent Overstreet [Tue, 8 Aug 2023 00:44:56 +0000 (20:44 -0400)]
bcachefs: Fix for sb buffer being misaligned

On old kernels, kmalloc() may return an allocation that's not naturally
aligned - this resulted in a bug where we allocated a bio with not
enough biovecs. Fix this by using buf_pages().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Convert journal validation to bkey_invalid_flags
Kent Overstreet [Sun, 6 Aug 2023 16:43:31 +0000 (12:43 -0400)]
bcachefs: Convert journal validation to bkey_invalid_flags

This fixes a bug where we were already passing bkey_invalid_flags
around, but treating the parameter as just read/write - so the compat
code wasn't being run correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Improve journal_entry_err_msg()
Kent Overstreet [Sun, 6 Aug 2023 14:57:25 +0000 (10:57 -0400)]
bcachefs: Improve journal_entry_err_msg()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: BCH_COMPAT_bformat_overflow_done no longer required
Kent Overstreet [Sun, 6 Aug 2023 14:04:37 +0000 (10:04 -0400)]
bcachefs: BCH_COMPAT_bformat_overflow_done no longer required

Awhile back, we changed bkey_format generation to ensure that the packed
representation could never represent fields larger than the unpacked
representation.

This was to ensure that bkey_packed_successor() always gave a sensible
result, but in the current code bkey_packed_successor() is only used in
a debug assertion - not for anything important.

This kills the requirement that we've gotten rid of those weird bkey
formats, and instead changes the assertion to check if we're dealing
with an old weird bkey format.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: kill EBUG_ON() redefinition in bkey.c
Kent Overstreet [Sun, 6 Aug 2023 14:02:41 +0000 (10:02 -0400)]
bcachefs: kill EBUG_ON() redefinition in bkey.c

our debug mode assertions in bkey.c haven't been getting run, whoops

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add logging to bch2_inode_peek() & related
Kent Overstreet [Sun, 6 Aug 2023 14:04:05 +0000 (10:04 -0400)]
bcachefs: Add logging to bch2_inode_peek() & related

Add error messages when we fail to lookup an inode, and also add a few
missing bch2_err_class() calls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix lock thrashing in __bchfs_fallocate()
Kent Overstreet [Thu, 3 Aug 2023 07:39:49 +0000 (03:39 -0400)]
bcachefs: Fix lock thrashing in __bchfs_fallocate()

We've observed significant lock thrashing on fstests generic/083 in
fallocate, due to dropping and retaking btree locks when checking the
pagecache for data.

This adds a nonblocking mode to bch2_clamp_data_hole(), where we only
use folio_trylock(), and can thus be used safely while btree locks are
held - thus we only have to drop btree locks as a fallback, on actual
lock contention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix for bch2_copygc() spuriously returning -EEXIST
Kent Overstreet [Fri, 4 Aug 2023 14:51:02 +0000 (10:51 -0400)]
bcachefs: Fix for bch2_copygc() spuriously returning -EEXIST

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Convert btree_err_type to normal error codes
Kent Overstreet [Thu, 3 Aug 2023 23:36:28 +0000 (19:36 -0400)]
bcachefs: Convert btree_err_type to normal error codes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix btree_err() macro
Kent Overstreet [Fri, 4 Aug 2023 00:32:46 +0000 (20:32 -0400)]
bcachefs: Fix btree_err() macro

Error code wasn't being propagated correctly, change it to match
fsck_err()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Ensure topology repair runs
Kent Overstreet [Fri, 4 Aug 2023 00:57:06 +0000 (20:57 -0400)]
bcachefs: Ensure topology repair runs

This fixes should_restart_for_topology_repair() - previously it was
returning false if the btree io path had already seleceted topology
repair to run, even if it hadn't run yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Log a message when running an explicit recovery pass
Kent Overstreet [Fri, 4 Aug 2023 00:37:32 +0000 (20:37 -0400)]
bcachefs: Log a message when running an explicit recovery pass

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Print out required recovery passes on version upgrade
Kent Overstreet [Thu, 3 Aug 2023 21:33:20 +0000 (17:33 -0400)]
bcachefs: Print out required recovery passes on version upgrade

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix shift by 64 in set_inc_field()
Kent Overstreet [Thu, 3 Aug 2023 20:38:36 +0000 (16:38 -0400)]
bcachefs: Fix shift by 64 in set_inc_field()

UBSAN was complaining about a shift by 64 in set_inc_field().

This only happened when the value being shifted was 0, so in theory
should be harmless - a shift by 64 (or register width) should logically
give a result of 0, but CPUs will in practice leave the input unchanged
when the number of bits to shift by wraps - and since our input here is
0, the output is still what we want.

But, it's still undefined behaviour and we need our UBSAN output to be
clean, so it needs to be fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bkey_format helper improvements
Kent Overstreet [Thu, 3 Aug 2023 18:42:37 +0000 (14:42 -0400)]
bcachefs: bkey_format helper improvements

 - add a to_text() method for bkey_format

 - convert bch2_bkey_format_validate() to modern error message style,
   where we pass a printbuf for the error string instead of returning a
   static string

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bcachefs_metadata_version_deleted_inodes
Kent Overstreet [Mon, 17 Jul 2023 04:56:29 +0000 (00:56 -0400)]
bcachefs: bcachefs_metadata_version_deleted_inodes

Add a new bitset btree for inodes pending deletion; this means we no
longer have to scan the full inodes btree after an unclean shutdown.

Specifically, this adds:
 - a trigger to update the deleted_inodes btree based on changes to the
   inodes btree
 - a new recovery pass
 - and check_inodes is now only a fsck pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix folio leak in folio_hole_offset()
Kent Overstreet [Thu, 3 Aug 2023 07:29:42 +0000 (03:29 -0400)]
bcachefs: Fix folio leak in folio_hole_offset()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix overlapping extent repair
Kent Overstreet [Fri, 21 Jul 2023 02:42:26 +0000 (22:42 -0400)]
bcachefs: Fix overlapping extent repair

A number of smallish fixes for overlapping extent repair, and (part of)
a new unit test. This fixes all the issues turned up by bhzhu203, in his
filesystem image from running mongodb + snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: In debug mode, run fsck again after fixing errors
Kent Overstreet [Thu, 3 Aug 2023 00:19:58 +0000 (20:19 -0400)]
bcachefs: In debug mode, run fsck again after fixing errors

We want to ensure that fsck actually fixed all the errors it found - the
second fsck run should be clean.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: recovery_types.h
Kent Overstreet [Wed, 2 Aug 2023 23:49:24 +0000 (19:49 -0400)]
bcachefs: recovery_types.h

Move some code out of bcachefs.h, which is too much of an everything
header.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Handle weird opt string from sys_fsconfig()
Kent Overstreet [Wed, 2 Aug 2023 16:51:51 +0000 (12:51 -0400)]
bcachefs: Handle weird opt string from sys_fsconfig()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Assorted fixes for clang
Kent Overstreet [Wed, 2 Aug 2023 00:06:45 +0000 (20:06 -0400)]
bcachefs: Assorted fixes for clang

clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Move fsck_inode_rm() to inode.c
Kent Overstreet [Fri, 21 Jul 2023 07:20:08 +0000 (03:20 -0400)]
bcachefs: Move fsck_inode_rm() to inode.c

Prep work for the new deleted inodes btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Consolidate btree id properties
Kent Overstreet [Fri, 21 Jul 2023 09:38:45 +0000 (05:38 -0400)]
bcachefs: Consolidate btree id properties

This refactoring centralizes defining per-btree properties.

bch2_key_types_allowed was also about to overflow a u32, so expand that
to a u64.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch2_trans_update_extent_overwrite()
Kent Overstreet [Fri, 21 Jul 2023 04:27:19 +0000 (00:27 -0400)]
bcachefs: bch2_trans_update_extent_overwrite()

Factor out a new helper, to be used when fsck has to repair overlapping
extents.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix minor memory leak on invalid bkey
Kent Overstreet [Fri, 21 Jul 2023 03:13:43 +0000 (23:13 -0400)]
bcachefs: Fix minor memory leak on invalid bkey

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Move some declarations to the correct header
Kent Overstreet [Fri, 21 Jul 2023 03:14:05 +0000 (23:14 -0400)]
bcachefs: Move some declarations to the correct header

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix btree iter leak in __bch2_insert_snapshot_whiteouts()
Kent Overstreet [Fri, 21 Jul 2023 02:47:59 +0000 (22:47 -0400)]
bcachefs: Fix btree iter leak in __bch2_insert_snapshot_whiteouts()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix a null ptr deref in check_xattr()
Kent Overstreet [Thu, 20 Jul 2023 23:30:53 +0000 (19:30 -0400)]
bcachefs: Fix a null ptr deref in check_xattr()

We were attempting to initialize inode hash info when no inodes were
found.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch2_btree_bit_mod()
Kent Overstreet [Mon, 17 Jul 2023 04:56:07 +0000 (00:56 -0400)]
bcachefs: bch2_btree_bit_mod()

New helper for bitset btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: move inode triggers to inode.c
Kent Overstreet [Mon, 17 Jul 2023 04:41:48 +0000 (00:41 -0400)]
bcachefs: move inode triggers to inode.c

bit of reorg

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: fsck: delete dead code
Kent Overstreet [Mon, 17 Jul 2023 04:12:58 +0000 (00:12 -0400)]
bcachefs: fsck: delete dead code

Delete the old, now reimplemented overlapping extent check/repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Make topology repair a normal recovery pass
Kent Overstreet [Mon, 17 Jul 2023 03:19:49 +0000 (23:19 -0400)]
bcachefs: Make topology repair a normal recovery pass

This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch2_run_explicit_recovery_pass()
Kent Overstreet [Mon, 17 Jul 2023 03:21:17 +0000 (23:21 -0400)]
bcachefs: bch2_run_explicit_recovery_pass()

This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Print version, options earlier in startup path
Kent Overstreet [Thu, 20 Jul 2023 22:09:26 +0000 (18:09 -0400)]
bcachefs: Print version, options earlier in startup path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: use prejournaled key updates for write buffer flushes
Brian Foster [Wed, 19 Jul 2023 12:53:06 +0000 (08:53 -0400)]
bcachefs: use prejournaled key updates for write buffer flushes

The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: support btree updates of prejournaled keys
Brian Foster [Wed, 19 Jul 2023 12:53:05 +0000 (08:53 -0400)]
bcachefs: support btree updates of prejournaled keys

Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.

Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.

Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: fold bch2_trans_update_by_path_trace() into callers
Brian Foster [Wed, 19 Jul 2023 12:53:04 +0000 (08:53 -0400)]
bcachefs: fold bch2_trans_update_by_path_trace() into callers

There is only one other caller so eliminate some boilerplate.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: remove unnecessary btree_insert_key_leaf() wrapper
Brian Foster [Wed, 19 Jul 2023 12:53:03 +0000 (08:53 -0400)]
bcachefs: remove unnecessary btree_insert_key_leaf() wrapper

This is in preparation to support prejournaled keys. We want the
ability to optionally pass a seq stored in the btree update rather
than the seq of the committing transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: remove duplicate code between backpointer update paths
Brian Foster [Wed, 19 Jul 2023 12:53:02 +0000 (08:53 -0400)]
bcachefs: remove duplicate code between backpointer update paths

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agoMAINTAINERS: add Brian Foster as a reviewer for bcachefs
Brian Foster [Thu, 20 Jul 2023 13:00:33 +0000 (09:00 -0400)]
MAINTAINERS: add Brian Foster as a reviewer for bcachefs

Brian has been playing with bcachefs for several months now and has
offerred to commit time to patch review.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Suppresss various error messages in no_data_io mode
Kent Overstreet [Mon, 17 Jul 2023 02:31:19 +0000 (22:31 -0400)]
bcachefs: Suppresss various error messages in no_data_io mode

We commonly use no_data_io mode when debugging filesystem metadata
dumps, where data checksum/compression errors are expected and
unimportant - this patch suppresses these.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix lookup_inode_for_snapshot()
Kent Overstreet [Mon, 17 Jul 2023 01:56:18 +0000 (21:56 -0400)]
bcachefs: Fix lookup_inode_for_snapshot()

This fixes a use-after-free.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: need_snapshot_cleanup shouldn't be a fsck error
Kent Overstreet [Mon, 17 Jul 2023 01:09:37 +0000 (21:09 -0400)]
bcachefs: need_snapshot_cleanup shouldn't be a fsck error

We currently don't track whether snapshot cleanup still needs to finish
(aside from running a full fsck), so it shouldn't be a fsck error yet -
fsck -n after fsck has succesfully completed shouldn't error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Improve key_visible_in_snapshot()
Kent Overstreet [Sun, 16 Jul 2023 22:15:01 +0000 (18:15 -0400)]
bcachefs: Improve key_visible_in_snapshot()

Delete a redundant bch2_snapshot_is_ancestor() check, and convert some
assertions to debug assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Refactor overlapping extent checks
Kent Overstreet [Sun, 16 Jul 2023 19:12:25 +0000 (15:12 -0400)]
bcachefs: Refactor overlapping extent checks

Make the overlapping extent check/repair code more self contained.

This is prep work for hopefully reducing key_visible_in_snapshot() usage
here as well, and also includes a nice performance optimization to not
check ref_visible2() unless the extents potentially overlap.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: check_extent(): don't use key_visible_in_snapshot()
Kent Overstreet [Sun, 16 Jul 2023 18:55:33 +0000 (14:55 -0400)]
bcachefs: check_extent(): don't use key_visible_in_snapshot()

This changes the main part of check_extents(), that checks the extent
against the corresponding inode, to not use key_visible_in_snapshot().

key_visible_in_snapshot() has to iterate over the list of ancestor
overwrites repeatedly calling bch2_snapshot_is_ancestor(), so this is a
significant performance improvement.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: check_extent() refactoring
Kent Overstreet [Sun, 16 Jul 2023 18:45:23 +0000 (14:45 -0400)]
bcachefs: check_extent() refactoring

More prep work for reducing key_visible_in_snapshot() usage - this
rearranges how KEY_TYPE_whitout keys are handled, so that they can be
marked off in inode_warker->inode->seen_this_pos.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: fsck: walk_inode() now takes is_whiteout
Kent Overstreet [Sun, 16 Jul 2023 18:19:08 +0000 (14:19 -0400)]
bcachefs: fsck: walk_inode() now takes is_whiteout

We only want to synthesize an inode for the current snapshot ID for non
whiteouts - this refactoring lets us call walk_inode() earlier and clean
up some control flow.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Simplify check_extent()
Kent Overstreet [Thu, 13 Jul 2023 05:41:02 +0000 (01:41 -0400)]
bcachefs: Simplify check_extent()

Minor refactoring/dead code deletion, prep work for reworking
check_extent() to avoid key_visible_in_snapshot().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: overlapping_extents_found()
Kent Overstreet [Thu, 13 Jul 2023 07:11:16 +0000 (03:11 -0400)]
bcachefs: overlapping_extents_found()

This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: fsck: inode_walker: last_pos, seen_this_pos
Kent Overstreet [Sun, 16 Jul 2023 18:24:36 +0000 (14:24 -0400)]
bcachefs: fsck: inode_walker: last_pos, seen_this_pos

Prep work for changing check_extent() to avoid
key_visible_in_snapshot() - this adds the state to track whether an
inode has seen an extent at this pos.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: check_extents(): make sure to check i_sectors for last inode
Kent Overstreet [Sun, 16 Jul 2023 18:33:57 +0000 (14:33 -0400)]
bcachefs: check_extents(): make sure to check i_sectors for last inode

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Inline bch2_snapshot_is_ancestor() fast path
Kent Overstreet [Sun, 16 Jul 2023 19:59:40 +0000 (15:59 -0400)]
bcachefs: Inline bch2_snapshot_is_ancestor() fast path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Upgrade path fixes
Kent Overstreet [Sun, 16 Jul 2023 01:03:26 +0000 (21:03 -0400)]
bcachefs: Upgrade path fixes

Some minor fixes to not print errors that are actually due to a verson
upgrade.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: is_ancestor bitmap
Kent Overstreet [Thu, 13 Jul 2023 06:43:29 +0000 (02:43 -0400)]
bcachefs: is_ancestor bitmap

Further optimization for bch2_snapshot_is_ancestor(). We add a small
inline bitmap to snapshot_t, which indicates which of the next 128
snapshot IDs are ancestors of the current id - eliminating the last few
iterations of the loop in bch2_snapshot_is_ancestor().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: mark bch_inode_info and bkey_cached as reclaimable
Mikulas Patocka [Thu, 13 Jul 2023 16:00:28 +0000 (18:00 +0200)]
bcachefs: mark bch_inode_info and bkey_cached as reclaimable

Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.

Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Compression levels
Kent Overstreet [Thu, 13 Jul 2023 02:27:16 +0000 (22:27 -0400)]
bcachefs: Compression levels

This allows including a compression level when specifying a compression
type, e.g.
  compression=zstd:15

Values from 1 through 15 indicate compression levels, 0 or unspecified
indicates the default.

For LZ4, values 3-15 specify that the HC algorithm should be used.

Note that for compatibility, extents themselves only include the
compression type, not the compression level. This means that specifying
the same compression algorithm but different compression levels for the
compression and background_compression options will have no effect.

XXX: perhaps we could add a warning for this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Extent sb compression type fields to 8 bits
Kent Overstreet [Thu, 13 Jul 2023 02:06:37 +0000 (22:06 -0400)]
bcachefs: Extent sb compression type fields to 8 bits

The upper 4 bits are for compression level.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bcachefs_format.h should be using __u64
Kent Overstreet [Thu, 13 Jul 2023 02:06:11 +0000 (22:06 -0400)]
bcachefs: bcachefs_format.h should be using __u64

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: fix_errors option is now a proper enum
Kent Overstreet [Wed, 12 Jul 2023 03:47:29 +0000 (23:47 -0400)]
bcachefs: fix_errors option is now a proper enum

Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.

But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch_opt_fn
Kent Overstreet [Thu, 13 Jul 2023 01:48:32 +0000 (21:48 -0400)]
bcachefs: bch_opt_fn

Minor refactoring to get rid of some unneeded token pasting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Convert snapshot table to RCU array
Kent Overstreet [Wed, 12 Jul 2023 17:55:03 +0000 (13:55 -0400)]
bcachefs: Convert snapshot table to RCU array

This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add a race_fault() for write buffer slowpath
Kent Overstreet [Wed, 12 Jul 2023 15:43:03 +0000 (11:43 -0400)]
bcachefs: Add a race_fault() for write buffer slowpath

We haven't hooked up dynamic fault injection quite yet, but we will soon

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add buffered IO fallback for userspace
Kent Overstreet [Tue, 11 Jul 2023 00:30:04 +0000 (20:30 -0400)]
bcachefs: Add buffered IO fallback for userspace

In userspace, we want to be able to switch to buffered IO when we're
dealing with an image on a filesystem/device that doesn't support the
blocksize the filesystem was formatted with.

This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be
supported by the shim version of blkdev_get_by_path() in -tools, and it
adds a fallback to disable direct IO and retry for userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fallocate now checks page cache
Kent Overstreet [Mon, 10 Jul 2023 02:28:08 +0000 (22:28 -0400)]
bcachefs: Fallocate now checks page cache

Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.

But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Don't start copygc until recovery is finished
Kent Overstreet [Mon, 10 Jul 2023 21:23:59 +0000 (17:23 -0400)]
bcachefs: Don't start copygc until recovery is finished

With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.

Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix build error on weird gcc
Kent Overstreet [Mon, 10 Jul 2023 19:56:05 +0000 (15:56 -0400)]
bcachefs: Fix build error on weird gcc

fixes
./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Snapshot depth, skiplist fields
Kent Overstreet [Sun, 25 Jun 2023 22:04:46 +0000 (18:04 -0400)]
bcachefs: Snapshot depth, skiplist fields

This extents KEY_TYPE_snapshot to include some new fields:
 - depth, to indicate depth of this particular node from the root
 - skip[3], skiplist entries for quickly walking back up to the root

These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.

Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.

This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Version table now lists required recovery passes
Kent Overstreet [Mon, 10 Jul 2023 17:42:26 +0000 (13:42 -0400)]
bcachefs: Version table now lists required recovery passes

Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.

To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.

This means we can also delete PASS_UPGRADE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()
Kent Overstreet [Mon, 10 Jul 2023 16:23:01 +0000 (12:23 -0400)]
bcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()

Add some new helpers, and fix upgrade/downgrade in bch2_fs_initialize().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix a write buffer flush deadlock
Kent Overstreet [Mon, 10 Jul 2023 15:17:56 +0000 (11:17 -0400)]
bcachefs: Fix a write buffer flush deadlock

We're not supposed to block if BTREE_INSERT_JOURNAL_RECLAIM && watermark
!= BCH_WATERMARK_reclaim.

This should really be a separate BTREE_INSERT_NONBLOCK flag - add some
comments to that effect, it's not important for this patch.

btree write buffer flush depends on this behaviour though - the first
loop tries to flush sequentially, which doesn't free up space in the
journal optimally. If that can't proceed we bail out and flush in
journal order - that won't work if we're blocked instead of returning an
error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bcachefs_metadata_version_major_minor
Kent Overstreet [Wed, 28 Jun 2023 02:09:35 +0000 (22:09 -0400)]
bcachefs: bcachefs_metadata_version_major_minor

This introduces major/minor versioning to the superblock version number.
Major version number changes indicate incompatible releases; we can move
forward to a new major version number, but not backwards. Minor version
numbers indicate compatible changes - these add features, but can still
be mounted and used by old versions.

With the recent patches that make it possible to roll out new btrees and
key types without breaking compatibility, we should be able to roll out
most new features without incompatible changes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Add new assertions for shutdown path
Kent Overstreet [Sun, 9 Jul 2023 19:13:30 +0000 (15:13 -0400)]
bcachefs: Add new assertions for shutdown path

We've been seeing assertions pop that indicate the btree node cache or
key cache have dirty items when we just did a clean shutdown.

Add some more assertions so we can catch this when we're dirtying items.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: bch2_xattr_set() now updates ctime
Kent Overstreet [Sun, 9 Jul 2023 18:18:28 +0000 (14:18 -0400)]
bcachefs: bch2_xattr_set() now updates ctime

Fixes fstests generic/728

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Kill bch2_xattr_get()
Kent Overstreet [Sun, 9 Jul 2023 18:12:58 +0000 (14:12 -0400)]
bcachefs: Kill bch2_xattr_get()

Inline it into the only caller

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Fix try_decrease_writepoints()
Kent Overstreet [Sun, 9 Jul 2023 17:49:34 +0000 (13:49 -0400)]
bcachefs: Fix try_decrease_writepoints()

We were freeing open buckets on the writepoint list, but forgetting to
take them off the writepoint list - whoops

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Mark as EXPERIMENTAL
Kent Overstreet [Sun, 9 Jul 2023 17:20:29 +0000 (13:20 -0400)]
bcachefs: Mark as EXPERIMENTAL

As discussed on list, bcachefs is going to be marked as experimental for
a few releases, until the inevitable tide of new bug reports subsides.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Enumerate recovery passes
Kent Overstreet [Fri, 7 Jul 2023 06:42:28 +0000 (02:42 -0400)]
bcachefs: Enumerate recovery passes

Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.

This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.

The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.

By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Stash journal replay params in bch_fs
Kent Overstreet [Sun, 9 Jul 2023 02:33:29 +0000 (22:33 -0400)]
bcachefs: Stash journal replay params in bch_fs

For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 months agobcachefs: Kill bch2_bucket_gens_read()
Kent Overstreet [Sun, 9 Jul 2023 02:27:03 +0000 (22:27 -0400)]
bcachefs: Kill bch2_bucket_gens_read()

This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.

This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>