linux-block.git
10 months agobcachefs: bch2_btree_iter_peek() now works with interior nodes
Kent Overstreet [Mon, 10 Oct 2022 02:25:19 +0000 (22:25 -0400)]
bcachefs: bch2_btree_iter_peek() now works with interior nodes

Needed by the next patch, which will be iterating over keys in nodes at
level 1.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_insert_node() no longer uses lock_write_nofail
Kent Overstreet [Sun, 9 Oct 2022 09:04:38 +0000 (05:04 -0400)]
bcachefs: bch2_btree_insert_node() no longer uses lock_write_nofail

Now that we have an error path plumbed through, there's no need to be
using bch2_btree_node_lock_write_nofail().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add error path to btree_split()
Kent Overstreet [Sun, 2 Oct 2022 02:15:30 +0000 (22:15 -0400)]
bcachefs: Add error path to btree_split()

The next patch in the series is (finally!) going to change btree splits
(and interior updates in general) to not take intent locks all the way
up to the root - instead only locking the nodes they'll need to modify.

However, this will be introducing a race since if we're not holding a
write lock on a btree node it can be written out by another thread, and
then we might not have enough space for a new bset entry.

We can handle this by retrying - we just need to introduce a new error
path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Write new btree nodes after parent update
Kent Overstreet [Sat, 1 Oct 2022 04:34:02 +0000 (00:34 -0400)]
bcachefs: Write new btree nodes after parent update

In order to avoid locking all btree nodes up to the root for btree node
splits, we're going to have to introduce a new error path into
bch2_btree_insert_node(); this mean we can't have done any writes or
modified global state before that point.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Simplify break_cycle()
Kent Overstreet [Sun, 9 Oct 2022 08:55:02 +0000 (04:55 -0400)]
bcachefs: Simplify break_cycle()

We'd like to prioritize aborting transactions that have done less work -
however, it appears breaking cycles by telling other threads to abort
may still be buggy, so disable that for now.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Print cycle on unrecoverable deadlock
Kent Overstreet [Sun, 9 Oct 2022 08:29:04 +0000 (04:29 -0400)]
bcachefs: Print cycle on unrecoverable deadlock

Some lock operations can't fail; a cycle of nofail locks is impossible
to recover from. So we want to get rid of these nofail locking
operations, but as this is tricky it'll be done incrementally.

If such a cycle happens, this patch prints out which codepaths are
involved so we know what to work on next.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Handle dropping pointers in data_update path
Kent Overstreet [Sun, 9 Oct 2022 07:32:17 +0000 (03:32 -0400)]
bcachefs: Handle dropping pointers in data_update path

Cached pointers are generally dropped, not moved: this led to an
assertion firing in the data update path when there were no new replicas
being written.

This path adds a data_options field for pointers to be dropped, and
tweaks move_extent() to check if we're only dropping pointers, not
writing new ones, before kicking off a data update operation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Ratelimit ec error message
Kent Overstreet [Sun, 9 Oct 2022 06:30:50 +0000 (02:30 -0400)]
bcachefs: Ratelimit ec error message

We should fix this, but for now this makes this more usable.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use btree_type_has_ptrs() more consistently
Kent Overstreet [Sun, 9 Oct 2022 06:25:53 +0000 (02:25 -0400)]
bcachefs: Use btree_type_has_ptrs() more consistently

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix "multiple types of data in same bucket" with ec
Kent Overstreet [Sun, 9 Oct 2022 05:08:51 +0000 (01:08 -0400)]
bcachefs: Fix "multiple types of data in same bucket" with ec

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Ensure fsck error is printed before panic
Kent Overstreet [Sun, 9 Oct 2022 04:54:36 +0000 (00:54 -0400)]
bcachefs: Ensure fsck error is printed before panic

When errors=panic, we want to make sure we print the error before
calling bch2_inconsistent_error().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a deadlock in btree_update_nodes_written()
Kent Overstreet [Mon, 3 Oct 2022 20:41:17 +0000 (16:41 -0400)]
bcachefs: Fix a deadlock in btree_update_nodes_written()

btree_node_lock_nopath() is something we'd like to get rid of, it's
always prone to deadlocks if we accidentally are holding other locks,
because it doesn't mark the lock it's taking in a path: we'll want to
get rid of it in the future, but for now this patch works it by calling
bch2_trans_unlock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_trans_locked()
Kent Overstreet [Mon, 3 Oct 2022 20:39:49 +0000 (16:39 -0400)]
bcachefs: bch2_trans_locked()

Useful debugging function.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve btree_deadlock debugfs output
Kent Overstreet [Sun, 2 Oct 2022 05:41:08 +0000 (01:41 -0400)]
bcachefs: Improve btree_deadlock debugfs output

This changes bch2_check_for_deadlock() to print the longest chains it
finds - when we have a deadlock because the cycle detector isn't finding
something, this will let us see what it's missing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't quash error in bch2_bucket_alloc_set_trans()
Kent Overstreet [Sun, 2 Oct 2022 03:54:46 +0000 (23:54 -0400)]
bcachefs: Don't quash error in bch2_bucket_alloc_set_trans()

We were incorrectly returning -BCH_ERR_insufficient_devices when we'd
received a different error from bch2_bucket_alloc_trans(), which
(erronously) turns into -EROFS further up the call chain.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a trans path overflow in bch2_btree_delete_range_trans()
Kent Overstreet [Wed, 28 Sep 2022 14:16:57 +0000 (10:16 -0400)]
bcachefs: Fix a trans path overflow in bch2_btree_delete_range_trans()

bch2_btree_delete_range_trans() was using btree_trans_too_many_iters()
to avoid path overflow, but this was buggy here (and also
btree_trans_too_many_iters() is suspect in general).

btree_trans_too_many_iters() only returns true when we're close to the
maximum number of paths - within 8 - but extent insert/delete assumes
that it can use more paths than that.

Instead, we need to call bch2_trans_begin() on every loop iteration.
Since we don't want to call bch2_trans_begin() (restarting the outer
transaction) if the call was a no-op - if we had no work to do - we have
to structure things a bit oddly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bucket_alloc_state
Kent Overstreet [Fri, 4 Nov 2022 20:06:55 +0000 (16:06 -0400)]
bcachefs: bucket_alloc_state

This refactoring puts our various allocation path counters into a
dedicated struct - the upcoming nocow patch is going to add another
counter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_btree_path_up_until_good_node()
Kent Overstreet [Tue, 27 Sep 2022 22:56:57 +0000 (18:56 -0400)]
bcachefs: Fix bch2_btree_path_up_until_good_node()

There was a rare bug when path->locks_want was nonzero, but not
BTREE_MAX_DEPTH, where we'd return on a valid node that wasn't locked -
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Factor out bch2_write_drop_io_error_ptrs()
Kent Overstreet [Tue, 27 Sep 2022 21:17:23 +0000 (17:17 -0400)]
bcachefs: Factor out bch2_write_drop_io_error_ptrs()

Move slowpath code to a separate, non-inline function.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Break out bch2_btree_path_traverse_cached_slowpath()
Kent Overstreet [Tue, 27 Sep 2022 02:34:49 +0000 (22:34 -0400)]
bcachefs: Break out bch2_btree_path_traverse_cached_slowpath()

Prep work for further refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Kill io_in_flight semaphore
Kent Overstreet [Mon, 26 Sep 2022 22:21:07 +0000 (18:21 -0400)]
bcachefs: Kill io_in_flight semaphore

This used to be needed more for buffered IO, but now the block layer has
writeback throttling - we can delete this now.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve bucket_alloc tracepoint
Kent Overstreet [Mon, 26 Sep 2022 22:18:00 +0000 (18:18 -0400)]
bcachefs: Improve bucket_alloc tracepoint

It now includes more info - whether the bucket was for metadata or data
- and also call it in the same place as the bucket_alloc_fail
tracepoint.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs; Mark __bch2_trans_iter_init as inline
Kent Overstreet [Mon, 26 Sep 2022 22:15:33 +0000 (18:15 -0400)]
bcachefs; Mark __bch2_trans_iter_init as inline

This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Inline fast path of check_pos_snapshot_overwritten()
Kent Overstreet [Mon, 26 Sep 2022 22:13:29 +0000 (18:13 -0400)]
bcachefs: Inline fast path of check_pos_snapshot_overwritten()

This moves the slowpath of check_pos_snapshot_overwritten() to a
separate function, and inlines the fast path - helping performance on
btrees that don't use snapshot and for users that aren't using
snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve jset_validate()
Kent Overstreet [Mon, 26 Sep 2022 20:23:19 +0000 (16:23 -0400)]
bcachefs: Improve jset_validate()

Previously, jset_validate() was formatting the initial part of an error
string for every entry it validating - expensive.

This moves that code to journal_entry_err_msg(), which is now only
called if there's an actual error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Optimize btree_path_alloc()
Kent Overstreet [Mon, 26 Sep 2022 20:19:56 +0000 (16:19 -0400)]
bcachefs: Optimize btree_path_alloc()

 - move slowpath code to a separate function, btree_path_overflow()
 - no need to use hweight64
 - copy nr_max_paths from btree_transaction_stats to btree_trans,
   avoiding a data dependency in the fast path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Inline bch2_trans_kmalloc() fast path
Kent Overstreet [Mon, 26 Sep 2022 20:15:17 +0000 (16:15 -0400)]
bcachefs: Inline bch2_trans_kmalloc() fast path

Small performance optimization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Run bch2_fs_counters_init() earlier
Kent Overstreet [Mon, 26 Sep 2022 02:26:48 +0000 (22:26 -0400)]
bcachefs: Run bch2_fs_counters_init() earlier

We need counters to be initialized before initializing shrinkers - the
shrinker callbacks will update those counters. This fixes a segfault in
userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: btree_err() now uses bch2_print_string_as_lines()
Kent Overstreet [Sun, 25 Sep 2022 22:22:54 +0000 (18:22 -0400)]
bcachefs: btree_err() now uses bch2_print_string_as_lines()

We've seen long error messages get truncated here, so convert to the new
bch2_print_string_as_lines().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve bch2_fsck_err()
Kent Overstreet [Sun, 25 Sep 2022 22:18:48 +0000 (18:18 -0400)]
bcachefs: Improve bch2_fsck_err()

 - factor out fsck_err_get()
 - if the "bcachefs (%s):" prefix has already been applied, don't
   duplicate it
 - convert to printbufs instead of static char arrays
 - tidy up control flow a bit
 - use bch2_print_string_as_lines(), to avoid messages getting truncated

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_print_string_as_lines()
Kent Overstreet [Sun, 25 Sep 2022 20:43:55 +0000 (16:43 -0400)]
bcachefs: bch2_print_string_as_lines()

This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_node_relock_notrace()
Kent Overstreet [Sun, 25 Sep 2022 20:42:53 +0000 (16:42 -0400)]
bcachefs: bch2_btree_node_relock_notrace()

Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_cache_scan() improvement
Kent Overstreet [Sun, 25 Sep 2022 18:49:14 +0000 (14:49 -0400)]
bcachefs: bch2_btree_cache_scan() improvement

We're still seeing OOM issues caused by the btree node cache shrinker
not sufficiently freeing memory: thus, this patch changes the shrinker
to not exit if __GFP_FS was not supplied.

Instead, tweak btree node memory allocation so that we never invoke
memory reclaim while holding the btree node cache lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix blocking with locks held
Kent Overstreet [Sat, 24 Sep 2022 01:00:24 +0000 (21:00 -0400)]
bcachefs: Fix blocking with locks held

This is a major oopsy - we should always be unlocking before calling
closure_sync(), else we'll cause a deadlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: btree_update_nodes_written() needs BTREE_INSERT_USE_RESERVE
Kent Overstreet [Fri, 23 Sep 2022 04:20:21 +0000 (00:20 -0400)]
bcachefs: btree_update_nodes_written() needs BTREE_INSERT_USE_RESERVE

This fixes an obvious deadlock - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix error handling in bch2_btree_update_start()
Kent Overstreet [Fri, 23 Sep 2022 01:27:42 +0000 (21:27 -0400)]
bcachefs: Fix error handling in bch2_btree_update_start()

We were checking for -EAGAIN, but we're not returned that when we didn't
pass a closure to wait with - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve bch2_btree_trans_to_text()
Kent Overstreet [Fri, 2 Sep 2022 02:56:27 +0000 (22:56 -0400)]
bcachefs: Improve bch2_btree_trans_to_text()

This is just a formatting/readability improvement.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Kill normalize_read_intent_locks()
Kent Overstreet [Mon, 22 Aug 2022 19:29:53 +0000 (15:29 -0400)]
bcachefs: Kill normalize_read_intent_locks()

Before we had the deadlock cycle detector, we didn't want to be holding
read locks when taking intent locks, because blocking on an intent lock
while holding a read lock was a lock ordering violation that could
cause a deadlock.

With the cycle detector this is no longer an issue, so this code can be
deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Ensure bch2_btree_node_lock_write_nofail() never fails
Kent Overstreet [Mon, 6 Mar 2023 13:58:02 +0000 (08:58 -0500)]
bcachefs: Ensure bch2_btree_node_lock_write_nofail() never fails

In order for bch2_btree_node_lock_write_nofail() to never produce a
deadlock, we must ensure we're never holding read locks when using it.
Fortunately, it's only used from code paths where any read locks may be
safely dropped.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Delete old deadlock avoidance code
Kent Overstreet [Mon, 22 Aug 2022 19:29:53 +0000 (15:29 -0400)]
bcachefs: Delete old deadlock avoidance code

This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Print deadlock cycle in debugfs
Kent Overstreet [Tue, 23 Aug 2022 03:12:11 +0000 (23:12 -0400)]
bcachefs: Print deadlock cycle in debugfs

In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Deadlock cycle detector
Kent Overstreet [Mon, 22 Aug 2022 17:23:47 +0000 (13:23 -0400)]
bcachefs: Deadlock cycle detector

We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
 - Sometimes we'd issue transaction restarts unnecessarily, when no
   deadlock would have actually occured. Lock ordering restarts have
   become our primary cause of transaction restarts, on some workloads
   totally 20% of actual transaction commits.

 - To avoid deadlock or livelock, we'd often have to take intent locks
   when we only wanted a read lock: with the lock ordering approach, it
   is actually illegal to hold _any_ read lock while blocking on an intent
   lock, and this has been causing us unnecessary lock contention.

 - It was getting fragile - the various lock ordering rules are not
   trivial, and we'd been seeing occasional livelock issues related to
   this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Fix bch2_btree_node_upgrade()
Kent Overstreet [Fri, 5 Aug 2022 17:06:44 +0000 (13:06 -0400)]
bcachefs: Fix bch2_btree_node_upgrade()

Previously, if we were trying to upgrade from a read to an intent lock
but we held an additional read lock via another btree_path,
bch2_btree_node_upgrade() would always fail, in six_lock_tryupgrade().

This patch factors out the code that __bch2_btree_node_lock_write() uses
to temporarily drop extra read locks, so that six_lock_tryupgrade() can
succeed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Add a debug assert
Kent Overstreet [Mon, 19 Sep 2022 18:14:01 +0000 (14:14 -0400)]
bcachefs: Add a debug assert

Chasing down a strange locking bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Wakeup now takes lock on behalf of waiter
Kent Overstreet [Fri, 26 Aug 2022 23:22:24 +0000 (19:22 -0400)]
six locks: Wakeup now takes lock on behalf of waiter

This brings back an important optimization, to avoid touching the wait
lists an extra time, while preserving the property that a thread is on a
lock waitlist iff it is waiting - it is never removed from the waitlist
until it has the lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Fix a lost wakeup
Kent Overstreet [Sat, 15 Oct 2022 04:34:38 +0000 (00:34 -0400)]
six locks: Fix a lost wakeup

There was a lost wakeup between a read unlock in percpu mode and a write
lock. The unlock path unlocks, then executes a barrier, then checks for
waiters; correspondingly, the lock side should set the wait bit and
execute a barrier, then attempt to take the lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Enable lockdep
Kent Overstreet [Sat, 24 Sep 2022 04:13:56 +0000 (00:13 -0400)]
six locks: Enable lockdep

Now that we have lockdep_set_no_check_recursion(), we can enable lockdep
checking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Add start_time to six_lock_waiter
Kent Overstreet [Sat, 24 Sep 2022 05:33:13 +0000 (01:33 -0400)]
six locks: Add start_time to six_lock_waiter

This is needed by the cycle detector in bcachefs - we need a way to
iterater over waitlist entries while dropping and retaking the waitlist
lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: six_lock_waiter()
Kent Overstreet [Sat, 27 Aug 2022 20:22:51 +0000 (16:22 -0400)]
six locks: six_lock_waiter()

This allows passing in the wait list entry - to be used for a deadlock
cycle detector.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Simplify wait lists
Kent Overstreet [Thu, 25 Aug 2022 14:49:52 +0000 (10:49 -0400)]
six locks: Simplify wait lists

This switches to a single list of waiters, instead of separate lists for
read and intent, and switches write locks to also use the wait lists
instead of being handled differently.

Also, removal from the wait list is now done by the process waiting on
the lock, not the process doing the wakeup. This is needed for the new
deadlock cycle detector - we need tasks to stay on the waitlist until
they've successfully acquired the lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add private error codes for ENOSPC
Kent Overstreet [Sun, 18 Sep 2022 21:10:33 +0000 (17:10 -0400)]
bcachefs: Add private error codes for ENOSPC

Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Errcodes can now subtype standard error codes
Kent Overstreet [Sun, 18 Sep 2022 19:43:50 +0000 (15:43 -0400)]
bcachefs: Errcodes can now subtype standard error codes

The next patch is going to be adding private error codes for all the
places we return -ENOSPC.

Additionally, this patch updates return paths at all module boundaries
to call bch2_err_class(), to return the standard error code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Make an assertion more informative
Kent Overstreet [Sun, 18 Sep 2022 17:37:34 +0000 (13:37 -0400)]
bcachefs: Make an assertion more informative

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: All held locks must be in a btree path
Kent Overstreet [Fri, 16 Sep 2022 18:42:38 +0000 (14:42 -0400)]
bcachefs: All held locks must be in a btree path

With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.

This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_path_upgrade() now emits transaction restart
Kent Overstreet [Sat, 17 Sep 2022 18:36:24 +0000 (14:36 -0400)]
bcachefs: bch2_btree_path_upgrade() now emits transaction restart

Centralizing the transaction restart/tracepoint in
bch2_btree_path_upgrade() lets us improve the tracepoint - now it emits
old and new locks_want.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add a manual trigger for lock wakeups
Kent Overstreet [Sat, 17 Sep 2022 19:20:13 +0000 (15:20 -0400)]
bcachefs: Add a manual trigger for lock wakeups

Spotted a lockup once that appeared to be a lost wakeup. Adding a manual
trigger for lock wakeups will make it easy to tell if that's what it is
next time it occurs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix sb_field_counters formatting
Kent Overstreet [Fri, 16 Sep 2022 22:39:01 +0000 (18:39 -0400)]
bcachefs: Fix sb_field_counters formatting

We have counters with longer names now, so adjust the tabstop - also,
make sure there's always a space printed between the name and the
number.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Re-enable hash_redo_key()
Kent Overstreet [Sun, 4 Sep 2022 18:10:12 +0000 (14:10 -0400)]
bcachefs: Re-enable hash_redo_key()

When subvolumes & snapshots were rolled out, hash_redo_key() was
disabled due to some new complications - namely, bch2_hash_set() works
at the subvolume level, and fsck does not run in a defined subvolume,
instead working at the snapshot ID level.

This patch splits out bch2_hash_set_snapshot() from bch2_hash_set(), and
makes one small tweak for fsck:

 - Normally, bch2_hash_set() (and other dirent code) needs to know what
   subvolume we're in, because dirents that point to other subvolumes
   should only be visible in the subvolume they were created in, not
   other snapshots. We can't check that in fsck, so we just assume that
   all dirents are visible.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Kill journal_keys->journal_seq_base
Kent Overstreet [Mon, 12 Sep 2022 06:22:47 +0000 (02:22 -0400)]
bcachefs: Kill journal_keys->journal_seq_base

This removes an optimization that didn't actually save us any memory,
due to alignment, but did make the code more complicated than it needed
to be. We were also seeing a bug where journal_seq_base wasn't getting
correctly initailized, so hopefully it'll fix that too.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix redundant transaction restart
Kent Overstreet [Sun, 4 Sep 2022 05:28:51 +0000 (01:28 -0400)]
bcachefs: Fix redundant transaction restart

Little bit of tidying up, this makes the counters a little bit clearer
as to what's happening.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Ensure intent locks are marked before taking write locks
Kent Overstreet [Sun, 4 Sep 2022 02:24:16 +0000 (22:24 -0400)]
bcachefs: Ensure intent locks are marked before taking write locks

Locks must be correctly marked for the cycle detector to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Avoid using btree_node_lock_nopath()
Kent Overstreet [Sat, 3 Sep 2022 02:59:39 +0000 (22:59 -0400)]
bcachefs: Avoid using btree_node_lock_nopath()

With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.

All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix usage of six lock's percpu mode, key cache version
Kent Overstreet [Sun, 4 Sep 2022 02:07:31 +0000 (22:07 -0400)]
bcachefs: Fix usage of six lock's percpu mode, key cache version

Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.

Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Refactor bkey_cached_alloc() path
Kent Overstreet [Sun, 4 Sep 2022 01:14:53 +0000 (21:14 -0400)]
bcachefs: Refactor bkey_cached_alloc() path

Clean up the arguments passed and make them more consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Convert more locking code to btree_bkey_cached_common
Kent Overstreet [Sun, 4 Sep 2022 01:09:54 +0000 (21:09 -0400)]
bcachefs: Convert more locking code to btree_bkey_cached_common

Ideally, all the code in btree_locking.c should be converted, but then
we'd want to convert btree_path to point to btree_key_cached_common too,
and then we'd be in for a much bigger cleanup - but a bit of incremental
cleanup will still be helpful for the next patches.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: btree_bkey_cached_common->cached
Kent Overstreet [Wed, 31 Aug 2022 22:53:42 +0000 (18:53 -0400)]
bcachefs: btree_bkey_cached_common->cached

Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix six_lock_readers_add()
Kent Overstreet [Fri, 2 Sep 2022 02:05:16 +0000 (22:05 -0400)]
bcachefs: Fix six_lock_readers_add()

Have to be careful with bit fields - when subtracting, this was
overflowing into the write_locking bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_node_lock_write_nofail()
Kent Overstreet [Tue, 23 Aug 2022 03:39:23 +0000 (23:39 -0400)]
bcachefs: bch2_btree_node_lock_write_nofail()

Taking a write lock will be able to fail, with the new cycle detector -
unless we pass it nofail, which is possible but not preferred.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: New locking functions
Kent Overstreet [Sun, 21 Aug 2022 18:29:43 +0000 (14:29 -0400)]
bcachefs: New locking functions

In the future, with the new deadlock cycle detector, we won't be using
bare six_lock_* anymore: lock wait entries will all be embedded in
btree_trans, and we will need a btree_trans context whenever locking a
btree node.

This patch plumbs a btree_trans to the few places that need it, and adds
two new locking functions
 - btree_node_lock_nopath, which may fail returning a transaction
   restart, and
 - btree_node_lock_nopath_nofail, to be used in places where we know we
   cannot deadlock (i.e. because we're holding no other locks).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Mark write locks before taking lock
Kent Overstreet [Fri, 26 Aug 2022 18:55:00 +0000 (14:55 -0400)]
bcachefs: Mark write locks before taking lock

six locks are unfair: while a thread is blocked trying to take a write
lock, new read locks will fail. The new deadlock cycle detector makes
use of our existing lock tracing, so we need to tell it we're holding a
write lock before we take the lock for it to work correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Delete time_stats for lock contended times
Kent Overstreet [Sat, 27 Aug 2022 21:47:27 +0000 (17:47 -0400)]
bcachefs: Delete time_stats for lock contended times

Since we've now got time_stats for lock hold times (per btree
transaction), we don't need this anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't leak lock pcpu counts memory
Kent Overstreet [Tue, 30 Aug 2022 15:40:03 +0000 (11:40 -0400)]
bcachefs: Don't leak lock pcpu counts memory

This fixes a small memory leak.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Delete six_lock_pcpu_free_rcu()
Kent Overstreet [Sat, 27 Aug 2022 19:00:59 +0000 (15:00 -0400)]
six locks: Delete six_lock_pcpu_free_rcu()

Didn't have any users, and wasn't a good idea to begin with - delete it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add persistent counters for all tracepoints
Kent Overstreet [Sat, 27 Aug 2022 16:48:36 +0000 (12:48 -0400)]
bcachefs: Add persistent counters for all tracepoints

Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_btree_update_start() to return -BCH_ERR_journal_reclaim_would_deadlock
Kent Overstreet [Sat, 27 Aug 2022 16:37:05 +0000 (12:37 -0400)]
bcachefs: Fix bch2_btree_update_start() to return -BCH_ERR_journal_reclaim_would_deadlock

On failure to get a journal pre-reservation because we're called from
journal reclaim we're not supposed to return a transaction restart error
- this fixes a livelock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve bch2_btree_node_relock()
Kent Overstreet [Sat, 27 Aug 2022 16:28:09 +0000 (12:28 -0400)]
bcachefs: Improve bch2_btree_node_relock()

This moves the IS_ERR_OR_NULL() check to the inline part, since that's a
fast path event.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve trans_restart_journal_preres_get tracepoint
Kent Overstreet [Sat, 27 Aug 2022 16:23:38 +0000 (12:23 -0400)]
bcachefs: Improve trans_restart_journal_preres_get tracepoint

It now includes journal_flags.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve btree_node_relock_fail tracepoint
Kent Overstreet [Sat, 27 Aug 2022 16:11:18 +0000 (12:11 -0400)]
bcachefs: Improve btree_node_relock_fail tracepoint

It now prints the error name when the btree node is an error pointer;
also, don't trace failures when the the btree node is
BCH_ERR_no_btree_node_up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Make more btree_paths available
Kent Overstreet [Sat, 27 Aug 2022 14:30:36 +0000 (10:30 -0400)]
bcachefs: Make more btree_paths available

 - Don't decrease BTREE_ITER_MAX when building with CONFIG_LOCKDEP
   anymore. The lockdep table sizes are configurable now, we don't need
   this anymore.
 - btree_trans_too_many_iters() is less conservative now. Previously it
   was causing a transaction restart if we had used more than
   BTREE_ITER_MAX / 2 paths, change this to BTREE_ITER_MAX - 8.

This helps with excessive transaction restarts/livelocks in the bucket
allocator path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Correctly initialize bkey_cached->lock
Kent Overstreet [Fri, 26 Aug 2022 01:42:46 +0000 (21:42 -0400)]
bcachefs: Correctly initialize bkey_cached->lock

We need to use the right class for some assertions to work correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Track held write locks
Kent Overstreet [Tue, 23 Aug 2022 01:05:31 +0000 (21:05 -0400)]
bcachefs: Track held write locks

The upcoming lock cycle detection code will need to know precisely which
locks every btree_trans is holding, including write locks - this patch
updates btree_node_locked_type to include write locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Print lock counts in debugs btree_transactions
Kent Overstreet [Tue, 23 Aug 2022 05:20:24 +0000 (01:20 -0400)]
bcachefs: Print lock counts in debugs btree_transactions

Improve our debugfs output, to help in debugging deadlocks: this shows,
for every btree node we print, the current number of readers/intent
locks/write locks held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Switch btree locking code to struct btree_bkey_cached_common
Kent Overstreet [Mon, 22 Aug 2022 17:21:10 +0000 (13:21 -0400)]
bcachefs: Switch btree locking code to struct btree_bkey_cached_common

This is just some type safety cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Track maximum transaction memory
Kent Overstreet [Tue, 23 Aug 2022 01:49:55 +0000 (21:49 -0400)]
bcachefs: Track maximum transaction memory

This patch
 - tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
 - makes it available in debugfs
 - switches bch2_trans_init() to using that for the amount of memory to
   preallocate, instead of the parameter passed in

This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agosix locks: Improve six_lock_count
Kent Overstreet [Mon, 22 Aug 2022 03:08:53 +0000 (23:08 -0400)]
six locks: Improve six_lock_count

six_lock_count now counts up whether a write lock held, and this patch
now also correctly counts six_lock->intent_lock_recurse.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Kill nodes_intent_locked
Kent Overstreet [Sun, 21 Aug 2022 21:20:42 +0000 (17:20 -0400)]
bcachefs: Kill nodes_intent_locked

Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Better use of locking helpers
Kent Overstreet [Sun, 21 Aug 2022 22:17:51 +0000 (18:17 -0400)]
bcachefs: Better use of locking helpers

Held btree locks are tracked in btree_path->nodes_locked and
btree_path->nodes_intent_locked. Upcoming patches are going to change
the representation in struct btree_path, so this patch switches to
proper helpers instead of direct access to these fields.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Reorganize btree_locking.[ch]
Kent Overstreet [Fri, 19 Aug 2022 23:50:18 +0000 (19:50 -0400)]
bcachefs: Reorganize btree_locking.[ch]

Tidy things up a bit before doing more work in this file.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: btree_locking.c
Kent Overstreet [Fri, 19 Aug 2022 19:35:34 +0000 (15:35 -0400)]
bcachefs: btree_locking.c

Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Fix adding a device with a label
Kent Overstreet [Thu, 18 Aug 2022 21:57:24 +0000 (17:57 -0400)]
bcachefs: Fix adding a device with a label

Device labels are represented as pointers in the member info section: we
need to get and then set the label for it to be kept correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: fsck: Another transaction restart handling fix
Kent Overstreet [Thu, 18 Aug 2022 21:00:12 +0000 (17:00 -0400)]
bcachefs: fsck: Another transaction restart handling fix

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: bch2_btree_delete_range_trans() now returns -BCH_ERR_transaction_restart_nested
Kent Overstreet [Thu, 18 Aug 2022 17:00:26 +0000 (13:00 -0400)]
bcachefs: bch2_btree_delete_range_trans() now returns -BCH_ERR_transaction_restart_nested

The new convention is that functions that handle transaction restarts
within an existing transaction context should return
-BCH_ERR_transaction_restart_nested when they did so, since they
invalidated the outer transaction context.

This also means bch2_btree_delete_range_trans() is changed to only call
bch2_trans_begin() after a transaction restart, not on every loop
iteration.

This is to fix a bug in fsck, in check_inode() when we truncate an inode
with BCH_INODE_I_SIZE_DIRTY set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Minor transaction restart handling fix
Kent Overstreet [Thu, 18 Aug 2022 02:17:08 +0000 (22:17 -0400)]
bcachefs: Minor transaction restart handling fix

 - fsck_inode_rm() wasn't returning BCH_ERR_transaction_restart_nested
 - change bch2_trans_verify_not_restarted() to call panic() - we don't
   want these errors to be missed

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Fix bch2_btree_iter_peek_slot() error path
Kent Overstreet [Wed, 17 Aug 2022 21:49:12 +0000 (17:49 -0400)]
bcachefs: Fix bch2_btree_iter_peek_slot() error path

iter->k needs to be consistent with iter->pos - required for
bch2_btree_iter_(rewind|advance) to work correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Another should_be_locked fixup
Kent Overstreet [Tue, 16 Aug 2022 07:08:15 +0000 (03:08 -0400)]
bcachefs: Another should_be_locked fixup

When returning a key from the key cache, in BTREE_ITER_WITH_KEY_CACHE
mode, we don't want to set should_be_locked on iter->path; we're not
returning a key from that path, so we donn't need to, and also since we
traversed the key cache iterator before setting should_be_locked on that
path it might be unlocked (if we unlocked, bch2_trans_relock() won't
have relocked it).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: bch2_bkey_packed_to_binary_text()
Kent Overstreet [Sun, 14 Aug 2022 18:44:17 +0000 (14:44 -0400)]
bcachefs: bch2_bkey_packed_to_binary_text()

For debugging the eytzinger search tree code, and low level bkey packing
code, it can be helpful to see things in binary: this patch improves our
helpers for doing so.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Add assertions for unexpected transaction restarts
Kent Overstreet [Thu, 7 Jul 2022 04:37:46 +0000 (00:37 -0400)]
bcachefs: Add assertions for unexpected transaction restarts

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: btree_path_down() optimization
Kent Overstreet [Mon, 15 Aug 2022 22:55:20 +0000 (18:55 -0400)]
bcachefs: btree_path_down() optimization

We should be calling btree_node_mem_ptr_set() before path_level_init(),
since we already touched the key that btree_node_mem_ptr_set() will
modify and path_level_init() will be doing the lookup in the child btree
node we're recursing to.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Always rebuild aux search trees when node boundaries change
Kent Overstreet [Wed, 17 Aug 2022 18:20:48 +0000 (14:20 -0400)]
bcachefs: Always rebuild aux search trees when node boundaries change

Topology repair may change btree node min/max keys: when it does so, we
need to always rebuild eytzinger search trees because nodes directly
depend on those values.

This fixes a bug found by the 'kill_btree_node' test, where we'd pop an
assertion in bch2_bset_search_linear().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
10 months agobcachefs: Add an overflow check in set_bkey_val_u64s()
Kent Overstreet [Mon, 15 Aug 2022 18:05:44 +0000 (14:05 -0400)]
bcachefs: Add an overflow check in set_bkey_val_u64s()

For now this is just a BUG_ON() - we may want to change this to return
an error in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>