linux-block.git
10 months agobcachefs: Split out debug_check_btree_accounting
Kent Overstreet [Mon, 2 Nov 2020 23:36:08 +0000 (18:36 -0500)]
bcachefs: Split out debug_check_btree_accounting

This check is very expensive

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Drop sysfs interface to debug parameters
Kent Overstreet [Mon, 2 Nov 2020 23:20:44 +0000 (18:20 -0500)]
bcachefs: Drop sysfs interface to debug parameters

It's not used much anymore, the module paramter interface is better.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Minor journal reclaim improvement
Kent Overstreet [Mon, 2 Nov 2020 22:51:38 +0000 (17:51 -0500)]
bcachefs: Minor journal reclaim improvement

With the btree key cache code, journal reclaim now has a lot more work
to do. It could be the case that after journal reclaim has finished one
iteration there's already more work to do, so put it in a loop to check
for that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Inode create optimization
Kent Overstreet [Tue, 27 Oct 2020 22:56:21 +0000 (18:56 -0400)]
bcachefs: Inode create optimization

On workloads that do a lot of multithreaded creates all at once, lock
contention on the inodes btree turns out to still be an issue.

This patch adds a small buffer of inode numbers that are known to be
free, so that we can avoid touching the btree on every create. Also,
this changes inode creates to update via the btree key cache for the
initial create.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve check for when bios are physically contiguous
Kent Overstreet [Fri, 30 Oct 2020 21:29:38 +0000 (17:29 -0400)]
bcachefs: Improve check for when bios are physically contiguous

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix spurious transaction restarts
Kent Overstreet [Wed, 28 Oct 2020 18:18:18 +0000 (14:18 -0400)]
bcachefs: Fix spurious transaction restarts

The check for whether locking a btree node would deadlock was wrong - we
have to check that interior nodes are locked before descendents, but
this check was wrong when consider cached vs. non cached iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve tracing for transaction restarts
Kent Overstreet [Wed, 28 Oct 2020 18:17:46 +0000 (14:17 -0400)]
bcachefs: Improve tracing for transaction restarts

We have a bug where we can get stuck with a process spinning in
transaction restarts - need more information.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix stack corruption
Kent Overstreet [Tue, 27 Oct 2020 18:10:52 +0000 (14:10 -0400)]
bcachefs: Fix stack corruption

A bkey_on_stack_realloc() call was in the wrong place, and broken for
indirect extents

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use cached iterators for inode updates
Kent Overstreet [Sun, 22 Sep 2019 23:10:21 +0000 (19:10 -0400)]
bcachefs: Use cached iterators for inode updates

This switches inode updates to use cached btree iterators - which should
be a nice performance boost, since lock contention on the inodes btree
can be a bottleneck on multithreaded workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: fiemap fixes
Kent Overstreet [Mon, 26 Oct 2020 21:03:28 +0000 (17:03 -0400)]
bcachefs: fiemap fixes

 - fiemap didn't know about inline extents, fixed
 - advancing to the next extent after we'd chased a pointer to the
   reflink btree was wrong, fixed

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix btree updates when mixing cached and non cached iterators
Kent Overstreet [Mon, 26 Oct 2020 18:45:20 +0000 (14:45 -0400)]
bcachefs: Fix btree updates when mixing cached and non cached iterators

There was a bug where bch2_trans_update() would incorrectly delete a
pending update where the new update did not actually overwrite the
existing update, because we were incorrectly using BTREE_ITER_TYPE when
sorting pending btree updates.

This affects the pending patch to use cached iterators for inode
updates.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add mode to bch2_inode_to_text
Kent Overstreet [Mon, 26 Oct 2020 18:54:55 +0000 (14:54 -0400)]
bcachefs: Add mode to bch2_inode_to_text

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Always write a journal entry when stopping journal
Kent Overstreet [Sun, 25 Oct 2020 05:08:28 +0000 (01:08 -0400)]
bcachefs: Always write a journal entry when stopping journal

This is to fix a (harmless) bug where the read clock hand in the
superblock doesn't match the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Drop alloc keys from journal when -o reconstruct_alloc
Kent Overstreet [Sun, 25 Oct 2020 01:20:16 +0000 (21:20 -0400)]
bcachefs: Drop alloc keys from journal when -o reconstruct_alloc

This fixes a bug where we'd pop an assertion due to replaying a key for
an interior btree node when that node no longer exists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Indirect inline data extents
Kent Overstreet [Sat, 24 Oct 2020 23:51:34 +0000 (19:51 -0400)]
bcachefs: Indirect inline data extents

When inline data extents were added, reflink was forgotten about - we
need indirect inline data extents for reflink + inline data to work
correctly.

This patch adds them, and a new feature bit that's flipped when they're
used.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix rare use after free in read path
Kent Overstreet [Sun, 25 Oct 2020 00:56:47 +0000 (20:56 -0400)]
bcachefs: Fix rare use after free in read path

If the bkey_on_stack_reassemble() call in __bch2_read_indirect_extent()
reallocates the buffer, k in bch2_read - which we pointed at the
bkey_on_stack buffer - will now point to a stale buffer. Whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve some error messages
Kent Overstreet [Sat, 24 Oct 2020 20:37:17 +0000 (16:37 -0400)]
bcachefs: Improve some error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix for passing target= opts as mount opts
Kent Overstreet [Sat, 24 Oct 2020 01:07:17 +0000 (21:07 -0400)]
bcachefs: Fix for passing target= opts as mount opts

Some options can't be parsed until the filesystem initialized;
previously, passing these options to mount or remount would cause mount
to fail.

This changes the mount path so that we parse the options passed in
twice, and just ignore any options that can't be parsed the first time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_mark_stripe()
Kent Overstreet [Fri, 23 Oct 2020 22:40:30 +0000 (18:40 -0400)]
bcachefs: Fix bch2_mark_stripe()

There's no reason not to always recalculate these fields

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't drop replicas when copygcing ec data
Kent Overstreet [Thu, 23 Jul 2020 03:11:48 +0000 (23:11 -0400)]
bcachefs: Don't drop replicas when copygcing ec data

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Account for stripe parity sectors separately
Kent Overstreet [Thu, 9 Jul 2020 22:31:51 +0000 (18:31 -0400)]
bcachefs: Account for stripe parity sectors separately

Instead of trying to charge EC parity to the data within the stripe
(which is subject to rounding errors), let's charge it to the stripe
itself. It should also make -ENOSPC issues easier to deal with if we
charge for parity blocks up front, and means we can also make more fine
grained accounting available to the user.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix for bad stripe pointers
Kent Overstreet [Tue, 20 Oct 2020 02:36:24 +0000 (22:36 -0400)]
bcachefs: Fix for bad stripe pointers

The allocator usually doesn't increment bucket gens right away on
buckets that it's about to hand out (for reasons that need to be
documented), instead deferring that to whatever extent update first
references that bucket.

But stripe pointers reference buckets without changing bucket sector
counts, meaning we could end up with a pointer in a stripe with a gen
newer than the bucket it points to.

Fix this by adding a transactional trigger for KEY_TYPE_stripe that just
writes out the keys in the alloc btree for the buckets it points to.

Also - consolidate the code that checks pointer validity.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Start/stop io clock hands in read/write paths
Kent Overstreet [Sat, 17 Oct 2020 20:44:27 +0000 (16:44 -0400)]
bcachefs: Start/stop io clock hands in read/write paths

This fixes a bug where the clock hands in the journal and superblock
didn't match, because we were still incrementing the read clock hand
while read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improvements to writing alloc info
Kent Overstreet [Sat, 17 Oct 2020 01:36:26 +0000 (21:36 -0400)]
bcachefs: Improvements to writing alloc info

Now that we've got transactional alloc info updates (and have for
awhile), we don't need to write it out on shutdown, and we don't need to
write it out on startup except when GC found errors - this is a big
improvement to mount/unmount performance.

This patch also fixes a few bugs where we weren't writing out alloc
info (on new filesystems, and new devices) and should have been.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix assertion popping in transaction commit path
Kent Overstreet [Wed, 16 Dec 2020 18:35:16 +0000 (13:35 -0500)]
bcachefs: Fix assertion popping in transaction commit path

We can't be holding read locks on btree nodes when we go to take write
locks: this would deadlock if another thread is holding an intent lock
on the node we have a read lock on, and it tries to commit and upgrade
to a write lock.

But instead of triggering an assertion, if this happens we can just
upgrade the read lock to an intent lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Perf improvements for bch_alloc_read()
Kent Overstreet [Sat, 17 Oct 2020 01:32:02 +0000 (21:32 -0400)]
bcachefs: Perf improvements for bch_alloc_read()

On large filesystems reading in the alloc info takes a significant
amount of time. But we don't need to be calling into the fully general
bch2_mark_key() path, just open code what we need in
bch2_alloc_read_fn().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix copygc dying on startup
Kent Overstreet [Fri, 16 Oct 2020 02:50:48 +0000 (22:50 -0400)]
bcachefs: Fix copygc dying on startup

The copygc threads errors out and makes the filesystem go RO if it ever
tries to run and discovers it has no reserve allocated - which is a
problem if it races with the allocator thread and its reserve hasn't
been filled yet.

The allocator thread doesn't start filling the copygc reserve until
after BCH_FS_STARTED has been set, so make sure to wake up the allocator
threads after setting that and before starting copygc.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix copygc of compressed data
Kent Overstreet [Fri, 16 Oct 2020 02:23:02 +0000 (22:23 -0400)]
bcachefs: Fix copygc of compressed data

The check for when we need to get a disk reservation was wrong.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix another lockdep splat
Kent Overstreet [Fri, 16 Oct 2020 01:48:58 +0000 (21:48 -0400)]
bcachefs: Fix another lockdep splat

vfree() can allocate memory, so we need to call memalloc_nofs_save().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix errors early in the fs init process
Kent Overstreet [Thu, 15 Oct 2020 19:58:36 +0000 (15:58 -0400)]
bcachefs: Fix errors early in the fs init process

At some point bch2_fs_alloc() was changed to always call bch2_fs_free()
in the error path, which means we need c->cl to always be initialized.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Copy ptr->cached when migrating data
Kent Overstreet [Fri, 10 Jul 2020 23:49:34 +0000 (19:49 -0400)]
bcachefs: Copy ptr->cached when migrating data

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix gc of stale ptr gens
Kent Overstreet [Tue, 13 Oct 2020 07:58:50 +0000 (03:58 -0400)]
bcachefs: Fix gc of stale ptr gens

Awhile back, gcing of stale pointers was split out from full
mark-and-sweep gc - but, the bit to actually drop those stale pointers
wasn't implemnted. Whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix off-by-one error in ptr gen check
Kent Overstreet [Tue, 13 Oct 2020 04:06:36 +0000 (00:06 -0400)]
bcachefs: Fix off-by-one error in ptr gen check

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a lockdep splat
Kent Overstreet [Sun, 11 Oct 2020 20:33:49 +0000 (16:33 -0400)]
bcachefs: Fix a lockdep splat

We can't allocate memory with GFP_FS while holding the btree cache lock,
and vfree() can allocate memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix __bch2_truncate_page()
Kent Overstreet [Fri, 9 Oct 2020 04:09:20 +0000 (00:09 -0400)]
bcachefs: Fix __bch2_truncate_page()

__bch2_truncate_page() will mark some of the blocks in a page as
unallocated. But, if the page is mmapped (and writable), every block in
the page needs to be marked dirty, else those blocks won't be written by
__bch2_writepage().

The solution is to change those userspace mappings to RO, so that we
force bch2_page_mkwrite() to be called again.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix journal_seq_copy()
Kent Overstreet [Wed, 7 Oct 2020 02:18:21 +0000 (22:18 -0400)]
bcachefs: Fix journal_seq_copy()

We also need to update the journal's bloom filter of inode numbers that
each journal write has upudates for - in case the inode gets evicted
before it gets fsynced.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix unmount path
Kent Overstreet [Tue, 8 Sep 2020 22:30:32 +0000 (18:30 -0400)]
bcachefs: Fix unmount path

There was a long standing race in the mount/unmount code - the VFS
intends for mount/unmount synchronizatino to be handled by the list of
superblocks, but we were still holding devices open after tearing down
our superblock in the unmount path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't fail mount if device has been removed
Kent Overstreet [Mon, 7 Sep 2020 02:58:28 +0000 (22:58 -0400)]
bcachefs: Don't fail mount if device has been removed

Also - make sure to show the devices we actually have open in /proc

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improvements to the journal read error paths
Kent Overstreet [Mon, 24 Aug 2020 19:58:26 +0000 (15:58 -0400)]
bcachefs: Improvements to the journal read error paths

 - Print out more information in error messages
 - On checksum error, keep the journal entry but mark it bad so that we
   can prefer entries from other devices that don't have bad checksums

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Make sure to go rw if lazy in fsck
Kent Overstreet [Mon, 24 Aug 2020 19:16:32 +0000 (15:16 -0400)]
bcachefs: Make sure to go rw if lazy in fsck

The paths where we delete or truncate inodes don't pass commit flags for
BTREE_INSERT_LAZY_RW, so just go rw if necessary in the fsck code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Some project id fixes
Kent Overstreet [Mon, 24 Aug 2020 18:57:48 +0000 (14:57 -0400)]
bcachefs: Some project id fixes

Inode options that are accessible via the xattr interface are stored
with a +1 bias, so that a value of 0 means unset. We weren't handling
this consistently.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't report inodes to statfs
Kent Overstreet [Sun, 16 Aug 2020 02:41:35 +0000 (22:41 -0400)]
bcachefs: Don't report inodes to statfs

We don't have a limit on the number of inodes in a filesystem, so this
is apparently the right way to report that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add a cond_resched() to bch2_alloc_write()
Kent Overstreet [Wed, 12 Aug 2020 19:08:17 +0000 (15:08 -0400)]
bcachefs: Add a cond_resched() to bch2_alloc_write()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a couple null ptr derefs when no disk groups exist
Kent Overstreet [Thu, 6 Aug 2020 19:22:24 +0000 (15:22 -0400)]
bcachefs: Fix a couple null ptr derefs when no disk groups exist

Normally successfully parsing a target means disk groups should exist,
but we don't want a BUG() or null ptr deref if we end up with an invalid
target.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix disk groups not being updated when set via sysfs
Kent Overstreet [Wed, 12 Aug 2020 19:00:08 +0000 (15:00 -0400)]
bcachefs: Fix disk groups not being updated when set via sysfs

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Change copygc to consider bucket fragmentation
Kent Overstreet [Wed, 12 Aug 2020 17:49:09 +0000 (13:49 -0400)]
bcachefs: Change copygc to consider bucket fragmentation

When devices have different sized buckets this is more correct.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't block on allocations when only writing to specific device
Kent Overstreet [Wed, 12 Aug 2020 17:48:02 +0000 (13:48 -0400)]
bcachefs: Don't block on allocations when only writing to specific device

Since the copygc thread is now global and not per device, we're not
freeing up space on any one device in bounded time - and indeed we never
really were, since rebalance wasn't moving data around between devices
with that objective.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a bug with the journal_seq_blacklist mechanism
Kent Overstreet [Wed, 5 Aug 2020 03:10:08 +0000 (23:10 -0400)]
bcachefs: Fix a bug with the journal_seq_blacklist mechanism

Previously, we would start doing btree updates before writing the first
journal entry; if this was after an unclean shutdown, this could cause
those btree updates to not be blacklisted.

Also, move some code to headers for userspace debug tools.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_new_stripes_to_text()
Kent Overstreet [Wed, 5 Aug 2020 03:12:49 +0000 (23:12 -0400)]
bcachefs: Fix bch2_new_stripes_to_text()

painful looking typo, fortunately difficult to hit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't disallow btree writes to RO devices
Kent Overstreet [Mon, 3 Aug 2020 17:58:36 +0000 (13:58 -0400)]
bcachefs: Don't disallow btree writes to RO devices

There's an inherent race with setting devices RO when they have dirty
btree nodes on them. We already check if a btree node is on an RO device
before we dirty it, so this patch just allows those writes so that we
don't have errors forcing the entire filesystem read only when trying to
remove a device.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix maximum btree node size
Kent Overstreet [Mon, 3 Aug 2020 17:37:11 +0000 (13:37 -0400)]
bcachefs: Fix maximum btree node size

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Convert various code to printbuf
Kent Overstreet [Sat, 25 Jul 2020 21:06:11 +0000 (17:06 -0400)]
bcachefs: Convert various code to printbuf

printbufs know how big the buffer is that was allocated, so we can get
rid of the random PAGE_SIZEs all over the place.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Remove some uses of PAGE_SIZE in the btree code
Kent Overstreet [Sat, 25 Jul 2020 19:07:37 +0000 (15:07 -0400)]
bcachefs: Remove some uses of PAGE_SIZE in the btree code

For portability to userspace, we should try to avoid working in kernel
pages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Ensure we wake up threads locking node when reusing it
Kent Overstreet [Sat, 25 Jul 2020 19:37:14 +0000 (15:37 -0400)]
bcachefs: Ensure we wake up threads locking node when reusing it

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_btree_node_insert_fits()
Kent Overstreet [Sat, 25 Jul 2020 18:19:37 +0000 (14:19 -0400)]
bcachefs: Fix bch2_btree_node_insert_fits()

It should be checking for the recently added flag
btree_node_needs_rewrite.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Ensure we only allocate one EC bucket per writepoint
Kent Overstreet [Thu, 23 Jul 2020 15:31:01 +0000 (11:31 -0400)]
bcachefs: Ensure we only allocate one EC bucket per writepoint

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a race with BCH_WRITE_SKIP_CLOSURE_PUT
Kent Overstreet [Thu, 23 Jul 2020 02:40:32 +0000 (22:40 -0400)]
bcachefs: Fix a race with BCH_WRITE_SKIP_CLOSURE_PUT

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't let copygc buckets be stolen by other threads
Kent Overstreet [Tue, 21 Jul 2020 21:12:39 +0000 (17:12 -0400)]
bcachefs: Don't let copygc buckets be stolen by other threads

And assorted other copygc fixes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Delete unused arguments
Kent Overstreet [Wed, 22 Jul 2020 17:27:00 +0000 (13:27 -0400)]
bcachefs: Delete unused arguments

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix an error path
Kent Overstreet [Wed, 22 Jul 2020 22:26:04 +0000 (18:26 -0400)]
bcachefs: Fix an error path

We were missing a 'goto retry' and continuing on with an error pointer.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Refactor replicas code
Kent Overstreet [Fri, 10 Jul 2020 20:13:52 +0000 (16:13 -0400)]
bcachefs: Refactor replicas code

Awhile back the mechanism for garbage collecting unused replicas entries
was significantly improved, but some cleanup was missed - this patch
does that now.

This is also prep work for a patch to account for erasure coded parity
blocks separately - we need to consolidate the logic for
checking/marking the various replicas entries from one bkey into a
single function.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't restrict copygc writes to the same device
Kent Overstreet [Sat, 11 Jul 2020 22:52:14 +0000 (18:52 -0400)]
bcachefs: Don't restrict copygc writes to the same device

This no longer makes any sense, since copygc is now one thread per
filesystem, not per device, with a single write point.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add bch2_blk_status_to_str()
Kent Overstreet [Tue, 21 Jul 2020 17:34:22 +0000 (13:34 -0400)]
bcachefs: Add bch2_blk_status_to_str()

We define our own BLK_STS_REMOVED, so we need our own to_str helper too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a faulty assertion
Kent Overstreet [Tue, 21 Jul 2020 15:51:17 +0000 (11:51 -0400)]
bcachefs: Fix a faulty assertion

Now that updates to interior nodes are journalled, we shouldn't be
checking topology of interior nodes until we've finished replaying
updates to that node.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Wrap write path in memalloc_nofs_save()
Kent Overstreet [Mon, 20 Jul 2020 17:00:15 +0000 (13:00 -0400)]
bcachefs: Wrap write path in memalloc_nofs_save()

This fixes a lockdep splat where we're allocating memory with vmalloc in
the compression bounce path, which doesn't always obey GFP_NOFS.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add an option for rebuilding the replicas section
Kent Overstreet [Mon, 20 Jul 2020 19:51:05 +0000 (15:51 -0400)]
bcachefs: Add an option for rebuilding the replicas section

There is a bug where we cnan end up clearing the data_has field in the
superblock members section, which causes us to skip reading the journal
and thus journal replay fails. This option tells the recovery path to
not trust those fields.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Make copygc thread global
Kent Overstreet [Sat, 11 Jul 2020 20:28:54 +0000 (16:28 -0400)]
bcachefs: Make copygc thread global

Per device copygc threads don't move data to different devices and they
make fragmentation works - they don't make much sense anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Drop extra pointers when marking data as in a stripe
Kent Overstreet [Sat, 11 Jul 2020 17:23:17 +0000 (13:23 -0400)]
bcachefs: Drop extra pointers when marking data as in a stripe

We ideally want the buckets used for the extra initial replicas to be
reused right away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix extent_ptr_durability() calculation for erasure coded data
Kent Overstreet [Sat, 11 Jul 2020 19:35:04 +0000 (15:35 -0400)]
bcachefs: Fix extent_ptr_durability() calculation for erasure coded data

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use x-macros for data types
Kent Overstreet [Thu, 9 Jul 2020 22:28:11 +0000 (18:28 -0400)]
bcachefs: Use x-macros for data types

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix short buffered writes
Kent Overstreet [Thu, 9 Jul 2020 17:54:58 +0000 (13:54 -0400)]
bcachefs: Fix short buffered writes

In the buffered write path, we have to check for short writes that write
to the full page, where the page wasn't UpToDate; when this happens, the
page is partly garbage, so we have to zero it out and revert that part
of the write.

This check was wrong - we reverted total from copied, but didn't revert
the iov_iter, probably also leading to corrupted writes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Allow existing stripes to be updated with new data buckets
Kent Overstreet [Tue, 30 Jun 2020 18:44:19 +0000 (14:44 -0400)]
bcachefs: Allow existing stripes to be updated with new data buckets

This solves internal fragmentation within stripes. We already have
copygc, which evacuates buckets that are partially or mostly empty, but
it's up to the ec code that manages stripes to deal with stripes that
have empty buckets in them.

This patch changes the path for creating new stripes to check if there's
existing stripes with empty buckets - and if so, update them with new
data buckets instead of creating new stripes.

TODO: improve the disk space accounting so that we can only use this
(more expensive path) when we have too much fragmentation in existing
stripes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Refactor stripe creation
Kent Overstreet [Tue, 7 Jul 2020 02:33:54 +0000 (22:33 -0400)]
bcachefs: Refactor stripe creation

Prep work for the patch to update existing stripes with new data blocks.
This moves allocating new stripes into ec.c, and also sets up the data
structures so that we can handly only allocating some of the blocks in a
stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Move stripe creation to workqueue
Kent Overstreet [Tue, 7 Jul 2020 00:59:46 +0000 (20:59 -0400)]
bcachefs: Move stripe creation to workqueue

This is mainly to solve a lock ordering issue, and also simplifies the
code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Improve stripe triggers/heap code
Kent Overstreet [Tue, 7 Jul 2020 00:18:13 +0000 (20:18 -0400)]
bcachefs: Improve stripe triggers/heap code

Soon we'll be able to modify existing stripes - replacing empty blocks
with new blocks and new p/q blocks. This patch updates the trigger code
to handle pointers changing in an existing stripe; also, it
significantly improves how the stripes heap works, which means we can
get rid of the stripe creation/deletion lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Rework triggers interface
Kent Overstreet [Mon, 6 Jul 2020 23:16:25 +0000 (19:16 -0400)]
bcachefs: Rework triggers interface

The trigger for stripe keys is shortly going to need both the old and
the new key passed to the trigger - this patch does that rework.

For now, this just changes the in memory triggers, and this doesn't
change how extent triggers work.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Kill BTREE_TRIGGER_NOOVERWRITES
Kent Overstreet [Mon, 6 Jul 2020 21:02:37 +0000 (17:02 -0400)]
bcachefs: Kill BTREE_TRIGGER_NOOVERWRITES

This is prep work for reworking the triggers machinery - we have
triggers that need to know both the old and the new key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Mark btree nodes as needing rewrite when not all replicas are RW
Kent Overstreet [Fri, 3 Jul 2020 20:32:00 +0000 (16:32 -0400)]
bcachefs: Mark btree nodes as needing rewrite when not all replicas are RW

This fixes a bug where recovery fails when one of the devices is read
only.

Also - consolidate the "must rewrite this node to insert it" behind a
new btree node flag.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use blk_status_to_str()
Kent Overstreet [Thu, 2 Jul 2020 17:43:58 +0000 (13:43 -0400)]
bcachefs: Use blk_status_to_str()

Improved error messages are always a good thing

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't cap ios in dio write path at 2 MB
Kent Overstreet [Tue, 30 Jun 2020 14:12:45 +0000 (10:12 -0400)]
bcachefs: Don't cap ios in dio write path at 2 MB

It appears this was erronious, a different bug was responsible

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Refactor dio write code to reinit bch_write_op
Kent Overstreet [Mon, 29 Jun 2020 22:22:06 +0000 (18:22 -0400)]
bcachefs: Refactor dio write code to reinit bch_write_op

This fixes a bug where the BCH_WRITE_SKIP_CLOSURE_PUT was set
incorrectly, causing the completion to be delivered multiple times.
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix bch2_extent_can_insert() not being called
Kent Overstreet [Sun, 28 Jun 2020 22:11:12 +0000 (18:11 -0400)]
bcachefs: Fix bch2_extent_can_insert() not being called

It's supposed to check whether we're splitting a compressed extent and
if so get a bigger disk reservation - hence this fixes a "disk usage
increased by x without a reservaiton" bug.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a null ptr deref in bch2_btree_iter_traverse_one()
Kent Overstreet [Fri, 26 Jun 2020 17:56:21 +0000 (13:56 -0400)]
bcachefs: Fix a null ptr deref in bch2_btree_iter_traverse_one()

We use sentinal values that aren't NULL to indicate there's a btree node
at a higher level; occasionally, this may result in
btree_iter_up_until_good_node() stopping at one of those sentinal
values.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Track sectors of erasure coded data
Kent Overstreet [Fri, 19 Jun 2020 01:06:42 +0000 (21:06 -0400)]
bcachefs: Track sectors of erasure coded data

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use btree reserve when appropriate
Kent Overstreet [Thu, 18 Jun 2020 21:16:29 +0000 (17:16 -0400)]
bcachefs: Use btree reserve when appropriate

Whenever we're doing an update that has pointers, that generally means
we need to do the update in order to release open bucket references - so
we should be using the btree open bucket reserve.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add a kthread_should_stop() check to allocator thread
Kent Overstreet [Wed, 17 Jun 2020 22:20:26 +0000 (18:20 -0400)]
bcachefs: Add a kthread_should_stop() check to allocator thread

Turns out it's possible during shutdown for the allocator to get stuck
spinning on bch2_invalidate_buckets() without hitting any of the other
checks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Change bch2_dump_bset() to also print key values
Kent Overstreet [Wed, 17 Jun 2020 21:33:53 +0000 (17:33 -0400)]
bcachefs: Change bch2_dump_bset() to also print key values

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a deadlock in the RO path
Kent Overstreet [Wed, 17 Jun 2020 21:30:38 +0000 (17:30 -0400)]
bcachefs: Fix a deadlock in the RO path

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix incorrect gfp check
Kent Overstreet [Tue, 16 Jun 2020 00:18:02 +0000 (20:18 -0400)]
bcachefs: Fix incorrect gfp check

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix lock ordering with new btree cache code
Kent Overstreet [Mon, 15 Jun 2020 23:53:46 +0000 (19:53 -0400)]
bcachefs: Fix lock ordering with new btree cache code

The code that checks lock ordering was recently changed to go off of the
pos of the btree node, rather than the iterator, but the btree cache
code didn't update to handle iterators that point to cached bkeys. Oops

Also, update various debug code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: delete a slightly faulty assertion
Kent Overstreet [Mon, 15 Jun 2020 21:59:09 +0000 (17:59 -0400)]
bcachefs: delete a slightly faulty assertion

state lock isn't held at startup

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Increase size of btree node reserve
Kent Overstreet [Mon, 15 Jun 2020 21:38:26 +0000 (17:38 -0400)]
bcachefs: Increase size of btree node reserve

Also tweak the allocator to be more aggressive about keeping it full.
The recent changes to make updates to interior nodes transactional (and
thus generate updates to the alloc btree) all put more stress on the
btree node reserves.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Give bkey_cached_key same attributes as bpos
Kent Overstreet [Mon, 15 Jun 2020 20:59:36 +0000 (16:59 -0400)]
bcachefs: Give bkey_cached_key same attributes as bpos

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Use cached iterators for alloc btree
Kent Overstreet [Sat, 5 Oct 2019 16:54:53 +0000 (12:54 -0400)]
bcachefs: Use cached iterators for alloc btree

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Btree key cache
Kent Overstreet [Fri, 8 Mar 2019 00:46:10 +0000 (19:46 -0500)]
bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Implement a new gc that only recalcs oldest gen
Kent Overstreet [Mon, 15 Jun 2020 19:10:54 +0000 (15:10 -0400)]
bcachefs: Implement a new gc that only recalcs oldest gen

Full mark and sweep gc doesn't (yet?) work with the new btree key cache
code, but it also blocks updates to interior btree nodes for the
duration and isn't really necessary in practice; we aren't currently
attempting to repair errors in allocation info at runtime.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Turn c->state_lock into an rwsem
Kent Overstreet [Mon, 15 Jun 2020 18:58:47 +0000 (14:58 -0400)]
bcachefs: Turn c->state_lock into an rwsem

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Add an internal option for reading entire journal
Kent Overstreet [Sat, 13 Jun 2020 22:43:14 +0000 (18:43 -0400)]
bcachefs: Add an internal option for reading entire journal

To be used the debug tool that dumps the contents of the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Don't deadlock when btree node reuse changes lock ordering
Kent Overstreet [Sat, 13 Jun 2020 02:29:48 +0000 (22:29 -0400)]
bcachefs: Don't deadlock when btree node reuse changes lock ordering

Btree node lock ordering is based on the logical key. However, 'struct
btree' may be reused for a different btree node under memory pressure.
This patch uses the new six lock callback to check if a btree node is no
longer the node we wanted to lock before blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 months agobcachefs: Fix a deadlock
Kent Overstreet [Fri, 12 Jun 2020 18:58:07 +0000 (14:58 -0400)]
bcachefs: Fix a deadlock

__bch2_btree_node_lock() was incorrectly using iter->pos as a proxy for
btree node lock ordering, this caused an off by one error that was
triggered by bch2_btree_node_get_sibling() getting the previous node.

This refactors the code to compare against btree node keys directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>