Kent Overstreet [Sat, 20 Jul 2024 18:37:24 +0000 (14:37 -0400)]
bcachefs: More informative error message in reattach_inode()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 17 Jul 2024 15:56:05 +0000 (11:56 -0400)]
bcachefs: kill btree_trans_too_many_iters() in bch2_bucket_alloc_freelist()
When we're called via
trans commit -> btree split -> allocator
We may have already arbitrarily many btree_paths, for the transaction
commit we're trying to do; when this happens, the
btree_trans_too_many_iters() call causes us to livelock.
Since the allocator calls btree_iter_dontneed to release paths as it
iterates, this shouldn't cause any problems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Tavian Barnes [Fri, 21 Jun 2024 20:38:44 +0000 (16:38 -0400)]
bcachefs: mean_and_variance: Avoid too-large shift amounts
Shifting a value by the width of its type or more is undefined.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 18 Jul 2024 21:17:10 +0000 (17:17 -0400)]
lockdep: Add comments for lockdep_set_no{validate,track}_class()
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 17 Jul 2024 00:20:21 +0000 (20:20 -0400)]
bcachefs: Fix integer overflow on trans->nr_updates
We can't have more updates than paths, so btree_path_idx_t is the
correct type to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 16 Jul 2024 20:43:59 +0000 (16:43 -0400)]
bcachefs: silence silly kdoc warning
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 15 Jul 2024 23:03:17 +0000 (19:03 -0400)]
bcachefs: Fix fsck warning about btree_trans not passed to fsck error
If a btree_trans is in use it's supposed to be passed to fsck_err so
that it can be unlocked if we're waiting on userspace input; but the
btree IO paths do call fsck errors where a btree_trans exists on the
stack but it's not passed through.
But it's ok, because it's unlocked while doing IO.
Fixes:
a850bde6498b ("bcachefs: fsck_err() may now take a btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 15 Jul 2024 20:30:44 +0000 (16:30 -0400)]
bcachefs: Add an error message for insufficient rw journal devs
This causes us to go read-only - need an error message saying why.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Tavian Barnes [Fri, 21 Jun 2024 20:39:58 +0000 (16:39 -0400)]
bcachefs: varint: Avoid left-shift of a negative value
Shifting a negative value left is undefined.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Tavian Barnes [Fri, 21 Jun 2024 20:29:32 +0000 (16:29 -0400)]
bcachefs: darray: Don't pass NULL to memcpy()
memcpy's second parameter must not be NULL, even if size is zero.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 14 Jul 2024 23:51:01 +0000 (19:51 -0400)]
bcachefs: Kill bch2_assert_btree_nodes_not_locked()
We no longer track individual btree node locks with lockdep, so this
will never be enabled.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 28 Aug 2023 20:13:18 +0000 (16:13 -0400)]
bcachefs: Rename BCH_WRITE_DONE -> BCH_WRITE_SUBMITTED
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 14 Jul 2024 20:32:11 +0000 (16:32 -0400)]
bcachefs: __bch2_read(): call trans_begin() on every loop iter
perusal of /sys/kernel/debug/bcachefs/*/btree_transaction_stats shows
that the read path has been acculumalating unneeded paths on the reflink
btree, which we don't want.
The solution is to call bch2_trans_begin(), which drops paths not used
on previous loop iteration.
bch2_readahead:
Max mem used: 0
Transaction duration:
count: 194235
since mount recent
duration of events
min: 150 ns
max: 9 ms
total: 838 ms
mean: 4 us 6 us
stddev: 34 us 7 us
time between events
min: 10 ns
max: 15 h
mean: 2 s 12 s
stddev: 2 s 3 ms
Maximum allocated btree paths (193):
path: idx 2 ref 0:0 P btree=extents l=0 pos
270943112:392:U32_MAX locks 0
path: idx 3 ref 1:0 S btree=extents l=0 pos
270943112:24578:U32_MAX locks 1
path: idx 4 ref 0:0 P btree=reflink l=0 pos 0:
24773509:0 locks 0
path: idx 5 ref 0:0 P S btree=reflink l=0 pos 0:
24773631:0 locks 1
path: idx 6 ref 0:0 P S btree=reflink l=0 pos 0:
24773759:0 locks 1
path: idx 7 ref 0:0 P S btree=reflink l=0 pos 0:
24773887:0 locks 1
path: idx 8 ref 0:0 P S btree=reflink l=0 pos 0:
24774015:0 locks 1
path: idx 9 ref 0:0 P S btree=reflink l=0 pos 0:
24774143:0 locks 1
path: idx 10 ref 0:0 P S btree=reflink l=0 pos 0:
24774271:0 locks 1
<many more reflink paths>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Fri, 12 Jul 2024 07:09:25 +0000 (15:09 +0800)]
bcachefs: show none if label is not set
If label is not set, the Label tag in superblock info show '(none)'.
```
[Before]
Device index: 0
Label:
Version: 1.4: member_seq
[After]
Device index: 0
Label: (none)
Version: 1.4: member_seq
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 12 Jul 2024 18:35:46 +0000 (14:35 -0400)]
bcachefs: drop packed, aligned from bkey_inode_buf
Unnecessary here, and this broke the rust bindings:
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:29025:1
|
29025 | pub struct bkey_i_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:32826:1
|
32826 | pub struct bkey_inode_buf {
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: `bkey_inode_buf` contains a field of type `bkey_i_inode_v3`
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:32827:9
|
32827 | pub inode: bkey_i_inode_v3,
| ^^^^^
note: ...which contains a field of type `bch_inode_v3`
--> /build/source/target/release/build/bch_bindgen-
9445b24c90aca2a3/out/bcachefs.rs:29027:9
|
29027 | pub v: bch_inode_v3,
| ^
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 12 Jul 2024 18:16:01 +0000 (14:16 -0400)]
bcachefs: btree node scan: fall back to comparing by journal seq
highly damaged filesystems, or filesystems that have been damaged and
repair and damaged again, may have sequence numbers we can't fully trust
- which in itself is something we need to debug.
Add a journal_seq fallback so that repair doesn't get stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 21 Dec 2023 23:54:09 +0000 (18:54 -0500)]
bcachefs: Add lockdep support for btree node locks
This adds lockdep tracking for held btree locks with a single dep_map in
btree_trans, i.e. tracking all held btree locks as one object.
This is more practical and more useful than having lockdep track held
btree locks individually, because
- we can take more locks than lockdep can track (unbounded, now that we
have dynamically resizable btree paths)
- there's no lock ordering between btree locks for lockdep to track (we
do cycle detection)
- and this makes it easy to teach lockdep that btree locks are not safe
to hold while invoking memory reclaim.
The last rule is one that lockdep would never learn, because we only do
trylock() from within shrinkers - but we very much do not want to be
invoking memory reclaim while holding btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 22 Dec 2023 01:34:17 +0000 (20:34 -0500)]
lockdep: lockdep_set_notrack_class()
Add a new helper to disable lockdep tracking entirely for a given class.
This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Jun 2024 20:04:40 +0000 (16:04 -0400)]
bcachefs: Improve copygc_wait_to_text()
printing the raw values can occasionally be very useful
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Jun 2024 22:08:20 +0000 (18:08 -0400)]
bcachefs: Convert clock code to u64s
Eliminate possible integer truncation bugs on 32 bit
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 29 Jun 2024 15:43:23 +0000 (11:43 -0400)]
bcachefs: Improve startup message
We're not always mounting when we start the filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Jun 2024 17:28:30 +0000 (13:28 -0400)]
bcachefs: Self healing on read IO error
This repurposes the promote path, which already knows how to call
data_update() after a read: we now automatically rewrite bad data when
we get a read error and then successfully retry from a different
replica.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Jun 2024 22:10:47 +0000 (18:10 -0400)]
bcachefs: Make read_only a mount option again, but hidden
fsck passes read_only as a mount option, and it's required for
nochanges, which it also uses.
Usually read_only is handled by the VFS, but we need to be able to
handle it too; we just don't want to print it out twice, so mark it as a
hidden option.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Jun 2024 20:25:39 +0000 (16:25 -0400)]
bcachefs: bch2_extent_crc_unpacked_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Jun 2024 17:51:38 +0000 (13:51 -0400)]
bcachefs: Ratelimit checksum error messages
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 28 Jun 2024 17:36:00 +0000 (13:36 -0400)]
bcachefs: spelling fix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 21:49:11 +0000 (17:49 -0400)]
bcachefs: Simplify btree key cache fill path
Don't allocate the new bkey_cached until after we've done the btree
lookup; this means we can kill bkey_cached.valid.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 23 Jun 2024 06:13:44 +0000 (02:13 -0400)]
bcachefs: Improve "unable to allocate journal write" message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 23 Jun 2024 22:48:22 +0000 (18:48 -0400)]
bcachefs: Fix missing BTREE_TRIGGER_bucket_invalidate flag
This fixes an accounting mismatch for cached data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 10 Sep 2023 21:29:39 +0000 (17:29 -0400)]
bcachefs: Ensure buffered writes write as much as they can
This adds a new helper, bch2_folio_reservation_get_partial(), which
reserves as many blocks as possible and may return partial success.
__bch2_buffered_write() is switched to the new helper - this fixes
fstests generic/275, the write until -ENOSPC test.
generic/230 now fails: this appears to be a test bug, where xfs_io isn't
looping after a partial write to get the error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Thu, 20 Jun 2024 13:21:12 +0000 (21:21 +0800)]
bcachefs: support STATX_DIOALIGN for statx file
Add support for STATX_DIOALIGN to bcachefs, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.
[Before]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:0
dio offset align:0
```
[After]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:1
dio offset align:512
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 19 Jun 2024 13:00:11 +0000 (09:00 -0400)]
bcachefs: split out lru_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 19:20:53 +0000 (15:20 -0400)]
bcachefs: bch2_btree_key_cache_drop() now evicts
As part of improving btree key cache coherency, the bkey_cached.valid
flag is going away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pankaj Raghav [Fri, 14 Jun 2024 10:50:31 +0000 (10:50 +0000)]
bcachefs: set fgf order hint before starting a buffered write
Set the preferred folio order in the fgp_flags by calling
fgf_set_order(). Page cache will try to allocate large folio of the
preferred order whenever possible instead of allocating multiple 0 order
folios.
This improves the buffered write performance up to 1.25x with default
mount options and up to 1.57x when mounted with no_data_io option with
the following fio workload:
fio --name=bcachefs --filename=/mnt/test --size=100G \
--ioengine=io_uring --iodepth=16 --rw=write --bs=128k
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pankaj Raghav [Fri, 14 Jun 2024 10:50:30 +0000 (10:50 +0000)]
bcachefs: use FGP_WRITEBEGIN instead of combining individual flags
Use FGP_WRITEBEGIN to avoid repeating the individual FGP flags before
starting a buffered write.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jun 2024 21:07:36 +0000 (17:07 -0400)]
bcachefs: Reduce the scope of gc_lock
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 13 Jun 2024 18:11:48 +0000 (14:11 -0400)]
bcachefs: per_cpu_sum()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Mon, 10 Jun 2024 12:26:39 +0000 (08:26 -0400)]
MAINTAINERS: remove Brian Foster as a reviewer for bcachefs
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 20:46:58 +0000 (16:46 -0400)]
bcachefs: kill key cache arg to bch2_assert_pos_locked()
this is an internal implementation detail - and we're improving key
cache coherency
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 19:24:14 +0000 (15:24 -0400)]
bcachefs: btree_path_cached_set()
new helper - small refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 19:25:12 +0000 (15:25 -0400)]
bcachefs: btree_node_unlock() assert
we have a separate helper for releasing write locks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 00:53:02 +0000 (20:53 -0400)]
bcachefs: bch2_gc_pos_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 7 Jun 2024 22:19:39 +0000 (18:19 -0400)]
bcachefs: bch2_btree_id_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 8 Jun 2024 00:51:57 +0000 (20:51 -0400)]
bcachefs: Kill gc_pos_btree_node()
gc_pos is now based on keys, not nodes, for invariantness w.r.t. splits
and merges
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 6 Jun 2024 18:33:27 +0000 (14:33 -0400)]
bcachefs: Fix bch2_gc_accounting_done() locking
The transaction commit path takes mark_lock, so we shouldn't be holding
it; use a bpos as an iterator so that we can drop and retake.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 6 Jun 2024 17:48:54 +0000 (13:48 -0400)]
bcachefs: bch2_accounting_mem_gc()
Add a new helper to free zeroed out accounting entries, and use it in
bch2_replicas_gc2(); bch2_replicas_gc2() was killing superblock replicas
entries if their corresponding accounting counters were nonzero, but
that's incorrect - the superblock replicas entry needs to exist if the
accounting entry exists, not if it's nonzero, because we check and
create the replicas entry when creating the new accounting entry - we
don't know when it's becoming nonzero.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 6 Jun 2024 17:25:28 +0000 (13:25 -0400)]
bcachefs: Refactor disk accounting data structures
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.
Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian Foster [Thu, 6 Jun 2024 13:58:26 +0000 (09:58 -0400)]
bcachefs: fix smatch data leak warning in fs usage ioctl
smatch warns that the copy of arg to userspace is a potential data
leak by virtue of arg.pad not being checked or zeroed. This was
introduced by the commit referenced below that switched arg from
being a zeroed runtime allocation to living on the stack. Fix by
simply zero initializing the structure.
Fixes:
cde738a61e65 ("bcachefs: Convert bch2_ioctl_fs_usage() to new accounting")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 5 Jun 2024 16:35:48 +0000 (12:35 -0400)]
bcachefs: Fix race in bch2_accounting_mem_insert()
bch2_accounting_mem_insert() drops and retakes mark_lock; thus, we need
to check if the entry in question has already been inserted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ariel Miculas [Mon, 3 Jun 2024 20:47:31 +0000 (23:47 +0300)]
bcachefs: bch2_btree_insert() - add btree iter flags
The commit
65bd44239727 ("bcachefs: bch2_btree_insert_trans() no longer
specifies BTREE_ITER_cached") removes BTREE_ITER_cached from
bch2_btree_insert_trans, which causes the update_inode function from
bcachefs-tools to take a long time (~20s). Add an iter_flags parameter
to bch2_btree_insert, so the users can specify iter update trigger
flags, such as BTREE_ITER_cached.
Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 1 Mar 2024 23:43:39 +0000 (18:43 -0500)]
bcachefs: BCH_IOCTL_QUERY_ACCOUNTING
Add a new ioctl that can return the new accounting counter types; it
takes as input a bitmask of accounting types to return.
This will be used for returning e.g. compression accounting and
rebalance_work accounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reed Riley [Sat, 11 May 2024 00:20:12 +0000 (00:20 +0000)]
bcachefs: support REMAP_FILE_DEDUP in bch2_remap_file_range
By removing the early-exit when REMAP_FILE_DEDUP is set, we should be
able to support the fideduperange ioctl, albeit less efficiently than if
we handled some of the extent locking and comparison logic inside
bcachefs. Extent comparison logic already exists inside of
`__generic_remap_file_range_prep`.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Mon, 3 Jun 2024 13:26:20 +0000 (21:26 +0800)]
bcachefs: support FS_IOC_SETFSLABEL
Implement support for FS_IOC_SETFSLABEL ioctl to set filesystem
label.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Mon, 3 Jun 2024 13:26:19 +0000 (21:26 +0800)]
bcachefs: support get fs label
Implement support for FS_IOC_GETFSLABEL ioctl to read filesystem
label.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hongbo Li [Mon, 3 Jun 2024 13:26:18 +0000 (21:26 +0800)]
bcachefs: implement FS_IOC_GETVERSION to support lsattr
In this patch we add the FS_IOC_GETVERSION ioctl for getting
i_generation from inode, after that, users can list file's
generation number by using "lsattr".
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 30 May 2024 01:14:40 +0000 (21:14 -0400)]
bcachefs: Unlock trans when waiting for user input in fsck
We can't hold locks while waiting for user input, that's a deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Youling Tang [Fri, 31 May 2024 02:35:09 +0000 (10:35 +0800)]
bcachefs: Add tracepoints for bch2_sync_fs() and bch2_fsync()
Add trace_bch2_sync_fs() and trace_bch2_fsync() implementations.
The output in trace is as follows:
sync-29779 [000] ..... 193.700935: bch2_sync_fs: dev 254,16 wait 1
<...>-40027 [002] ..... 342.535227: bch2_fsync: dev 254,32 ino 4099 parent 4096 datasync 1
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Youling Tang [Fri, 31 May 2024 02:31:15 +0000 (10:31 +0800)]
bcachefs: track writeback errors using the generic tracking infrastructure
We already using mapping_set_error() in bch2_writepage_io_done(), so all
we need to do is to use file_check_and_advance_wb_err() when handling
fsync() requests in bch2_fsync().
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ariel Miculas [Thu, 30 May 2024 21:13:58 +0000 (00:13 +0300)]
bcachefs: bch2_dir_emit() - fix directory reads in the fuse driver
Commit
0c0cbfdb84725e9933a24ecf47c61bdeeda06ba2 dropped the ctx->pos
update before the call to dir_emit. This breaks the userspace
implementation, causing the directory reads to be stuck in an infinite
loop. This doesn't happen in the kernel because the vfs handles the
updates to ctx->pos, but in the fuse implementation nobody updates
it.
Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 30 May 2024 19:54:08 +0000 (15:54 -0400)]
bcachefs: twf: delete dead struct fields
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 30 May 2024 00:37:39 +0000 (20:37 -0400)]
bcachefs: bch2_stdio_redirect_readline_timeout()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 30 May 2024 00:34:48 +0000 (20:34 -0400)]
bcachefs: twf: convert bch2_stdio_redirect_readline() to darray
We now read the line from the buffer atomically, which means we have to
allow the buffer to grow past STDIO_REDIRECT_BUFSIZE if we're waiting
for a full line - this behaviour is necessary for
stdio_redirect_readline_timeout() in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 30 May 2024 02:06:00 +0000 (22:06 -0400)]
bcachefs: Plumb more logging through stdio redirect
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 9 Feb 2024 02:10:32 +0000 (21:10 -0500)]
bcachefs: fsck_err() may now take a btree_trans
fsck_err() now optionally takes a btree_trans; if the current thread has
one, it is required that it be passed.
The next patch will use this to unlock when waiting for user input.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 23:37:29 +0000 (19:37 -0400)]
bcachefs: btree_types bitmask cleanups
Make things more consistent and ensure that we're using u64 bitfields -
key types and btree ids are already around 32 bits.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 7 Apr 2024 03:58:01 +0000 (23:58 -0400)]
bcachefs: Delete old assertion for online fsck
the order in which btree_gc walks keys have changed, so we no longer
have the sort of issues with online fsck this assertion was warning
about.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 22:54:39 +0000 (18:54 -0400)]
bcachefs: Initialize gc buckets in alloc trigger
Needed for online fsck; we need the trigger to initialize newly
allocated buckets and generation number changes while gc is running.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 22:53:48 +0000 (18:53 -0400)]
bcachefs: Walk leaf to root in btree_gc
Next change will move gc_alloc_start initialization into the alloc
trigger, so we have to mark those first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 21:54:46 +0000 (17:54 -0400)]
bcachefs: Don't block journal when finishing check_allocations()
Blocking the journal was needed to finish checking old style accounting,
but that code is gone and it's not needed in the alloc rewrite,
mark_lock is sufficient for synchronization.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 17:55:49 +0000 (13:55 -0400)]
bcachefs: bch2_fs_get_tree() cleanup
- improve error paths
- call bch2_fs_start() separately, after applying late-parsed options
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 29 May 2024 17:38:06 +0000 (13:38 -0400)]
bcachefs: Kill bch2_mount()
Fold into bch2_fs_get_tree()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 27 Dec 2023 16:33:21 +0000 (11:33 -0500)]
bcachefs: Eytzinger accumulation for accounting keys
The btree write buffer takes as input keys from the journal, sorts them,
deduplicates them, and flushes them back to the btree in sorted order.
The disk space accounting rewrite is moving accounting to normal btree
keys, with update (in this case deltas) accumulated in the write buffer
and then flushed to the btree; but this is going to increase the number
of keys handled by the write buffer by perhaps as much as a factor of
3x-5x.
The overhead from copying around and sorting this many keys would cause
a significant performance regression, but: there is huge locality in
updates to accounting keys that we can take advantage of.
Instead of appending accounting keys to the list of keys to be sorted,
this patch adds an eytzinger search tree of recently seen accounting
keys. We look up the accounting key in the eytzinger search tree and
apply the delta directly, adding it if it doesn't exist, and
periodically prune the eytzinger tree of unused entries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 19 Mar 2024 04:04:52 +0000 (00:04 -0400)]
bcachefs: bch_acct_rebalance_work
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 29 Feb 2024 03:37:21 +0000 (22:37 -0500)]
bcachefs: bch_acct_btree
Add counters for how much disk space we're using per btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 07:17:02 +0000 (02:17 -0500)]
bcachefs: bch_acct_snapshot
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 Feb 2024 22:23:41 +0000 (17:23 -0500)]
bcachefs: bch2_fs_usage_base_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Feb 2024 00:58:07 +0000 (19:58 -0500)]
bcachefs: bch2_fs_accounting_to_text()
Helper to show raw accounting in sysfs, mainly for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Feb 2024 02:09:51 +0000 (21:09 -0500)]
bcachefs: Convert bch2_compression_stats_to_text() to new accounting
We no longer have to walk the whole btree to calculate compression
stats.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 7 Jan 2024 02:42:36 +0000 (21:42 -0500)]
bcachefs: bch_acct_compression
This adds per-compression-type accounting of compressed and uncompressed
size as well as number of extents - meaning we can now see compression
ratio (without walking the whole filesystem).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 18 Feb 2024 05:13:22 +0000 (00:13 -0500)]
bcachefs: bch2_verify_accounting_clean()
Verify that the in-memory accounting verifies the on-disk accounting
after a clean shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 20:21:10 +0000 (15:21 -0500)]
bcachefs: Convert bch2_replicas_gc2() to new accounting
bch2_replicas_gc2() is used for garbage collection superblock replicas
entries that are empty - this converts it to the new accounting scheme.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 12 Feb 2024 03:48:05 +0000 (22:48 -0500)]
bcachefs: Convert gc to new accounting
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 2 Jan 2024 05:22:57 +0000 (00:22 -0500)]
bcachefs: Kill replicas_journal_res
More dead code deletion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 2 Jan 2024 05:15:16 +0000 (00:15 -0500)]
bcachefs: Kill fs_usage_online
More dead code deletion.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 25 Feb 2024 01:04:48 +0000 (20:04 -0500)]
bcachefs: Kill bch2_fs_usage_to_text()
Dead code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 28 Dec 2023 03:09:25 +0000 (22:09 -0500)]
bcachefs: Delete journal-buf-sharded old style accounting
More deletion of dead code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 1 Jan 2024 03:30:15 +0000 (22:30 -0500)]
bcachefs: Kill writing old accounting to journal
More ripping out of the old disk space accounting.
Note that the new disk space accounting is incompatible with the old,
and writing out old style disk space accounting with the new code is
infeasible.
This means upgrading and downgrading past this version requires
regenerating accounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 2 Jan 2024 04:36:23 +0000 (23:36 -0500)]
bcachefs: kill bch2_fs_usage_read()
With bch2_ioctl_fs_usage(), this is now dead code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sun, 7 Jan 2024 01:29:25 +0000 (20:29 -0500)]
bcachefs: Convert bch2_ioctl_fs_usage() to new accounting
This converts bch2_ioctl_fs_usage() to read from the new disk
accounting, via bch2_fs_replicas_usage_read().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 6 Jan 2024 02:23:07 +0000 (21:23 -0500)]
bcachefs: Kill bch2_fs_usage_initialize()
Deleting code for the old disk accounting scheme.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 2 Jan 2024 00:42:37 +0000 (19:42 -0500)]
bcachefs: dev_usage updated by new accounting
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.
This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 4 Jun 2024 22:31:13 +0000 (18:31 -0400)]
bcachefs: Coalesce accounting keys before journal replay
This fixes a performance regression in journal replay; without
colaescing accounting keys we have multiple keys at the same position,
which means journal_keys_peek_upto() has to skip past many overwritten
keys - turning journal replay into an O(n^2) algorithm.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 9 Nov 2023 19:22:46 +0000 (14:22 -0500)]
bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 17 Nov 2023 05:23:07 +0000 (00:23 -0500)]
bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.
Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 28 Dec 2023 01:59:01 +0000 (20:59 -0500)]
bcachefs: Accumulate accounting keys in journal replay
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.
Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.
That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.
There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.
To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 27 Dec 2023 23:31:46 +0000 (18:31 -0500)]
bcachefs: KEY_TYPE_accounting
New key type for the disk space accounting rewrite.
- Holds a variable sized array of u64s (may be more than one for
accounting e.g. compressed and uncompressed size, or buckets and
sectors for a given data type)
- Updates are deltas, not new versions of the key: this means updates
to accounting can happen via the btree write buffer, which we'll be
teaching to accumulate deltas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thomas Bertschinger [Tue, 28 May 2024 04:36:11 +0000 (22:36 -0600)]
bcachefs: use new mount API
This updates bcachefs to use the new mount API:
- Update the file_system_type to use the new init_fs_context()
function.
- Define the new fs_context_operations functions.
- No longer register bch2_mount() and bch2_remount(); these are now
called via the new fs_context functions.
- Define a new helper type, bch2_opts_parse that includes a struct
bch_opts and additionally a printbuf used to save options that can't
be parsed until after the FS is opened. This enables us to parse as
many options as possible prior to opening the filesystem while saving
those options that need the open FS for later parsing.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thomas Bertschinger [Tue, 28 May 2024 04:36:10 +0000 (22:36 -0600)]
bcachefs: Add error code to defer option parsing
This introduces a new error code, option_needs_open_fs, which is used to
indicate that an attempt was made to parse a mount option prior to
opening a filesystem, when that mount option requires an open filesystem
in order to be validated.
Returning this error results in bch2_parse_one_mount_opt() saving that
option for later parsing, after the filesystem is opened.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Thomas Bertschinger [Tue, 28 May 2024 04:36:09 +0000 (22:36 -0600)]
bcachefs: add printbuf arg to bch2_parse_mount_opts()
Mount options that take the name of a device that may be part of a
filesystem, for example "metadata_target", cannot be validated until
after the filesystem has been opened. However, an attempt to parse those
options may be made prior to the filesystem being opened.
This change adds a printbuf parameter to bch2_parse_mount_opts() which
will be used to save those mount options, when they are supplied prior
to the FS being opened, so that they can be parsed later.
This functionality is not currently needed, but will be used after
bcachefs starts using the new mount API to parse mount options. This is
because using the new mount API, we will process mount options prior to
opening the FS, but the new API doesn't provide a convenient way to
"replay" mount option parsing. So we save these options ourselves to
accomplish this.
This change also splits out the code to parse a single option into
bch2_parse_one_mount_opt(), which will be useful when using the new
mount API which deals with a single mount option at a time.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 26 Apr 2024 00:45:00 +0000 (20:45 -0400)]
bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.
Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>