git.kernel.dk Git - linux-block.git/log

btrfs: change return type of btrfs_bio_csum() to int

The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: change return type of btree_csum_one_bio() to int

The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: change return type of btrfs_csum_one_bio() to int

The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: change return type of btrfs_lookup_bio_sums() to int

The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: drop redundant local variable in raid_wait_write_end_io()

The bio status is read only once, no variable needed for that.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: merge __setup_root() to btrfs_alloc_root()

There's only one caller of __setup_root() so merge it there.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use unsigned types for constants defined as bit shifts

The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused btrfs_io_stripe::length

First added (but not effectively used) in 02c372e1f016e5 ("btrfs: add
support for inserting raid stripe extents"). The structure is
initialized to zeros so the only use in btrfs_insert_one_raid_extent()

    u64 length = bioc->stripes[i].length;
    struct btrfs_raid_stride *raid_stride = &stripe_extent->strides[i];

    if (length == 0)
            length = bioc->size;

the 'if' always happens.

Last use in 4016358e852861 ("btrfs: remove unused variable length in
btrfs_insert_one_raid_extent()") was an obvious cleanup. It seems to be
safe to remove, raid-stripe-tree works without using it since 6.6.

This was found by tool https://github.com/jirislaby/clang-struct .

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use list_first_entry() everywhere

Using the helper makes it a bit more clear that we're accessing the
first list entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()

The use of ASSERT(0) is maybe useful for some cases but more like a
notice for developers. Assertions can be compiled in independently so
convert it to a debugging helper.

The difference is that it's just a warning and will not end up in BUG().
The converted cases are in connection with proper error handling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARN

Use the conditional warning instead of typing the whole condition.
Optional message is printed where it seems clear what could be the
problem.

Conversion is left out in btree_csum_one_bio() because of the additional
condition.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add debug build only WARN

Add conditional WARN() wrapper that's enabled only in debug build. It
should be used for unexpected conditions that should be noisy. Use it
instead of ASSERT(0). As it will not lead to BUG() make sure that
continuing is still possible, e.g. the error is handled anyway.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use verbose ASSERT() in volumes.c

The file volumes.c has about 40 assertions and half of them are suitable
for ASSERT() with additional data.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: enhance ASSERT() to take optional format string

Currently ASSERT() prints the stringified condition and without macro
expansions so simple constants like BTRFS_MAX_METADATA_BLOCKSIZE remain
readable in the output.

There are expressions where we'd like to see the exact values but all we
get is something like:

assertion failed: em->start <= start && start < extent_map_end(em), in fs/btrfs/extent_map.c:613

It would be nice to be able to print any additional information to help
understand the problem. With some preprocessor magic and compile-time
optimizations we can enhance ASSERT to work like that as well:

ASSERT(value > limit, "value=%llu limit=%llu", value, limit);

with free-form printk arguments that will be part of the assertion
message.

Pros:
- helps debugging and understanding reported problems
- the optional format is verified at compile-time

Cons:
- increases the .ko size
- writing the assertion code is repetitive (condition, format, values)
- format and variable type must match (extra lookup)
- needs gcc 8.x and newer, otherwise it's the short format

Recommended use is for non-trivial expressions, so basic ASSERT(value) can be
used for pointers or sometimes integers.

The format has been slightly updated to also print the result of the
evaluation of the condition, appended to the stringified condition as
"condition :: <value>".

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove BTRFS_REF_LAST from enum btrfs_ref_type

Commit b28b1f0ce44c ("btrfs: delayed-ref: Introduce better documented
delayed ref structures") introduced BTRFS_REF_LAST, which can be used
for sanity checking, e.g. in switch/case or for loops.

In btrfs_ref_type() there is an assertion

ASSERT(ref->type == BTRFS_REF_DATA || ref->type == BTRFS_REF_METADATA);

to validate the values so we don't need the ending enum.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use bvec_kmap_local() in btrfs_decompress_buf2page()

This removes the last direct poke into bvec internals in btrfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: use virtual addresses directly

Instead of the old @page and @page_offset pair inside scrub, here we can
directly use the virtual address for a sector.

This has the following benefit:

- Simplified parameters
  A single @kaddr will repair @page and @page_offset.

- No more unnecessary kmap/kunmap calls
  Since all pages utilized by scrub is allocated by scrub, and no
  highmem is allowed, we do not need to do any kmap/kunmap.

  And add an ASSERT() inside the new scrub_stripe_get_kaddr() to
  catch any unexpected highmem page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: raid56: store a physical address in structure sector_ptr

Instead of using a @page + @pg_offset pair inside sector_ptr structure,
use a single physical address instead.

This allows us to grab both the page and offset from a single u64 value.
Although we still need an extra bool value, @has_paddr, to distinguish
if the sector is properly mapped (as the 0 physical address is totally
valid).

This change doesn't change the size of structure sector_ptr, but reduces
the parameters of several functions.

Note: the original idea and patch is from Christoph Hellwig
(https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/)
but the final implementation is different.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ Use physical addresses instead to handle highmem. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify bvec iteration in index_one_bio()

Flatten the two loops by open coding bio_for_each_segment() and advancing
the iterator one sector at a time.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Fix a bug that @offset is not increased. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move kmapping out of btrfs_check_sector_csum()

Move kmapping the page out of btrfs_check_sector_csum().

This allows using bvec_kmap_local() where suitable and reduces the number
of kmap*() calls in the raid56 code.

This also means btrfs_check_sector_csum() will only accept a properly
kmapped address.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass a physical address to btrfs_repair_io_failure()

Using physical address has the following advantages:

- All involved callers only need a single pointer
  Instead of the old @folio + @offset pair.

- No complex poking into the bio_vec structure
  As a bio_vec can be single or multiple paged, grabbing the real page
  can be quite complex if the bio_vec is a multi-page one.

  Instead bvec_phys() will always give a single physical address, and it
  cab be easily converted to a page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: track the next file offset in struct btrfs_bio_ctrl

The bio implementation is not something we should really mess around,
and we shouldn't recalculate the pos from the folio over and over.
Instead just track then end of the current bio in logical file offsets
in the btrfs_bio_ctrl, which is much simpler and easier to read.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove the alignment checks in end_bbio_data_read()

end_bbio_data_read() checks that each iterated folio fragment is aligned
and justifies that with block drivers advancing the bio. But block
driver only advance bi_iter, while end_bbio_data_read() uses
bio_for_each_folio_all() to iterate the immutable bi_io_vec array that
can't be changed by drivers at all.

Furthermore btrfs has already did the alignment check of the file
offset inside submit_one_sector(), and the size is fixed to fs block
size, there is no need to re-do the alignment check again inside the
endio function.

So just remove the unnecessary alignment check along with the incorrect
comment.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update and correct description of btrfs_get_or_create_delayed_node()

The comment mistakenly says the function is returning PTR_ERR instead of
ERR_PTR. Fix it and update it so it's more descriptive.

Signed-off-by: Charles Han <hanchunchao@inspur.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Enhance the function comment. ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify return logic from btrfs_delayed_ref_init()

Make this simpler by returning directly when there's no other cleanup
needed.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: reuse exit helper for cleanup in btrfs_bioset_init()

Do not duplicate the cleanup after failed initialization
in btrfs_bioset_init() and reuse the exit function btrfs_bioset_exit().

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename iov_iter iterator parameter in btrfs_buffered_write()

Using 'i' for a parameter is confusing and conforming to current
preferences, so rename it to 'iter'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: enable large data folios support for defrag

Currently we reject large folios for defrag gracefully, but the
implementation itself is already mostly large folios compatible.

There are several parts of defrag in btrfs:

- Extent map checking
  Aka, defrag_collect_targets(), which prepares a list of target ranges
  that should be defragged.

  This part is completely folio unrelated, thus it doesn't care about
  the folio size.

- Target folio preparation
  Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
  target folio.

  Since folio read and lock are already supporting large folios, this
  part needs only minor changes.

- Redirty the target range of the folio
  This is already done in a way supporting large folios.

So it's pretty straightforward to enable large folios for defrag:

- Do not reject large folios for experimental builds
  This affects the large folio check inside defrag_prepare_one_folio().

- Wait for ordered extents of the whole folio in
  defrag_prepare_one_folio()

- Lock the whole extent range for all involved folios in
  defrag_one_range()

- Allow the folios[] array to be partially empty
  Since we can have large folios, folios[] will not always be full.

  This affects:
  * How to allocate folios in defrag_one_range()
    Now we cannot use page index, but use the end position of the folio
    as an iterator.

  * How to free the folios[] array
    If we hit an empty slot, it means we have large folios and already
    hit the end of the array.

  * How to mark the range dirty
    Instead of use page index directly, we have to go through each
    folio, and check if the folio covers the defrag target inside
    defrag_one_locked_target().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare compression paths for large data folios

All compression algorithms inside btrfs are not supporting large folios
due to the following points:

- btrfs_calc_input_length() is assuming page sized folio

- kmap_local_folio() usages are using offset_in_page()

Prepare them to support large data folios by:

- Add a folio parameter to btrfs_calc_input_length()
  And use that folio parameter to calculate the correct length.

  Since we're here, also add extra ASSERT()s to make sure the parameter
  @cur is inside the folio range.

  This affects only zlib and zstd. Lzo compresses at most one block at a
  time, thus not affected.

- Use offset_in_folio() to calculate the kmap_local_folio() offset
  This affects all 3 algorithms.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename __tree_search() to remove double underscore prefix

There's no need to have a double underscore prefix as there's no variant
of the function without it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename __lookup_extent_mapping() to remove double underscore prefix

There's no need to have a double underscore prefix as there's no variant
of the function without it anymore.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename remaining exported extent map functions

Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename functions to allocate and free extent maps

These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename extent map functions to get block start, end and check if in tree

These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename exported extent map compression functions

These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: remove no longer used tracepoints for eb locking

There are several tracepoints for extent buffer locks that are not used
anymore:

  * btrfs_tree_read_unlock_blocking
  * btrfs_set_lock_blocking_read
  * btrfs_set_lock_blocking_write
  * btrfs_tree_read_lock_atomic

These stopped being used after we switched extent buffer locks from a
custom implementation to rw semaphores in commit 196d59ab9ccc
("btrfs: switch extent buffer tree lock to rw_semaphore").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add btrfs prefix to names where it's missing

Most of our tracepoints have the 'btrfs_' prefix in their names but a few
of them are missing, making it inconsistent. So add the prefix to the ones
that are missing it, creating consistency, making it clear for users these
are btrfs tracepoints and eventually avoid name collisions with other
tracepoints defined by other kernel subsystems.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make btrfs_find_contiguous_extent_bit() return bool instead of int

The function needs only to return true or false, so there's no need to
return an integer. Currently it returns 0 when a range with the given
bits is set and 1 when not found, which is a bit counter intuitive too.
So change the function to return a bool instead, returning true when a
range is found and false otherwise. Update the function's documentation
to mention the return value too.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove double underscore prefix from __set_extent_bit()

Now that set_extent_bit() was renamed to btrfs_set_extent_bit(), there's
no need to have a __set_extent_bit() function, we can just remove the
double underscore prefix, which we try to avoid according to the coding
style conventions.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename remaining exported functions from extent-io-tree.h

Rename the remaning exported functions that don't have a 'btrfs_' prefix.
By convention exported functions should have such prefix to make it clear
they are btrfs specific and to avoid collisions with functions from
elsewhere in the kernel.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename free_extent_state() to include a btrfs prefix

This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.

Rename the function to add 'btrfs_' prefix to it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename the functions to count, test and get bit ranges in io trees

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their names to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename the functions to init and release an extent io tree

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: directly grab inode at __btrfs_debug_check_extent_io_range()

We've tested that we are dealing with io tree that is associated to an
inode (its owner is IO_TREE_INODE_IO), so there's no need to call
btrfs_extent_io_tree_to_inode() in a separate line and we just assign
tree->inode to the local inode variable when we declare it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename the functions to get inode and fs_info from an extent io tree

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs. Also remove the 'const' suffix from extent_io_tree_to_inode_const()
since there's no non-const variant anymore and makes the naming consistent
with extent_io_tree_to_fs_info() (no 'const' suffix and returns a const
pointer).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename the functions to search for bits in extent ranges

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.

So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename set_extent_bit() to include a btrfs prefix

This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.

So rename it to btrfs_set_extent_bit().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename the functions to clear bits for an extent range

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.

So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename __lock_extent() and __try_lock_extent()

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. Their double
underscore prefix is also discouraged.

So remove their double underscore prefix, add a 'btrfs_' prefix to their
name to make it clear they are from btrfs and a '_bits' suffix to avoid
collision with btrfs_lock_extent() and btrfs_try_lock_extent().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add btrfs prefix to dio lock and unlock extent functions

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add btrfs prefix to main lock, try lock and unlock extent functions

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add btrfs prefix to trace events for extent state alloc and free

These trace events don't have the 'btrfs_' prefix in their name, unlike
the other trace events from extent-io-tree.c. So add the prefix to make
them consistent and follow coding style conventions too.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove extent_io_tree_to_inode() and is_inode_io_tree()

These functions aren't used outside extent-io-tree.c, but yet one of them
(extent_io_tree_to_inode()) is unnecessarily exported in the header.

Furthermore their single use is in a pattern like this:

    if (is_inode_io_tree(tree))
        foo(extent_io_tree_to_inode(tree), ...);

So we're effectively unnecessarily adding more indirection, checking
twice if tree->owner == IO_TREE_INODE_IO before getting the inode and
doing a non-inline function call to get tree->inode.

Simplify this by removing these helper functions and instead doing
thing like this:

   if (tree->owner == IO_TREE_INODE_IO)
       foo(tree->inode, ...);

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: more unlikely annotations

Add more unlikely annotations to branches that lead to EUCLEAN, overall
in the tree checker this helps to reorder instructions for the no-error
case.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use folio_contains() for EOF detection

Currently we use the following pattern to detect if the folio contains
the end of a file:

if (folio->index == end_index)
folio_zero_range();

But that only works if the folio is page sized.

For the following case, it will not work and leave the range beyond EOF
uninitialized:

  The page size is 4K, and the fs block size is also 4K.

16K        20K       24K
        |          |     |   |
                 |
                         EOF at 22K

And we have a large folio sized 8K at file offset 16K.

In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.

Fix the following call sites which use the above pattern:

- add_ra_bio_pages()

- extent_writepage()

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unnecessary early exits in delalloc folio lock and unlock

Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we
have the following early exits:

if (index == locked_folio->index && end_index == index)
return;

This allows us to exit early if the range is inside the same locked
folio.

However the current check relies on page sized folios, if we have a large
folio that contains @index but not at @index, then the early exit will
no longer trigger.

Furthermore without the above early check, the existing code can handle it
well, as both __process_folios_contig() and lock_delalloc_folios() will
skip any folio page lock/unlock if it's on the locked folio.

Here we remove the early exits and let the existing code handle the
same index case, to make the code a little simpler.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: use btrfs_root_id() to get the id of a root

Instead of open coding btrfs_root_id() to get the ID of a root, use the
helper in the trace points, which also makes the code less verbose.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zlib: prepare copy_data_into_buffer() for large data folios

The function itself is already taking large folios into consideration,
just remove the ASSERT(!folio_test_large()) line.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: subpage: prepare for large data folios

The subpage handling code has two locations not supporting large folios:

- btrfs_attach_subpage()
  Which is doing a metadata specific ASSERT() check.

  But for the future large data folios support, that check is too
  generic.  Since it's metadata specific, only check the ASSERT() for
  metadata.

- btrfs_subpage_assert()
  Just remove the "ASSERT(folio_order(folio) == 0)" check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare end_bbio_data_write() for large data folios

The function is doing an ASSERT() checking the folio order, but all
later functions are handling large folios properly, thus we can safely
remove that ASSERT().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare prepare_one_folio() for large data folios

The only blockage is the ASSERT() rejecting large folios, just remove
it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare btrfs_page_mkwrite() for large data folios

The function btrfs_page_mkwrite() has an explicit ASSERT() checking the
folio order.

To make it support large data folios, we need to:

- Remove the ASSERT(folio_order(folio) == 0)

- Use folio_contains() to check if the folio covers the last page

Otherwise the code is already supporting large folios well.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: prepare put_file_data() for large data folios

Currently put_file_data() can only accept a page sized folio. However
the function itself is not that complex, it's just copying data from
filemap folio into the send buffer.

Make it support large data folios:

- Change the loop to use file offset instead of page index

- Calculate @pg_offset and @cur_len after getting the folio

- Remove the "WARN_ON(folio_order(folio));" line

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: remove the again label inside put_file_data()

The again label is here to retry to get the folio for the current index.
When triggering that label, there is no advance of the iterator.

So it can be replaced by a simple "continue" and remove the again label.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_insert_inode_extref()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_del_inode_extref()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_encoded_read_inline()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in can_nocow_extent()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_set_inode_index_count()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use BTRFS_PATH_AUTO_FREE in may_destroy_subvol()

This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversions

The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant record start offset check at test_range_bit()

It's pointless to check if the current record's start offset is greater
than the end offset, as before we just tested if it was greater than the
start offset - and if it's not it means it's less than or equal to the
start offset, so it can not be greater than the end offset, as our start
offset is always smaller than the end offset.

So remove that check and also add an assertion to verify the start offset
is smaller then the end offset.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify last record detection at test_range_bit()

The overflow detection for the start offset of the next record is not
really necessary, we can just stop iterating if the current record ends at
or after out end offset. This removes the need to test if the current
record end offset is (u64)-1 and to check if adding 1 to the current
end offset results in 0.

By testing only if the current record ends at or after the end offset, we
also don't need anymore to test the new start offset at the head of the
while loop.

This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47ff0c9 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant check at find_first_extent_bit_state()

The tree_search() function always returns an entry that either contains
the search offset or the first entry in the tree that starts after the
offset. So checking at find_first_extent_bit_state() if the returned
entry ends at or after the search offset is pointless. Remove the check.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix documentation for tree_search_for_insert()

There are several things wrong with the documentation:

1) At the top it's only mentioned that we search for an entry containing
   the given offset, but when such entry does not exists we search for
   the first entry that starts and ends after that offset;

2) It mentions that @node_ret and @parent_ret aren't changed if the
   returned entry contains the given offset - that is true only if the
   returned entry starts exactly at @offset, otherwise those arguments
   are changed;

3) It mentions that if no entry containing offset is found then we return
   the first entry ending before the offset - that is not true, we return
   the first entry that starts and ends after that offset;

4) It also mentions that NULL is never returned. This is false as in case
   there's no entry containing offset or any entry that starts and ends
   after offset, NULL is returned.

So fix the documentation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify last record detection at test_range_bit_exists()

Instead of keeping track of the minimum start offset of the next record
and detecting overflow every time we update that offset to be the sum of
current record's end offset plus one, we can simply exit when the current
record ends at or beyond our end offset and forget about updating the
start offset on every iteration and testing for it at the top of the loop.
This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47ff0c9 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use clear_extent_bits() instead of clear_extent_bit() where possible

Several places are using clear_extent_bit() and passing a NULL value for
the 'cached' argument, which is pointless as they can use instead
clear_extent_bits().

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use clear_extent_bits() at chunk_map_device_clear_bits()

Instead of using __clear_extent_bit() we can use clear_extent_bits() since
we pass a NULL value for the cached and changeset arguments.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use clear_extent_bit() at try_release_extent_state()

Instead of using __clear_extent_bit() we can use clear_extent_bit() since
we pass a NULL value for the changeset argument.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()

[BUG WITH EXPERIMENTAL LARGE FOLIOS]
When testing the experimental large data folio support with compression,
there are several ASSERT()s triggered from btrfs_decompress_buf2page()
when running fsstress with compress=zstd mount option:

- ASSERT(copy_len) from btrfs_decompress_buf2page()
- VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page()

[CAUSE]
Inside btrfs_decompress_buf2page(), we need to grab the file offset from
the current bvec.bv_page, to check if we even need to copy data into the
bio.

And since we're using single page bvec, and no large folio, every page
inside the folio should have its index properly setup.

But when large folios are involved, only the first page (aka, the head
page) of a large folio has its index properly initialized.

The other pages inside the large folio will not have their indexes
properly initialized.

Thus the page_offset() call inside btrfs_decompress_buf2page() will
result garbage, and completely screw up the @copy_len calculation.

[FIX]
Instead of using page->index directly, go with page_pgoff(), which can
handle non-head pages correctly.

So introduce a helper, file_offset_from_bvec(), to get the file offset
from a single page bio_vec, so the copy_len calculation can be done
correctly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use rb_entry_safe() where possible to simplify code

Simplify conditionally reading an rb_entry(), there's the
rb_entry_safe() helper that checks the node pointer for NULL so we don't
have to write it explicitly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass a pointer to get_range_bits() to cache first search result

Allow get_range_bits() to take an extent state pointer to pointer argument
so that we can cache the first extent state record in the target range, so
that a caller can use it for subsequent operations without doing a full
tree search. Currently the only user is try_release_extent_state(), which
then does a call to __clear_extent_bit() which can use such a cached state
record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allow folios to be released while ordered extent is finishing

When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.

However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.

When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:

  * Once the writeback completes we queue the ordered extent completion;

  * Once the ordered extent completion starts, we lock the range in the
    inode's io_tree (at btrfs_finish_one_ordered());

  * If the release_folio callback is called while the folio's range is
    locked in the inode's io_tree, we don't allow the folio to be
    released, so the kernel has to try to release memory elsewhere,
    which may result in triggering more writeback or releasing other
    pages from the page cache which may be more useful to have around
    for applications.

In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.

Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update comment for try_release_extent_state()

Drop reference to pages from the comment since the function is fully folio
aware and works regardless of how many pages are in the folio. Also while
at it, capitalize the first word and make it more explicit that
release_folio is a callback from struct address_space_operations.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare btrfs_punch_hole_lock_range() for large data folios

The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.

But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:

        start                            end
        |                                |
  |//|//|//|//|  |  |  |  |  |  |  |  |//|//|
   \         /                         \   /
    Folio A                            Folio B

In the above case, folio A and B contain our start/end indexes, and there
are no other folios in the range.  Thus we do not need to retry inside
btrfs_punch_hole_lock_range().

To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:

- Shrink the search range towards page boundaries
  If the rounded down end (exclusive, otherwise it can underflow when @end
  is inside the folio at file offset 0) is no larger than the rounded up
  start, it means the range contains no other pages other than the ones
  covering @start and @end.

  Can return false directly in that case.

- Grab all the folios inside the range

- Skip any large folios that cover the start and end indexes

- If any other folios are found return true

- Otherwise return false

This new helper is going to handle both large folios and regular ones.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare btrfs_buffered_write() for large data folios

This involves the following modifications:

- Set the order flags for __filemap_get_folio() inside
  prepare_one_folio()

  This will allow __filemap_get_folio() to create a large folio if the
  address space supports it.

- Limit the initial @write_bytes inside copy_one_range()
  If the largest folio boundary splits the initial write range, there is
  no way we can write beyond the largest folio boundary.

  This is done by a simple helper calc_write_bytes().

- Release exceeding reserved space if the folio is smaller than expected
  Which is doing the same handling when short copy happens.

All the preparations should not change the behavior when the largest
folio order is 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: refactor how we handle reserved space inside copy_one_range()

There are several things not ideal in copy_one_range():

- Unnecessary temporary variables
  * block_offset
  * reserve_bytes
  * dirty_blocks
  * num_blocks
  * release_bytes
  These are utilized to handle short-copy cases.

- Inconsistent handling of btrfs_delalloc_release_extents()
  There is a hidden behavior that, after reserving metadata for X bytes
  of data write, we have to call btrfs_delalloc_release_extents() with X
  once and only once.

  Calling btrfs_delalloc_release_extents(X - 4K) and
  btrfs_delalloc_release_extents(4K) will cause outstanding extents
  accounting to go wrong.

  This is because the outstanding extents mechanism is not designed to
  handle shrinking of reserved space.

Improve above situations by:

- Use a single @reserved_start and @reserved_len pair
  Now we reserve space for the initial range, and if a short copy
  happened and we need to shrink the reserved space, we can easily
  calculate the new length, and update @reserved_len.

- Introduce helpers to shrink reserved data and metadata space
  This is done by two new helpers, shrink_reserved_space() and
  btrfs_delalloc_shrink_extents().

  The later will do a better calculation if we need to modify the
  outstanding extents, and the first one will be utilized inside
  copy_one_range().

- Manually unlock, release reserved space and return if no byte is
  copied

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove EXTENT_UPTODATE io tree flag

The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the
fs_info->excluded_extents as used by super blocks and not available for
extent allocation (to prevent adding those ranges as free space in the
in memory space caches). As we can use any flag for that purpose, and
we are using EXTENT_DIRTY for the pinned extents io tree for example,
remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the
excluded extents io tree.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop searching for EXTENT_DIRTY bit in the excluded extents io tree

At btrfs_add_new_free_space() we keep searching for ranges in the excluded
extents io tree that have the EXTENT_DIRTY bit set, however we never ever
set that bit for ranges in that tree. That is a leftover from when that
function used the global freed extents trees (fs_info->freed_extents[2]),
where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those
trees are gone with commit fe119a6eeb67 ("btrfs: switch to per-transaction
pinned extents"), which introduced the fs_info->excluded_extents io tree,
where only EXTENT_UPTODATE is set.

So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_tree

After commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE
state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's
io_tree anymore, but we still have some code attempting to clear that
bit from an inode's io_tree. Remove that code as it doesn't do anything
anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded
extents io_tree (fs_info->excluded_extents), which is used to track the
locations of super blocks, so that their ranges are never marked as free,
making them unavailable for extent allocation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix fsync of files with no hard links not persisting deletion

If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.

This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.

So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.

Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.

The base code for a fstests test case that reproduces this bug is the
following:

   . ./common/dmflakey

   _require_scratch
   _require_dm_target flakey
   _require_mknod

   _scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
   _require_metadata_journaling $SCRATCH_DEV
   _init_flakey
   _mount_flakey

   touch $SCRATCH_MNT/foo

   # Commit the current transaction and persist the file.
   _scratch_sync

   # A fifo to communicate with a background xfs_io process that will
   # fsync the file after we deleted its hard link while it's open by
   # xfs_io.
   mkfifo $SCRATCH_MNT/fifo

   tail -f $SCRATCH_MNT/fifo | \
        $XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
   XFS_IO_PID=$!

   # Give some time for the xfs_io process to open a file descriptor for
   # the file.
   sleep 1

   # Now while the file is open by the xfs_io process, delete its only
   # hard link.
   rm -f $SCRATCH_MNT/foo

   # Now that it has no more hard links, make the xfs_io process fsync it.
   echo "fsync" > $SCRATCH_MNT/fifo

   # Terminate the xfs_io process so that we can unmount.
   echo "quit" > $SCRATCH_MNT/fifo
   wait $XFS_IO_PID
   unset XFS_IO_PID

   # Simulate a power failure and then mount again the filesystem to
   # replay the journal/log.
   _flakey_drop_and_remount

   # We don't expect the file to exist anymore, since it was fsynced when
   # it had no more hard links.
   [ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"

   _unmount_flakey

   # success, all done
   echo "Silence is golden"
   status=0
   exit

A test case for fstests will be submitted soon.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix typo in space info explanation

There's an explanation of how space info works at the top of
fs/btrfs/space-info.c, which makes reference to a variable called
bytes_may_reserve. There's nothing called that in the code, and wasn't
at time the comment was written; as far I can tell this is a typo, and
it should actually be bytes_may_use.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused flag EXTENT_BUFFER_IN_TREE

This flag is set after inserting the eb to the buffer tree and cleared
on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a
bit to track if we're in the radix tree") and wanted to make use of it,
faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting
code"). Both are 10+ years old, we can remove the flag.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused flag EXTENT_BUFFER_CORRUPT

This flag is no longer being used. It was added by commit a826d6dcb32d
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c92386028 ("btrfs: remove reada
infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused flag EXTENT_BUFFER_READAHEAD

This flag is no longer being used. It was added by commit ab0fff03055d
("btrfs: add READAHEAD extent buffer flag") and used in commits:

79fb65a1f6d9 ("Btrfs: don't call readahead hook until we have read the entire eb")
78e62c02abb9 ("btrfs: Remove extent_io_ops::readpage_io_failed_hook")
371cdc0700c7 ("btrfs: introduce subpage metadata validation check")

Finally all the code using it was removed by commit f26c92386028 ("btrfs: remove
reada infrastructure").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unused flag EXTENT_BUFFER_READ_ERR

This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree
inode write errors to avoid fs corruption") but it stopped being used
after commit 046b562b20a5 ("btrfs: use a separate end_io handler for
read_extent_buffer").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor out the main loop of btrfs_buffered_write() into a helper

Inside the main loop of btrfs_buffered_write() we are doing a lot of
heavy lifting inside a while() loop.

This makes it pretty hard to read, factor out the content into a helper,
copy_one_range() to do the work.

This has no functional change, but with some minor variable renames,
e.g. rename all "sector" into "block".

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor out space reservation code from btrfs_buffered_write()

Inside the main loop of btrfs_buffered_write(), we have a complex data
and metadata space reservation code, which tries to reserve space for
a COW write, if failed then fallback to check if we can do a NOCOW
write.

Factor out that part of code into a dedicated helper, reserve_space(),
to make the main loop a little easier to read.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup the reserved space inside loop of btrfs_buffered_write()

Inside the main loop of btrfs_buffered_write(), if something wrong
happened, there is a out-of-loop cleanup path to release the reserved
space.

This behavior saves some code lines, but makes it much harder to read,
as we need to check release_bytes to make sure when we need to do the
cleanup.

Factor out the cleanup part into a helper, release_reserved_space(), to
do the cleanup inside the main loop, so that we can move @release_bytes
inside the loop.

This will make later refactoring of the main loop much easier.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove force_page_uptodate variable from btrfs_buffered_write()

Commit c87c299776e4 ("btrfs: make buffered write to copy one page a
time") changed how the variable @force_page_uptodate was updated.

Before that commit the variable was only initialized to false at the
beginning of the function, and after hitting a short copy, the next
retry on the same folio would force the folio to be read from the disk.

But after the commit, the variable is always initialized to false at the
beginning of the loop's scope, causing prepare_one_folio() never to get a
true value passed in.

The change in behavior is not a huge deal, it only makes a difference
on how we handle short copies:

Old: Allow the buffer to be split

     The first short copy will be rejected, that's the same for both
     cases.

     But for the next retry, we require the folio to be read from disk.

     Then even if we hit a short copy again, since the folio is already
     uptodate, we do not need to handle partial uptodate range, and can
     continue, marking the short copied range as dirty and continue.

     This will split the buffer write into the folio as two buffered
     writes.

New: Do not allow the buffer to be split

     The first short copy will be rejected, that's the same for both
     cases.

     For the next retry, we do nothing special, thus if the short copy
     happened again, we reject it again, until either the short copy is
     gone, or we failed to fault in the buffer.

     This will mean the buffer write into the folio will either fail or
     succeed, no splitting will happen.

To me, either solution is fine, but the new one makes it simpler and
requires no special handling, so I prefer that solution.

And since @force_page_uptodate is always false when passed into
prepare_one_folio(), we can just remove the variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>