Btrfs: fix stale dir entries after unlink, inode eviction and fsync
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.
This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.
The following test case for fstests triggers the problem:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with 2 hard links.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Now remove one of the links, trigger inode eviction and then fsync
# our inode.
unlink $SCRATCH_MNT/testdir/bar
echo 2 > /proc/sys/vm/drop_caches
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
# Silently drop all writes on our scratch device to simulate a power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify our directory entries.
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
# If we remove our inode, its parent should become empty and therefore we should
# be able to remove the parent.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test failed on btrfs with:
generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
--- tests/generic/098.out 2015-07-23 18:01:12.
616175932 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.
924138308 +0100
@@ -1,3 +1,6 @@
QA output created by 098
Entries in testdir:
+bar
foo
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
(Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
$ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
Checking filesystem on /dev/sdc
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>