From: Filipe Manana Date: Fri, 15 Jan 2016 11:05:12 +0000 (+0000) Subject: Btrfs: fix deadlock running delayed iputs at transaction commit time X-Git-Tag: v4.5-rc1~21^2~12 X-Git-Url: https://git.kernel.dk/?a=commitdiff_plain;h=c2d6cb1636d235257086f939a8194ef0bf93af6e;p=linux-2.6-block.git Btrfs: fix deadlock running delayed iputs at transaction commit time While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [] dump_stack+0x4e/0x79 [ 886.403568] [] __lock_acquire+0xd42/0xf0b [ 886.403568] [] ? __module_address+0xdf/0x108 [ 886.403568] [] lock_acquire+0x10d/0x194 [ 886.403568] [] ? lock_acquire+0x10d/0x194 [ 886.403568] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [] down_read+0x3e/0x4d [ 886.489542] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [] flush_space+0x435/0x44a [btrfs] [ 886.489542] [] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [] evict+0xa7/0x15c [ 886.489542] [] iput+0x1d3/0x266 [ 886.489542] [] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [] ? signal_pending_state+0x31/0x31 [ 886.489542] [] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [] generic_file_direct_write+0xb3/0x128 [ 886.489542] [] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [] ? __lock_is_held+0x38/0x50 [ 886.489542] [] __vfs_write+0x7c/0xa5 [ 886.489542] [] vfs_write+0xa0/0xe4 [ 886.489542] [] SyS_write+0x50/0x7e [ 886.489542] [] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [] schedule+0x7f/0x97 [ 1081.883340] [] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [] down_write+0x43/0x57 [ 1081.903846] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [] ? mark_held_locks+0x56/0x6c [ 1081.910409] [] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [] ? __lock_is_held+0x38/0x50 [ 1081.923922] [] __vfs_write+0x7c/0xa5 [ 1081.925275] [] vfs_write+0xa0/0xe4 [ 1081.926584] [] SyS_write+0x50/0x7e [ 1081.927968] [] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [] schedule+0x7f/0x97 [ 1081.998017] [] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [] ? finish_wait+0x6d/0x76 [ 1082.000306] [] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [] down_write+0x43/0x57 [ 1082.005000] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [] vfs_fallocate+0x170/0x1ff [ 1082.013738] [] ioctl_preallocate+0x89/0x9b [ 1082.014778] [] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [] ? __fget_light+0x4d/0x71 [ 1082.017789] [] SyS_ioctl+0x57/0x79 [ 1082.018706] [] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana Signed-off-by: Chris Mason --- diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c5f40dc1f74f..e9c2f8895eab 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1614,7 +1614,7 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct rw_semaphore delayed_iput_sem; + struct mutex cleaner_delayed_iput_mutex; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 64f02c3d0dd0..be03f93ca257 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1801,7 +1801,10 @@ static int cleaner_kthread(void *arg) goto sleep; } + mutex_lock(&root->fs_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(root); + mutex_unlock(&root->fs_info->cleaner_delayed_iput_mutex); + again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(&root->fs_info->cleaner_mutex); @@ -2571,8 +2574,8 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); + mutex_init(&fs_info->cleaner_delayed_iput_mutex); seqlock_init(&fs_info->profiles_lock); - init_rwsem(&fs_info->delayed_iput_sem); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); INIT_LIST_HEAD(&fs_info->space_info); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 92843ab772a5..abcffa4b8231 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4153,11 +4153,12 @@ commit_trans: if (ret) return ret; /* - * make sure that all running delayed iput are - * done + * The cleaner kthread might still be doing iput + * operations. Wait for it to finish so that + * more space is released. */ - down_write(&root->fs_info->delayed_iput_sem); - up_write(&root->fs_info->delayed_iput_sem); + mutex_lock(&root->fs_info->cleaner_delayed_iput_mutex); + mutex_unlock(&root->fs_info->cleaner_delayed_iput_mutex); goto again; } else { btrfs_end_transaction(trans, root); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 85afe66955cf..8ad9e2200442 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3134,7 +3134,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root) { struct btrfs_fs_info *fs_info = root->fs_info; - down_read(&fs_info->delayed_iput_sem); spin_lock(&fs_info->delayed_iput_lock); while (!list_empty(&fs_info->delayed_iputs)) { struct btrfs_inode *inode; @@ -3153,7 +3152,6 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root) spin_lock(&fs_info->delayed_iput_lock); } spin_unlock(&fs_info->delayed_iput_lock); - up_read(&root->fs_info->delayed_iput_sem); } /*