AgeCommit message (Collapse)Author
2020-02-27io_uring: use poll driven retry for files that support itio_uring-task-pollJens Axboe
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <>
2020-02-27io_uring: mark requests that we can do poll async in io_op_defsJens Axboe
Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked. Signed-off-by: Jens Axboe <>
2020-02-27io_uring: add per-task callback handlerJens Axboe
For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case. We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason. eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks. These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself. Signed-off-by: Jens Axboe <>
2020-02-27io_uring: store io_kiocb in wait->privateJens Axboe
Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues. No functional changes in this patch. Signed-off-by: Jens Axboe <>
2020-02-27task_work_run: don't take ->pi_lock unconditionallyOleg Nesterov
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING. And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear. Signed-off-by: Oleg Nesterov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: add splice(2) supportPavel Begunkov
Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: add interface for getting filesPavel Begunkov
Preparation without functional changes. Adds io_get_file(), that allows to grab files not only into req->file. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27splice: make do_splice publicPavel Begunkov
Make do_splice(), so other kernel parts can reuse it Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: remove req->in_asyncPavel Begunkov
req->in_async is not really needed, it only prevents propagation of @nxt for fast not-blocked submissions. Remove it. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: don't do full *prep_worker() from io-wqPavel Begunkov
io_prep_async_worker() called io_wq_assign_next() do many useless checks: io_req_work_grab_env() was already called during prep, and @do_hashed is not ever used. Add io_prep_next_work() -- simplified version, that can be called io-wq. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: don't call work.func from sync ctxPavel Begunkov
Many operations define custom work.func before getting into an io-wq. There are several points against: - it calls io_wq_assign_next() from outside io-wq, that may be confusing - sync context would go unnecessary through io_req_cancelled() - prototypes are quite different, so work!=old_work looks strange - makes async/sync responsibilities fuzzy - adds extra overhead Don't call generic path and io-wq handlers from each other, but use helpers instead Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-27io_uring: io_accept() should hold on to submit reference on retryJens Axboe
Don't drop an early reference, hang on to it and let the caller drop it. This makes it behave more like "regular" requests. Signed-off-by: Jens Axboe <>
2020-02-27io_uring: consider any io_read/write -EAGAIN as finalJens Axboe
If the -EAGAIN happens because of a static condition, then a poll or later retry won't fix it. We must call it again from blocking condition. Play it safe and ensure that any -EAGAIN condition from read or write must retry from async context. Signed-off-by: Jens Axboe <>
2020-02-27io_uring: fix 32-bit compatability with sendmsg/recvmsgio_uring-5.6-2020-02-28Jens Axboe
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()") Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <>
2020-02-27io_uring: define and set show_fdinfo only if procfs is enabledTobias Klauser
Follow the pattern used with other *_show_fdinfo functions and only define and use io_uring_show_fdinfo and its helper functions if CONFIG_PROC_FS is set. Fixes: 87ce955b24c9 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor") Signed-off-by: Tobias Klauser <> Signed-off-by: Jens Axboe <>
2020-02-26io_uring: drop file set ref put/get on switchJens Axboe
Dan reports that he triggered a warning on ring exit doing some testing: percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Modules linked in: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83 RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292 RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000 RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045 R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0 R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> rcu_core+0x1e4/0x4d0 __do_softirq+0xdb/0x2f1 irq_exit+0xa0/0xb0 smp_apic_timer_interrupt+0x60/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x23/0x170 Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95 Turns out that this is due to percpu_ref_switch_to_atomic() only grabbing a reference to the percpu refcount if it's not already in atomic mode. io_uring drops a ref and re-gets it when switching back to percpu mode. We attempt to protect against this with the FFD_F_ATOMIC bit, but that isn't reliable. We don't actually need to juggle these refcounts between atomic and percpu switch, we can just do them when we've switched to atomic mode. This removes the need for FFD_F_ATOMIC, which wasn't reliable. Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Reported-by: Dan Melnic <> Signed-off-by: Jens Axboe <>
2020-02-26io_uring: import_single_range() returns 0/-ERRORJens Axboe
Unlike the other core import helpers, import_single_range() returns 0 on success, not the length imported. This means that links that depend on the result of non-vec based IORING_OP_{READ,WRITE} that were added for 5.5 get errored when they should not be. Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <>
2020-02-26io_uring: pick up link work on submit reference dropJens Axboe
If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal. This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item. Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers. Reviewed-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-25io-wq: ensure work->task_pid is cleared on initJens Axboe
We use ->task_pid for exit cancellation, but we need to ensure it's cleared to zero for io_req_work_grab_env() to do the right thing. Take a suggestion from Bart and clear the whole thing, just setting the function passed in. This makes it more future proof as well. Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid") Signed-off-by: Jens Axboe <>
2020-02-25io-wq: remove spin-for-work optimizationJens Axboe
Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO. Reported-by: Andres Freund <> Signed-off-by: Jens Axboe <>
2020-02-25io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLLXiaoguang Wang
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled. There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events. Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens. For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll. Signed-off-by: Xiaoguang Wang <> Signed-off-by: Jens Axboe <>
2020-02-24io_uring: fix personality idr leakJens Axboe
We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed. Fixes: 071698e13ac6 ("io_uring: allow registering credentials") Reviewed-by: Stefano Garzarella <> Signed-off-by: Jens Axboe <>
2020-02-23io_uring: handle multiple personalities in link chainsJens Axboe
If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain. Ensure __io_queue_sqe() overrides the credentials, if they are different. Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing. Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands") Reported-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-02-23Linux 5.6-rc3Linus Torvalds
2020-02-23Merge tag 'for-5.6-rc2-tag' of ↵Linus Torvalds
git:// Pull btrfs fixes from David Sterba: "These are fixes that were found during testing with help of error injection, plus some other stable material. There's a fixup to patch added to rc1 causing locking in wrong context warnings, tests found one more deadlock scenario. The patches are tagged for stable, two of them now in the queue but we'd like all three released at the same time. I'm not happy about fixes to fixes in such a fast succession during rcs, but I hope we found all the fallouts of commit 28553fa992cb ('Btrfs: fix race between shrinking truncate and fiemap')" * tag 'for-5.6-rc2-tag' of git:// Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents btrfs: fix bytes_may_use underflow in prealloc error condtition btrfs: handle logged extent failure properly btrfs: do not check delayed items are empty for single transaction cleanup btrfs: reset fs_root to NULL on error in open_ctree btrfs: destroy qgroup extent records on transaction abort
2020-02-23Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git:// Pull ext4 fixes from Ted Ts'o: "More miscellaneous ext4 bug fixes (all stable fodder)" * tag 'ext4_for_linus_stable' of git:// ext4: fix mount failure with quota configured as module jbd2: fix ocfs2 corrupt when clearing block group bits ext4: fix race between writepages and enabling EXT4_EXTENTS_FL ext4: rename s_journal_flag_rwsem to s_writepages_rwsem ext4: fix potential race between s_flex_groups online resizing and access ext4: fix potential race between s_group_info online resizing and access ext4: fix potential race between online resizing and write operations ext4: add cond_resched() to __ext4_find_entry() ext4: fix a data race in EXT4_I(inode)->i_disksize
2020-02-23Merge tag 'csky-for-linus-5.6-rc3' of git:// Torvalds
Pull csky updates from Guo Ren: "Sorry, I missed 5.6-rc1 merge window, but in this pull request the most are the fixes and the rests are between fixes and features. The only outside modification is the MAINTAINERS file update with our mailing list. - cache flush implementation fixes - ftrace modify panic fix - CONFIG_SMP boot problem fix - fix pt_regs saving for atomic.S - fix fixaddr_init without highmem. - fix stack protector support - fix fake Tightly-Coupled Memory code compile and use - fix some typos and coding convention" * tag 'csky-for-linus-5.6-rc3' of git:// (23 commits) csky: Replace <linux/clk-provider.h> by <linux/of_clk.h> csky: Implement copy_thread_tls csky: Add PCI support csky: Minimize defconfig to support buildroot config.fragment csky: Add setup_initrd check code csky: Cleanup old Kconfig options arch/csky: fix some Kconfig typos csky: Fixup compile warning for three unimplemented syscalls csky: Remove unused cache implementation csky: Fixup ftrace modify panic csky: Add flush_icache_mm to defer flush icache all csky: Optimize abiv2 copy_to_user_page with VM_EXEC csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860) csky: Remove unnecessary flush_icache_* implementation csky: Support icache flush without specific instructions csky/Kconfig: Add Kconfig.platforms to support some drivers csky/smp: Fixup boot failed when CONFIG_SMP csky: Set regs->usp to kernel sp, when the exception is from kernel csky/mm: Fixup export invalid_pte_table symbol csky: Separate fixaddr_init from highmem ...
2020-02-23csky: Replace <linux/clk-provider.h> by <linux/of_clk.h>Geert Uytterhoeven
The C-Sky platform code is not a clock provider, and just needs to call of_clk_init(). Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>. Signed-off-by: Geert Uytterhoeven <> Signed-off-by: Guo Ren <>
2020-02-22Merge tag 'ras-urgent-2020-02-22' of ↵Linus Torvalds
git:// Pull RAS fixes from Thomas Gleixner: "Two fixes for the AMD MCE driver: - Populate the per CPU MCA bank descriptor pointer only after it has been completely set up to prevent a use-after-free in case that one of the subsequent initialization step fails - Implement a proper release function for the sysfs entries of MCA threshold controls instead of freeing the memory right in the CPU teardown code, which leads to another use-after-free when the associated sysfs file is opened and accessed" * tag 'ras-urgent-2020-02-22' of git:// x86/mce/amd: Fix kobject lifetime x86/mce/amd: Publish the bank pointer only after setup has succeeded
2020-02-22Merge tag 'irq-urgent-2020-02-22' of ↵Linus Torvalds
git:// Pull irq fixes from Thomas Gleixner: "Two fixes for the irq core code which are follow ups to the recent MSI fixes: - The WARN_ON which was put into the MSI setaffinity callback for paranoia reasons actually triggered via a callchain which escaped when all the possible ways to reach that code were analyzed. The proc/irq/$N/*affinity interfaces have a quirk which came in when ALPHA moved to the generic interface: In case that the written affinity mask does not contain any online CPU it calls into ALPHAs magic auto affinity setting code. A few years later this mechanism was also made available to x86 for no good reasons and in a way which circumvents all sanity checks for interrupts which cannot have their affinity set from process context on X86 due to the way the X86 interrupt delivery works. It would be possible to make this work properly, but there is no point in doing so. If the interrupt is not yet started then the affinity setting has no effect and if it is started already then it is already assigned to an online CPU so there is no point to randomly move it to some other CPU. Just return EINVAL as the code has done before that change forever. - The new MSI quirk bit in the irq domain flags turned out to be already occupied, which escaped the author and the reviewers because the already in use bits were 0,6,2,3,4,5 listed in that order. That bit 6 was simply overlooked because the ordering was straight forward linear otherwise. So the new bit ended up being a duplicate. Fix it up by switching the oddball 6 to the obvious 1" * tag 'irq-urgent-2020-02-22' of git:// genirq/irqdomain: Make sure all irq domain flags are distinct genirq/proc: Reject invalid affinity masks (again)
2020-02-22Merge tag 'x86-urgent-2020-02-22' of ↵Linus Torvalds
git:// Pull x86 fixes from Thomas Gleixner: "Two fixes for x86: - Remove the __force_oder definiton from the kaslr boot code as it is already defined in the page table code which makes GCC 10 builds fail because it changed the default to -fno-common. - Address the AMD erratum 1054 concerning the IRPERF capability and enable the Instructions Retired fixed counter on machines which are not affected by the erratum" * tag 'x86-urgent-2020-02-22' of git:// x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF x86/boot/compressed: Don't declare __force_order in kaslr_64.c
2020-02-22Merge tag 'zonefs-5.6-rc3' of ↵Linus Torvalds
git:// Pull zonefs fix from Damien Le Moal: "A single patch fixing typos in the documentation file" * tag 'zonefs-5.6-rc3' of git:// zonefs: fix documentation typos etc.
2020-02-22Merge tag 'io_uring-5.6-2020-02-22' of git:// Torvalds
Pull io_uring fixes from Jens Axboe: "Here's a small collection of fixes that were queued up: - Remove unnecessary NULL check (Dan) - Missing io_req_cancelled() call in fallocate (Pavel) - Put the cleanup check for aux data in the right spot (Pavel) - Two fixes for SQPOLL (Stefano, Xiaoguang)" * tag 'io_uring-5.6-2020-02-22' of git:// io_uring: fix __io_iopoll_check deadlock in io_sq_thread io_uring: prevent sq_thread from spinning when it should stop io_uring: fix use-after-free by io_cleanup_req() io_uring: remove unnecessary NULL checks io_uring: add missing io_req_cancelled()
2020-02-22Merge tag 'block-5.6-2020-02-22' of git:// Torvalds
Pull block fixes from Jens Axboe: "Just a set of NVMe fixes via Keith" * tag 'block-5.6-2020-02-22' of git:// nvme-multipath: Fix memory leak with ana_log_buf nvme: Fix uninitialized-variable warning nvme-pci: Use single IRQ vector for old Apple models nvme/pci: Add sleep quirk for Samsung and Toshiba drives
2020-02-22Merge tag 'scsi-fixes' of ↵Linus Torvalds
git:// Pull SCSI fixes from James Bottomley: "Four non-core fixes. Two are reverts of target fixes which turned out to have unwanted side effects, one is a revert of an RDMA fix with the same problem and the final one fixes an incorrect warning about memory allocation failures in megaraid_sas (the driver actually reduces the allocation size until it succeeds)" Signed-off-by: James E.J. Bottomley <> * tag 'scsi-fixes' of git:// scsi: Revert "target: iscsi: Wait for all commands to finish before freeing a session" scsi: Revert "RDMA/isert: Fix a recently introduced regression related to logout" scsi: megaraid_sas: silence a warning scsi: Revert "target/core: Inline transport_lun_remove_cmd()"
2020-02-22Merge tag 'hwmon-for-v5.6-rc3' of ↵Linus Torvalds
git:// Pull hwmon fixes from Guenter Roeck: - Fix crash in w83627ehf driver seen with W83627DHG-P - Fix lockdep splat in acpi_power_meter driver - Fix xdpe12284 documentation Sphinx warnings * tag 'hwmon-for-v5.6-rc3' of git:// hwmon: (w83627ehf) Fix crash seen with W83627DHG-P hwmon: (acpi_power_meter) Fix lockdep splat Documentation/hwmon: fix xdpe12284 Sphinx warnings
2020-02-22Merge tag 'devicetree-fixes-for-5.6-2' of ↵Linus Torvalds
git:// Pull devicetree fixes deom Rob Herring: "A handful of fixes in DT bindings for MDIO bus, Allwinner CSI, OMAP HSMMC, and Tegra124 EMC" * tag 'devicetree-fixes-for-5.6-2' of git:// dt-bindings: media: csi: Fix clocks description dt-bindings: media: csi: Add interconnects properties dt-bindings: net: mdio: remove compatible string from example dt-bindings: memory-controller: Update example for Tegra124 EMC dt-bindings: mmc: omap-hsmmc: Fix SDIO interrupt
2020-02-22Merge tag 's390-5.6-4' of ↵Linus Torvalds
git:// Pull s390 fixes from Vasily Gorbik: - Remove ieee_emulation_warnings sysctl which is a dead code. - Avoid triggering rebuild of the kernel during make install. - Enable protected virtualization guest support in default configs. - Fix cio_ignore seq_file .next function to increase position index. And use kobj_to_dev instead of container_of in cio code. - Fix storage block address lists to contain absolute addresses in qdio code. - Few clang warnings and spelling fixes. * tag 's390-5.6-4' of git:// s390/qdio: fill SBALEs with absolute addresses s390/qdio: fill SL with absolute addresses s390: remove obsolete ieee_emulation_warnings s390: make 'install' not depend on vmlinux s390/kaslr: Fix casts in get_random s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range s390/pkey/zcrypt: spelling s/crytp/crypt/ s390/cio: use kobj_to_dev() API s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST s390/cio: cio_ignore_proc_seq_next should increase position index
2020-02-22io_uring: fix __io_iopoll_check deadlock in io_sq_threadio_uring-5.6-2020-02-22Xiaoguang Wang
Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit. I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check(). Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang <> Signed-off-by: Jens Axboe <>
2020-02-21ext4: fix mount failure with quota configured as moduleJan Kara
When CONFIG_QFMT_V2 is configured as a module, the test in ext4_feature_set_ok() fails and so mount of filesystems with quota or project features fails. Fix the test to use IS_ENABLED macro which works properly even for modules. Link: Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: Jan Kara <> Signed-off-by: Theodore Ts'o <> Cc:
2020-02-21jbd2: fix ocfs2 corrupt when clearing block group bitswangyan
I found a NULL pointer dereference in ocfs2_block_group_clear_bits(). The running environment: kernel version: 4.19 A cluster with two nodes, 5 luns mounted on two nodes, and do some file operations like dd/fallocate/truncate/rm on every lun with storage network disconnection. The fallocate operation on dm-23-45 caused an null pointer dereference. The information of NULL pointer dereference as follows: [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45. [577992.878290] Aborting journal on device dm-23-45. ... [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46. [577992.890908] __journal_remove_journal_head: freeing b_committed_data [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30 [577992.890918] __journal_remove_journal_head: freeing b_committed_data [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30 [577992.890922] __journal_remove_journal_head: freeing b_committed_data [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30 [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30 [577992.890928] __journal_remove_journal_head: freeing b_committed_data [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30 [577992.890933] __journal_remove_journal_head: freeing b_committed_data [577992.890939] __journal_remove_journal_head: freeing b_committed_data [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [577992.890950] Mem abort info: [577992.890951] ESR = 0x96000004 [577992.890952] Exception class = DABT (current EL), IL = 32 bits [577992.890952] SET = 0, FnV = 0 [577992.890953] EA = 0, S1PTW = 0 [577992.890954] Data abort info: [577992.890955] ISV = 0, ISS = 0x00000004 [577992.890956] CM = 0, WnR = 0 [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9 [577992.890960] [0000000000000020] pgd=0000000000000000 [577992.890964] Internal error: Oops: 96000004 [#1] SMP [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd) [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1 [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO) [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2] [577992.891084] sp : ffff0000c8e2b810 [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000 [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70 [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2 [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30 [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000 [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0 [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174 [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565 [577992.891103] x11: 0000000000000038 x10: 0101010101010101 [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080 [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002 [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00 [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000 [577992.891116] Call trace: [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2] [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2] [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2] [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2] [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2] [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2] [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2] [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2] [577992.891316] vfs_fallocate+0x11c/0x250 [577992.891317] ksys_fallocate+0x54/0x88 [577992.891319] __arm64_sys_fallocate+0x28/0x38 [577992.891323] el0_svc_common+0x78/0x130 [577992.891325] el0_svc_handler+0x38/0x78 [577992.891327] el0_svc+0x8/0xc My analysis process as follows: ocfs2_fallocate __ocfs2_change_file_space ocfs2_allocate_extents ocfs2_extend_allocation ocfs2_add_inode_data ocfs2_add_clusters_in_btree ocfs2_insert_extent ocfs2_do_insert_extent ocfs2_rotate_tree_right ocfs2_extend_rotate_transaction ocfs2_extend_trans jbd2_journal_restart jbd2__journal_restart /* handle->h_transaction is NULL, * is_handle_aborted(handle) is true */ handle->h_transaction = NULL; start_this_handle return -EROFS; ocfs2_free_clusters _ocfs2_free_clusters _ocfs2_free_suballoc_bits ocfs2_block_group_clear_bits ocfs2_journal_access_gd __ocfs2_journal_access jbd2_journal_get_undo_access /* I think jbd2_write_access_granted() will * return true, because do_get_write_access() * will return -EROFS. */ if (jbd2_write_access_granted(...)) return 0; do_get_write_access /* handle->h_transaction is NULL, it will * return -EROFS here, so do_get_write_access() * was not called. */ if (is_handle_aborted(handle)) return -EROFS; /* bh2jh(group_bh) is NULL, caused NULL pointer dereference */ undo_bg = (struct ocfs2_group_desc *) bh2jh(group_bh)->b_committed_data; If handle->h_transaction == NULL, then jbd2_write_access_granted() does not really guarantee that journal_head will stay around, not even speaking of its b_committed_data. The bh2jh(group_bh) can be removed after ocfs2_journal_access_gd() and before call "bh2jh(group_bh)->b_committed_data". So, we should move is_handle_aborted() check from do_get_write_access() into jbd2_journal_get_undo_access() and jbd2_journal_get_write_access() before the call to jbd2_write_access_granted(). Link: Signed-off-by: Yan Wang <> Signed-off-by: Theodore Ts'o <> Reviewed-by: Jun Piao <> Reviewed-by: Jan Kara <> Cc:
2020-02-21ext4: fix race between writepages and enabling EXT4_EXTENTS_FLEric Biggers
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: Reported-by: Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <> Signed-off-by: Theodore Ts'o <> Reviewed-by: Jan Kara <> Cc:
2020-02-21ext4: rename s_journal_flag_rwsem to s_writepages_rwsemEric Biggers
In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: Signed-off-by: Eric Biggers <> Signed-off-by: Theodore Ts'o <> Reviewed-by: Jan Kara <> Cc:
2020-02-21ext4: fix potential race between s_flex_groups online resizing and accessSuraj Jitindar Singh
During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: Link: Signed-off-by: Suraj Jitindar Singh <> Signed-off-by: Theodore Ts'o <> Cc:
2020-02-21Merge tag 'for-linus-5.6-rc3-tag' of ↵Linus Torvalds
git:// Pull xen fixes from Juergen Gross: "Two small fixes for Xen: - a fix to avoid warnings with new gcc - a fix for incorrectly disabled interrupts when calling _cond_resched()" * tag 'for-linus-5.6-rc3-tag' of git:// xen: Enable interrupts when calling _cond_resched() x86/xen: Distribute switch variables for initialization
2020-02-21Merge tag 'arm64-fixes' of ↵Linus Torvalds
git:// Pull arm64 fixes from Will Deacon: "It's all straightforward apart from the changes to mmap()/mremap() in relation to their handling of address arguments from userspace with non-zero tag bits in the upper byte. The change to brk() is necessary to fix a nasty user-visible regression in malloc(), but we tightened up mmap() and mremap() at the same time because they also allow the user to create virtual aliases by accident. It's much less likely than brk() to matter in practice, but enforcing the principle of "don't permit the creation of mappings using tagged addresses" leads to a straightforward ABI without having to worry about the "but what if a crazy program did foo?" aspect of things. Summary: - Fix regression in malloc() caused by ignored address tags in brk() - Add missing brackets around argument to untagged_addr() macro - Fix clang build when using binutils assembler - Fix silly typo in virtual memory map documentation" * tag 'arm64-fixes' of git:// mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() docs: arm64: fix trivial spelling enought to enough in memory.rst arm64: memory: Add missing brackets to untagged_addr() macro arm64: lse: Fix LSE atomics with LLVM
2020-02-21Merge tag 'powerpc-5.6-3' of ↵Linus Torvalds
git:// Pull powerpc fixes from Michael Ellerman: "Some more powerpc fixes for 5.6. This is two weeks worth as I was out sick last week: - Three fixes for the recently added VMAP_STACK on 32-bit. - Three fixes related to hugepages on 8xx (32-bit). - A fix for a bug in our transactional memory handling that could lead to a kernel crash if we saw a page fault during signal delivery. - A fix for a deadlock in our PCI EEH (Enhanced Error Handling) code. - A couple of other minor fixes. Thanks to: Christophe Leroy, Erhard F, Frederic Barrat, Gustavo Luiz Duarte, Larry Finger, Leonardo Bras, Oliver O'Halloran, Sam Bobroff" * tag 'powerpc-5.6-3' of git:// powerpc/entry: Fix an #if which should be an #ifdef in entry_32.S powerpc/xmon: Fix whitespace handling in getstring() powerpc/6xx: Fix power_save_ppc32_restore() with CONFIG_VMAP_STACK powerpc/chrp: Fix enter_rtas() with CONFIG_VMAP_STACK powerpc/32s: Fix DSI and ISI exceptions for CONFIG_VMAP_STACK powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal delivery powerpc/8xx: Fix clearing of bits 20-23 in ITLB miss powerpc/hugetlb: Fix 8M hugepages on 8xx powerpc/hugetlb: Fix 512k hugepages on 8xx with 16k page size powerpc/eeh: Fix deadlock handling dead PHB
2020-02-21Merge tag 'linux-watchdog-5.6-rc3' of ↵Linus Torvalds
git:// Pull watchdog fixes from Wim Van Sebroeck: - mtk_wdt needs RESET_CONTROLLER to build - da9062 driver fixes: - fix power management ops - do not ping the hw during stop() - add dependency on I2C * tag 'linux-watchdog-5.6-rc3' of git:// watchdog: da9062: Add dependency on I2C watchdog: da9062: fix power management ops watchdog: da9062: do not ping the hw during stop() watchdog: fix mtk_wdt.c RESET_CONTROLLER build error
2020-02-21Merge tag 'char-misc-5.6-rc3' of ↵Linus Torvalds
git:// Pull char/misc driver fixes from Greg KH: "Here are some small char/misc driver fixes for 5.6-rc3. Also included in here are some updates for some documentation files that I seem to be maintaining these days. The driver fixes are: - small fixes for the habanalabs driver - fsi driver bugfix All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.6-rc3' of git:// Documentation/process: Swap out the ambassador for Canonical habanalabs: patched cb equals user cb in device memset habanalabs: do not halt CoreSight during hard reset habanalabs: halt the engines before hard-reset MAINTAINERS: remove unnecessary ':' characters fsi: aspeed: add unspecified HAS_IOMEM dependency COPYING: state that all contributions really are covered by this file Documentation/process: Change Microsoft contact for embargoed hardware issues embargoed-hardware-issues: drop Amazon contact as the email address now bounces Documentation/process: Add Arm contact for embargoed HW issues
2020-02-21Merge tag 'staging-5.6-rc3' of ↵Linus Torvalds
git:// Pull staging driver fixes from Greg KH: "Here are some small staging driver fixes for 5.6-rc3, along with the removal of an unused/unneeded driver as well. The android vsoc driver is not needed anymore by anyone, so it was removed. The other driver fixes are: - ashmem bugfixes - greybus audio driver bugfix - wireless driver bugfixes and tiny cleanups to error paths All of these have been in linux-next for a while now with no reported issues" * tag 'staging-5.6-rc3' of git:// staging: rtl8723bs: Remove unneeded goto statements staging: rtl8188eu: Remove some unneeded goto statements staging: rtl8723bs: Fix potential overuse of kernel memory staging: rtl8188eu: Fix potential overuse of kernel memory staging: rtl8723bs: Fix potential security hole staging: rtl8188eu: Fix potential security hole staging: greybus: use after free in gb_audio_manager_remove_all() staging: android: Delete the 'vsoc' driver staging: rtl8723bs: fix copy of overlapping memory staging: android: ashmem: Disallow ashmem memory from being remapped staging: vt6656: fix sign of rx_dbm to bb_pre_ed_rssi.