From: SeongJae Park Date: Thu, 10 Apr 2025 00:00:19 +0000 (-0700) Subject: mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() X-Git-Tag: v6.16-rc1~92^2~181 X-Git-Url: https://git.kernel.dk/?a=commitdiff_plain;h=066c770437835d2bd2072bd2c88a71fcbbd5ccb3;p=linux-block.git mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE", v3. When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE with multiple address ranges, tlb flushes happen for each of the given address ranges. Because such tlb flushes are for the same process, doing those in a batch is more efficient while still being safe. Modify process_madvise() entry level code path to do such batched tlb flushes, while the internal unmap logic do only gathering of the tlb entries to flush. In more detail, modify the entry functions to initialize an mmu_gather object and pass it to the internal logic. And make the internal logic do only gathering of the tlb entries to flush into the received mmu_gather object. After all internal function calls are done, the entry functions flush the gathered tlb entries at once. Because process_madvise() and madvise() share the internal unmap logic, make same change to madvise() entry code together, to make code consistent and cleaner. It is only for keeping the code clean, and shouldn't degrade madvise(). It could rather provide a potential tlb flushes reduction benefit for a case that there are multiple vmas for the given address range. It is only a side effect from an effort to keep code clean, so we don't measure it separately. Similar optimizations might be applicable to other madvise behavior such as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this patch series, though. Patches Sequence ================ The first patch defines a new data structure for managing information that is required for batched tlb flushes (mmu_gather and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal logic to receive it. The second patch batches tlb flushes for MADV_FREE handling for both madvise() and process_madvise(). Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching. The third patch splits zap_page_range_single() for batching of MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for the hint using the sub-logic that the third patch split out, and the helpers for batched tlb flushes that introduced for the MADV_FREE case, by the second patch. Test Results ============ I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory using multiple process_madvise() calls. I apply the advice in 4 KiB sized regions granularity, but with varying batch size per process_madvise() call (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. To reduce measurement errors, I did the measurement five times. The measurement results are as below. 'sz_batch' column shows the batch size of process_madvise() calls. 'Before' and 'After' columns show the average of latencies in nanoseconds that measured five times on kernels that built without and with the tlb flushes batching of this series (patches 3 and 4), respectively. For the baseline, mm-new tree of 2025-04-09[2] has been used, after reverting the second version of this patch series and adding a temporal fix for !CONFIG_DEBUG_VM build failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency measurements standard deviation to average in percent for 'Before' and 'After', respectively. 'Latency_reduction' shows the reduction of the latency that the 'After' has achieved compared to 'Before', in percent. Higher 'Latency_reduction' values mean more efficiency improvements. sz_batch Before B-stdev After A-stdev Latency_reduction 1 146386348 2.78 111327360.6 3.13 23.95 2 108222130 1.54 72131173.6 2.39 33.35 4 93617846.8 2.76 51859294.4 2.50 44.61 8 80555150.4 2.38 44328790 1.58 44.97 16 77272777 1.62 37489433.2 1.16 51.48 32 76478465.2 2.75 33570506 3.48 56.10 64 75810266.6 1.15 27037652.6 1.61 64.34 128 73222748 3.86 25517629.4 3.30 65.15 256 72534970.8 2.31 25002180.4 0.94 65.53 512 71809392 5.12 24152285.4 2.41 66.37 1024 73281170.2 4.53 24183615 2.09 67.00 Unexpectedly the latency has reduced (improved) even with batch size one. I think some of compiler optimizations have affected that, like also observed with the first version of this patch series. So, please focus on the proportion between the improvement and the batch size. As expected, tlb flushes batching provides latency reduction that proportional to the batch size. The efficiency gain ranges from about 33 percent with batch size 2, and up to 67 percent with batch size 1,024. Please note that this is a very simple microbenchmark, so real efficiency gain on real workload could be very different. This patch (of 4): To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE, an mmu_gather object in addition to the behavior integer need to be passed to the internal logics. Using a struct can make it easy without increasing the number of parameters of all code paths towards the internal logic. Define a struct for the purpose and use it on the code path that starts from madvise_do_behavior() and ends on madvise_dontneed_free(). Note that this changes madvise_walk_vmas() visitor type signature, too. Specifically, it changes its 'arg' type from 'unsigned long' to the new struct pointer. Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Lorenzo Stoakes Cc: David Hildenbrand Cc: Liam R. Howlett Cc: Rik van Riel Cc: SeongJae Park Cc: Shakeel Butt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- diff --git a/mm/madvise.c b/mm/madvise.c index b17f684322ad..26fa868b41af 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -48,6 +48,11 @@ struct madvise_walk_private { bool pageout; }; +struct madvise_behavior { + int behavior; + struct mmu_gather *tlb; +}; + /* * Any behaviour which results in changes to the vma->vm_flags needs to * take mmap_lock for writing. Others, which simply traverse vmas, need @@ -893,8 +898,9 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, static long madvise_dontneed_free(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - int behavior) + struct madvise_behavior *madv_behavior) { + int behavior = madv_behavior->behavior; struct mm_struct *mm = vma->vm_mm; *prev = vma; @@ -1249,8 +1255,10 @@ static long madvise_guard_remove(struct vm_area_struct *vma, static int madvise_vma_behavior(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - unsigned long behavior) + void *behavior_arg) { + struct madvise_behavior *arg = behavior_arg; + int behavior = arg->behavior; int error; struct anon_vma_name *anon_name; unsigned long new_flags = vma->vm_flags; @@ -1270,7 +1278,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(vma, prev, start, end, arg); case MADV_NORMAL: new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; @@ -1487,10 +1495,10 @@ static bool process_madvise_remote_valid(int behavior) */ static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, - unsigned long end, unsigned long arg, + unsigned long end, void *arg, int (*visit)(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, - unsigned long end, unsigned long arg)) + unsigned long end, void *arg)) { struct vm_area_struct *vma; struct vm_area_struct *prev; @@ -1548,7 +1556,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, static int madvise_vma_anon_name(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, - unsigned long anon_name) + void *anon_name) { int error; @@ -1557,7 +1565,7 @@ static int madvise_vma_anon_name(struct vm_area_struct *vma, return -EBADF; error = madvise_update_vma(vma, prev, start, end, vma->vm_flags, - (struct anon_vma_name *)anon_name); + anon_name); /* * madvise() returns EAGAIN if kernel resources, such as @@ -1589,7 +1597,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, if (end == start) return 0; - return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name, + return madvise_walk_vmas(mm, start, end, anon_name, madvise_vma_anon_name); } #endif /* CONFIG_ANON_VMA_NAME */ @@ -1677,8 +1685,10 @@ static bool is_madvise_populate(int behavior) } static int madvise_do_behavior(struct mm_struct *mm, - unsigned long start, size_t len_in, int behavior) + unsigned long start, size_t len_in, + struct madvise_behavior *madv_behavior) { + int behavior = madv_behavior->behavior; struct blk_plug plug; unsigned long end; int error; @@ -1692,7 +1702,7 @@ static int madvise_do_behavior(struct mm_struct *mm, if (is_madvise_populate(behavior)) error = madvise_populate(mm, start, end, behavior); else - error = madvise_walk_vmas(mm, start, end, behavior, + error = madvise_walk_vmas(mm, start, end, madv_behavior, madvise_vma_behavior); blk_finish_plug(&plug); return error; @@ -1773,13 +1783,14 @@ static int madvise_do_behavior(struct mm_struct *mm, int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { int error; + struct madvise_behavior madv_behavior = {.behavior = behavior}; if (madvise_should_skip(start, len_in, behavior, &error)) return error; error = madvise_lock(mm, behavior); if (error) return error; - error = madvise_do_behavior(mm, start, len_in, behavior); + error = madvise_do_behavior(mm, start, len_in, &madv_behavior); madvise_unlock(mm, behavior); return error; @@ -1796,6 +1807,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, { ssize_t ret = 0; size_t total_len; + struct madvise_behavior madv_behavior = {.behavior = behavior}; total_len = iov_iter_count(iter); @@ -1811,7 +1823,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, if (madvise_should_skip(start, len_in, behavior, &error)) ret = error; else - ret = madvise_do_behavior(mm, start, len_in, behavior); + ret = madvise_do_behavior(mm, start, len_in, + &madv_behavior); /* * An madvise operation is attempting to restart the syscall, * but we cannot proceed as it would not be correct to repeat