linux-2.6-block.git
10 months agopowerpc/book3s64/vmemmap: switch radix to use a different vmemmap handling function
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:56 +0000 (00:37 +0530)]
powerpc/book3s64/vmemmap: switch radix to use a different vmemmap handling function

This is in preparation to update radix to implement vmemmap optimization
for devdax.  Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to
   PAGE_SIZE
3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
   PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the
vmemmap area is invalid.  If found to be invalid, that implies we can
safely free the vmemmap area.  We don't use the PAGE_UNUSED pattern used
by x86 because with 64K page size, we need to do the above check even at
the PAGE_SIZE granularity.

[aneesh.kumar@linux.ibm.com: fix section mismatch warning]
Link: https://lkml.kernel.org/r/87h6pqvu5g.fsf@linux.ibm.com
[aneesh.kumar@linux.ibm.com: fix kernel build error]
Link: https://lkml.kernel.org/r/877cqkwd20.fsf@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-11-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agopowerpc/book3s64/mm: enable transparent pud hugepage
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:55 +0000 (00:37 +0530)]
powerpc/book3s64/mm: enable transparent pud hugepage

This is enabled only with radix translation and 1G hugepage size.  This
will be used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge().  We should never find that return true.  The
only expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation
and hence is marked to call BUG() in such a case.

Link: https://lkml.kernel.org/r/20230724190759.483013-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agopowerpc/mm/trace: convert trace event to trace event class
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:54 +0000 (00:37 +0530)]
powerpc/mm/trace: convert trace event to trace event class

A follow-up patch will add a pud variant for this same event.  Using event
class makes that addition simpler.

No functional change in this patch.

Link: https://lkml.kernel.org/r/20230724190759.483013-9-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/vmemmap optimization: split hugetlb and devdax vmemmap optimization
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:53 +0000 (00:37 +0530)]
mm/vmemmap optimization: split hugetlb and devdax vmemmap optimization

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes.  That is not
supported without unmapping of pte(marking it invalid) by some
architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled.  Hence split DAX optimization support into
a different config.

s390, loongarch and riscv don't have devdax support.  So the DAX config is
not enabled for them.  With this change, arm64 should be able to select
DAX optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Link: https://lkml.kernel.org/r/20230724190759.483013-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/huge pud: use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:52 +0000 (00:37 +0530)]
mm/huge pud: use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled.  Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Link: https://lkml.kernel.org/r/20230724190759.483013-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: add pud_same similar to __HAVE_ARCH_P4D_SAME
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:51 +0000 (00:37 +0530)]
mm: add pud_same similar to __HAVE_ARCH_P4D_SAME

This helps architectures to override pmd_same and pud_same independently.

Link: https://lkml.kernel.org/r/20230724190759.483013-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/vmemmap: allow architectures to override how vmemmap optimization works
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:50 +0000 (00:37 +0530)]
mm/vmemmap: allow architectures to override how vmemmap optimization works

Architectures like powerpc will like to use different page table
allocators and mapping mechanisms to implement vmemmap optimization.
Similar to vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages

Link: https://lkml.kernel.org/r/20230724190759.483013-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/vmemmap: improve vmemmap_can_optimize and allow architectures to override
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:49 +0000 (00:37 +0530)]
mm/vmemmap: improve vmemmap_can_optimize and allow architectures to override

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within
vmemmap such that tail page mapping can point to the second PAGE_SIZE
area.  Enforce that in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation).  Hence allow architecture
override.

Link: https://lkml.kernel.org/r/20230724190759.483013-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: change pudp_huge_get_and_clear_full take vm_area_struct as arg
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:48 +0000 (00:37 +0530)]
mm: change pudp_huge_get_and_clear_full take vm_area_struct as arg

We will use this in a later patch to do tlb flush when clearing pud
entries on powerpc.  This is similar to commit 93a98695f2f9 ("mm: change
pmdp_huge_get_and_clear_full take vm_area_struct as arg")

Link: https://lkml.kernel.org/r/20230724190759.483013-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/hugepage pud: allow arch-specific helper function to check huge page pud support
Aneesh Kumar K.V [Mon, 24 Jul 2023 19:07:47 +0000 (00:37 +0530)]
mm/hugepage pud: allow arch-specific helper function to check huge page pud support

Patch series "Add support for DAX vmemmap optimization for ppc64", v6.

This patch series implements changes required to support DAX vmemmap
optimization for ppc64.  The vmemmap optimization is only enabled with
radix MMU translation and 1GB PUD mapping with 64K page size.

The patch series also splits the hugetlb vmemmap optimization as a
separate Kconfig variable so that architectures can enable DAX vmemmap
optimization without enabling hugetlb vmemmap optimization.  This should
enable architectures like arm64 to enable DAX vmemmap optimization while
they can't enable hugetlb vmemmap optimization.  More details of the same
are in patch "mm/vmemmap optimization: Split hugetlb and devdax vmemmap
optimization".

With 64K page size for 16384 pages added (1G) we save 14 pages
With 4K page size for 262144 pages added (1G) we save 4094 pages
With 4K page size for 512 pages added (2M) we save 6 pages

This patch (of 13):

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation.  To support that add
has_transparent_pud_hugepage() helper that architectures can override.

[aneesh.kumar@linux.ibm.com: use the new has_transparent_pud_hugepage()]
Link: https://lkml.kernel.org/r/87tttrvtaj.fsf@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: handle faults that merely update the accessed bit under the VMA lock
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:10 +0000 (19:54 +0100)]
mm: handle faults that merely update the accessed bit under the VMA lock

Move FAULT_FLAG_VMA_LOCK check out of handle_pte_fault().  This should
have a significant performance improvement for mmaped files.  Write faults
(on read-only shared pages) still take the mmap lock as we do not want to
audit all the implementations of ->pfn_mkwrite() and ->page_mkwrite().
However write-faults on private mappings are handled under the VMA lock.

[willy@infradead.org: address "suspicious RCU usage" warning]
Link: https://lkml.kernel.org/r/ZMK7jwpI4uD6tKrF@casper.infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: handle swap and NUMA PTE faults under the VMA lock
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:09 +0000 (19:54 +0100)]
mm: handle swap and NUMA PTE faults under the VMA lock

Move the FAULT_FLAG_VMA_LOCK check down in handle_pte_fault().  This is
probably not a huge win in its own right, but is a nicely separable bit
from the next patch.

Link: https://lkml.kernel.org/r/20230724185410.1124082-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: run the fault-around code under the VMA lock
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:08 +0000 (19:54 +0100)]
mm: run the fault-around code under the VMA lock

The map_pages fs method should be safe to run under the VMA lock instead
of the mmap lock.  This should have a measurable reduction in contention
on the mmap lock.

Link: https://lkml.kernel.org/r/20230724185410.1124082-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: move FAULT_FLAG_VMA_LOCK check down from do_fault()
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:07 +0000 (19:54 +0100)]
mm: move FAULT_FLAG_VMA_LOCK check down from do_fault()

Perform the check at the start of do_read_fault(), do_cow_fault() and
do_shared_fault() instead.  Should be no performance change from the last
commit.

Link: https://lkml.kernel.org/r/20230724185410.1124082-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:06 +0000 (19:54 +0100)]
mm: move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()

Call do_pte_missing() under the VMA lock ...  then immediately retry in
do_fault().

Link: https://lkml.kernel.org/r/20230724185410.1124082-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: handle some PMD faults under the VMA lock
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:05 +0000 (19:54 +0100)]
mm: handle some PMD faults under the VMA lock

Push the VMA_LOCK check down from __handle_mm_fault() to
handle_pte_fault().  Once again, we refuse to call ->huge_fault() with the
VMA lock held, but we will wait for a PMD migration entry with the VMA
lock held, handle NUMA migration and set the accessed bit.  We were
already doing this for anonymous VMAs, so it should be safe.

Link: https://lkml.kernel.org/r/20230724185410.1124082-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: handle PUD faults under the VMA lock
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:04 +0000 (19:54 +0100)]
mm: handle PUD faults under the VMA lock

Postpone checking the VMA_LOCK flag until we've attempted to handle faults
on PUDs.  There's a mild upside to this patch in that we'll allocate the
page tables while under the VMA lock rather than the mmap lock, reducing
the hold time on the mmap lock, since the retry will find the page tables
already populated.  The real purpose here is to make a commit that shows
we don't call ->huge_fault under the VMA lock.  We do now handle setting
the accessed bit on a PUD fault under the VMA lock, but that doesn't seem
likely to be a measurable difference.

Link: https://lkml.kernel.org/r/20230724185410.1124082-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:03 +0000 (19:54 +0100)]
mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()

Handle a little more of the page fault path outside the mmap sem.  The
hugetlb path doesn't need to check whether the VMA is anonymous; the
VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
performance change from the previous commit; this is simply a step to ease
bisection of any problems.

Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: allow per-VMA locks on file-backed VMAs
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:02 +0000 (19:54 +0100)]
mm: allow per-VMA locks on file-backed VMAs

Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
The fault path will immediately fail in handle_mm_fault().  There may be a
small performance reduction from this patch as a little unnecessary work
will be done on each page fault.  See later patches for the improvement.

Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: remove CONFIG_PER_VMA_LOCK ifdefs
Matthew Wilcox (Oracle) [Mon, 24 Jul 2023 18:54:01 +0000 (19:54 +0100)]
mm: remove CONFIG_PER_VMA_LOCK ifdefs

Patch series "Handle most file-backed faults under the VMA lock", v3.

This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.

This patch (of 10):

Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to
eliminate ifdefs in the users.

Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/mmap: change vma iteration order in do_vmi_align_munmap()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:57 +0000 (14:31 -0400)]
mm/mmap: change vma iteration order in do_vmi_align_munmap()

By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations.  It
also means that prev/next will be loaded closer to when they are used.

This requires changing the loop type when gathering the VMAs that will
be freed.

Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).

Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.

Link: https://lkml.kernel.org/r/20230724183157.3939892-16-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: reduce resets during store setup
Liam R. Howlett [Mon, 24 Jul 2023 18:31:56 +0000 (14:31 -0400)]
maple_tree: reduce resets during store setup

mas_prealloc() may walk partially down the tree before finding that a
split or spanning store is needed.  When the write occurs, relax the
logic on resetting the walk so that partial walks will not restart, but
walks that have gone too far (a store that affects beyond the current
node) should be restarted.

Link: https://lkml.kernel.org/r/20230724183157.3939892-15-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: refine mas_preallocate() node calculations
Liam R. Howlett [Mon, 24 Jul 2023 18:31:55 +0000 (14:31 -0400)]
maple_tree: refine mas_preallocate() node calculations

Calculate the number of nodes based on the pending write action instead
of assuming the worst case.

This addresses a performance regression introduced in platforms that
have longer allocation timing.

Link: https://lkml.kernel.org/r/20230724183157.3939892-14-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: update mas_preallocate() testing
Liam R. Howlett [Mon, 24 Jul 2023 18:31:54 +0000 (14:31 -0400)]
maple_tree: update mas_preallocate() testing

Since the mas_preallocate() calculation has been updated to be more
precise, the testing must also be updated to check for what is expected.

Link: https://lkml.kernel.org/r/20230724183157.3939892-13-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: move mas_wr_end_piv() below mas_wr_extend_null()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:53 +0000 (14:31 -0400)]
maple_tree: move mas_wr_end_piv() below mas_wr_extend_null()

Relocate it and call mas_wr_extend_null() from within mas_wr_end_piv().
Extending the NULL may affect the end pivot value so call
mas_wr_endtend_null() from within mas_wr_end_piv() to keep it all
together.

Link: https://lkml.kernel.org/r/20230724183157.3939892-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: set up vma iterator for vma_iter_prealloc() calls
Liam R. Howlett [Mon, 24 Jul 2023 18:31:52 +0000 (14:31 -0400)]
mm: set up vma iterator for vma_iter_prealloc() calls

Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.

Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: use vma_iter_clear_gfp() in nommu
Liam R. Howlett [Mon, 24 Jul 2023 18:31:51 +0000 (14:31 -0400)]
mm: use vma_iter_clear_gfp() in nommu

Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code.  This will reduce node preallocations
in nommu.

Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: adjust node allocation on mas_rebalance()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:50 +0000 (14:31 -0400)]
maple_tree: adjust node allocation on mas_rebalance()

mas_rebalance() is called to rebalance an insufficient node into a
single node or two sufficient nodes.  The preallocation estimate is
always too many in this case as the height of the tree will never grow
and there is no possibility to have a three way split in this case, so
revise the node allocation count.

Link: https://lkml.kernel.org/r/20230724183157.3939892-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: re-introduce entry to mas_preallocate() arguments
Liam R. Howlett [Mon, 24 Jul 2023 18:31:49 +0000 (14:31 -0400)]
maple_tree: re-introduce entry to mas_preallocate() arguments

The current preallocation strategy is to preallocate the absolute
worst-case allocation for a tree modification.  The entry (or NULL) is
needed to know how many nodes are needed to write to the tree.  Start by
adding the argument to the mas_preallocate() definition.

Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: remove re-walk from mmap_region()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:48 +0000 (14:31 -0400)]
mm: remove re-walk from mmap_region()

Using vma_iter_set() will reset the tree and cause a re-walk.  Use
vmi_iter_config() to set the write to a sub-set of the range.  Change
the file case to also use vmi_iter_config() so that the end is correctly
set.

Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: introduce __mas_set_range()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:47 +0000 (14:31 -0400)]
maple_tree: introduce __mas_set_range()

mas_set_range() resets the node to MAS_START, which will cause a re-walk
of the tree to the range.  This is unnecessary when the maple state is
already at the correct location of the write.  Add a function that only
sets the range to avoid unnecessary re-walking of the tree.

Link: https://lkml.kernel.org/r/20230724183157.3939892-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: remove prev check from do_vmi_align_munmap()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:46 +0000 (14:31 -0400)]
mm: remove prev check from do_vmi_align_munmap()

If the prev does not exist, the vma iterator will be set to MAS_NONE,
which will be treated as a MAS_START when the mas_next or mas_find is
used.  In this case, the next caller will be the vma iterator, which
uses mas_find() under the hood and will now do what the user expects.

Link: https://lkml.kernel.org/r/20230724183157.3939892-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: change do_vmi_align_munmap() tracking of VMAs to remove
Liam R. Howlett [Mon, 24 Jul 2023 18:31:45 +0000 (14:31 -0400)]
mm: change do_vmi_align_munmap() tracking of VMAs to remove

The majority of the calls to munmap a vm range is within a single vma.
The maple tree is able to store a single entry at 0, with a size of 1 as
a pointer and avoid any allocations.  Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count.  This
will leverage the ability to store the first entry without a node
allocation.

Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries.  Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.

Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls.  This happens in the
static unmap_region() and exit_mmap().

Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: add benchmarking for mas_prev()
Liam R. Howlett [Mon, 24 Jul 2023 18:31:44 +0000 (14:31 -0400)]
maple_tree: add benchmarking for mas_prev()

Add some benchmarking functions in testing for mas_prev().  This is
useful to ensure there are no regressions added during modifications.

Link: https://lkml.kernel.org/r/20230724183157.3939892-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: add benchmarking for mas_for_each
Liam R. Howlett [Mon, 24 Jul 2023 18:31:43 +0000 (14:31 -0400)]
maple_tree: add benchmarking for mas_for_each

Patch series "Reduce preallocations for maple tree", v3.

Initial work on preallocations showed no regression in performance during
testing, but recently some users (both on [1] and off [android] list) have
reported that preallocating the worst-case number of nodes has caused some
slow down.  This patch set addresses the number of allocations in a few
ways.

During munmap() most munmap() operations will remove a single VMA, so
leverage the fact that the maple tree can place a single pointer at range
0 - 0 without allocating.  This is done by changing the index of the VMAs
to be indexed by the count, starting at 0.

Re-introduce the entry argument to mas_preallocate() so that a more
intelligent guess of the node count can be made.

Implement the more intelligent guess of the node count, although there is
more work to be done.

During development of v2 of this patch set, I also noticed that the number
of nodes being allocated for a rebalance was beyond what could possibly be
needed.  This is addressed in patch 0008.

This patch (of 15):

Add a way to test the speed of mas_for_each() to the testing code.

Link: https://lkml.kernel.org/r/20230724183157.3939892-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230724183157.3939892-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: don't drop VMA locks in mm_drop_all_locks()
Jann Horn [Thu, 20 Jul 2023 19:34:36 +0000 (21:34 +0200)]
mm: don't drop VMA locks in mm_drop_all_locks()

Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
lock is held write-locked by the caller, and the caller is responsible for
dropping the mmap lock at a later point (which will also release the VMA
locks).

Calling vma_end_write_all() here is dangerous because the caller might
have write-locked a VMA with the expectation that it will stay
write-locked until the mmap_lock is released, as usual.

This _almost_ becomes a problem in the following scenario:

An anonymous VMA A and an SGX VMA B are mapped adjacent to each other.
Userspace calls munmap() on a range starting at the start address of A and
ending in the middle of B.

Hypothetical call graph with additional notes in brackets:

do_vmi_align_munmap
  [begin first for_each_vma_range loop]
  vma_start_write [on VMA A]
  vma_mark_detached [on VMA A]
  __split_vma [on VMA B]
    sgx_vma_open [== new->vm_ops->open]
      sgx_encl_mm_add
        __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
          mm_take_all_locks
          mm_drop_all_locks
            vma_end_write_all [drops VMA lock taken on VMA A before]
  vma_start_write [on VMA B]
  vma_mark_detached [on VMA B]
  [end first for_each_vma_range loop]
  vma_iter_clear_gfp [removes VMAs from maple tree]
  mmap_write_downgrade
  unmap_region
  mmap_read_unlock

In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
holds a VMA write lock on VMA A, the VMA write lock has actually been
invalidated inside __split_vma().

The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
actually happen here, as far as I understand, because we are duplicating
an existing SGX VMA, but sgx_encl_mm_add() only calls
__mmu_notifier_register() for the first SGX VMA created in a given
process.  So this could only happen in fork(), not on munmap().  But in my
view it is just pure luck that this can't happen.

Also, we wouldn't actually have any bad consequences from this in
do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
A, we've already marked VMA A as detached, which makes it completely
ineligible for any VMA-locked page faults.  But again, that's just pure
luck.

So remove the vma_end_write_all(), so that VMA write locks are only ever
released on mmap_write_unlock() or mmap_write_downgrade().

Also add comments to document the locking rules established by this patch.

Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
Fixes: eeff9a5d47f8 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: convert bio_associate_blkg_from_page() to take in a folio
ZhangPeng [Fri, 21 Jul 2023 03:44:51 +0000 (11:44 +0800)]
mm/page_io: convert bio_associate_blkg_from_page() to take in a folio

Convert bio_associate_blkg_from_page() to take in a folio. We can remove
two implicit calls to compound_head() by taking in a folio.

Link: https://lkml.kernel.org/r/20230721034451.16412-11-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: convert count_swpout_vm_event() to take in a folio
ZhangPeng [Fri, 21 Jul 2023 03:44:50 +0000 (11:44 +0800)]
mm/page_io: convert count_swpout_vm_event() to take in a folio

Convert count_swpout_vm_event() to take in a folio. We can remove five
implicit calls to compound_head() by taking in a folio.

Link: https://lkml.kernel.org/r/20230721034451.16412-10-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: use a folio in swap_writepage_bdev_async()
ZhangPeng [Fri, 21 Jul 2023 03:44:49 +0000 (11:44 +0800)]
mm/page_io: use a folio in swap_writepage_bdev_async()

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/20230721034451.16412-9-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: use a folio in swap_writepage_bdev_sync()
ZhangPeng [Fri, 21 Jul 2023 03:44:48 +0000 (11:44 +0800)]
mm/page_io: use a folio in swap_writepage_bdev_sync()

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/20230721034451.16412-8-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: use a folio in sio_read_complete()
ZhangPeng [Fri, 21 Jul 2023 03:44:47 +0000 (11:44 +0800)]
mm/page_io: use a folio in sio_read_complete()

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/20230721034451.16412-7-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: use a folio in __end_swap_bio_read()
ZhangPeng [Fri, 21 Jul 2023 03:44:46 +0000 (11:44 +0800)]
mm/page_io: use a folio in __end_swap_bio_read()

Saves one implicit call to compound_head().

Link: https://lkml.kernel.org/r/20230721034451.16412-6-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: use a folio in __end_swap_bio_write()
ZhangPeng [Fri, 21 Jul 2023 03:44:45 +0000 (11:44 +0800)]
mm/page_io: use a folio in __end_swap_bio_write()

Saves two implicit call to compound_head().

Link: https://lkml.kernel.org/r/20230721034451.16412-5-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: introduce bio_first_folio_all()
ZhangPeng [Fri, 21 Jul 2023 03:44:44 +0000 (11:44 +0800)]
mm/page_io: introduce bio_first_folio_all()

Introduce bio_first_folio_all() to return a folio, which makes it easier
to use.

Link: https://lkml.kernel.org/r/20230721034451.16412-4-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: remove unneeded SetPageError()
ZhangPeng [Fri, 21 Jul 2023 03:44:43 +0000 (11:44 +0800)]
mm/page_io: remove unneeded SetPageError()

Nobody checks the PageError()/folio_test_error() for the page/folio in
__end_swap_bio_read/write() and sio_write_complete(). Therefore, we
don't need to set the error flag. Just drop it.

Link: https://lkml.kernel.org/r/20230721034451.16412-3-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/page_io: remove unneeded ClearPageUptodate()
ZhangPeng [Fri, 21 Jul 2023 03:44:42 +0000 (11:44 +0800)]
mm/page_io: remove unneeded ClearPageUptodate()

Patch series "Convert several functions in page_io.c to use a folio", v4.

Convert several functions in page_io.c to use a folio, which can remove
several implicit calls to compound_head().

This patch (of 10):

The VM_BUG_ON_FOLIO in swap_readpage() ensures that the page is already
!uptodate in __end_swap_bio_read() and sio_read_complete().  Just remove
unneeded ClearPageUptodate().

Link: https://lkml.kernel.org/r/20230721034451.16412-1-zhangpeng362@huawei.com
Link: https://lkml.kernel.org/r/20230721034451.16412-2-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/compaction: avoid unneeded pageblock_end_pfn when no_set_skip_hint is set
Kemeng Shi [Fri, 21 Jul 2023 15:09:57 +0000 (23:09 +0800)]
mm/compaction: avoid unneeded pageblock_end_pfn when no_set_skip_hint is set

Move pageblock_end_pfn after no_set_skip_hint check to avoid unneeded
pageblock_end_pfn if no_set_skip_hint is set.

Link: https://lkml.kernel.org/r/20230721150957.2058634-3-shikemeng@huawei.com
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/compaction: correct comment of candidate pfn in fast_isolate_freepages
Kemeng Shi [Fri, 21 Jul 2023 15:09:56 +0000 (23:09 +0800)]
mm/compaction: correct comment of candidate pfn in fast_isolate_freepages

Patch series "Two minor cleanups for compaction", v2.

This series contains two random cleanups for compaction.

This patch (of 2):

If no preferred one was not found, we will use candidate page with maximum
pfn > min_pfn which is saved in high_pfn.  Correct "minimum" to "maximum
candidate" in comment.

Link: https://lkml.kernel.org/r/20230721150957.2058634-1-shikemeng@huawei.com
Link: https://lkml.kernel.org/r/20230721150957.2058634-2-shikemeng@huawei.com
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/mprotect: fix obsolete function name in change_pte_range()
Miaohe Lin [Sun, 23 Jul 2023 03:31:14 +0000 (11:31 +0800)]
mm/mprotect: fix obsolete function name in change_pte_range()

Since commit 79a1971c5f14 ("mm: move the copy_one_pte() pte_present check
into the caller"), the explanation of preserving soft-dirtiness is moved
into copy_nonpresent_pte().  Update corresponding comment.

Link: https://lkml.kernel.org/r/20230723033114.3224409-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: run all tests from run_vmtests.sh
Ryan Roberts [Mon, 24 Jul 2023 08:25:22 +0000 (09:25 +0100)]
selftests/mm: run all tests from run_vmtests.sh

It is very unclear to me how one is supposed to run all the mm selftests
consistently and get clear results.

Most of the test programs are launched by both run_vmtests.sh and
run_kselftest.sh:

  hugepage-mmap
  hugepage-shm
  map_hugetlb
  hugepage-mremap
  hugepage-vmemmap
  hugetlb-madvise
  map_fixed_noreplace
  gup_test
  gup_longterm
  uffd-unit-tests
  uffd-stress
  compaction_test
  on-fault-limit
  map_populate
  mlock-random-test
  mlock2-tests
  mrelease_test
  mremap_test
  thuge-gen
  virtual_address_range
  va_high_addr_switch
  mremap_dontunmap
  hmm-tests
  madv_populate
  memfd_secret
  ksm_tests
  ksm_functional_tests
  soft-dirty
  cow

However, of this set, when launched by run_vmtests.sh, some of the
programs are invoked multiple times with different arguments. When
invoked by run_kselftest.sh, they are invoked without arguments (and as
a consequence, some fail immediately).

Some test programs are only launched by run_vmtests.sh:

  test_vmalloc.sh

And some test programs and only launched by run_kselftest.sh:

  khugepaged
  migration
  mkdirty
  transhuge-stress
  split_huge_page_test
  mdwe_test
  write_to_hugetlbfs

Furthermore, run_vmtests.sh is invoked by run_kselftest.sh, so in this
case all the test programs invoked by both scripts are run twice!

Needless to say, this is a bit of a mess. In the absence of fully
understanding the history here, it looks to me like the best solution is
to launch ALL test programs from run_vmtests.sh, and ONLY invoke
run_vmtests.sh from run_kselftest.sh. This way, we get full control over
the parameters, each program is only invoked the intended number of
times, and regardless of which script is used, the same tests get run in
the same way.

The only drawback is that if using run_kselftest.sh, it's top-level tap
result reporting reports only a single test and it fails if any of the
contained tests fail. I don't see this as a big deal though since we
still see all the nested reporting from multiple layers. The other issue
with this is that all of run_vmtests.sh must execute within a single
kselftest timeout period, so let's increase that to something more
suitable.

In the Makefile, TEST_GEN_PROGS will compile and install the tests and
will add them to the list of tests that run_kselftest.sh will run.
TEST_GEN_FILES will compile and install the tests but will not add them
to the test list. So let's move all the programs from TEST_GEN_PROGS to
TEST_GEN_FILES so that they are built but not executed by
run_kselftest.sh. Note that run_vmtests.sh is added to TEST_PROGS, which
means it ends up in the test list. (the lack of "_GEN" means it won't be
compiled, but simply copied).

Link: https://lkml.kernel.org/r/20230724082522.1202616-9-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: optionally pass duration to transhuge-stress
Ryan Roberts [Mon, 24 Jul 2023 08:25:21 +0000 (09:25 +0100)]
selftests/mm: optionally pass duration to transhuge-stress

Until now, transhuge-stress runs until its explicitly killed, so when
invoked by run_kselftest.sh, it would run until the test timeout, then it
would be killed and the test would be marked as failed.

Add a new, optional command line parameter that allows the user to specify
the duration in seconds that the program should run.  The program exits
after this duration with a success (0) exit code.  If the argument is
omitted the old behacvior remains.

On it's own, this doesn't quite solve our problem because run_kselftest.sh
does not allow passing parameters to the program under test.  But we will
shortly move this to run_vmtests.sh, which does allow parameter passing.

Link: https://lkml.kernel.org/r/20230724082522.1202616-8-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: make migration test robust to failure
Ryan Roberts [Mon, 24 Jul 2023 08:25:20 +0000 (09:25 +0100)]
selftests/mm: make migration test robust to failure

The `migration` test currently has a number of robustness problems that
cause it to hang and leak resources.

Timeout: There are 3 tests, which each previously ran for 60 seconds.
However, the timeout in mm/settings for a single test binary was set to 45
seconds.  So when run using run_kselftest.sh, the top level timeout would
trigger before the test binary was finished.  Solve this by meeting in the
middle; each of the 3 tests now runs for 20 seconds (for a total of 60),
and the top level timeout is set to 90 seconds.

Leaking child processes: the `shared_anon` test fork()s some children but
then an ASSERT() fires before the test kills those children.  The assert
causes immediate exit of the parent and leaking of the children.
Furthermore, if run using the run_kselftest.sh wrapper, the wrapper would
get stuck waiting for those children to exit, which never happens.  Solve
this by setting the "parent death signal" to SIGHUP in the child, so that
the child is killed automatically if the parent dies.

With these changes, the test binary now runs to completion on arm64, with
2 tests passing and the `shared_anon` test failing.

Link: https://lkml.kernel.org/r/20230724082522.1202616-7-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: va_high_addr_switch should skip unsupported arm64 configs
Ryan Roberts [Mon, 24 Jul 2023 08:25:19 +0000 (09:25 +0100)]
selftests/mm: va_high_addr_switch should skip unsupported arm64 configs

va_high_addr_switch has a mechanism to determine if the tests should be
run or skipped (supported_arch()).  This currently returns unconditionally
true for arm64.  However, va_high_addr_switch also requires a large
virtual address space for the tests to run, otherwise they spuriously
fail.

Since arm64 can only support VA > 48 bits when the page size is 64K, let's
decide whether we should skip the test suite based on the page size.  This
reduces noise when running on 4K and 16K kernels.

Link: https://lkml.kernel.org/r/20230724082522.1202616-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: fix thuge-gen test bugs
Ryan Roberts [Mon, 24 Jul 2023 08:25:18 +0000 (09:25 +0100)]
selftests/mm: fix thuge-gen test bugs

thuge-gen was previously only munmapping part of the mmapped buffer, which
caused us to run out of 1G huge pages for a later part of the test.  Fix
this by munmapping the whole buffer.  Based on the code, it looks like a
typo rather than an intention to keep some of the buffer mapped.

thuge-gen was also calling mmap with SHM_HUGETLB flag (bit 11 set), which
is actually MAP_DENYWRITE in mmap context.  The man page says this flag is
ignored in modern kernels.  I'm pretty sure from the context that the
author intended to pass the MAP_HUGETLB flag so I've fixed that up too.

Link: https://lkml.kernel.org/r/20230724082522.1202616-5-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: enable mrelease_test for arm64
Ryan Roberts [Mon, 24 Jul 2023 08:25:17 +0000 (09:25 +0100)]
selftests/mm: enable mrelease_test for arm64

mrelease_test defaults to defining __NR_pidfd_open and
__NR_process_mrelease syscall numbers to -1, if they are not defined
anywhere else, and the suite would then be marked as skipped as a result.

arm64 (at least the stock debian toolchain that I'm using) requires
including <sys/syscall.h> to pull in the defines for these syscalls.  So
let's add this header.  With this in place, the test is passing on arm64.

Link: https://lkml.kernel.org/r/20230724082522.1202616-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests/mm: skip soft-dirty tests on arm64
Ryan Roberts [Mon, 24 Jul 2023 08:25:16 +0000 (09:25 +0100)]
selftests/mm: skip soft-dirty tests on arm64

arm64 does not support the soft-dirty PTE bit.  However, the `soft-dirty`
test suite is currently run unconditionally and therefore generates
spurious test failures on arm64.  There are also some tests in
`madv_populate` which assume it is supported.

For `soft-dirty` lets disable the whole suite for arm64; it is no longer
built and run_vmtests.sh will skip it if its not present.

For `madv_populate`, we need a runtime mechanism so that the remaining
tests continue to be run.  Unfortunately, the only way to determine if the
soft-dirty dirty bit is supported is to write to a page, then see if the
bit is set in /proc/self/pagemap.  But the tests that we want to
conditionally execute are testing precicesly this.  So if we introduced
this feature check, we could accedentally turn a real failure (on a system
that claims to support soft-dirty) into a skip.  So instead, do the check
based on architecture; for arm64, we report that soft-dirty is not
supported.

Link: https://lkml.kernel.org/r/20230724082522.1202616-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoselftests: line buffer test program's stdout
Ryan Roberts [Mon, 24 Jul 2023 08:25:15 +0000 (09:25 +0100)]
selftests: line buffer test program's stdout

Patch series "selftests/mm fixes for arm64", v3.

Given my on-going work on large anon folios and contpte mappings, I
decided it would be a good idea to start running mm selftests to help
guard against regressions.  However, it soon became clear that I
couldn't get the suite to run cleanly on arm64 with a vanilla v6.5-rc1
kernel (perhaps I'm just doing it wrong??), so got stuck in a rabbit
hole trying to debug and fix all the issues.  Some were down to
misconfigurations, but I also found a number of issues with the tests
and even a couple of issues with the kernel.

This patch (of 8):

The selftests runner pipes the test program's stdout to tap_prefix.  The
presence of the pipe means that the test program sets its stdout to be
fully buffered (as aposed to line buffered when directly connected to the
terminal).  The block buffering means that there is often content in the
buffer at fork() time, which causes the output to end up duplicated.  This
was causing problems for mm:cow where test results were duplicated 20-30x.

Solve this by using `stdbuf`, when available to force the test program to
use line buffered mode.  This means previously printf'ed results are
flushed out of the program before any fork().

Additionally, explicitly set line buffer mode in ksft_print_header(),
which means that all test programs that use the ksft framework will
benefit even if stdbuf is not present on the system.

[ryan.roberts@arm.com: add setvbuf() to set buffering mode]
Link: https://lkml.kernel.org/r/20230726070655.2713530-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230724082522.1202616-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230724082522.1202616-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: fix obsolete function name above debug_pagealloc_enabled_static()
Miaohe Lin [Thu, 20 Jul 2023 11:28:06 +0000 (19:28 +0800)]
mm: fix obsolete function name above debug_pagealloc_enabled_static()

Since commit 04013513cc84 ("mm, page_alloc: do not rely on the order of
page_poison and init_on_alloc/free parameters"), init_debug_pagealloc() is
converted to init_mem_debugging_and_hardening().  Later it's renamed to
mem_debugging_and_hardening_init() via commit f2fc4b44ec2b ("mm: move
init_mem_debugging_and_hardening() to mm/mm_init.c").

Link: https://lkml.kernel.org/r/20230720112806.3851893-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agommu_notifiers: rename invalidate_range notifier
Alistair Popple [Tue, 25 Jul 2023 13:42:07 +0000 (23:42 +1000)]
mmu_notifiers: rename invalidate_range notifier

There are two main use cases for mmu notifiers.  One is by KVM which uses
mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries at
any time.  Hence using start/end() can lead to memory corruption as these
callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end() callbacks
or the invalidate_range() callbacks.  To make this usage clearer rename
the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
update documention.

Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agommu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_ran...
Alistair Popple [Tue, 25 Jul 2023 13:42:06 +0000 (23:42 +1000)]
mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions.  Therefore there is no need to explicitly notify
or invalidate as part of the range end functions.  This means we can
remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agommu_notifiers: call invalidate_range() when invalidating TLBs
Alistair Popple [Tue, 25 Jul 2023 13:42:05 +0000 (23:42 +1000)]
mmu_notifiers: call invalidate_range() when invalidating TLBs

The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync
with the CPU page tables.  Currently it is called from separate code paths
to the main CPU TLB invalidations.  This can lead to a secondary TLB not
getting invalidated when required and makes it hard to reason about when
exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs requiring
explicit software invalidations.

This fixes a SMMU bug on ARM64.  On ARM64 PTE permission upgrades require
a TLB invalidation.  This invalidation is done by the architecture
specific ptep_set_access_flags() which calls flush_tlb_page() if required.
However this doesn't call the notifier resulting in infinite faults being
generated by devices using the SMMU if it has previously cached a
read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures all
invalidations happen at the same time as the CPU invalidation.  The
architecture specific flush_tlb_all() routines do not call the notifier as
none of the IOMMUs require this.

Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agommu_notifiers: fixup comment in mmu_interval_read_begin()
Alistair Popple [Tue, 25 Jul 2023 13:42:04 +0000 (23:42 +1000)]
mmu_notifiers: fixup comment in mmu_interval_read_begin()

The comment in mmu_interval_read_begin() refers to a function that doesn't
exist and uses the wrong call-back name.  The op for mmu interval
notifiers is mmu_interval_notifier_ops->invalidate() so fix the comment up
to reflect that.

Link: https://lkml.kernel.org/r/e7a09081b3ac82a03c189409f1262fc2df91071e.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoarm64/smmu: use TLBI ASID when invalidating entire range
Alistair Popple [Tue, 25 Jul 2023 13:42:03 +0000 (23:42 +1000)]
arm64/smmu: use TLBI ASID when invalidating entire range

Patch series "Invalidate secondary IOMMU TLB on permission upgrade", v4.

The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.

This patch (of 5):

The ARM SMMU has a specific command for invalidating the TLB for an entire
ASID.  Currently this is used for the IO_PGTABLE API but not for ATS when
called from the MMU notifier.

The current implementation of notifiers does not attempt to invalidate
such a large address range, instead walking each VMA and invalidating each
range individually during mmap removal.  However in future SMMU TLB
invalidations are going to be sent as part of the normal flush_tlb_*()
kernel calls.  To better deal with that add handling to use TLBI ASID when
invalidating the entire address space.

Link: https://lkml.kernel.org/r/cover.1eca029b8603ef4eebe5b41eae51facfc5920c41.1690292440.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/ba5f0ec5fbc2ab188797524d3687e075e2412a2b.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: Be more strict about locking
Liam R. Howlett [Fri, 14 Jul 2023 19:55:51 +0000 (15:55 -0400)]
maple_tree: Be more strict about locking

Use lockdep to check the write path in the maple tree holds the lock in
write mode.

Introduce mt_write_lock_is_held() to check if the lock is held for
writing.  Update the necessary checks for rcu_dereference_protected() to
use the new write lock check.

Link: https://lkml.kernel.org/r/20230714195551.894800-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/mmap: change detached vma locking scheme
Liam R. Howlett [Wed, 5 Jul 2023 18:47:49 +0000 (14:47 -0400)]
mm/mmap: change detached vma locking scheme

Don't set the lock to the mm lock so that the detached VMA tree does not
complain about being unlocked when the mmap_lock is dropped prior to
freeing the tree.

Introduce mt_on_stack() for setting the external lock to NULL only when
LOCKDEP is used.

Move the destroying of the detached tree outside the mmap lock all
together.

Link: https://lkml.kernel.org/r/20230719183142.ktgcmuj2pnlr3h3s@revolver
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: relax lockdep checks for on-stack trees
Liam R. Howlett [Fri, 14 Jul 2023 19:55:49 +0000 (15:55 -0400)]
maple_tree: relax lockdep checks for on-stack trees

To support early release of the maple tree locks, do not lockdep check the
lock if it is set to NULL.  This is intended for the special case on-stack
use of tracking entries and not for general use.

Link: https://lkml.kernel.org/r/20230714195551.894800-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/mmap: clean up validate_mm() calls
Liam R. Howlett [Fri, 14 Jul 2023 19:55:48 +0000 (15:55 -0400)]
mm/mmap: clean up validate_mm() calls

Patch series "More strict maple tree lockdep", v2.

Linus asked for more strict maple tree lockdep checking [1] and for them
to resume the normal path through Andrews tree.

This series of patches adds checks to ensure the lock is held in write
mode during the write path of the maple tree instead of checking if it's
held at all.

It also reduces the validate_mm() calls by consolidating into commonly
used functions (patch 0001), and removes the necessity of holding the lock
on the detached tree during munmap() operations.

This patch (of 4):

validate_mm() calls are too spread out and duplicated in numerous
locations.  Also, now that the stack write is done under the write lock,
it is not necessary to validate the mm prior to write operations.

Add a validate_mm() to the stack expansions, and to vma_complete() so
that numerous others may be dropped.

Note that vma_link() (and also insert_vm_struct() by call path) already
call validate_mm().

vma_merge() also had an unnecessary call to vma_iter_free() since the
logic change to abort earlier if no merging is necessary.

Drop extra validate_mm() calls at the start of functions and error paths
which won't write to the tree.

Relocate the validate_mm() call in the do_brk_flags() to avoid
re-running the same test when vma_complete() is used.

The call within the error path of mmap_region() is left intentionally
because of the complexity of the function and the potential of drivers
modifying the tree.

Link: https://lkml.kernel.org/r/20230714195551.894800-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230714195551.894800-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/hugetlb: get rid of page_hstate()
Sidhartha Kumar [Wed, 19 Jul 2023 18:41:45 +0000 (11:41 -0700)]
mm/hugetlb: get rid of page_hstate()

Convert the last page_hstate() user to use folio_hstate() so page_hstate()
can be safely removed.

Link: https://lkml.kernel.org/r/20230719184145.301911-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/rmap: correct stale comment of rmap_walk_anon and rmap_walk_file
Kemeng Shi [Tue, 18 Jul 2023 09:21:36 +0000 (17:21 +0800)]
mm/rmap: correct stale comment of rmap_walk_anon and rmap_walk_file

1. update page to folio in comment
2. add comment of new added @locked

Link: https://lkml.kernel.org/r/20230718092136.1935789-1-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: kfence: allocate kfence_metadata at runtime
Peng Zhang [Tue, 18 Jul 2023 07:30:19 +0000 (15:30 +0800)]
mm: kfence: allocate kfence_metadata at runtime

kfence_metadata is currently a static array.  For the purpose of
allocating scalable __kfence_pool, we first change it to runtime
allocation of metadata.  Since the size of an object of kfence_metadata is
1160 bytes, we can save at least 72 pages (with default 256 objects)
without enabling kfence.

[akpm@linux-foundation.org: restore newline, per Marco]
Link: https://lkml.kernel.org/r/20230718073019.52513-1-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomemory tier: use helper macro __ATTR_RW()
Miaohe Lin [Sat, 15 Jul 2023 03:51:11 +0000 (11:51 +0800)]
memory tier: use helper macro __ATTR_RW()

Use helper macro __ATTR_RW to define numa demotion attributes.  Minor
readability improvement.

Link: https://lkml.kernel.org/r/20230715035111.2656784-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: mtree_insert: fix typo in kernel-doc description of GFP flags
Mike Rapoport (IBM) [Sat, 15 Jul 2023 08:40:38 +0000 (11:40 +0300)]
maple_tree: mtree_insert: fix typo in kernel-doc description of GFP flags

Replace FGP_FLAGS with GFP_FLAGS

Link: https://lkml.kernel.org/r/20230715084038.987955-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomaple_tree: mtree_insert*: fix typo in kernel-doc description
Mike Rapoport (IBM) [Sat, 15 Jul 2023 14:39:20 +0000 (17:39 +0300)]
maple_tree: mtree_insert*: fix typo in kernel-doc description

Replace "Insert and entry at a give index" with "Insert an entry at a
given index"

Link: https://lkml.kernel.org/r/20230715143920.994812-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agofs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a...
Zhu, Lipeng [Sun, 16 Jul 2023 14:56:54 +0000 (22:56 +0800)]
fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.

When running UnixBench/Shell Scripts, we observed high false sharing for
accessing i_mmap against i_mmap_rwsem.

UnixBench/Shell Scripts are typical load/execute command test scenarios,
which concurrently launch->execute->exit a lot of shell commands.  A lot
of processes invoke vma_interval_tree_remove which touch "i_mmap", the
call stack:

----vma_interval_tree_remove
    |----unlink_file_vma
    |    free_pgtables
    |    |----exit_mmap
    |    |    mmput
    |    |    |----begin_new_exec
    |    |    |    load_elf_binary
    |    |    |    bprm_execve

Meanwhile, there are a lot of processes touch 'i_mmap_rwsem' to acquire
the semaphore in order to access 'i_mmap'.  In existing 'address_space'
layout, 'i_mmap' and 'i_mmap_rwsem' are in the same cacheline.

The patch places the i_mmap and i_mmap_rwsem in separate cache lines to
avoid this false sharing problem.

With this patch, based on kernel v6.4.0, on Intel Sapphire Rapids
112C/224T platform, the score improves by ~5.3%.  And perf c2c tool shows
the false sharing is resolved as expected, the symbol
vma_interval_tree_remove disappeared in cache line 0 after this change.

Baseline:
=================================================
      Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
    0    3729     5791        0        0  0xff19b3818445c740
-------------------------------------------------------------
   3.27%    3.02%    0.00%    0.00%   0x18     0       1  0xffffffffa194403b       604       483       389      692       203  [k] vma_interval_tree_insert    [kernel.kallsyms]  vma_interval_tree_insert+75      0  1
   4.13%    3.63%    0.00%    0.00%   0x20     0       1  0xffffffffa19440a2       553       413       415      962       215  [k] vma_interval_tree_remove    [kernel.kallsyms]  vma_interval_tree_remove+18      0  1
   2.04%    1.35%    0.00%    0.00%   0x28     0       1  0xffffffffa219a1d6      1210       855       460     1229       222  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+678    0  1
   0.62%    1.85%    0.00%    0.00%   0x28     0       1  0xffffffffa219a1bf       762       329       577      527       198  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+655    0  1
   0.48%    0.31%    0.00%    0.00%   0x28     0       1  0xffffffffa219a58c      1677      1476       733     1544       224  [k] down_write                  [kernel.kallsyms]  down_write+28                    0  1
   0.05%    0.07%    0.00%    0.00%   0x28     0       1  0xffffffffa219a21d      1040       819       689       33        27  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+749    0  1
   0.00%    0.05%    0.00%    0.00%   0x28     0       1  0xffffffffa17707db         0      1005       786     1373       223  [k] up_write                    [kernel.kallsyms]  up_write+27                      0  1
   0.00%    0.02%    0.00%    0.00%   0x28     0       1  0xffffffffa219a064         0       233       778       32        30  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+308    0  1
  33.82%   34.10%    0.00%    0.00%   0x30     0       1  0xffffffffa1770945       779       495       534     6011       224  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+53           0  1
  17.06%   15.28%    0.00%    0.00%   0x30     0       1  0xffffffffa1770915       593       438       468     2715       224  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+5            0  1
   3.54%    3.52%    0.00%    0.00%   0x30     0       1  0xffffffffa2199f84       881       601       583     1421       223  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+84     0  1

With this change:
-------------------------------------------------------------
   0      556      838        0        0  0xff2780d7965d2780
-------------------------------------------------------------
    0.18%    0.60%    0.00%    0.00%    0x8     0       1  0xffffffffafff27b8       503       453       569       14        13  [k] do_dentry_open              [kernel.kallsyms]  do_dentry_open+456               0  1
    0.54%    0.12%    0.00%    0.00%    0x8     0       1  0xffffffffaffc51ac       510       199       428       15        12  [k] hugepage_vma_check          [kernel.kallsyms]  hugepage_vma_check+252           0  1
    1.80%    2.15%    0.00%    0.00%   0x18     0       1  0xffffffffb079a1d6      1778       799       343      215       136  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+678    0  1
    0.54%    1.31%    0.00%    0.00%   0x18     0       1  0xffffffffb079a1bf       547       296       528       91        71  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+655    0  1
    0.72%    0.72%    0.00%    0.00%   0x18     0       1  0xffffffffb079a58c      1479      1534       676      288       163  [k] down_write                  [kernel.kallsyms]  down_write+28                    0  1
    0.00%    0.12%    0.00%    0.00%   0x18     0       1  0xffffffffafd707db         0      2381       744      282       158  [k] up_write                    [kernel.kallsyms]  up_write+27                      0  1
    0.00%    0.12%    0.00%    0.00%   0x18     0       1  0xffffffffb079a064         0       239       518        6         6  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+308    0  1
   46.58%   47.02%    0.00%    0.00%   0x20     0       1  0xffffffffafd70945       704       403       499     1137       219  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+53           0  1
   23.92%   25.78%    0.00%    0.00%   0x20     0       1  0xffffffffafd70915       558       413       500      542       185  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+5            0  1

v1->v2: change padding to exchange fields.

Link: https://lkml.kernel.org/r/20230716145653.20122-1-lipeng.zhu@intel.com
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Yu Ma <yu.ma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/mm_init.c: drop node_start_pfn from adjust_zone_range_for_zone_movable()
Haifeng Xu [Mon, 17 Jul 2023 06:58:11 +0000 (06:58 +0000)]
mm/mm_init.c: drop node_start_pfn from adjust_zone_range_for_zone_movable()

node_start_pfn is not used in adjust_zone_range_for_zone_movable(), so it
is pointless to waste a function argument.  Drop the parameter.

Link: https://lkml.kernel.org/r/20230717065811.1262-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/memcg: minor cleanup for mc_handle_present_pte()
Miaohe Lin [Mon, 17 Jul 2023 11:36:44 +0000 (19:36 +0800)]
mm/memcg: minor cleanup for mc_handle_present_pte()

When pagetable lock is held, the page will always be page_mapped().  So
remove unneeded page_mapped() check.  Also the page can't be freed from
under us in this case.  So use get_page() to get extra page reference to
simplify the code.  No functional change intended.

Link: https://lkml.kernel.org/r/20230717113644.3026478-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoarm64: support batched/deferred tlb shootdown during page reclamation/migration
Barry Song [Mon, 17 Jul 2023 13:10:04 +0000 (21:10 +0800)]
arm64: support batched/deferred tlb shootdown during page reclamation/migration

On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
software IPI.  But sync tlbi is still quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  ......
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process.  If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush.  This
provides us a very good chance to leverage the existing batched tlb in
kernel.  The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
        w/o             w/
real    0m13.460s       0m11.771s
user    0m0.248s        0m0.279s
sys     0m12.039s       0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/tlbbatch: introduce arch_flush_tlb_batched_pending()
Yicong Yang [Mon, 17 Jul 2023 13:10:03 +0000 (21:10 +0800)]
mm/tlbbatch: introduce arch_flush_tlb_batched_pending()

Currently we'll flush the mm in flush_tlb_batched_pending() to avoid race
between reclaim unmaps pages by batched TLB flush and mprotect/munmap/etc.
Other architectures like arm64 may only need a synchronization
barrier(dsb) here rather than a full mm flush.  So add
arch_flush_tlb_batched_pending() to allow an arch-specific implementation
here.  This intends no functional changes on x86 since still a full mm
flush for x86.

Link: https://lkml.kernel.org/r/20230717131004.12662-4-yangyicong@huawei.com
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/tlbbatch: rename and extend some functions
Barry Song [Mon, 17 Jul 2023 13:10:02 +0000 (21:10 +0800)]
mm/tlbbatch: rename and extend some functions

This patch does some preparation works to extend batched TLB flush to
arm64. Including:
- Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
  to accept an additional argument for address, architectures
  like arm64 may need this for tlbi.
- Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
  to match its current function since we don't need to handle
  mm on architectures like arm64 and add_mm is not proper,
  add_pending will make sense to both as on x86 we're pending the
  TLB flush operations while on arm64 we're pending the synchronize
  operations.

This intends no functional changes on x86.

Link: https://lkml.kernel.org/r/20230717131004.12662-3-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/tlbbatch: introduce arch_tlbbatch_should_defer()
Anshuman Khandual [Mon, 17 Jul 2023 13:10:01 +0000 (21:10 +0800)]
mm/tlbbatch: introduce arch_tlbbatch_should_defer()

Patch series "arm64: support batched/deferred tlb shootdown during page
reclamation/migration", v11.

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.  A simplest micro benchmark shows even on
snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
huge even for paging out one page mapped by only one process: 5.36% a.out
[kernel.kallsyms] [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs, the cost
should become even higher due to the bad scalability of tlb shootdown.
The same benchmark can result in 16.99% CPU consumption on ARM64 server
with around 100 cores according to the test on patch 4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()

Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
the patchset.  The micro benchmark becomes 5% faster even for one page
mapped by single process on snapdragon 888.

Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
does some renaming/extension for the current implementation first (Patch
1-3), then add the support on arm64 (Patch 4).

This patch (of 4):

The entire scheme of deferred TLB flush in reclaim path rests on the fact
that the cost to refill TLB entries is less than flushing out individual
entries by sending IPI to remote CPUs.  But architecture can have
different ways to evaluate that.  Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

[yangyicong@hisilicon.com: rebase and fix incorrect return value type]
Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed
Baoquan He [Thu, 6 Jul 2023 15:45:20 +0000 (23:45 +0800)]
mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed

Now there are no users of ioremap_allowed and iounmap_allowed, clean
them up.

Link: https://lkml.kernel.org/r/20230706154520.11257-20-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoarm64 : mm: add wrapper function ioremap_prot()
Baoquan He [Thu, 6 Jul 2023 15:45:19 +0000 (23:45 +0800)]
arm64 : mm: add wrapper function ioremap_prot()

Since hook functions ioremap_allowed() and iounmap_allowed() will be
obsoleted, add wrapper function ioremap_prot() to contain the the specific
handling in addition to generic_ioremap_prot() invocation.

Link: https://lkml.kernel.org/r/20230706154520.11257-19-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agopowerpc: mm: convert to GENERIC_IOREMAP
Christophe Leroy [Thu, 6 Jul 2023 15:45:18 +0000 (23:45 +0800)]
powerpc: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap()
and iounmap() are all visible and available to arch. Arch needs to
provide wrapper functions to override the generic versions if there's
arch specific handling in its ioremap_prot(), ioremap() or iounmap().
This change will simplify implementation by removing duplicated code
with generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for powerpc's
special operation when ioremap() and iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-18-bhe@redhat.com
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: move is_ioremap_addr() into new header file
Baoquan He [Thu, 6 Jul 2023 15:45:17 +0000 (23:45 +0800)]
mm: move is_ioremap_addr() into new header file

Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in
mm/ioremap.c.  Move it into its own new header file linux/ioremap.h.

Link: https://lkml.kernel.org/r/20230706154520.11257-17-bhe@redhat.com
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/ioremap: consider IOREMAP space in generic ioremap
Christophe Leroy [Thu, 6 Jul 2023 15:45:16 +0000 (23:45 +0800)]
mm/ioremap: consider IOREMAP space in generic ioremap

Architectures like powerpc have a dedicated space for IOREMAP mappings.

If so, use it in generic_ioremap_prot().

Link: https://lkml.kernel.org/r/20230706154520.11257-16-bhe@redhat.com
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoparisc: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:15 +0000 (23:45 +0800)]
parisc: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper function ioremap_prot() for parisc's special operation
when iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-15-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoxtensa: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:14 +0000 (23:45 +0800)]
xtensa: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot(), ioremap() and iounmap() for
xtensa's special operation when ioremap() and iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-14-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agosh: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:13 +0000 (23:45 +0800)]
sh: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for SuperH's
special operation when ioremap() and iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-13-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agosh: add <asm-generic/io.h> including
Baoquan He [Thu, 6 Jul 2023 15:45:12 +0000 (23:45 +0800)]
sh: add <asm-generic/io.h> including

In <asm-generic/io.h>, it provides a generic implementation of all
I/O accessors.

For some port|mm io functions, SuperH has its own implementation in
arch/sh/kernel/iomap.c and arch/sh/include/asm/io_noioport.h.  These will
conflict with those in <asm-generic/io.h> and cause compiling error.
Hence add macro definitions to ensure that the SuperH version of them will
override the generic version.

[arnd@arndb.de: fix asm-generic/io.h inclusion]
Link: https://lkml.kernel.org/r/20230802141658.2064864-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20230706154520.11257-12-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agos390: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:11 +0000 (23:45 +0800)]
s390: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for s390's
special operation when ioremap() and iounmap().

And also replace including <asm-generic/io.h> with <asm/io.h> in
arch/s390/kernel/perf_cpum_sf.c, otherwise building error will be seen
because macro defined in <asm/io.h> can't be seen in perf_cpum_sf.c.

Link: https://lkml.kernel.org/r/20230706154520.11257-11-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoopenrisc: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:10 +0000 (23:45 +0800)]
openrisc: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

For openrisc, the current ioremap() and iounmap() are the same as generic
version.  After taking GENERIC_IOREMAP way, the old ioremap() and
iounmap() can be completely removed.

Link: https://lkml.kernel.org/r/20230706154520.11257-10-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoia64: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:09 +0000 (23:45 +0800)]
ia64: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for ia64's
special operation when ioremap() and iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-9-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoarc: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:08 +0000 (23:45 +0800)]
arc: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic versions if there's arch
specific handling in its ioremap_prot(), ioremap() or iounmap().  This
change will simplify implementation by removing duplicated code with
generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for arc's special
operation when ioremap_prot() and iounmap().

Link: https://lkml.kernel.org/r/20230706154520.11257-8-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/ioremap: add slab availability checking in ioremap_prot
Baoquan He [Thu, 6 Jul 2023 15:45:07 +0000 (23:45 +0800)]
mm/ioremap: add slab availability checking in ioremap_prot

Several architectures has done checking if slab if available in
ioremap_prot().  In fact it should be done in generic ioremap_prot() since
on any architecutre, slab allocator must be available before
get_vm_area_caller() and vunmap() are used.

Add the checking into generic_ioremap_prot().

Link: https://lkml.kernel.org/r/20230706154520.11257-7-bhe@redhat.com
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm: ioremap: allow ARCH to have its own ioremap method definition
Baoquan He [Thu, 6 Jul 2023 15:45:06 +0000 (23:45 +0800)]
mm: ioremap: allow ARCH to have its own ioremap method definition

Architectures can be converted to GENERIC_IOREMAP, to take standard
ioremap_xxx() and iounmap() way.  But some ARCH-es could have specific
handling for ioremap_prot(), ioremap() and iounmap(), than standard
methods.

In oder to convert these ARCH-es to take GENERIC_IOREMAP method, allow
these architecutres to have their own ioremap_prot(), ioremap() and
iounmap() definitions.

Link: https://lkml.kernel.org/r/20230706154520.11257-6-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agomm/ioremap: define generic_ioremap_prot() and generic_iounmap()
Christophe Leroy [Thu, 6 Jul 2023 15:45:05 +0000 (23:45 +0800)]
mm/ioremap: define generic_ioremap_prot() and generic_iounmap()

Define a generic version of ioremap_prot() and iounmap() that
architectures can call after they have performed the necessary alteration
to parameters and/or necessary verifications.

Link: https://lkml.kernel.org/r/20230706154520.11257-5-bhe@redhat.com
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoopenrisc: mm: remove unneeded early ioremap code
Baoquan He [Thu, 6 Jul 2023 15:45:04 +0000 (23:45 +0800)]
openrisc: mm: remove unneeded early ioremap code

Under arch/openrisc, there isn't any place where ioremap() is called.  It
means that there isn't early ioremap handling needed in openrisc, So the
early ioremap handling code in ioremap() of arch/openrisc/mm/ioremap.c is
unnecessary and can be removed.

And also remove the special handling in iounmap() since no page is got
from fixmap pool along with early ioremap code removing in ioremap().

Link: https://lore.kernel.org/linux-mm/YwxfxKrTUtAuejKQ@oscomms1/
Link: https://lkml.kernel.org/r/20230706154520.11257-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Stafford Horne <shorne@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agohexagon: mm: convert to GENERIC_IOREMAP
Baoquan He [Thu, 6 Jul 2023 15:45:03 +0000 (23:45 +0800)]
hexagon: mm: convert to GENERIC_IOREMAP

By taking GENERIC_IOREMAP method, the generic ioremap_prot() and iounmap()
are visible and available to arch.  This change will simplify
implementation by removing duplicated code with generic ioremap_prot() and
iounmap(), and has the equivalent functioality.

For hexagon, the current ioremap() and iounmap() are the same as generic
version.  After taking GENERIC_IOREMAP way, the old ioremap() and
iounmap() can be completely removed.

Link: https://lkml.kernel.org/r/20230706154520.11257-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agoasm-generic/iomap.h: remove ARCH_HAS_IOREMAP_xx macros
Baoquan He [Thu, 6 Jul 2023 15:45:02 +0000 (23:45 +0800)]
asm-generic/iomap.h: remove ARCH_HAS_IOREMAP_xx macros

Patch series "mm: ioremap: Convert architectures to take GENERIC_IOREMAP
way", v8.

Motivation and implementation:
==============================
Currently, many architecutres have't taken the standard GENERIC_IOREMAP
way to implement ioremap_prot(), iounmap(), and ioremap_xx(), but make
these functions specifically under each arch's folder.  Those cause many
duplicated code of ioremap() and iounmap().

In this patchset, firstly introduce generic_ioremap_prot() and
generic_iounmap() to extract the generic code for GENERIC_IOREMAP.  By
taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch.  Arch needs to provide
wrapper functions to override the generic version if there's arch specific
handling in its corresponding ioremap_prot(), ioremap() or iounmap().
With these changes, duplicated ioremap/iounmap() code uder ARCH-es are
removed, and the equivalent functioality is kept as before.

Background info:
================

1) The converting more architectures to take GENERIC_IOREMAP way is
   suggested by Christoph in below discussion:
   https://lore.kernel.org/all/Yp7h0Jv6vpgt6xdZ@infradead.org/T/#u

2) In the previous v1 to v3, it's basically further action after arm64
   has converted to GENERIC_IOREMAP way in below patchset.  It's done by
   adding hook ioremap_allowed() and iounmap_allowed() in ARCH to add ARCH
   specific handling the middle of ioremap_prot() and iounmap().

[PATCH v5 0/6] arm64: Cleanup ioremap() and support ioremap_prot()
https://lore.kernel.org/all/20220607125027.44946-1-wangkefeng.wang@huawei.com/T/#u

Later, during v3 reviewing, Christophe Leroy suggested to introduce
generic_ioremap_prot() and generic_iounmap() to generic codes, and ARCH
can provide wrapper function ioremap_prot(), ioremap() or iounmap() if
needed.  Christophe made a RFC patchset as below to specially demonstrate
his idea.  This is what v4 and now v5 is doing.

[RFC PATCH 0/8] mm: ioremap: Convert architectures to take GENERIC_IOREMAP way
https://lore.kernel.org/all/cover.1665568707.git.christophe.leroy@csgroup.eu/T/#u

Testing:
========
In v8, I only applied this patchset onto the latest linus's tree to build
and run on arm64 and s390.

This patch (of 19):

Let's use '#define ioremap_xx' and "#ifdef ioremap_xx" instead.

To remove defined ARCH_HAS_IOREMAP_xx macros in <asm/io.h> of each ARCH,
the ARCH's own ioremap_wc|wt|np definition need be above "#include
<asm-generic/iomap.h>.  Otherwise the redefinition error would be seen
during compiling.  So the relevant adjustments are made to avoid compiling
error:

  loongarch:
  - doesn't include <asm-generic/iomap.h>, defining ARCH_HAS_IOREMAP_WC
    is redundant, so simply remove it.

  m68k:
  - selected GENERIC_IOMAP, <asm-generic/iomap.h> has been added in
    <asm-generic/io.h>, and <asm/kmap.h> is included above
    <asm-generic/iomap.h>, so simply remove ARCH_HAS_IOREMAP_WT defining.

  mips:
  - move "#include <asm-generic/iomap.h>" below ioremap_wc definition
    in <asm/io.h>

  powerpc:
  - remove "#include <asm-generic/iomap.h>" in <asm/io.h> because it's
    duplicated with the one in <asm-generic/io.h>, let's rely on the
    latter.

  x86:
  - selected GENERIC_IOMAP, remove #include <asm-generic/iomap.h> in
    the middle of <asm/io.h>. Let's rely on <asm-generic/io.h>.

Link: https://lkml.kernel.org/r/20230706154520.11257-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Helge Deller <deller@gmx.de>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 months agolib/test_meminit: allocate pages up to order MAX_ORDER
Andrew Donnellan [Fri, 14 Jul 2023 01:52:38 +0000 (11:52 +1000)]
lib/test_meminit: allocate pages up to order MAX_ORDER

test_pages() tests the page allocator by calling alloc_pages() with
different orders up to order 10.

However, different architectures and platforms support different maximum
contiguous allocation sizes.  The default maximum allocation order
(MAX_ORDER) is 10, but architectures can use CONFIG_ARCH_FORCE_MAX_ORDER
to override this.  On platforms where this is less than 10, test_meminit()
will blow up with a WARN().  This is expected, so let's not do that.

Replace the hardcoded "10" with the MAX_ORDER macro so that we test
allocations up to the expected platform limit.

Link: https://lkml.kernel.org/r/20230714015238.47931-1-ajd@linux.ibm.com
Fixes: 5015a300a522 ("lib: introduce test_meminit module")
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Xiaoke Wang <xkernel.wang@foxmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>