linux-block.git
2 months agomemcg: vmalloc: simplify MEMCG_VMALLOC updates
Shakeel Butt [Thu, 3 Apr 2025 05:33:26 +0000 (22:33 -0700)]
memcg: vmalloc: simplify MEMCG_VMALLOC updates

The vmalloc region can either be charged to a single memcg or none.  At
the moment kernel traverses all the pages backing the vmalloc region to
update the MEMCG_VMALLOC stat.  However there is no need to look at all
the pages as all those pages will be charged to a single memcg or none.
Simplify the MEMCG_VMALLOC update by just looking at the first page of the
vmalloc region.

[shakeel.butt@linux.dev: add comment]
Link: https://lkml.kernel.org/r/bmlkdbqgwboyqrnxyom7n52fjmo76ux77jhqw5odc6c6dfon3h@zdylwtmlywbt
Link: https://lkml.kernel.org/r/20250403053326.26860-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/compaction: reduce the difference between low and high watermarks
Michal Clapinski [Fri, 4 Apr 2025 11:11:03 +0000 (13:11 +0200)]
mm/compaction: reduce the difference between low and high watermarks

Reduce the diff between low and high watermarks when compaction
proactiveness is set to high.  This allows users who set the proactiveness
really high to have more stable fragmentation score over time.

Link: https://lkml.kernel.org/r/20250404111103.1994507-3-mclapinski@google.com
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/compaction: remove low watermark cap for proactive compaction
Michal Clapinski [Fri, 4 Apr 2025 11:11:02 +0000 (13:11 +0200)]
mm/compaction: remove low watermark cap for proactive compaction

Patch series "mm/compaction: allow more aggressive proactive compaction",
v4.

Our goal is to keep memory usage of a VM low on the host.  For that
reason, we use free page reporting which by default reports free pages of
order 9 and larger to the host to be freed.  The feature works well only
if the memory in the guest is not fragmented below pages of order 9.
Proactive compaction can be reused to achieve defragmentation after some
parameter tweaking.

When the fragmentation score (lower is better) gets larger than the high
watermark, proactive compaction kicks in.  Compaction stops when the score
goes below the low watermark (or no progress is made and backoff kicks
in).  Let's define the difference between high and low watermarks as
leeway.  Before these changes, the minimum possible value for low
watermark was 5 and the leeway was hardcoded to 10 (so minimum possible
value for high watermark was 15).

To test this, I created a VM with 19GB of memory and free page reporting
enabled.  The VM was ~idle.  I meassured the memory usage from inside the
guest (/proc/meminfo) and from the host (provided by the hypervisor).

Before:
https://drive.google.com/file/d/1Xw23lRry_PgEH3f6QRnSGvoHh2u9UHyI/view?usp=sharing
After:
https://drive.google.com/file/d/1wMhpIzepx6t44F70yCPA50n1S5V2AT-a/view?usp=sharing

This patch (of 2):

Previously a min cap of 5 has been set in the commit introducing proactive
compaction.  This was to make sure users don't hurt themselves by setting
the proactiveness to 100 and making their system unresponsive.  But the
compaction mechanism has a backoff mechanism that will sleep for 30s if no
progress is made, so I don't see a significant risk here.  My system (19GB
of memory) has been perfectly fine with both watermarks hardcoded to 0.

Link: https://lkml.kernel.org/r/20250404111103.1994507-1-mclapinski@google.com
Link: https://lkml.kernel.org/r/20250404111103.1994507-2-mclapinski@google.com
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/page_alloc: simplify free_page_is_bad by removing free_page_is_bad_report
Ye Liu [Fri, 28 Mar 2025 01:20:31 +0000 (09:20 +0800)]
mm/page_alloc: simplify free_page_is_bad by removing free_page_is_bad_report

Refactor free_page_is_bad() to call bad_page() directly, removing the
intermediate free_page_is_bad_report(). This reduces unnecessary
indirection, improving code clarity and maintainability without changing
functionality.

Link: https://lkml.kernel.org/r/20250328012031.1204993-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agozram: modernize writeback interface
Sergey Senozhatsky [Thu, 27 Mar 2025 01:58:09 +0000 (10:58 +0900)]
zram: modernize writeback interface

The writeback interface supports a page_index=N parameter which performs
writeback of the given page.  Since we rarely need to writeback just one
single page, the typical use case involves a number of writeback calls,
each performing writeback of one page:

  echo page_index=100 > zram0/writeback
  ...
  echo page_index=200 > zram0/writeback
  echo page_index=500 > zram0/writeback
  ...
  echo page_index=700 > zram0/writeback

One obvious downside of this is that it increases the number of syscalls.
Less obvious, but a significantly more important downside, is that when
given only one page to post-process zram cannot perform an optimal target
selection.  This becomes a critical limitation when writeback_limit is
enabled, because under writeback_limit we want to guarantee the highest
memory savings hence we first need to writeback pages that release the
highest amount of zsmalloc pool memory.

This patch adds page_indexes=LOW-HIGH parameter to the writeback
interface:

  echo page_indexes=100-200 page_indexes=500-700 > zram0/writeback

This gives zram a chance to apply an optimal target selection strategy on
each iteration of the writeback loop.

We also now permit multiple page_index parameters per call (previously
zram would recognize only one page_index) and a mix or single pages and
page ranges:

  echo page_index=42 page_index=99 page_indexes=100-200 \
       page_indexes=500-700 > zram0/writeback

Apart from that the patch also unifies parameters passing and resembles
other "modern" zram device attributes (e.g.  recompression), while the old
interface used a mixed scheme: values-less parameters for mode and a
key=value format for page_index.  We still support the "old" value-less
format for compatibility reasons.

[senozhatsky@chromium.org: simplify parse_page_index() range checks, per Brian]
  nk: https://lkml.kernel.org/r/20250404015327.2427684-1-senozhatsky@chromium.org
[sozhatsky@chromium.org: fix uninitialized variable in zram_writeback_slots(), per Dan]
  nk: https://lkml.kernel.org/r/20250409112611.1154282-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250327015818.4148660-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoselftests/mm: convert page_size to unsigned long
Siddarth G [Thu, 3 Apr 2025 10:13:45 +0000 (15:43 +0530)]
selftests/mm: convert page_size to unsigned long

Cppcheck warning:
int result is assigned to long long variable. If the variable is long long
to avoid loss of information, then you have loss of information.

This patch changes the type of page_size from 'unsigned int' to
'unsigned long' instead of using ULL suffixes. Changing hpage_size to
'unsigned long' was considered, but since gethugepage() expects an int,
this change was avoided.

Link: https://lkml.kernel.org/r/20250403101345.29226-1-siddarthsgml@gmail.com
Signed-off-by: Siddarth G <siddarthsgml@gmail.com>
Reported-by: David Binderman <dcb314@hotmail.com>
Closes: https://lore.kernel.org/all/AS8PR02MB10217315060BBFDB21F19643E9CA62@AS8PR02MB10217.eurprd02.prod.outlook.com/
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/show_mem: optimize si_meminfo_node by reducing redundant code
Ye Liu [Tue, 25 Mar 2025 07:38:03 +0000 (15:38 +0800)]
mm/show_mem: optimize si_meminfo_node by reducing redundant code

Refactors the si_meminfo_node() function by reducing redundant code and
improving readability.

Moved the calculation of managed_pages inside the existing loop that
processes pgdat->node_zones, eliminating the need for a separate loop.

Simplified the logic by removing unnecessary preprocessor conditionals.

Ensured that both totalram, totalhigh, and other memory statistics are
consistently set without duplication.

This change results in cleaner and more efficient code without altering
functionality.

Link: https://lkml.kernel.org/r/20250325073803.852594-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: annotate data race in update_hiwater_rss
Ignacio Encinas [Mon, 31 Mar 2025 19:57:05 +0000 (21:57 +0200)]
mm: annotate data race in update_hiwater_rss

mm_struct.hiwater_rss can be accessed concurrently without proper
synchronization as reported by KCSAN.

This data race is benign as it only affects accounting information.
Annotate it with data_race() to make KCSAN happy.

Link: https://lkml.kernel.org/r/20250331-mm-maxrss-data-race-v2-1-cf958e6205bf@iencinas.com
Signed-off-by: Ignacio Encinas <ignacio@iencinas.com>
Reported-by: syzbot+419c4b42acc36c420ad3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67e3390c.050a0220.1ec46.0001.GAE@google.com/
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/compaction: use folio in hugetlb pathway
Vishal Moola (Oracle) [Tue, 1 Apr 2025 02:10:25 +0000 (19:10 -0700)]
mm/compaction: use folio in hugetlb pathway

Use a folio in the hugetlb pathway during the compaction migrate-able
pageblock scan.

This removes a call to compound_head().

Link: https://lkml.kernel.org/r/20250401021025.637333-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoacpi,srat: give memory block size advice based on CFMWS alignment
Gregory Price [Mon, 27 Jan 2025 15:34:05 +0000 (10:34 -0500)]
acpi,srat: give memory block size advice based on CFMWS alignment

Capacity is stranded when CFMWS regions are not aligned to block size.  On
x86, block size increases with capacity (2G blocks @ 64G capacity).

Use CFMWS base/size to report memory block size alignment advice.

Link: https://lkml.kernel.org/r/20250127153405.3379117-4-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agox86: probe memory block size advisement value during mm init
Gregory Price [Mon, 27 Jan 2025 15:34:04 +0000 (10:34 -0500)]
x86: probe memory block size advisement value during mm init

Systems with hotplug may provide an advisement value on what the memblock
size should be.  Probe this value when the rest of the configuration
values are considered.

The new heuristic is as follows

1) set_memory_block_size_order value if already set (cmdline param)
2) minimum block size if memory is less than large block limit
3) if no hotplug advice: Max block size if system is bare-metal,
   otherwise use end of memory alignment.
4) if hotplug advice: lesser of advice and end of memory alignment.

Convert to cpu_feature_enabled() while at it.[1]

[1] https://lore.kernel.org/all/20241031103401.GBZyNdGQ-ZyXKyzC_z@fat_crate.local/

Link: https://lkml.kernel.org/r/20250127153405.3379117-3-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Borislav Petkov <bp@alien8.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomemory: implement memory_block_advise/probe_max_size
Gregory Price [Mon, 27 Jan 2025 15:34:03 +0000 (10:34 -0500)]
memory: implement memory_block_advise/probe_max_size

Patch series "memory,x86,acpi: hotplug memory alignment advisement", v8.

When physical address regions are not aligned to memory block size, the
misaligned portion is lost (stranded capacity).

Block size (min/max/selected) is architecture defined.  Most architectures
tend to use the minimum block size or some simplistic heurist.  On x86,
memory block size increases up to 2GB, and is otherwise fitted to the
alignment of non-hotplug (i.e.  not special purpose memory).

CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window.  Per the
CXL specification, this memory must be aligned to at least 256MB.

When a CFMW aligns on a size less than the block size, this causes a loss
of up to 2GB per CFMW on x86.  It is not uncommon for CFMW to be allocated
per-device - though this behavior is BIOS defined.

This patch set provides 3 things:
 1) implement advise/query functions in driverse/base/memory.c to
    report/query architecture agnostic hotplug block alignment advice.
 2) update x86 memblock size logic to consider the hotplug advice
 3) add code in acpi/numa/srat.c to report CFMW alignment advice

The advisement interfaces are design to be called during arch_init code
prior to allocator and smp_init.  start_kernel will call these through
setup_arch() (via acpi and mm/init_64.c on x86), which occurs prior to
mm_core_init and smp_init - so no need for atomics.

There's an attempt to signal callers to advise() that query has already
occurred, but this is predicated on the notion that query actually occurs
(which presently only happens on the x86 arch).  This is to assist
debugging future users.  Otherwise, the advise() call has been marked
__init to help static discovery of bad call times.

Once query is called the first time, it will always return the same value.

Interfaces return -EBUSY and 0 respectively on systems without hotplug.

This patch (of 3):

Hotplug memory sources may have opinions on what the memblock size should
be - usually for alignment purposes.  For example, CXL memory extents can
be 256MB with a matching alignment.  If this size/alignment is smaller
than the block size, it can result in stranded capacity.

Implement memory_block_advise_max_size for use prior to allocator init,
for software to advise the system on the max block size.

Implement memory_block_probe_max_size for use by arch init code to
calculate the best block size.  Use of advice is architecture defined.

The probe value can never change after first probe.  Calls to advise after
probe will return -EBUSY to aid debugging.

On systems without hotplug, always return -ENODEV and 0 respectively.

Link: https://lkml.kernel.org/r/20250127153405.3379117-1-gourry@gourry.net
Link: https://lkml.kernel.org/r/20250127153405.3379117-2-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: page_alloc: remove redundant READ_ONCE
Songtang Liu [Wed, 2 Apr 2025 07:41:16 +0000 (03:41 -0400)]
mm: page_alloc: remove redundant READ_ONCE

In the current code, batch is a local variable, and it cannot be
concurrently modified.  It's unnecessary to use READ_ONCE here, so remove
it.

Link: https://lkml.kernel.org/r/CAA=HWd1kn01ym8YuVFuAqK2Ggq3itEGkqX8T6eCXs_C7tiv-Jw@mail.gmail.com
Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
Signed-off-by: Songtang Liu <liusongtang@bytedance.com>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomemcg, oom: do not bypass oom killer for dying tasks
Michal Hocko [Wed, 2 Apr 2025 09:01:17 +0000 (11:01 +0200)]
memcg, oom: do not bypass oom killer for dying tasks

7775face2079 ("memcg: killed threads should not invoke memcg OOM killer")
has added a bypass of the oom killer path for dying threads because a very
specific workload (described in the changelog) could hit "no killable
tasks" path.  This itself is not fatal condition but it could be annoying
if this was a common case.

On the other hand the bypass has some issues on its own.  Without
triggering oom killer we won't be able to trigger async oom reclaim
(oom_reaper) which can operate on killed tasks as well as long as they
still have their mm available.  This could be the case during futex
cleanup when the memory as pointed out by Johannes in [1].  The said case
is still not fully understood but let's drop this bypass that was mostly
driven by an artificial workload and allow dying tasks to go into oom
path.  This will make the code easier to reason about and also help corner
cases where oom_reaper could help to release memory.

Link: https://lore.kernel.org/all/20241212183012.GB1026@cmpxchg.org/T/#u
Link: https://lkml.kernel.org/r/20250402090117.130245-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agozsmalloc: prefer the the original page's node for compressed data
Nhat Pham [Wed, 2 Apr 2025 20:44:16 +0000 (13:44 -0700)]
zsmalloc: prefer the the original page's node for compressed data

Currently, zsmalloc, zswap's and zram's backend memory allocator, does not
enforce any policy for the allocation of memory for the compressed data,
instead just adopting the memory policy of the task entering reclaim, or
the default policy (prefer local node) if no such policy is specified.
This can lead to several pathological behaviors in multi-node NUMA
systems:

1. Systems with CXL-based memory tiering can encounter the following
   inversion with zswap/zram: the coldest pages demoted to the CXL tier
   can return to the high tier when they are reclaimed to compressed swap,
   creating memory pressure on the high tier.

2. Consider a direct reclaimer scanning nodes in order of allocation
   preference.  If it ventures into remote nodes, the memory it compresses
   there should stay there.  Trying to shift those contents over to the
   reclaiming thread's preferred node further *increases* its local
   pressure, and provoking more spills.  The remote node is also the most
   likely to refault this data again.  This undesirable behavior was
   pointed out by Johannes Weiner in [1].

3. For zswap writeback, the zswap entries are organized in
   node-specific LRUs, based on the node placement of the original pages,
   allowing for targeted zswap writeback for specific nodes.

   However, the compressed data of a zswap entry can be placed on a
   different node from the LRU it is placed on.  This means that reclaim
   targeted at one node might not free up memory used for zswap entries in
   that node, but instead reclaiming memory in a different node.

All of these issues will be resolved if the compressed data go to the same
node as the original page.  This patch encourages this behavior by having
zswap and zram pass the node of the original page to zsmalloc, and have
zsmalloc prefer the specified node if we need to allocate new (zs)pages
for the compressed data.

Note that we are not strictly binding the allocation to the preferred
node.  We still allow the allocation to fall back to other nodes when the
preferred node is full, or if we have zspages with slots available on a
different node.  This is OK, and still a strict improvement over the
status quo:

1. On a system with demotion enabled, we will generally prefer
   demotions over compressed swapping, and only swap when pages have
   already gone to the lowest tier.  This patch should achieve the desired
   effect for the most part.

2. If the preferred node is out of memory, letting the compressed data
   going to other nodes can be better than the alternative (OOMs, keeping
   cold memory unreclaimed, disk swapping, etc.).

3. If the allocation go to a separate node because we have a zspage
   with slots available, at least we're not creating extra immediate
   memory pressure (since the space is already allocated).

3. While there can be mixings, we generally reclaim pages in same-node
   batches, which encourage zspage grouping that is more likely to go to
   the right node.

4. A strict binding would require partitioning zsmalloc by node, which
   is more complicated, and more prone to regression, since it reduces the
   storage density of zsmalloc.  We need to evaluate the tradeoff and
   benchmark carefully before adopting such an involved solution.

[1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/

[senozhatsky@chromium.org: coding-style fixes]
Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge
Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram, zsmalloc]
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zswap/zsmalloc]
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: delete thp_nr_pages()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:10 +0000 (22:06 +0100)]
mm: delete thp_nr_pages()

All callers now use folio_nr_pages().  Delete this wrapper.

Link: https://lkml.kernel.org/r/20250402210612.2444135-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agofilemap: remove readahead_page_batch()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:09 +0000 (22:06 +0100)]
filemap: remove readahead_page_batch()

This function has no more callers; delete it.

Link: https://lkml.kernel.org/r/20250402210612.2444135-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agofilemap: convert __readahead_batch() to use a folio
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:08 +0000 (22:06 +0100)]
filemap: convert __readahead_batch() to use a folio

Extract folios from i_mapping, not pages.  Removes a hidden call to
compound_head(), a use of thp_nr_pages() and an unnecessary assertion that
we didn't find a tail page in the page cache.

Link: https://lkml.kernel.org/r/20250402210612.2444135-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agofilemap: remove find_subpage()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:07 +0000 (22:06 +0100)]
filemap: remove find_subpage()

All users of this function now call folio_file_page() instead.  Delete it.

Link: https://lkml.kernel.org/r/20250402210612.2444135-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoiov_iter: convert iov_iter_extract_xarray_pages() to use folios
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:06 +0000 (22:06 +0100)]
iov_iter: convert iov_iter_extract_xarray_pages() to use folios

ITER_XARRAY is exclusively used with xarrays that contain folios, not
pages, so extract folio pointers from it, not page pointers.  Removes a
use of find_subpage().

Link: https://lkml.kernel.org/r/20250402210612.2444135-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoiov_iter: convert iter_xarray_populate_pages() to use folios
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:05 +0000 (22:06 +0100)]
iov_iter: convert iter_xarray_populate_pages() to use folios

ITER_XARRAY is exclusively used with xarrays that contain folios, not
pages, so extract folio pointers from it, not page pointers.  Removes a
hidden call to compound_head() and a use of find_subpage().

Link: https://lkml.kernel.org/r/20250402210612.2444135-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: remove offset_in_thp()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:04 +0000 (22:06 +0100)]
mm: remove offset_in_thp()

All callers have been converted to call offset_in_folio().

Link: https://lkml.kernel.org/r/20250402210612.2444135-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agofilemap: remove readahead_page()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 21:06:03 +0000 (22:06 +0100)]
filemap: remove readahead_page()

Patch series "Misc folio patches for 6.16".

Remove a few APIs that we've converted everybody from using.  I also found
a few places that extract a page pointer from i_pages, which will be an
invalid thing to do when we separate pages from folios.

This patch (of 8):

All filesystems have now been converted to call readahead_folio() so we
can delete this wrapper.

Link: https://lkml.kernel.org/r/20250402210612.2444135-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250402210612.2444135-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoarch: remove mk_pmd()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:05 +0000 (19:17 +0100)]
arch: remove mk_pmd()

There are now no callers of mk_huge_pmd() and mk_pmd().  Remove them.

Link: https://lkml.kernel.org/r/20250402181709.2386022-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: add folio_mk_pmd()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:04 +0000 (19:17 +0100)]
mm: add folio_mk_pmd()

Removes five conversions from folio to page.  Also removes both callers of
mk_pmd() that aren't part of mk_huge_pmd(), getting us a step closer to
removing the confusion between mk_pmd(), mk_huge_pmd() and pmd_mkhuge().

Link: https://lkml.kernel.org/r/20250402181709.2386022-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: remove mk_huge_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:03 +0000 (19:17 +0100)]
mm: remove mk_huge_pte()

The only remaining user of mk_huge_pte() is the debug code, so remove the
API and replace its use with pfn_pte() which lets us remove the conversion
to a page first.  We should always call arch_make_huge_pte() to turn this
PTE into a huge PTE before operating on it with huge_pte_mkdirty() etc.

Link: https://lkml.kernel.org/r/20250402181709.2386022-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agohugetlb: simplify make_huge_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:02 +0000 (19:17 +0100)]
hugetlb: simplify make_huge_pte()

mk_huge_pte() is a bad API.  Despite its name, it creates a normal PTE
which is later transformed into a huge PTE by arch_make_huge_pte().  So
replace the page argument with a folio argument and call folio_mk_pte()
instead.  Then, because we now know this is a regular PTE rather than a
huge one, use pte_mkdirty() instead of huge_pte_mkdirty() (and similar
functions).

Link: https://lkml.kernel.org/r/20250402181709.2386022-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: add folio_mk_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:01 +0000 (19:17 +0100)]
mm: add folio_mk_pte()

Remove a cast from folio to page in four callers of mk_pte().

Link: https://lkml.kernel.org/r/20250402181709.2386022-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: make mk_pte() definition unconditional
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:17:00 +0000 (19:17 +0100)]
mm: make mk_pte() definition unconditional

All architectures now use the common mk_pte() definition, so we can remove
the condition.

Link: https://lkml.kernel.org/r/20250402181709.2386022-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoum: remove custom definition of mk_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:16:59 +0000 (19:16 +0100)]
um: remove custom definition of mk_pte()

Move the pfn_pte() definitions from the 2level and 4level files to the
generic pgtable.h and delete the custom definition of mk_pte() so that we
use the central definition.

Link: https://lkml.kernel.org/r/20250402181709.2386022-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agox86: remove custom definition of mk_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:16:58 +0000 (19:16 +0100)]
x86: remove custom definition of mk_pte()

Move the shadow stack check to pfn_pte() which lets us use the common
definition of mk_pte().

Link: https://lkml.kernel.org/r/20250402181709.2386022-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agosparc32: remove custom definition of mk_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:16:57 +0000 (19:16 +0100)]
sparc32: remove custom definition of mk_pte()

Instead of defining pfn_pte() in terms of mk_pte(), make pfn_pte() the
base implementation.  That lets us use the generic definition of mk_pte().

Link: https://lkml.kernel.org/r/20250402181709.2386022-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: introduce a common definition of mk_pte()
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:16:56 +0000 (19:16 +0100)]
mm: introduce a common definition of mk_pte()

Most architectures simply call pfn_pte().  Centralise that as the normal
definition and remove the definition of mk_pte() from the architectures
which have either that exact definition or something similar.

Link: https://lkml.kernel.org/r/20250402181709.2386022-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390
Cc: Zi Yan <ziy@nvidia.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: set the pte dirty if the folio is already dirty
Matthew Wilcox (Oracle) [Wed, 2 Apr 2025 18:16:55 +0000 (19:16 +0100)]
mm: set the pte dirty if the folio is already dirty

Patch series "Add folio_mk_pte()", v2.

Today if you have a folio and want to create a PTE that points to the
first page in it, you have to convert from a folio to a page.  That's
zero-cost today but will be more expensive in the future.

I didn't want to add folio_mk_pte() to each architecture, and I didn't
want to lose any optimisations that architectures have from their own
implementation of mk_pte().  Fortunately, most architectures have by now
turned their mk_pte() into a fairly bland variant of pfn_pte(), but s390
has a special optimisation that needs to be moved into generic code in the
first patch.

At the end of this patch set, we have mk_pte() and folio_mk_pte() in mm.h
and each architecture only has to implement pfn_pte().  We've also
eliminated mk_huge_pte(), mk_huge_pmd() and mk_pmd().

This patch (of 11):

If the first access to a folio is a read that is then followed by a write,
we can save a page fault.  s390 implemented this in their mk_pte() in
commit abf09bed3cce ("s390/mm: implement software dirty bits"), but other
architectures can also benefit from this.

Link: https://lkml.kernel.org/r/20250402181709.2386022-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250402181709.2386022-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> # for s390
Cc: Zi Yan <ziy@nvidia.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: fix ratelimit_pages update error in dirty_ratio_handler()
Jinliang Zheng [Tue, 15 Apr 2025 09:02:32 +0000 (17:02 +0800)]
mm: fix ratelimit_pages update error in dirty_ratio_handler()

In dirty_ratio_handler(), vm_dirty_bytes must be set to zero before
calling writeback_set_ratelimit(), as global_dirty_limits() always
prioritizes the value of vm_dirty_bytes.

It's domain_dirty_limits() that's relevant here, not node_dirty_ok:

  dirty_ratio_handler
    writeback_set_ratelimit
      global_dirty_limits(&dirty_thresh)           <- ratelimit_pages based on dirty_thresh
        domain_dirty_limits
          if (bytes)                               <- bytes = vm_dirty_bytes <--------+
            thresh = f1(bytes)                     <- prioritizes vm_dirty_bytes      |
          else                                                                        |
            thresh = f2(ratio)                                                        |
      ratelimit_pages = f3(dirty_thresh)                                              |
    vm_dirty_bytes = 0                             <- it's late! ---------------------+

This causes ratelimit_pages to still use the value calculated based on
vm_dirty_bytes, which is wrong now.

The impact visible to userspace is difficult to capture directly because
there is no procfs/sysfs interface exported to user space.  However, it
will have a real impact on the balance of dirty pages.

For example:

1. On default, we have vm_dirty_ratio=40, vm_dirty_bytes=0

2. echo 8192 > dirty_bytes, then vm_dirty_bytes=8192,
   vm_dirty_ratio=0, and ratelimit_pages is calculated based on
   vm_dirty_bytes now.

3. echo 20 > dirty_ratio, then since vm_dirty_bytes is not reset to
   zero when writeback_set_ratelimit() -> global_dirty_limits() ->
   domain_dirty_limits() is called, reallimit_pages is still calculated
   based on vm_dirty_bytes instead of vm_dirty_ratio.  This does not
   conform to the actual intent of the user.

Link: https://lkml.kernel.org/r/20250415090232.7544-1-alexjlzheng@tencent.com
Fixes: 9d823e8f6b1b ("writeback: per task dirty rate limit")
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: MengEn Sun <mengensun@tencent.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Fenggaung Wu <fengguang.wu@intel.com>
Cc: Jinliang Zheng <alexjlzheng@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoMerge branch 'mm-hotfixes-stable' into mm-stable in order to successfully
Andrew Morton [Mon, 12 May 2025 00:45:12 +0000 (17:45 -0700)]
Merge branch 'mm-hotfixes-stable' into mm-stable in order to successfully
merge mm-unstable's "mm: add folio_mk_pte()",

2 months agomm: userfaultfd: correct dirty flags set for both present and swap pte
Barry Song [Thu, 8 May 2025 22:09:12 +0000 (10:09 +1200)]
mm: userfaultfd: correct dirty flags set for both present and swap pte

As David pointed out, what truly matters for mremap and userfaultfd move
operations is the soft dirty bit.  The current comment and
implementation—which always sets the dirty bit for present PTEs and
fails to set the soft dirty bit for swap PTEs—are incorrect.  This could
break features like Checkpoint-Restore in Userspace (CRIU).

This patch updates the behavior to correctly set the soft dirty bit for
both present and swap PTEs in accordance with mremap.

Link: https://lkml.kernel.org/r/20250508220912.7275-1-21cnbao@gmail.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: David Hildenbrand <david@redhat.com>
Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redhat.com/
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agozsmalloc: don't underflow size calculation in zs_obj_write()
Sergey Senozhatsky [Wed, 7 May 2025 05:42:24 +0000 (14:42 +0900)]
zsmalloc: don't underflow size calculation in zs_obj_write()

Do not mix class->size and object size during offsets/sizes calculation in
zs_obj_write().  Size classes can merge into clusters, based on
objects-per-zspage and pages-per-zspage characteristics, so some size
classes can store objects smaller than class->size.  This becomes
problematic when object size is much smaller than class->size.  zsmalloc
can falsely decide that object spans two physical pages, because a larger
class->size value is used for that check, while the actual object is much
smaller and fits the free space of the first physical page, so there is
nothing to write to the second page and memcpy() size calculation
underflows.

 Unable to handle kernel paging request at virtual address ffffc00081ff4000
 pc : __memcpy+0x10/0x24
 lr : zs_obj_write+0x1b0/0x1d0 [zsmalloc]
 Call trace:
  __memcpy+0x10/0x24 (P)
  zram_write_page+0x150/0x4fc [zram]
  zram_submit_bio+0x5e0/0x6a4 [zram]
  __submit_bio+0x168/0x220
  submit_bio_noacct_nocheck+0x128/0x2c8
  submit_bio_noacct+0x19c/0x2f8

This is mostly seen on system with larger page-sizes, because size class
cluters of such systems hold wider size ranges than on 4K PAGE_SIZE
systems.

Assume a 16K PAGE_SIZE system, a write of 820 bytes object to a 864-bytes
size class at offset 15560.  15560 + 864 is more than 16384 so zsmalloc
attempts to memcpy() it to two physical pages.  However, 16384 - 15560 =
824 which is more than 820, so the object in fact doesn't span two
physical pages, and there is no data to write to the second physical page.

We always know the exact size in bytes of the object that we are about to
write (store), so use it instead of class->size.

Link: https://lkml.kernel.org/r/20250507054312.4135983-1-senozhatsky@chromium.org
Fixes: 44f76413496e ("zsmalloc: introduce new object mapping API")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Igor Belousov <igor.b@beldev.am>
Tested-by: Igor Belousov <igor.b@beldev.am>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/page_alloc: fix race condition in unaccepted memory handling
Kirill A. Shutemov [Tue, 6 May 2025 13:32:07 +0000 (16:32 +0300)]
mm/page_alloc: fix race condition in unaccepted memory handling

The page allocator tracks the number of zones that have unaccepted memory
using static_branch_enc/dec() and uses that static branch in hot paths to
determine if it needs to deal with unaccepted memory.

Borislav and Thomas pointed out that the tracking is racy: operations on
static_branch are not serialized against adding/removing unaccepted pages
to/from the zone.

Sanity checks inside static_branch machinery detects it:

WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0

The comment around the WARN() explains the problem:

/*
 * Warn about the '-1' case though; since that means a
 * decrement is concurrent with a first (0->1) increment. IOW
 * people are trying to disable something that wasn't yet fully
 * enabled. This suggests an ordering problem on the user side.
 */

The effect of this static_branch optimization is only visible on
microbenchmark.

Instead of adding more complexity around it, remove it altogether.

Link: https://lkml.kernel.org/r/20250506133207.1009676-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
Kirill A. Shutemov [Tue, 6 May 2025 11:25:08 +0000 (14:25 +0300)]
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory

try_alloc_pages() will not attempt to allocate memory if the system has
*any* unaccepted memory.  Memory is accepted as needed and can remain in
the system indefinitely, causing the interface to always fail.

Rather than immediately giving up, attempt to use already accepted memory
on free lists.

Pass 'alloc_flags' to cond_accept_memory() and do not accept new memory
for ALLOC_TRYLOCK requests.

Found via code inspection - only BPF uses this at present and the
runtime effects are unclear.

Link: https://lkml.kernel.org/r/20250506112509.905147-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoMAINTAINERS: add mm GUP section
Lorenzo Stoakes [Tue, 6 May 2025 17:36:01 +0000 (18:36 +0100)]
MAINTAINERS: add mm GUP section

As part of the ongoing efforts to sub-divide memory management
maintainership and reviewership, establish a section for GUP (Get User
Pages) support and add appropriate maintainers and reviewers.

Link: https://lkml.kernel.org/r/20250506173601.97562-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/codetag: move tag retrieval back upfront in __free_pages()
David Wang [Mon, 5 May 2025 19:30:34 +0000 (03:30 +0800)]
mm/codetag: move tag retrieval back upfront in __free_pages()

Commit 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled()
 checks") introduces a possible use-after-free scenario, when page
is non-compound, page[0] could be released by other thread right
after put_page_testzero failed in current thread, pgalloc_tag_sub_pages
afterwards would manipulate an invalid page for accounting remaining
pages:

[timeline]   [thread1]                     [thread2]
  |          alloc_page non-compound
  V
  |                                        get_page, rf counter inc
  V
  |          in ___free_pages
  |          put_page_testzero fails
  V
  |                                        put_page, page released
  V
  |          in ___free_pages,
  |          pgalloc_tag_sub_pages
  |          manipulate an invalid page
  V

Restore __free_pages() to its state before, retrieve alloc tag
beforehand.

Link: https://lkml.kernel.org/r/20250505193034.91682-1-00107082@163.com
Fixes: 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks")
Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/memory: fix mapcount / refcount sanity check for mTHP reuse
Kairui Song [Fri, 25 Apr 2025 07:43:25 +0000 (15:43 +0800)]
mm/memory: fix mapcount / refcount sanity check for mTHP reuse

The following WARNING was triggered during swap stress test with mTHP
enabled:

[ 6609.335758] ------------[ cut here ]------------
[ 6609.337758] WARNING: CPU: 82 PID: 755116 at mm/memory.c:3794 do_wp_page+0x1084/0x10e0
[ 6609.340922] Modules linked in: zram virtiofs
[ 6609.342699] CPU: 82 UID: 0 PID: 755116 Comm: sh Kdump: loaded Not tainted 6.15.0-rc1+ #1429 PREEMPT(voluntary)
[ 6609.347620] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[ 6609.349909] RIP: 0010:do_wp_page+0x1084/0x10e0
[ 6609.351532] Code: ff ff 48 c7 c6 80 ba 49 82 4c 89 ef e8 95 fd fe ff 0f 0b bd f5 ff ff ff e9 43 fb ff ff 41 83 a9 bc 12 00 00 01 e9 5c fb ff ff <0f> 0b e9 a6 fc ff ff 65 ff 00 f0 48 0f b
a 6d 00 1f 0f 83 82 fc ff
[ 6609.357959] RSP: 0000:ffffc90002273d40 EFLAGS: 00010287
[ 6609.359915] RAX: 000000000000000f RBX: 0000000000000000 RCX: 000fffffffe00000
[ 6609.362606] RDX: 0000000000000010 RSI: 000055a119ac1000 RDI: ffffea000ae6ec00
[ 6609.365143] RBP: ffffea000ae6ec68 R08: 84000002b9bb1025 R09: 000055a119ab6000
[ 6609.367569] R10: ffff8881caa2ad80 R11: 0000000000000000 R12: ffff8881caa2ad80
[ 6609.370255] R13: ffffea000ae6ec00 R14: 000055a119ac1c9c R15: ffffc90002273dd8
[ 6609.373007] FS:  00007f08e467f740(0000) GS:ffff88a07c214000(0000) knlGS:0000000000000000
[ 6609.375999] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6609.377946] CR2: 000055a119ac1c9c CR3: 00000001adfd6005 CR4: 0000000000770eb0
[ 6609.380376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6609.382853] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6609.385216] PKRU: 55555554
[ 6609.386141] Call Trace:
[ 6609.387017]  <TASK>
[ 6609.387718]  ? ___pte_offset_map+0x1b/0x110
[ 6609.389056]  __handle_mm_fault+0xa51/0xf00
[ 6609.390363]  ? exc_page_fault+0x6a/0x140
[ 6609.391629]  handle_mm_fault+0x13d/0x360
[ 6609.392856]  do_user_addr_fault+0x2f2/0x7f0
[ 6609.394160]  ? sigprocmask+0x77/0xa0
[ 6609.395375]  exc_page_fault+0x6a/0x140
[ 6609.396735]  asm_exc_page_fault+0x26/0x30
[ 6609.398224] RIP: 0033:0x55a1050bc18b
[ 6609.399567] Code: 8b 3f 4d 85 ff 74 40 41 39 5f 18 75 f2 49 8b 7f 08 44 38 27 75 e9 4c 89 c6 4c 89 45 c8 e8 bd 83 fa ff 4c 8b 45 c8 85 c0 75 d5 <41> 83 47 1c 01 48 83 c4 28 4c 89 f8 5b 4
1 5c 41 5d 41 5e 41 5f 5d
[ 6609.405971] RSP: 002b:00007ffcf5f37d90 EFLAGS: 00010246
[ 6609.407737] RAX: 0000000000000000 RBX: 00000000182768fa RCX: 0000000000000000
[ 6609.410151] RDX: 00000000000000fa RSI: 000055a105175c7b RDI: 000055a119ac1c60
[ 6609.412606] RBP: 00007ffcf5f37de0 R08: 000055a105175c7b R09: 0000000000000000
[ 6609.414998] R10: 000000004d2dfb5a R11: 0000000000000246 R12: 0000000000000050
[ 6609.417193] R13: 00000000000000fa R14: 000055a119abaf60 R15: 000055a119ac1c80
[ 6609.419268]  </TASK>
[ 6609.419928] ---[ end trace 0000000000000000 ]---

The WARN_ON here is simply incorrect.  The refcount here must be at least
the mapcount, not the opposite.  Each mapcount must have a corresponding
refcount, but the refcount may increase if other components grab the
folio, which is acceptable.  Meanwhile, having a mapcount larger than
refcount is a real problem.

So fix the WARN_ON condition.

Link: https://lkml.kernel.org/r/20250425074325.61833-1-ryncsn@gmail.com
Fixes: 1da190f4d0a6 ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Kairui Song <kasong@tencent.com>
Closes: https://lore.kernel.org/all/CAMgjq7D+ea3eg9gRCVvRnto3Sv3_H3WVhupX4e=k8T5QAfBHbw@mail.gmail.com/
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
David Hildenbrand [Tue, 22 Apr 2025 14:49:42 +0000 (16:49 +0200)]
kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()

Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used
for duplicating VMAs during fork(), but also for duplicating VMAs when
splitting VMAs or when mremap()'ing them.

VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.

In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation.  So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.

Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA.  Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.

So when unmapping B, the reservation would be updated to cover only A.
When unmapping A, we would properly remove the now-shrunk reservation.
That scenario describes the mremap() shrinking (old_size > new_size),
where we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.

What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first?
It would be broken because we would never free the reservation.  Likely,
there are ways to trigger such a VMA split outside of mremap().

Affecting other VMA duplication was not intended, vm_area_dup() being used
outside of kernel/fork.c was an oversight.  So let's fix that for; how to
handle VMA splits better should be investigated separately.

With a simple reproducer that uses mprotect() to split such a VMA I can
trigger

x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]

Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com
Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: hugetlb: fix incorrect fallback for subpool
Wupeng Ma [Thu, 10 Apr 2025 06:26:33 +0000 (14:26 +0800)]
mm: hugetlb: fix incorrect fallback for subpool

During our testing with hugetlb subpool enabled, we observe that
hstate->resv_huge_pages may underflow into negative values.  Root cause
analysis reveals a race condition in subpool reservation fallback handling
as follow:

hugetlb_reserve_pages()
    /* Attempt subpool reservation */
    gbl_reserve = hugepage_subpool_get_pages(spool, chg);

    /* Global reservation may fail after subpool allocation */
    if (hugetlb_acct_memory(h, gbl_reserve) < 0)
        goto out_put_pages;

out_put_pages:
    /* This incorrectly restores reservation to subpool */
    hugepage_subpool_put_pages(spool, chg);

When hugetlb_acct_memory() fails after subpool allocation, the current
implementation over-commits subpool reservations by returning the full
'chg' value instead of the actual allocated 'gbl_reserve' amount.  This
discrepancy propagates to global reservations during subsequent releases,
eventually causing resv_huge_pages underflow.

This problem can be trigger easily with the following steps:
1. reverse hugepage for hugeltb allocation
2. mount hugetlbfs with min_size to enable hugetlb subpool
3. alloc hugepages with two task(make sure the second will fail due to
   insufficient amount of hugepages)
4. with for a few seconds and repeat step 3 which will make
   hstate->resv_huge_pages to go below zero.

To fix this problem, return corrent amount of pages to subpool during the
fallback after hugepage_subpool_get_pages is called.

Link: https://lkml.kernel.org/r/20250410062633.3102457-1-mawupeng1@huawei.com
Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ma Wupeng <mawupeng1@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoLinux 6.15-rc6 v6.15-rc6
Linus Torvalds [Sun, 11 May 2025 21:54:11 +0000 (14:54 -0700)]
Linux 6.15-rc6

2 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 11 May 2025 18:30:13 +0000 (11:30 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Avoid use of uninitialized memcache pointer in user_mem_abort()

   - Always set HCR_EL2.xMO bits when running in VHE, allowing
     interrupts to be taken while TGE=0 and fixing an ugly bug on
     AmpereOne that occurs when taking an interrupt while clearing the
     xMO bits (AC03_CPU_36)

   - Prevent VMMs from hiding support for AArch64 at any EL virtualized
     by KVM

   - Save/restore the host value for HCRX_EL2 instead of restoring an
     incorrect fixed value

   - Make host_stage2_set_owner_locked() check that the entire requested
     range is memory rather than just the first page

  RISC-V:

   - Add missing reset of smstateen CSRs

  x86:

   - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid
     causing problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to
     sanitize the VMCB as its state is undefined after SHUTDOWN,
     emulating INIT is the least awful choice).

   - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
     KVM doesn't goof a sanity check in the future.

   - Free obsolete roots when (re)loading the MMU to fix a bug where
     pre-faulting memory can get stuck due to always encountering a
     stale root.

   - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB
     page to print state, so that KVM doesn't print stale/wrong
     information.

   - When changing memory attributes (e.g. shared <=> private), add
     potential hugepage ranges to the mmu_invalidate_range_{start,end}
     set so that KVM doesn't create a shared/private hugepage when the
     the corresponding attributes will become mixed (the attributes are
     commited *after* KVM finishes the invalidation).

   - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM
     has at least one active VM. Effectively BP_SPEC_REDUCE when KVM is
     loaded led to very measurable performance regressions for non-KVM
     workloads"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions
  KVM: arm64: Fix memory check in host_stage2_set_owner_locked()
  KVM: arm64: Kill HCRX_HOST_FLAGS
  KVM: arm64: Properly save/restore HCRX_EL2
  KVM: arm64: selftest: Don't try to disable AArch64 support
  KVM: arm64: Prevent userspace from disabling AArch64 support at any virtualisable EL
  KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
  KVM: arm64: Fix uninitialized memcache pointer in user_mem_abort()
  KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing
  KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fields
  KVM: RISC-V: reset smstateen CSRs
  KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
  KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()
  KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interception

2 months agoMerge tag 'mips-fixes_6.15_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Sun, 11 May 2025 18:19:52 +0000 (11:19 -0700)]
Merge tag 'mips-fixes_6.15_1' of git://git./linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

 - Fix delayed timers

 - Fix NULL pointer deref

 - Fix wrong range check

* tag 'mips-fixes_6.15_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: Fix MAX_REG_OFFSET
  MIPS: CPS: Fix potential NULL pointer dereferences in cps_prepare_cpus()
  MIPS: rename rollback_handler with skipover_handler
  MIPS: Move r4k_wait() to .cpuidle.text section
  MIPS: Fix idle VS timer enqueue

2 months agoMerge tag 'x86-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 11 May 2025 18:08:55 +0000 (11:08 -0700)]
Merge tag 'x86-urgent-2025-05-11' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
 "Fix a boot regression on very old x86 CPUs without CPUID support"

* tag 'x86-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Consolidate the loader enablement checking

2 months agoMerge tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 11 May 2025 17:33:25 +0000 (10:33 -0700)]
Merge tag 'timers-urgent-2025-05-11' of git://git./linux/kernel/git/tip/tip

Pull misc timers fixes from Ingo Molnar:

 - Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks

 - Work around absolute relocations into vDSO code that GCC erroneously
   emits in certain arm64 build environments

 - Fix a false positive lockdep warning in the i8253 clocksource driver

* tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/i8253: Use raw_spinlock_irqsave() in clockevent_i8253_disable()
  arm64: vdso: Work around invalid absolute relocations from GCC
  timekeeping: Prevent coarse clocks going backwards

2 months agoMerge tag 'input-for-v6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 11 May 2025 17:29:29 +0000 (10:29 -0700)]
Merge tag 'input-for-v6.15-rc5' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

 - Synaptics touchpad on multiple laptops (Dynabook Portege X30L-G,
   Dynabook Portege X30-D, TUXEDO InfinityBook Pro 14 v5, Dell Precision
   M3800, HP Elitebook 850 G1) switched from PS/2 to SMBus mode

 - a number of new controllers added to xpad driver: HORI Drum
   controller, PowerA Fusion Pro 4, PowerA MOGA XP-Ultra controller,
   8BitDo Ultimate 2 Wireless Controller, 8BitDo Ultimate 3-mode
   Controller, Hyperkin DuchesS Xbox One controller

 - fixes to xpad driver to properly handle Mad Catz JOYTECH NEO SE
   Advanced and PDP Mirror's Edge Official controllers

 - fixes to xpad driver to properly handle "Share" button on some
   controllers

 - a fix for device initialization timing and for waking up the
   controller in cyttsp5 driver

 - a fix for hisi_powerkey driver to properly wake up from s2idle state

 - other assorted cleanups and fixes

* tag 'input-for-v6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xpad - fix xpad_device sorting
  Input: xpad - add support for several more controllers
  Input: xpad - fix Share button on Xbox One controllers
  Input: xpad - fix two controller table values
  Input: hisi_powerkey - enable system-wakeup for s2idle
  Input: synaptics - enable InterTouch on Dell Precision M3800
  Input: synaptics - enable InterTouch on TUXEDO InfinityBook Pro 14 v5
  Input: synaptics - enable InterTouch on Dynabook Portege X30L-G
  Input: synaptics - enable InterTouch on Dynabook Portege X30-D
  Input: synaptics - enable SMBus for HP Elitebook 850 G1
  Input: mtk-pmic-keys - fix possible null pointer dereference
  Input: xpad - add support for 8BitDo Ultimate 2 Wireless Controller
  Input: cyttsp5 - fix power control issue on wakeup
  MAINTAINERS: .mailmap: update Mattijs Korpershoek's email address
  dt-bindings: mediatek,mt6779-keypad: Update Mattijs' email address
  Input: stmpe-ts - use module alias instead of device table
  Input: cyttsp5 - ensure minimum reset pulse width
  Input: sparcspkr - avoid unannotated fall-through
  input/joystick: magellan: Mark __nonstring look-up table

2 months agoMerge tag 'fixes-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Sun, 11 May 2025 17:23:53 +0000 (10:23 -0700)]
Merge tag 'fixes-2025-05-11' of git://git./linux/kernel/git/rppt/memblock

Pull memblock fixes from Mike Rapoport:

 - Mark set_high_memory() as __init to fix section mismatch

 - Accept memory allocated in memblock_double_array() to mitigate crash
   of SNP guests

* tag 'fixes-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: Accept allocated memory before use in memblock_double_array()
  mm,mm_init: Mark set_high_memory as __init

2 months agoInput: xpad - fix xpad_device sorting
Vicki Pfau [Sun, 11 May 2025 06:06:34 +0000 (23:06 -0700)]
Input: xpad - fix xpad_device sorting

A recent commit put one entry in the wrong place. This just moves it to the
right place.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-5-vi@endrift.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2 months agoInput: xpad - add support for several more controllers
Vicki Pfau [Sun, 11 May 2025 06:00:10 +0000 (23:00 -0700)]
Input: xpad - add support for several more controllers

This adds support for several new controllers, all of which include
Share buttons:

- HORI Drum controller
- PowerA Fusion Pro 4
- 8BitDo Ultimate 3-mode Controller
- Hyperkin DuchesS Xbox One controller
- PowerA MOGA XP-Ultra controller

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-4-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2 months agoInput: xpad - fix Share button on Xbox One controllers
Vicki Pfau [Sun, 11 May 2025 05:59:25 +0000 (22:59 -0700)]
Input: xpad - fix Share button on Xbox One controllers

The Share button, if present, is always one of two offsets from the end of the
file, depending on the presence of a specific interface. As we lack parsing for
the identify packet we can't automatically determine the presence of that
interface, but we can hardcode which of these offsets is correct for a given
controller.

More controllers are probably fixable by adding the MAP_SHARE_BUTTON in the
future, but for now I only added the ones that I have the ability to test
directly.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-2-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2 months agoInput: xpad - fix two controller table values
Vicki Pfau [Fri, 28 Mar 2025 23:43:36 +0000 (16:43 -0700)]
Input: xpad - fix two controller table values

Two controllers -- Mad Catz JOYTECH NEO SE Advanced and PDP Mirror's
Edge Official -- were missing the value of the mapping field, and thus
wouldn't detect properly.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-1-vi@endrift.com
Fixes: 540602a43ae5 ("Input: xpad - add a few new VID/PID combinations")
Fixes: 3492321e2e60 ("Input: xpad - add multiple supported devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2 months agoInput: hisi_powerkey - enable system-wakeup for s2idle
Ulf Hansson [Thu, 6 Mar 2025 11:50:21 +0000 (12:50 +0100)]
Input: hisi_powerkey - enable system-wakeup for s2idle

To wake up the system from s2idle when pressing the power-button, let's
convert from using pm_wakeup_event() to pm_wakeup_dev_event(), as it allows
us to specify the "hard" in-parameter, which needs to be set for s2idle.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250306115021.797426-1-ulf.hansson@linaro.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2 months agoMerge tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 10 May 2025 22:50:56 +0000 (15:50 -0700)]
Merge tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git./linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "22 hotfixes. 13 are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  About half are for MM. Five OCFS2 fixes and a few MAINTAINERS updates"

* tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  mm: fix folio_pte_batch() on XEN PV
  nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()
  mm/hugetlb: copy the CMA flag when demoting
  mm, swap: fix false warning for large allocation with !THP_SWAP
  selftests/mm: fix a build failure on powerpc
  selftests/mm: fix build break when compiling pkey_util.c
  mm: vmalloc: support more granular vrealloc() sizing
  tools/testing/selftests: fix guard region test tmpfs assumption
  ocfs2: stop quota recovery before disabling quotas
  ocfs2: implement handshaking with ocfs2 recovery thread
  ocfs2: switch osb->disable_recovery to enum
  mailmap: map Uwe's BayLibre addresses to a single one
  MAINTAINERS: add mm THP section
  mm/userfaultfd: fix uninitialized output field for -EAGAIN race
  selftests/mm: compaction_test: support platform with huge mount of memory
  MAINTAINERS: add core mm section
  ocfs2: fix panic in failed foilio allocation
  mm/huge_memory: fix dereferencing invalid pmd migration entry
  MAINTAINERS: add reverse mapping section
  x86: disable image size check for test builds
  ...

2 months agoMerge tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 10 May 2025 16:53:11 +0000 (09:53 -0700)]
Merge tag 'driver-core-6.15-rc6' of git://git./linux/kernel/git/driver-core/driver-core

Pull driver core fix from Greg KH:
 "Here is a single driver core fix for a regression for platform devices
  that is a regression from a change that went into 6.15-rc1 that
  affected Pixel devices. It has been in linux-next for over a week with
  no reported problems"

* tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  platform: Fix race condition during DMA configure at IOMMU probe time

2 months agoMerge tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sat, 10 May 2025 16:18:05 +0000 (09:18 -0700)]
Merge tag 'usb-6.15-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 6.15-rc6. Included in here
  are:

   - typec driver fixes

   - usbtmc ioctl fixes

   - xhci driver fixes

   - cdnsp driver fixes

   - some gadget driver fixes

  Nothing really major, just all little stuff that people have reported
  being issues. All of these have been in linux-next this week with no
  reported issues"

* tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: dbc: Avoid event polling busyloop if pending rx transfers are inactive.
  usb: xhci: Don't trust the EP Context cycle bit when moving HW dequeue
  usb: usbtmc: Fix erroneous generic_read ioctl return
  usb: usbtmc: Fix erroneous wait_srq ioctl return
  usb: usbtmc: Fix erroneous get_stb ioctl error returns
  usb: typec: tcpm: delay SNK_TRY_WAIT_DEBOUNCE to SRC_TRYWAIT transition
  USB: usbtmc: use interruptible sleep in usbtmc_read
  usb: cdnsp: fix L1 resume issue for RTL_REVISION_NEW_LPM version
  usb: typec: ucsi: displayport: Fix NULL pointer access
  usb: typec: ucsi: displayport: Fix deadlock
  usb: misc: onboard_usb_dev: fix support for Cypress HX3 hubs
  usb: uhci-platform: Make the clock really optional
  usb: dwc3: gadget: Make gadget_wakeup asynchronous
  usb: gadget: Use get_status callback to set remote wakeup capability
  usb: gadget: f_ecm: Add get_status callback
  usb: host: tegra: Prevent host controller crash when OTG port is used
  usb: cdnsp: Fix issue with resuming from L1
  usb: gadget: tegra-xudc: ACK ST_RC after clearing CTRL_RUN

2 months agoMerge tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sat, 10 May 2025 16:08:19 +0000 (09:08 -0700)]
Merge tag 'staging-6.15-rc6' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
 "Here are three small staging driver fixes for 6.15-rc6. These are:

   - bcm2835-camera driver fix

   - two axis-fifo driver fixes

  All of these have been in linux-next for a few weeks with no reported
  issues"

* tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: axis-fifo: Remove hardware resets for user errors
  staging: axis-fifo: Correct handling of tx_fifo_depth for size validation
  staging: bcm2835-camera: Initialise dev in v4l2_dev

2 months agoMerge tag 'char-misc-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sat, 10 May 2025 15:55:15 +0000 (08:55 -0700)]
Merge tag 'char-misc-6.15-rc6' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc/IIO driver fixes from Greg KH:
 "Here are a bunch of small driver fixes (mostly all IIO) for 6.15-rc6.
  Included in here are:

   - loads of tiny IIO driver fixes for reported issues

   - hyperv driver fix for a much-reported and worked on sysfs ring
     buffer creation bug

  All of these have been in linux-next for over a week (the IIO ones for
  many weeks now), with no reported issues"

* tag 'char-misc-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (30 commits)
  Drivers: hv: Make the sysfs node size for the ring buffer dynamic
  uio_hv_generic: Fix sysfs creation path for ring buffer
  iio: adis16201: Correct inclinometer channel resolution
  iio: adc: ad7606: fix serial register access
  iio: pressure: mprls0025pa: use aligned_s64 for timestamp
  iio: imu: adis16550: align buffers for timestamp
  staging: iio: adc: ad7816: Correct conditional logic for store mode
  iio: adc: ad7266: Fix potential timestamp alignment issue.
  iio: adc: ad7768-1: Fix insufficient alignment of timestamp.
  iio: adc: dln2: Use aligned_s64 for timestamp
  iio: accel: adxl355: Make timestamp 64-bit aligned using aligned_s64
  iio: temp: maxim-thermocouple: Fix potential lack of DMA safe buffer.
  iio: chemical: pms7003: use aligned_s64 for timestamp
  iio: chemical: sps30: use aligned_s64 for timestamp
  iio: imu: inv_mpu6050: align buffer for timestamp
  iio: imu: st_lsm6dsx: Fix wakeup source leaks on device unbind
  iio: adc: qcom-spmi-iadc: Fix wakeup source leaks on device unbind
  iio: accel: fxls8962af: Fix wakeup source leaks on device unbind
  iio: adc: ad7380: fix event threshold shift
  iio: hid-sensor-prox: Fix incorrect OFFSET calculation
  ...

2 months agoMerge tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 10 May 2025 15:52:41 +0000 (08:52 -0700)]
Merge tag 'i2c-for-6.15-rc6' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

 - omap: use correct function to read from device tree

 - MAINTAINERS: remove Seth from ISMT maintainership

* tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  MAINTAINERS: Remove entry for Seth Heasley
  i2c: omap: fix deprecated of_property_read_bool() use

2 months agoMerge tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 10 May 2025 15:44:36 +0000 (08:44 -0700)]
Merge tag 'for-linus-6.15a-rc6-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - A fix for the xenbus driver allowing to use a PVH Dom0 with
   Xenstore running in another domain

 - A fix for the xenbus driver addressing a rare race condition
   resulting in NULL dereferences and other problems

 - A fix for the xen-swiotlb driver fixing a problem seen on Arm
   platforms

* tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xenbus: Use kref to track req lifetime
  xenbus: Allow PVH dom0 a non-local xenstore
  xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands it

2 months agoMerge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 10 May 2025 15:36:07 +0000 (08:36 -0700)]
Merge tag 'pull-fixes' of git://git./linux/kernel/git/viro/vfs

Pull mount fixes from Al Viro:
 "A couple of races around legalize_mnt vs umount (both fairly old and
  hard to hit) plus two bugs in move_mount(2) - both around 'move
  detached subtree in place' logics"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix IS_MNT_PROPAGATING uses
  do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  do_umount(): add missing barrier before refcount checks in sync case
  __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock

2 months agoMerge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEAD
Paolo Bonzini [Sat, 10 May 2025 15:11:06 +0000 (11:11 -0400)]
Merge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.15-rcN

 - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing
   problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the
   VMCB as its state is undefined after SHUTDOWN, emulating INIT is the
   least awful choice).

 - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
   KVM doesn't goof a sanity check in the future.

 - Free obsolete roots when (re)loading the MMU to fix a bug where
   pre-faulting memory can get stuck due to always encountering a stale
   root.

 - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page
   to print state, so that KVM doesn't print stale/wrong information.

 - When changing memory attributes (e.g. shared <=> private), add potential
   hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM
   doesn't create a shared/private hugepage when the the corresponding
   attributes will become mixed (the attributes are commited *after* KVM
   finishes the invalidation).

 - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at
   least one active VM.  Effectively BP_SPEC_REDUCE when KVM is loaded led
   to very measurable performance regressions for non-KVM workloads.

2 months agoMerge tag 'kvmarm-fixes-6.15-3' of https://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Sat, 10 May 2025 15:10:02 +0000 (11:10 -0400)]
Merge tag 'kvmarm-fixes-6.15-3' of https://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.15, round #3

 - Avoid use of uninitialized memcache pointer in user_mem_abort()

 - Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts
   to be taken while TGE=0 and fixing an ugly bug on AmpereOne that
   occurs when taking an interrupt while clearing the xMO bits
   (AC03_CPU_36)

 - Prevent VMMs from hiding support for AArch64 at any EL virtualized by
   KVM

 - Save/restore the host value for HCRX_EL2 instead of restoring an
   incorrect fixed value

 - Make host_stage2_set_owner_locked() check that the entire requested
   range is memory rather than just the first page

2 months agoMerge tag 'kvm-riscv-fixes-6.15-1' of https://github.com/kvm-riscv/linux into HEAD
Paolo Bonzini [Sat, 10 May 2025 15:09:26 +0000 (11:09 -0400)]
Merge tag 'kvm-riscv-fixes-6.15-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.15, take #1

- Add missing reset of smstateen CSRs

2 months agoMerge tag 'i2c-host-fixes-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Wolfram Sang [Sat, 10 May 2025 09:41:13 +0000 (11:41 +0200)]
Merge tag 'i2c-host-fixes-6.15-rc6' of git://git./linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host-fixes for v6.15-rc6

- omap: use correct function to read from device tree
- MAINTAINERS: remove Seth from ISMT maintainership

2 months agoMerge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Fri, 9 May 2025 23:45:21 +0000 (16:45 -0700)]
Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix dentry leak which can cause umount crash

 - Add warning for parse contexts error on compounded operation

* tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Avoid race in open_cached_dir with lease breaks
  smb3 client: warn when parse contexts returns error on compounded operation

2 months agofix IS_MNT_PROPAGATING uses
Al Viro [Thu, 8 May 2025 19:35:51 +0000 (15:35 -0400)]
fix IS_MNT_PROPAGATING uses

propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.

When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).

Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'.  Trying to move_mount() our detached tree on subdirectory
in A should have
* moved B' on that subdirectory in A
* skipped the corresponding subdirectory in B' itself
* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.

IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount().  The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2 months agodo_move_mount(): don't leak MNTNS_PROPAGATING on failures
Al Viro [Tue, 29 Apr 2025 01:43:23 +0000 (21:43 -0400)]
do_move_mount(): don't leak MNTNS_PROPAGATING on failures

as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2 months agodo_umount(): add missing barrier before refcount checks in sync case
Al Viro [Tue, 29 Apr 2025 03:56:14 +0000 (23:56 -0400)]
do_umount(): add missing barrier before refcount checks in sync case

do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e72d97 ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2 months ago__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
Al Viro [Sun, 27 Apr 2025 19:41:51 +0000 (15:41 -0400)]
__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock

... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().

Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2 months agoMerge tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda...
Linus Torvalds [Fri, 9 May 2025 21:06:34 +0000 (14:06 -0700)]
Merge tag 'rust-fixes-6.15-2' of git://git./linux/kernel/git/ojeda/linux

Pull rust fixes from Miguel Ojeda:

 - Make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88.0

 - Clean Rust (and Clippy) lints for the upcoming Rust 1.87.0 and 1.88.0
   releases

 - Clean objtool warning for the upcoming Rust 1.87.0 release by adding
   one more noreturn function

* tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88
  rust: clean Rust 1.88.0's `clippy::uninlined_format_args` lint
  rust: clean Rust 1.88.0's warning about `clippy::disallowed_macros` configuration
  rust: clean Rust 1.88.0's `unnecessary_transmutes` lint
  rust: allow Rust 1.87.0's `clippy::ptr_eq` lint
  objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0

2 months agoMerge tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Fri, 9 May 2025 19:41:34 +0000 (12:41 -0700)]
Merge tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Weekly drm fixes, bit bigger than last week, but overall amdgpu/xe
  with some ivpu bits and a random few fixes, and dropping the
  ttm_backup struct which wrapped struct file and was recently
  frowned at.

  drm:
   - Fix overflow when generating wedged event

  ttm:
   - Fix documentation
   - Remove struct ttm_backup

  panel:
   - simple: Fix timings for AUO G101EVN010

  amdgpu:
   - DC FP fixes
   - Freesync fix
   - DMUB AUX fixes
   - VCN fix
   - Hibernation fixes
   - HDP fixes

  xe:
   - Prevent PF queue overflow
   - Hold all forcewake during mocs test
   - Remove GSC flush on reset path
   - Fix forcewake put on error path
   - Fix runtime warning when building without svm

  i915:
   - Fix oops on resume after disconnecting DP MST sinks during suspend
   - Fix SPLC num_waiters refcounting

  ivpu:
   - Increase timeouts
   - Fix deadlock in cmdq ioctl
   - Unlock mutices in correct order

  v3d:
   - Avoid memory leak in job handling"

* tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel: (32 commits)
  drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
  drm/xe: Add config control for svm flush work
  drm/xe: Release force wake first then runtime power
  drm/xe/gsc: do not flush the GSC worker from the reset path
  drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs
  drm/xe: Add page queue multiplier
  drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
  drm/amdgpu: fix pm notifier handling
  Revert "drm/amd: Stop evicting resources on APUs in suspend"
  drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
  drm/amd/display: Fix wrong handling for AUX_DEFER case
  drm/amd/display: Copy AUX read reply data whenever length > 0
  drm/amd/display: Remove incorrect checking in dmub aux handler
  drm/amd/display: Fix the checking condition in dmub aux handling
  drm/amd/display: Shift DMUB AUX reply command if necessary
  drm/amd/display: Call FP Protect Before Mode Programming/Mode Support
  ...

2 months agoMerge tag 'drm-intel-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/i915...
Dave Airlie [Fri, 9 May 2025 19:07:17 +0000 (05:07 +1000)]
Merge tag 'drm-intel-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

drm/i915 fixes for v6.15-rc6:
- Fix oops on resume after disconnecting DP MST sinks during suspend
- Fix SPLC num_waiters refcounting

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://lore.kernel.org/r/87tt5umeaw.fsf@intel.com
2 months agoMerge tag 'drm-xe-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/xe/kernel...
Dave Airlie [Fri, 9 May 2025 19:02:38 +0000 (05:02 +1000)]
Merge tag 'drm-xe-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Prevent PF queue overflow
- Hold all forcewake during mocs test
- Remove GSC flush on reset path
- Fix forcewake put on error path
- Fix runtime warning when building without svm

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/jffqa56f2zp4i5ztz677cdspgxhnw7qfop3dd3l2epykfpfvza@q2nw6wapsphz
2 months agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 9 May 2025 18:30:26 +0000 (11:30 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Move the arm64_use_ng_mappings variable from the .bss to the .data
  section as it is accessed very early during boot with the MMU off and
  before the .bss has been initialised.

  This could lead to incorrect idmap page table"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation

2 months agoMerge tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 9 May 2025 18:17:50 +0000 (11:17 -0700)]
Merge tag 'riscv-for-linus-6.15-rc6' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - The compressed half-word misaligned access instructions (c.lhu, c.lh,
   and c.sh) from the Zcb extension are now properly emulated

 - A series of fixes to properly emulate permissions while handling
   userspace misaligned accesses

 - A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the
   envcfg CSR on systems that don't support that CSR, and to report
   those failures up to userspace

 - The .rela.dyn section is no longer stripped from vmlinux, as it is
   necessary to relocate the kernel under some conditions (including
   kexec)

* tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm
  scripts: Do not strip .rela.dyn section
  riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL
  riscv: misaligned: use get_user() instead of __get_user()
  riscv: misaligned: enable IRQs while handling misaligned accesses
  riscv: misaligned: factorize trap handling
  riscv: misaligned: Add handling for ZCB instructions

2 months agoMerge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 9 May 2025 17:34:50 +0000 (10:34 -0700)]
Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix for a regression in this series for loop and read/write iterator
   handling

 - zone append block update tweak

 - remove a broken IO priority test

 - NVMe pull request via Christoph:
      - unblock ctrl state transition for firmware update (Daniel
        Wagner)

* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
  block: remove test of incorrect io priority level
  nvme: unblock ctrl state transition for firmware update
  block: only update request sector if needed
  loop: Add sanity check for read/write_iter

2 months agoMerge tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 9 May 2025 16:26:46 +0000 (09:26 -0700)]
Merge tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for linked timeouts arming and firing wrt prep and issue of the
   request being managed by the linked timeout

 - Fix for a CQE ordering issue between requests with multishot and
   using the same buffer group. This is a dumbed down version for this
   release and for stable, it'll get improved for v6.16

 - Tweak the SQPOLL submit batch size. A previous commit made SQPOLL
   manage its own task_work and chose a tiny batch size, bump it from 8
   to 32 to fix a performance regression due to that

* tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux:
  io_uring/sqpoll: Increase task_work submission batch size
  io_uring: ensure deferred completions are flushed for multishot
  io_uring: always arm linked timeouts prior to issue

2 months agoMerge tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules...
Linus Torvalds [Fri, 9 May 2025 16:09:49 +0000 (09:09 -0700)]
Merge tag 'modules-6.15-rc6' of git://git./linux/kernel/git/modules/linux

Pull modules fix from Petr Pavlu:
 "A single fix to prevent use of an uninitialized completion pointer
  when releasing a module_kobject in specific situations.

  This addresses a latent bug exposed by commit f95bbfe18512 ("drivers:
  base: handle module_kobject creation")"

* tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: ensure that kobject_put() is safe for module type kobjects

2 months agox86/mm: Eliminate window where TLB flushes may be inadvertently skipped
Dave Hansen [Thu, 8 May 2025 22:41:32 +0000 (15:41 -0700)]
x86/mm: Eliminate window where TLB flushes may be inadvertently skipped

tl;dr: There is a window in the mm switching code where the new CR3 is
set and the CPU should be getting TLB flushes for the new mm.  But
should_flush_tlb() has a bug and suppresses the flush.  Fix it by
widening the window where should_flush_tlb() sends an IPI.

Long Version:

=== History ===

There were a few things leading up to this.

First, updating mm_cpumask() was observed to be too expensive, so it was
made lazier.  But being lazy caused too many unnecessary IPIs to CPUs
due to the now-lazy mm_cpumask().  So code was added to cull
mm_cpumask() periodically[2].  But that culling was a bit too aggressive
and skipped sending TLB flushes to CPUs that need them.  So here we are
again.

=== Problem ===

The too-aggressive code in should_flush_tlb() strikes in this window:

// Turn on IPIs for this CPU/mm combination, but only
// if should_flush_tlb() agrees:
cpumask_set_cpu(cpu, mm_cpumask(next));

next_tlb_gen = atomic64_read(&next->context.tlb_gen);
choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
load_new_mm_cr3(need_flush);
// ^ After 'need_flush' is set to false, IPIs *MUST*
// be sent to this CPU and not be ignored.

        this_cpu_write(cpu_tlbstate.loaded_mm, next);
// ^ Not until this point does should_flush_tlb()
// become true!

should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3()
and writing to 'loaded_mm', which is a window where they should not be
suppressed.  Whoops.

=== Solution ===

Thankfully, the fuzzy "just about to write CR3" window is already marked
with loaded_mm==LOADED_MM_SWITCHING.  Simply checking for that state in
should_flush_tlb() is sufficient to ensure that the CPU is targeted with
an IPI.

This will cause more TLB flush IPIs.  But the window is relatively small
and I do not expect this to cause any kind of measurable performance
impact.

Update the comment where LOADED_MM_SWITCHING is written since it grew
yet another user.

Peter Z also raised a concern that should_flush_tlb() might not observe
'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off()
writes them.  Add a barrier to ensure that they are observed in the
order they are written.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/
Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2]
Reported-by: Stephen Dolan <sdolan@janestreet.com>
Cc: stable@vger.kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 months agoio_uring/sqpoll: Increase task_work submission batch size io_uring-6.15-20250509
Gabriel Krisman Bertazi [Thu, 8 May 2025 18:12:03 +0000 (14:12 -0400)]
io_uring/sqpoll: Increase task_work submission batch size

Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agodrm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
Imre Deak [Wed, 7 May 2025 15:19:53 +0000 (18:19 +0300)]
drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation

Determining the SST/MST mode during state computation must be done based
on the output type stored in the CRTC state, which in turn is set once
based on the modeset connector's SST vs. MST type and will not change as
long as the connector is using the CRTC. OTOH the MST mode indicated by
the given connector's intel_dp::is_mst flag can change independently of
the above output type, based on what sink is at any moment plugged to
the connector.

Fix the state computation accordingly.

Cc: Jani Nikula <jani.nikula@intel.com>
Fixes: f6971d7427c2 ("drm/i915/mst: adapt intel_dp_mtp_tu_compute_config() for 128b/132b SST")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4607
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://lore.kernel.org/r/20250507151953.251846-1-imre.deak@intel.com
(cherry picked from commit 0f45696ddb2b901fbf15cb8d2e89767be481d59f)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2 months agomemblock: Accept allocated memory before use in memblock_double_array()
Tom Lendacky [Thu, 8 May 2025 17:24:10 +0000 (12:24 -0500)]
memblock: Accept allocated memory before use in memblock_double_array()

When increasing the array size in memblock_double_array() and the slab
is not yet available, a call to memblock_find_in_range() is used to
reserve/allocate memory. However, the range returned may not have been
accepted, which can result in a crash when booting an SNP guest:

  RIP: 0010:memcpy_orig+0x68/0x130
  Code: ...
  RSP: 0000:ffffffff9cc03ce8 EFLAGS: 00010006
  RAX: ff11001ff83e5000 RBX: 0000000000000000 RCX: fffffffffffff000
  RDX: 0000000000000bc0 RSI: ffffffff9dba8860 RDI: ff11001ff83e5c00
  RBP: 0000000000002000 R08: 0000000000000000 R09: 0000000000002000
  R10: 000000207fffe000 R11: 0000040000000000 R12: ffffffff9d06ef78
  R13: ff11001ff83e5000 R14: ffffffff9dba7c60 R15: 0000000000000c00
  memblock_double_array+0xff/0x310
  memblock_add_range+0x1fb/0x2f0
  memblock_reserve+0x4f/0xa0
  memblock_alloc_range_nid+0xac/0x130
  memblock_alloc_internal+0x53/0xc0
  memblock_alloc_try_nid+0x3d/0xa0
  swiotlb_init_remap+0x149/0x2f0
  mem_init+0xb/0xb0
  mm_core_init+0x8f/0x350
  start_kernel+0x17e/0x5d0
  x86_64_start_reservations+0x14/0x30
  x86_64_start_kernel+0x92/0xa0
  secondary_startup_64_no_verify+0x194/0x19b

Mitigate this by calling accept_memory() on the memory range returned
before the slab is available.

Prior to v6.12, the accept_memory() interface used a 'start' and 'end'
parameter instead of 'start' and 'size', therefore the accept_memory()
call must be adjusted to specify 'start + size' for 'end' when applying
to kernels prior to v6.12.

Cc: stable@vger.kernel.org # see patch description, needs adjustments for <= 6.11
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/da1ac73bf4ded761e21b4e4bb5178382a580cd73.1746725050.git.thomas.lendacky@amd.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2 months agoMerge tag 'amd-drm-fixes-6.15-2025-05-08' of https://gitlab.freedesktop.org/agd5f...
Dave Airlie [Fri, 9 May 2025 01:10:41 +0000 (11:10 +1000)]
Merge tag 'amd-drm-fixes-6.15-2025-05-08' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.15-2025-05-08:

amdgpu:
- DC FP fixes
- Freesync fix
- DMUB AUX fixes
- VCN fix
- Hibernation fixes
- HDP fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250508194102.3242372-1-alexander.deucher@amd.com
2 months agoMerge tag 'drm-misc-fixes-2025-05-08' of https://gitlab.freedesktop.org/drm/misc...
Dave Airlie [Thu, 8 May 2025 22:51:57 +0000 (08:51 +1000)]
Merge tag 'drm-misc-fixes-2025-05-08' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Short summary of fixes pull:

drm:
- Fix overflow when generating wedged event

ivpu:
- Increate timeouts
- Fix deadlock in cmdq ioctl
- Unlock mutices in correct order

panel:
- simple: Fix timings for AUO G101EVN010

ttm:
- Fix documentation
- Remove struct ttm_backup

v3d:
- Avoid memory leak in job handling

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20250508104939.GA76697@2a02-2454-fd5e-fd00-c110-cbf2-6528-c5be.dyn6.pyur.net
2 months agoMerge tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Thu, 8 May 2025 21:28:49 +0000 (14:28 -0700)]
Merge tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:

 - Some fixes to help with filesystem analysis: ensure superblock
   error count gets written if we go ERO, don't discard the journal
   aggressively (so it's available for list_journal -a)

 - Fix lost wakeup on arm causing us to get stuck when reading btree
   nodes

 - Fix fsck failing to exit on ctrl-c

 - An additional fix for filesystems with misaligned bucket sizes: we
   now ensure that allocations are properly aligned

 - Setting background target but not promote target will now leave that
   data cached on the foreground target, as it used to

 - Revert a change to when we allocate the VFS superblock, this was done
   for implementing blk_holder_ops but ended up not being needed, and
   allocating a superblock and not setting SB_BORN while we do recovery
   caused sync() calls and other things to hang

 - Assorted fixes for harmless error messages that caused concern to
   users

* tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Don't aggressively discard the journal
  bcachefs: Ensure superblock gets written when we go ERO
  bcachefs: Filter out harmless EROFS error messages
  bcachefs: journal_shutdown is EROFS, not EIO
  bcachefs: Call bch2_fs_start before getting vfs superblock
  bcachefs: fix hung task timeout in journal read
  bcachefs: Add missing barriers before wake_up_bit()
  bcachefs: Ensure proper write alignment
  bcachefs: Improve want_cached_ptr()
  bcachefs: thread_with_stdio: fix spinning instead of exiting

2 months agodrm/xe: Add config control for svm flush work
Shuicheng Lin [Fri, 2 May 2025 17:00:52 +0000 (17:00 +0000)]
drm/xe: Add config control for svm flush work

Without CONFIG_DRM_XE_GPUSVM set, GPU SVM is not initialized thus below
warning pops. Refine the flush work code to be controlled by the config
to avoid below warning:
"
[  453.132028] ------------[ cut here ]------------
[  453.132527] WARNING: CPU: 9 PID: 4491 at kernel/workqueue.c:4205 __flush_work+0x379/0x3a0
[  453.133355] Modules linked in: xe drm_ttm_helper ttm gpu_sched drm_buddy drm_suballoc_helper drm_gpuvm drm_exec
[  453.134352] CPU: 9 UID: 0 PID: 4491 Comm: xe_exec_mix_mod Tainted: G     U  W           6.15.0-rc3+ #7 PREEMPT(full)
[  453.135405] Tainted: [U]=USER, [W]=WARN
...
[  453.136921] RIP: 0010:__flush_work+0x379/0x3a0
[  453.137417] Code: 8b 45 00 48 8b 55 08 89 c7 48 c1 e8 04 83 e7 08 83 e0 0f 83 cf 02 89 c6 48 0f ba 6d 00 03 e9 d5 fe ff ff 0f 0b e9 db fd ff ff <0f> 0b 45 31 e4 e9 d1 fd ff ff 0f 0b e9 03 ff ff ff 0f 0b e9 d6 fe
[  453.139250] RSP: 0018:ffffc90000c67b18 EFLAGS: 00010246
[  453.139782] RAX: 0000000000000000 RBX: ffff888108a24000 RCX: 0000000000002000
[  453.140521] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8881016d61c8
[  453.141253] RBP: ffff8881016d61c8 R08: 0000000000000000 R09: 0000000000000000
[  453.141985] R10: 0000000000000000 R11: 0000000008a24000 R12: 0000000000000001
[  453.142709] R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107db8c00
[  453.143450] FS:  00007f44853d4c80(0000) GS:ffff8882f469b000(0000) knlGS:0000000000000000
[  453.144276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  453.144853] CR2: 00007f4487629228 CR3: 00000001016aa000 CR4: 00000000000406f0
[  453.145594] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  453.146320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  453.147061] Call Trace:
[  453.147336]  <TASK>
[  453.147579]  ? tick_nohz_tick_stopped+0xd/0x30
[  453.148067]  ? xas_load+0x9/0xb0
[  453.148435]  ? xa_load+0x6f/0xb0
[  453.148781]  __xe_vm_bind_ioctl+0xbd5/0x1500 [xe]
[  453.149338]  ? dev_printk_emit+0x48/0x70
[  453.149762]  ? _dev_printk+0x57/0x80
[  453.150148]  ? drm_ioctl+0x17c/0x440
[  453.150544]  ? __drm_dev_vprintk+0x36/0x90
[  453.150983]  ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[  453.151575]  ? drm_ioctl_kernel+0x9f/0xf0
[  453.151998]  ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[  453.152560]  drm_ioctl_kernel+0x9f/0xf0
[  453.152968]  drm_ioctl+0x20f/0x440
[  453.153332]  ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[  453.153893]  ? ioctl_has_perm.constprop.0.isra.0+0xae/0x100
[  453.154489]  ? memory_bm_test_bit+0x5/0x60
[  453.154935]  xe_drm_ioctl+0x47/0x70 [xe]
[  453.155419]  __x64_sys_ioctl+0x8d/0xc0
[  453.155824]  do_syscall_64+0x47/0x110
[  453.156228]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
"

v2 (Matt):
    refine commit message to have more details
    add Fixes tag
    move the code to xe_svm.h which already have the config
    remove a blank line per codestyle suggestion

Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250502170052.1787973-1-shuicheng.lin@intel.com
(cherry picked from commit 9d80698bcd97a5ad1088bcbb055e73fd068895e2)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2 months agodrm/xe: Release force wake first then runtime power
Shuicheng Lin [Wed, 7 May 2025 02:23:02 +0000 (02:23 +0000)]
drm/xe: Release force wake first then runtime power

xe_force_wake_get() is dependent on xe_pm_runtime_get(), so for
the release path, xe_force_wake_put() should be called first then
xe_pm_runtime_put().
Combine the error path and normal path together with goto.

Fixes: 85d547608ef5 ("drm/xe/xe_gt_debugfs: Update handling of xe_force_wake_get return")
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250507022302.2187527-1-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 432cd94efdca06296cc5e76d673546f58aa90ee1)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2 months agodrm/xe/gsc: do not flush the GSC worker from the reset path
Daniele Ceraolo Spurio [Fri, 2 May 2025 15:51:04 +0000 (08:51 -0700)]
drm/xe/gsc: do not flush the GSC worker from the reset path

The workqueue used for the reset worker is marked as WQ_MEM_RECLAIM,
while the GSC one isn't (and can't be as we need to do memory
allocations in the gsc worker). Therefore, we can't flush the latter
from the former.

The reason why we had such a flush was to avoid interrupting either
the GSC FW load or in progress GSC proxy operations. GSC proxy
operations fall into 2 categories:

1) GSC proxy init: this only happens once immediately after GSC FW load
   and does not support being interrupted. The only way to recover from
   an interruption of the proxy init is to do an FLR and re-load the GSC.

2) GSC proxy request: this can happen in response to a request that
   the driver sends to the GSC. If this is interrupted, the GSC FW will
   timeout and the driver request will be failed, but overall the GSC
   will keep working fine.

Flushing the work allowed us to avoid interruption in both cases (unless
the hang came from the GSC engine itself, in which case we're toast
anyway). However, a failure on a proxy request is tolerable if we're in
a scenario where we're triggering a GT reset (i.e., something is already
gone pretty wrong), so what we really need to avoid is interrupting
the init flow, which we can do by polling on the register that reports
when the proxy init is complete (as that ensure us that all the load and
init operations have been completed).

Note that during suspend we still want to do a flush of the worker to
make sure it completes any operations involving the HW before the power
is cut.

v2: fix spelling in commit msg, rename waiter function (Julia)

Fixes: dd0e89e5edc2 ("drm/xe/gsc: GSC FW load")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4830
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250502155104.2201469-1-daniele.ceraolospurio@intel.com
(cherry picked from commit 12370bfcc4f0bdf70279ec5b570eb298963422b5)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2 months agodrm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs
Tejas Upadhyay [Mon, 28 Apr 2025 08:23:57 +0000 (13:53 +0530)]
drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs

LNCF registers report wrong values when XE_FORCEWAKE_GT
only is held. Holding XE_FORCEWAKE_ALL ensures correct
operations on LNCF regs.

V2(Himal):
 - Use xe_force_wake_ref_has_domain

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1999
Fixes: a6a4ea6d7d37 ("drm/xe: Add mocs kunit")
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250428082357.1730068-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit 70a2585e582058e94fe4381a337be42dec800337)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2 months agodrm/xe: Add page queue multiplier
Matthew Brost [Tue, 8 Apr 2025 15:59:15 +0000 (08:59 -0700)]
drm/xe: Add page queue multiplier

For an unknown reason the math to determine the PF queue size does is
not correct - compute UMD applications are overflowing the PF queue
which is fatal. A multippier of 8 fixes the problem.

Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jagmeet Randhawa <jagmeet.randhawa@intel.com>
Link: https://lore.kernel.org/r/20250408155915.78770-1-matthew.brost@intel.com
(cherry picked from commit 29582e0ea75c95668d168b12406e3c56cf5a73c4)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2 months agoMerge tag 'vfio-v6.15-rc6' of https://github.com/awilliam/linux-vfio
Linus Torvalds [Thu, 8 May 2025 19:09:22 +0000 (12:09 -0700)]
Merge tag 'vfio-v6.15-rc6' of https://github.com/awilliam/linux-vfio

Pull vfio fix from Alex Williamson:

 - Fix an issue in vfio-pci huge_fault handling by aligning faults to
   the order, resulting in deterministic use of huge pages.  This
   avoids a race where simultaneous aligned and unaligned faults to
   the same PMD can result in a VM_FAULT_OOM and subsequent VM crash.
   (Alex Williamson)

* tag 'vfio-v6.15-rc6' of https://github.com/awilliam/linux-vfio:
  vfio/pci: Align huge faults to order

2 months agoMerge tag 'riscv-fixes-6.15-rc6' of ssh://gitolite.kernel.org/pub/scm/linux/kernel...
Palmer Dabbelt [Thu, 8 May 2025 16:40:21 +0000 (09:40 -0700)]
Merge tag 'riscv-fixes-6.15-rc6' of ssh://gitolite./linux/kernel/git/alexghiti/linux into fixes

riscv fixes for 6.15-rc6

- A fix to handle compressed halfword load/store instructions misaligned accesses
- A fix to allow user memory access while handling a misaligned access
- 2 fixes to return an error if the pointer masking extension is not implemented on the platform but userspace still tries to access it, which caused oops on some early platforms
- A fix to prevent the stripping of .rela.dyn so that a vmlinux loaded by kexec can successfully boot

* tag 'riscv-fixes-6.15-rc6' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux:
  riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm
  scripts: Do not strip .rela.dyn section
  riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL
  riscv: misaligned: use get_user() instead of __get_user()
  riscv: misaligned: enable IRQs while handling misaligned accesses
  riscv: misaligned: factorize trap handling
  riscv: misaligned: Add handling for ZCB instructions

2 months agodrm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
Alex Deucher [Wed, 30 Apr 2025 16:50:02 +0000 (12:50 -0400)]
drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush

Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: 689275140cb8 ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbc064adfcf9095e7d895bea87b2f75c1ab23236)
Cc: stable@vger.kernel.org
2 months agodrm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
Alex Deucher [Wed, 30 Apr 2025 16:48:51 +0000 (12:48 -0400)]
drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush

Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: abe1cbaec6cf ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 84141ff615951359c9a99696fd79a36c465ed847)
Cc: stable@vger.kernel.org
2 months agodrm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
Alex Deucher [Wed, 30 Apr 2025 16:47:37 +0000 (12:47 -0400)]
drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush

Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: f756dbac1ce1 ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4a89b7698e771914b4d5b571600c76e2fdcbe2a9)
Cc: stable@vger.kernel.org