Mathieu Desnoyers [Tue, 12 Nov 2024 15:28:26 +0000 (10:28 -0500)]
rseq: Validate read-only fields under DEBUG_RSEQ config
The rseq uapi requires cooperation between users of the rseq fields
to ensure that all libraries and applications using rseq within a
process do not interfere with each other.
This is especially important for fields which are meant to be read-only
from user-space, as documented in uapi/linux/rseq.h:
- cpu_id_start,
- cpu_id,
- node_id,
- mm_cid.
Storing to those fields from a user-space library prevents any sharing
of the rseq ABI with other libraries and applications, as other users
are not aware that the content of those fields has been altered by a
third-party library.
This is unfortunately the current behavior of tcmalloc: it purposefully
overlaps part of a cached value with the cpu_id_start upper bits to get
notified about preemption, because the kernel clears those upper bits
before returning to user-space. This behavior does not conform to the
rseq uapi header ABI.
This prevents tcmalloc from using rseq when rseq is registered by the
GNU C library 2.35+. It requires tcmalloc users to disable glibc rseq
registration with a glibc tunable, which is a sad state of affairs.
Considering that tcmalloc and the GNU C library are the two first
upstream projects using rseq, and that they are already incompatible due
to use of this hack, adding kernel-level validation of all read-only
fields content is necessary to ensure future users of rseq abide by the
rseq ABI requirements.
Validate that user-space does not corrupt the read-only fields and
conform to the rseq uapi header ABI when the kernel is built with
CONFIG_DEBUG_RSEQ=y. This is done by storing a copy of the read-only
fields in the task_struct, and validating the prior values present in
user-space before updating them. If the values do not match, print
a warning on the console (printk_ratelimited()).
This is a first step to identify misuses of the rseq ABI by printing
a warning on the console. After a giving some time to userspace to
correct its use of rseq, the plan is to eventually terminate offending
processes with SIGSEGV.
This change is expected to produce warnings for the upstream tcmalloc
implementation, but tcmalloc developers mentioned they were open to
adapt their implementation to kernel-level change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/google/tcmalloc/issues/144
Peter Zijlstra [Fri, 29 Nov 2024 10:15:41 +0000 (11:15 +0100)]
sched/fair: Untangle NEXT_BUDDY and pick_next_task()
There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:
- yield_to_task()
- cgroup dequeue / pick optimization
However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.
Fixes:
147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
Andy Shevchenko [Mon, 2 Dec 2024 17:35:30 +0000 (19:35 +0200)]
sched/fair: Mark m*_vruntime() with __maybe_unused
When max_vruntime() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:
kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function]
526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
| ^~~~~~~~~~~~
Fix this by marking them with __maybe_unused (all cases for the sake of
symmetry).
See also commit
6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241202173546.634433-1-andriy.shevchenko@linux.intel.com
Vincent Guittot [Mon, 2 Dec 2024 17:46:06 +0000 (18:46 +0100)]
sched/fair: Fix variable declaration position
Move variable declaration at the beginning of the function
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:05 +0000 (18:46 +0100)]
sched/fair: Do not try to migrate delayed dequeue task
Migrating a delayed dequeued task doesn't help in balancing the number
of runnable tasks in the system.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:04 +0000 (18:46 +0100)]
sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the
reality as the value includes both the ready to run tasks and the delayed
dequeue tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:03 +0000 (18:46 +0100)]
sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit
5e963f2bd465
("sched/fair: Commit to EEVDF").
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-9-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:02 +0000 (18:46 +0100)]
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename
idle_h_nr_running into h_nr_idle.
The "running" is not correct anymore as it includes delayed dequeue tasks
as well.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:01 +0000 (18:46 +0100)]
sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have:
- h_nr_runnable which tracks tasks ready to run
- h_nr_queued which tracks enqueued tasks either ready to run or
delayed dequeue
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:46:00 +0000 (18:46 +0100)]
sched/fair: Use the new cfs_rq.h_nr_runnable
Use the new h_nr_runnable that tracks only queued and runnable tasks in the
statistics that are used to balance the system:
- PELT runnable_avg
- deciding if a group is overloaded or has spare capacity
- numa stats
- reduced capacity management
- load balance
- nohz kick
It should be noticed that the rq->nr_running still counts the delayed
dequeued tasks as delayed dequeue is a fair feature that is meaningless
at core level.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:45:59 +0000 (18:45 +0100)]
sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statistics that are used to balance the system and in particular
the field cfs.h_nr_queued when the sched_entity is associated to a task.
Create a new h_nr_runnable that tracks only queued and runnable tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:45:58 +0000 (18:45 +0100)]
sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_nr_queued to reflect this new behavior.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
Peter Zijlstra [Mon, 9 Dec 2024 10:48:10 +0000 (11:48 +0100)]
Merge branch 'sched/urgent'
Sync with urgent bits as a base for further work.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Peter Zijlstra [Mon, 2 Dec 2024 17:45:57 +0000 (18:45 +0100)]
sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while
commit
fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.
Track h_nr_delayed such that we can discount those and adjust the
signal.
Fixes:
fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Closes: https://lore.kernel.org/lkml/
a9a45193-d0c6-4ba2-a822-
464ad30b550e@arm.com/
Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/
[ Fixes checkpatch warnings and rebased ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
Vincent Guittot [Mon, 2 Dec 2024 17:45:56 +0000 (18:45 +0100)]
sched/fair: Fix sched_can_stop_tick() for fair tasks
We can't stop the tick of a rq if there are at least 2 tasks enqueued in
the whole hierarchy and not only at the root cfs rq.
rq->cfs.nr_running tracks the number of sched_entity at one level
whereas rq->cfs.h_nr_running tracks all queued tasks in the
hierarchy.
Fixes:
11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
K Prateek Nayak [Thu, 28 Nov 2024 07:29:54 +0000 (12:59 +0530)]
sched/fair: Fix NEXT_BUDDY
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().
Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.
Fixes:
152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Adam Li <adamli@os.amperecomputing.com>
Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
Waiman Long [Wed, 30 Oct 2024 17:52:53 +0000 (13:52 -0400)]
sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
As all the non-domain and non-managed_irq housekeeping types have been
unified to HK_TYPE_KERNEL_NOISE, replace all these references in the
scheduler to use HK_TYPE_KERNEL_NOISE.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com
Waiman Long [Wed, 30 Oct 2024 17:52:52 +0000 (13:52 -0400)]
sched/isolation: Consolidate housekeeping cpumasks that are always identical
The housekeeping cpumasks are only set by two boot commandline
parameters: "nohz_full" and "isolcpus". When there is more than one of
"nohz_full" or "isolcpus", the extra ones must have the same CPU list
or the setup will fail partially.
The HK_TYPE_DOMAIN and HK_TYPE_MANAGED_IRQ types are settable by
"isolcpus" only and their settings can be independent of the other
types. The other housekeeping types are all set by "nohz_full" or
"isolcpus=nohz" without a way to set them individually. So they all
have identical cpumasks.
There is actually no point in having different cpumasks for these
"nohz_full" only housekeeping types. Consolidate these types to use the
same cpumask by aliasing them to the same value. If there is a need to
set any of them independently in the future, we can break them out to
their own cpumasks again.
With this change, the number of cpumasks in the housekeeping structure
drops from 9 to 3. Other than that, there should be no other functional
change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-4-longman@redhat.com
Waiman Long [Wed, 30 Oct 2024 17:52:51 +0000 (13:52 -0400)]
sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"
The "isolcpus=nohz" boot parameter and flag were used to disable tick
when running a single task. Nowsdays, this "nohz" flag is seldomly used
as it is included as part of the "nohz_full" parameter. Extend this
flag to cover other kernel noises disabled by the "nohz_full" parameter
to make them equivalent. This also eliminates the need to use both the
"isolcpus" and the "nohz_full" parameters to fully isolated a given
set of CPUs.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-3-longman@redhat.com
Waiman Long [Wed, 30 Oct 2024 17:52:50 +0000 (13:52 -0400)]
sched/core: Remove HK_TYPE_SCHED
The HK_TYPE_SCHED housekeeping type is defined but not set anywhere. So
any code that try to use HK_TYPE_SCHED are essentially dead code. So
remove HK_TYPE_SCHED and any code that use it.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-2-longman@redhat.com
Valentin Schneider [Wed, 27 Nov 2024 16:55:01 +0000 (17:55 +0100)]
sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
Andy reported that clang gets upset with CONFIG_CFS_BANDWIDTH=n:
kernel/sched/fair.c:6580:20: error: unused function 'cfs_bandwidth_used' [-Werror,-Wunused-function]
6580 | static inline bool cfs_bandwidth_used(void)
| ^~~~~~~~~~~~~~~~~~
Indeed, cfs_bandwidth_used() is only used within functions defined under
CONFIG_CFS_BANDWIDTH=y. Remove its CONFIG_CFS_BANDWIDTH=n declaration &
definition.
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20241127165501.160004-1-vschneid@redhat.com
Wander Lairson Costa [Wed, 24 Jul 2024 14:22:48 +0000 (11:22 -0300)]
sched/deadline: Consolidate Timer Cancellation
After commit
b58652db66c9 ("sched/deadline: Fix task_struct reference
leak"), I identified additional calls to hrtimer_try_to_cancel that
might also require a dl_server check. It remains unclear whether this
omission was intentional or accidental in those contexts.
This patch consolidates the timer cancellation logic into dedicated
functions, ensuring consistent behavior across all calls.
Additionally, it reduces code duplication and improves overall code
cleanliness.
Note the use of the __always_inline keyword. In some instances, we
have a task_struct pointer, dereference the dl member, and then use
the container_of macro to retrieve the task_struct pointer again. By
inlining the code, the compiler can potentially optimize out this
redundant round trip.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-3-wander@redhat.com
Juri Lelli [Fri, 15 Nov 2024 11:48:29 +0000 (11:48 +0000)]
sched/deadline: Check bandwidth overflow earlier for hotplug
Currently we check for bandwidth overflow potentially due to hotplug
operations at the end of sched_cpu_deactivate(), after the cpu going
offline has already been removed from scheduling, active_mask, etc.
This can create issues for DEADLINE tasks, as there is a substantial
race window between the start of sched_cpu_deactivate() and the moment
we possibly decide to roll-back the operation if dl_bw_deactivate()
returns failure in cpuset_cpu_inactive(). An example is a throttled
task that sees its replenishment timer firing while the cpu it was
previously running on is considered offline, but before
dl_bw_deactivate() had a chance to say no and roll-back happened.
Fix this by directly calling dl_bw_deactivate() first thing in
sched_cpu_deactivate() and do the required calculation in the former
function considering the cpu passed as an argument as offline already.
By doing so we also simplify sched_cpu_deactivate(), as there is no need
anymore for any kind of roll-back if we fail early.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb
Juri Lelli [Thu, 14 Nov 2024 14:28:10 +0000 (14:28 +0000)]
sched/deadline: Correctly account for allocated bandwidth during hotplug
For hotplug operations, DEADLINE needs to check that there is still enough
bandwidth left after removing the CPU that is going offline. We however
fail to do so currently.
Restore the correct behavior by restructuring dl_bw_manage() a bit, so
that overflow conditions (not enough bandwidth left) are properly
checked. Also account for dl_server bandwidth, i.e. discount such
bandwidth in the calculation since NORMAL tasks will be anyway moved
away from the CPU as a result of the hotplug operation.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com
Juri Lelli [Thu, 14 Nov 2024 14:28:09 +0000 (14:28 +0000)]
sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes
When root domain non-destructive changes (e.g., only modifying one of
the existing root domains while the rest is not touched) happen we still
need to clear DEADLINE bandwidth accounting so that it's then properly
restored, taking into account DEADLINE tasks associated to each cpuset
(associated to each root domain). After the introduction of dl_servers,
we fail to restore such servers contribution after non-destructive
changes (as they are only considered on destructive changes when
runqueues are attached to the new domains).
Fix this by making sure we iterate over the dl_servers attached to
domains that have not been destroyed and add their bandwidth
contribution back correctly.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-2-juri.lelli@redhat.com
Harshit Agarwal [Thu, 14 Nov 2024 21:08:11 +0000 (14:08 -0700)]
sched: add READ_ONCE to task_on_rq_queued
task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is
set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release
in __block_task, and also read with READ_ONCE in task_on_rq_migrating.
Make all of these accesses pair together by adding READ_ONCE in the
task_on_rq_queued.
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com
Suleiman Souhlal [Mon, 18 Nov 2024 04:37:45 +0000 (13:37 +0900)]
sched: Don't try to catch up excess steal time.
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/
20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
John Stultz [Thu, 14 Nov 2024 19:00:47 +0000 (11:00 -0800)]
locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex
Anders had bisected a crash using PREEMPT_RT with linux-next and
isolated it down to commit
894d1b3db41c ("locking/mutex: Remove
wakeups from under mutex::wait_lock"), where it seemed the
wake_q structure was somehow getting corrupted causing a null
pointer traversal.
I was able to easily repoduce this with PREEMPT_RT and managed
to isolate down that through various call stacks we were
actually calling wake_up_q() twice on the same wake_q.
I found that in the problematic commit, I had added the
wake_up_q() call in task_blocks_on_rt_mutex() around
__ww_mutex_add_waiter(), following a similar pattern in
__mutex_lock_common().
However, its just wrong. We haven't dropped the lock->wait_lock,
so its contrary to the point of the original patch. And it
didn't match the __mutex_lock_common() logic of re-initializing
the wake_q after calling it midway in the stack.
Looking at it now, the wake_up_q() call is incorrect and should
just be removed. So drop the erronious logic I had added.
Fixes:
894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Closes: https://lore.kernel.org/lkml/
6afb936f-17c7-43fa-90e0-
b9e780866097@app.fastmail.com/
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241114190051.552665-1-jstultz@google.com
Wander Lairson Costa [Wed, 24 Jul 2024 14:22:47 +0000 (11:22 -0300)]
sched/deadline: Fix warning in migrate_enable for boosted tasks
When running the following command:
while true; do
stress-ng --cyclic 30 --timeout 30s --minimize --quiet
done
a warning is eventually triggered:
WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794
setup_new_dl_entity+0x13e/0x180
...
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? enqueue_dl_entity+0x631/0x6e0
? setup_new_dl_entity+0x13e/0x180
? __warn+0x7e/0xd0
? report_bug+0x11a/0x1a0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
enqueue_dl_entity+0x631/0x6e0
enqueue_task_dl+0x7d/0x120
__do_set_cpus_allowed+0xe3/0x280
__set_cpus_allowed_ptr_locked+0x140/0x1d0
__set_cpus_allowed_ptr+0x54/0xa0
migrate_enable+0x7e/0x150
rt_spin_unlock+0x1c/0x90
group_send_sig_info+0xf7/0x1a0
? kill_pid_info+0x1f/0x1d0
kill_pid_info+0x78/0x1d0
kill_proc_info+0x5b/0x110
__x64_sys_kill+0x93/0xc0
do_syscall_64+0x5c/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f0dab31f92b
This warning occurs because set_cpus_allowed dequeues and enqueues tasks
with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning
is triggered. A boosted task already had its parameters set by
rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary,
hence the WARN_ON call.
Check if we are requeueing a boosted task and avoid calling
setup_new_dl_entity if that's the case.
Fixes:
295d6d5e3736 ("sched/deadline: Fix switching to -deadline")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com
Sebastian Andrzej Siewior [Fri, 22 Nov 2024 17:35:57 +0000 (18:35 +0100)]
sched/core: Update kernel boot parameters for LAZY preempt.
Update the documentation for the `preempt=' parameter which now also
accepts `lazy'.
Fixes:
7c70cb94d29cd ("sched: Add Lazy preemption model")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241122173557.MYOtT95Q@linutronix.de
K Prateek Nayak [Tue, 19 Nov 2024 05:44:32 +0000 (05:44 +0000)]
sched/core: Prevent wakeup of ksoftirqd during idle load balance
Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.
Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
<idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED]
<idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
...
Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.
Following are the observations with the changes when enabling the same
set of events:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
<idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.
Fixes:
b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Closes: https://lore.kernel.org/lkml/
fcf823f-195e-6c9a-eac3-
25f870cb35ac@inria.fr/ [1]
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 19 Nov 2024 05:44:31 +0000 (05:44 +0000)]
sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
Commit
b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the
TIF_NEED_RESCHED flag in idle task's thread info and relying on
flush_smp_call_function_queue() in idle exit path to run the
call-function. A softirq raised by the call-function is handled shortly
after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag
remains set and is only cleared later when schedule_idle() calls
__schedule().
need_resched() check in _nohz_idle_balance() exists to bail out of load
balancing if another task has woken up on the CPU currently in-charge of
idle load balancing which is being processed in SCHED_SOFTIRQ context.
Since the optimization mentioned above overloads the interpretation of
TIF_NEED_RESCHED, check for idle_cpu() before going with the existing
need_resched() check which can catch a genuine task wakeup on an idle
CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as
well as the case where ksoftirqd needs to be preempted as a result of
new task wakeup or slice expiry.
In case of PREEMPT_RT or threadirqs, although the idle load balancing
may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd
is the only fair task going back to sleep will trigger a newidle balance
on the CPU which will alleviate some imbalance if it exists if idle
balance fails to do so.
Fixes:
b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 19 Nov 2024 05:44:30 +0000 (05:44 +0000)]
sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:
o Since commit
f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
on wakelist if wakee cpu is idle") a target perceived to be idle
(target_rq->nr_running == 0) will return true for
ttwu_queue_cond(target) which will offload the task wakeup to the idle
target via an IPI.
In all such cases target_rq->ttwu_pending will be set to 1 before
queuing the wake function.
If an idle load balance races here, following scenarios are possible:
- The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
IPI is sent to the CPU to wake it out of idle. If the
nohz_csd_func() queues before sched_ttwu_pending(), the idle load
balance will bail out since idle_cpu(target) returns 0 since
target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
sched_ttwu_pending() it should see rq->nr_running to be non-zero and
bail out of idle load balancing.
- The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
the sender will simply set TIF_NEED_RESCHED for the target to put it
out of idle and flush_smp_call_function_queue() in do_idle() will
execute the call function. Depending on the ordering of the queuing
of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
target_rq->nr_running to be non-zero if there is a genuine task
wakeup racing with the idle load balance kick.
o The waker CPU perceives the target CPU to be busy
(targer_rq->nr_running != 0) but the CPU is in fact going idle and due
to a series of unfortunate events, the system reaches a case where the
waker CPU decides to perform the wakeup by itself in ttwu_queue() on
the target CPU but target is concurrently selected for idle load
balance (XXX: Can this happen? I'm not sure, but we'll consider the
mother of all coincidences to estimate the worst case scenario).
ttwu_do_activate() calls enqueue_task() which would increment
"rq->nr_running" post which it calls wakeup_preempt() which is
responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
thing to note in this case is that rq->nr_running is already non-zero
in case of a wakeup before TIF_NEED_RESCHED is set which would
lead to idle_cpu() check returning false.
In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".
Chasing the reason why this check might have existed in the first place,
I came across Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
sched_ttwu_do_pending(list);
if (unlikely((rq->idle == current) &&
rq->nohz_balance_kick &&
!need_resched()))
raise_softirq_irqoff(SCHED_SOFTIRQ);
Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:
if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
raise_softirq_irqoff(SCHED_SOFTIRQ);
where idle_cpu() seems to have replaced "rq->idle == current" check.
Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit
b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.
Fixes:
b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com
K Prateek Nayak [Tue, 19 Nov 2024 05:44:29 +0000 (05:44 +0000)]
softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
do_softirq_post_smp_call_flush() on PREEMPT_RT kernels carries a
WARN_ON_ONCE() for any SOFTIRQ being raised from an SMP-call-function.
Since do_softirq_post_smp_call_flush() is called with preempt disabled,
raising a SOFTIRQ during flush_smp_call_function_queue() can lead to
longer preempt disabled sections.
Since commit
b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") IPIs to an idle CPU in
TIF_POLLING_NRFLAG mode can be optimized out by instead setting
TIF_NEED_RESCHED bit in idle task's thread_info and relying on the
flush_smp_call_function_queue() in the idle-exit path to run the
SMP-call-function.
To trigger an idle load balancing, the scheduler queues
nohz_csd_function() responsible for triggering an idle load balancing on
a target nohz idle CPU and sends an IPI. Only now, this IPI is optimized
out and the SMP-call-function is executed from
flush_smp_call_function_queue() in do_idle() which can raise a
SCHED_SOFTIRQ to trigger the balancing.
So far, this went undetected since, the need_resched() check in
nohz_csd_function() would make it bail out of idle load balancing early
as the idle thread does not clear TIF_POLLING_NRFLAG before calling
flush_smp_call_function_queue(). The need_resched() check was added with
the intent to catch a new task wakeup, however, it has recently
discovered to be unnecessary and will be removed in the subsequent
commit after which nohz_csd_function() can raise a SCHED_SOFTIRQ from
flush_smp_call_function_queue() to trigger an idle load balance on an
idle target in TIF_POLLING_NRFLAG mode.
nohz_csd_function() bails out early if "idle_cpu()" check for the
target CPU, and does not lock the target CPU's rq until the very end,
once it has found tasks to run on the CPU and will not inhibit the
wakeup of, or running of a newly woken up higher priority task. Account
for this and prevent a WARN_ON_ONCE() when SCHED_SOFTIRQ is raised from
flush_smp_call_function_queue().
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-2-kprateek.nayak@amd.com
Josh Don [Mon, 11 Nov 2024 18:27:38 +0000 (10:27 -0800)]
sched: fix warning in sched_setaffinity
Commit
8f9ea86fdf99b added some logic to sched_setaffinity that included
a WARN when a per-task affinity assignment races with a cpuset update.
Specifically, we can have a race where a cpuset update results in the
task affinity no longer being a subset of the cpuset. That's fine; we
have a fallback to instead use the cpuset mask. However, we have a WARN
set up that will trigger if the cpuset mask has no overlap at all with
the requested task affinity. This shouldn't be a warning condition; its
trivial to create this condition.
Reproduced the warning by the following setup:
- $PID inside a cpuset cgroup
- another thread repeatedly switching the cpuset cpus from 1-2 to just 1
- another thread repeatedly setting the $PID affinity (via taskset) to 2
Fixes:
8f9ea86fdf99b ("sched: Always preserve the user requested cpumask")
Signed-off-by: Josh Don <joshdon@google.com>
Acked-and-tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lkml.kernel.org/r/20241111182738.1832953-1-joshdon@google.com
Juri Lelli [Wed, 27 Nov 2024 06:37:40 +0000 (07:37 +0100)]
sched/deadline: Fix replenish_dl_new_period dl_server condition
The condition in replenish_dl_new_period() that checks if a reservation
(dl_server) is deferred and is not handling a starvation case is
obviously wrong.
Fix it.
Fixes:
a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
Linus Torvalds [Sun, 1 Dec 2024 22:28:56 +0000 (14:28 -0800)]
Linux 6.13-rc1
Linus Torvalds [Sun, 1 Dec 2024 21:38:24 +0000 (13:38 -0800)]
Merge tag 'i2c-for-6.13-rc1-part3' of git://git./linux/kernel/git/wsa/linux
Pull i2c component probing support from Wolfram Sang:
"Add OF component probing.
Some devices are designed and manufactured with some components having
multiple drop-in replacement options. These components are often
connected to the mainboard via ribbon cables, having the same signals
and pin assignments across all options. These may include the display
panel and touchscreen on laptops and tablets, and the trackpad on
laptops. Sometimes which component option is used in a particular
device can be detected by some firmware provided identifier, other
times that information is not available, and the kernel has to try to
probe each device.
Instead of a delicate dance between drivers and device tree quirks,
this change introduces a simple I2C component probe function. For a
given class of devices on the same I2C bus, it will go through all of
them, doing a simple I2C read transfer and see which one of them
responds. It will then enable the device that responds"
* tag 'i2c-for-6.13-rc1-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: fix typo in I2C OF COMPONENT PROBER
of: base: Document prefix argument for of_get_next_child_with_prefix()
i2c: Fix whitespace style issue
arm64: dts: mediatek: mt8173-elm-hana: Mark touchscreens and trackpads as fail
platform/chrome: Introduce device tree hardware prober
i2c: of-prober: Add GPIO support to simple helpers
i2c: of-prober: Add simple helpers for regulator support
i2c: Introduce OF component probe function
of: base: Add for_each_child_of_node_with_prefix()
of: dynamic: Add of_changeset_update_prop_string
Linus Torvalds [Sun, 1 Dec 2024 21:10:51 +0000 (13:10 -0800)]
Merge tag 'trace-printf-v6.13' of git://git./linux/kernel/git/trace/linux-trace
Pull bprintf() removal from Steven Rostedt:
- Remove unused bprintf() function, that was added with the rest of the
"bin-printf" functions.
These are functions that are used by trace_printk() that allows to
quickly save the format and arguments into the ring buffer without
the expensive processing of converting numbers to ASCII. Then on
output, at a much later time, the ring buffer is read and the string
processing occurs then. The bprintf() was added for consistency but
was never used. It can be safely removed.
* tag 'trace-printf-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
printf: Remove unused 'bprintf'
Linus Torvalds [Sun, 1 Dec 2024 20:41:21 +0000 (12:41 -0800)]
Merge tag 'timers_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Fix a case where posix timers with a thread-group-wide target would
miss signals if some of the group's threads are exiting
- Fix a hang caused by ndelay() calling the wrong delay function
__udelay()
- Fix a wrong offset calculation in adjtimex(2) when using ADJ_MICRO
(microsecond resolution) and a negative offset
* tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timers: Target group sigqueue to current task only if not exiting
delay: Fix ndelay() spuriously treated as udelay()
ntp: Remove invalid cast in time offset math
Linus Torvalds [Sun, 1 Dec 2024 20:37:58 +0000 (12:37 -0800)]
Merge tag 'irq_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Move the ->select callback to the correct ops structure in
irq-mvebu-sei to fix some Marvell Armada platforms
- Add a workaround for Hisilicon ITS erratum
162100801 which can cause
some virtual interrupts to get lost
- More platform_driver::remove() conversion
* tag 'irq_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Switch back to struct platform_driver::remove()
irqchip/gicv3-its: Add workaround for hip09 ITS erratum
162100801
irqchip/irq-mvebu-sei: Move misplaced select() callback to SEI CP domain
Linus Torvalds [Sun, 1 Dec 2024 20:35:37 +0000 (12:35 -0800)]
Merge tag 'x86_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a terminating zero end-element to the array describing AMD CPUs
affected by erratum 1386 so that the matching loop actually
terminates instead of going off into the weeds
- Update the boot protocol documentation to mention the fact that the
preferred address to load the kernel to is considered in the
relocatable kernel case too
- Flush the memory buffer containing the microcode patch after applying
microcode on AMD Zen1 and Zen2, to avoid unnecessary slowdowns
- Make sure the PPIN CPU feature flag is cleared on all CPUs if PPIN
has been disabled
* tag 'x86_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Terminate the erratum_1386_microcode array
x86/Documentation: Update algo in init_size description of boot protocol
x86/microcode/AMD: Flush patch buffer mapping after application
x86/mm: Carve out INVLPG inline asm for use by others
x86/cpu: Fix PPIN initialization
Linus Torvalds [Sun, 1 Dec 2024 17:23:33 +0000 (09:23 -0800)]
strscpy: write destination buffer only once
The point behind strscpy() was to once and for all avoid all the
problems with 'strncpy()' and later broken "fixed" versions like
strlcpy() that just made things worse.
So strscpy not only guarantees NUL-termination (unlike strncpy), it also
doesn't do unnecessary padding at the destination. But at the same time
also avoids byte-at-a-time reads and writes by _allowing_ some extra NUL
writes - within the size, of course - so that the whole copy can be done
with word operations.
It is also stable in the face of a mutable source string: it explicitly
does not read the source buffer multiple times (so an implementation
using "strnlen()+memcpy()" would be wrong), and does not read the source
buffer past the size (like the mis-design that is strlcpy does).
Finally, the return value is designed to be simple and unambiguous: if
the string cannot be copied fully, it returns an actual negative error,
making error handling clearer and simpler (and the caller already knows
the size of the buffer). Otherwise it returns the string length of the
result.
However, there was one final stability issue that can be important to
callers: the stability of the destination buffer.
In particular, the same way we shouldn't read the source buffer more
than once, we should avoid doing multiple writes to the destination
buffer: first writing a potentially non-terminated string, and then
terminating it with NUL at the end does not result in a stable result
buffer.
Yes, it gives the right result in the end, but if the rule for the
destination buffer was that it is _always_ NUL-terminated even when
accessed concurrently with updates, the final byte of the buffer needs
to always _stay_ as a NUL byte.
[ Note that "final byte is NUL" here is literally about the final byte
in the destination array, not the terminating NUL at the end of the
string itself. There is no attempt to try to make concurrent reads and
writes give any kind of consistent string length or contents, but we
do want to guarantee that there is always at least that final
terminating NUL character at the end of the destination array if it
existed before ]
This is relevant in the kernel for the tsk->comm[] array, for example.
Even without locking (for either readers or writers), we want to know
that while the buffer contents may be garbled, it is always a valid C
string and always has a NUL character at 'comm[TASK_COMM_LEN-1]' (and
never has any "out of thin air" data).
So avoid any "copy possibly non-terminated string, and terminate later"
behavior, and write the destination buffer only once.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dr. David Alan Gilbert [Wed, 2 Oct 2024 17:31:47 +0000 (18:31 +0100)]
printf: Remove unused 'bprintf'
bprintf() is unused. Remove it. It was added in the commit
4370aa4aa753
("vsprintf: add binary printf") but as far as I can see was never used,
unlike the other two functions in that patch.
Link: https://lore.kernel.org/20241002173147.210107-1-linux@treblig.org
Reviewed-by: Andy Shevchenko <andy@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sun, 1 Dec 2024 02:30:22 +0000 (18:30 -0800)]
Merge tag 'turbostat-2024.11.30' of git://git./linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
- assorted minor bug fixes
- assorted platform specific tweaks
- initial RAPL PSYS (SysWatt) support
* tag 'turbostat-2024.11.30' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: 2024.11.30
tools/power turbostat: Add RAPL psys as a built-in counter
tools/power turbostat: Fix child's argument forwarding
tools/power turbostat: Force --no-perf in --dump mode
tools/power turbostat: Add support for /sys/class/drm/card1
tools/power turbostat: Cache graphics sysfs file descriptors during probe
tools/power turbostat: Consolidate graphics sysfs access
tools/power turbostat: Remove unnecessary fflush() call
tools/power turbostat: Enhance platform divergence description
tools/power turbostat: Add initial support for GraniteRapids-D
tools/power turbostat: Remove PC3 support on Lunarlake
tools/power turbostat: Rename arl_features to lnl_features
tools/power turbostat: Add back PC8 support on Arrowlake
tools/power turbostat: Remove PC7/PC9 support on MTL
tools/power turbostat: Honor --show CPU, even when even when num_cpus=1
tools/power turbostat: Fix trailing '\n' parsing
tools/power turbostat: Allow using cpu device in perf counters on hybrid platforms
tools/power turbostat: Fix column printing for PMT xtal_time counters
tools/power turbostat: fix GCC9 build regression
Linus Torvalds [Sun, 1 Dec 2024 02:23:05 +0000 (18:23 -0800)]
Merge tag 'pci-v6.13-fixes-1' of git://git./linux/kernel/git/pci/pci
Pull PCI fix from Bjorn Helgaas:
- When removing a PCI device, only look up and remove a platform device
if there is an associated device node for which there could be a
platform device, to fix a merge window regression (Brian Norris)
* tag 'pci-v6.13-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/pwrctrl: Unregister platform device only if one actually exists
Linus Torvalds [Sun, 1 Dec 2024 02:14:56 +0000 (18:14 -0800)]
Merge tag 'lsm-pr-
20241129' of git://git./linux/kernel/git/pcmoore/lsm
Pull ima fix from Paul Moore:
"One small patch to fix a function parameter / local variable naming
snafu that went up to you in the current merge window"
* tag 'lsm-pr-
20241129' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
ima: uncover hidden variable in ima_match_rules()
Linus Torvalds [Sat, 30 Nov 2024 23:47:29 +0000 (15:47 -0800)]
Merge tag 'block-6.13-
20242901' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Use correct srcu list traversal (Breno)
- Scatter-gather support for metadata (Keith)
- Fabrics shutdown race condition fix (Nilay)
- Persistent reservations updates (Guixin)
- Add the required bits for MD atomic write support for raid0/1/10
- Correct return value for unknown opcode in ublk
- Fix deadlock with zone revalidation
- Fix for the io priority request vs bio cleanups
- Use the correct unsigned int type for various limit helpers
- Fix for a race in loop
- Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
it easier for actual humans to read
- Fix potential UAF when iterating tags
- A few fixes for bfq-iosched UAF issues
- Fix for brd discard not decrementing the allocated page count
- Various little fixes and cleanups
* tag 'block-6.13-
20242901' of git://git.kernel.dk/linux: (36 commits)
brd: decrease the number of allocated pages which discarded
block, bfq: fix bfqq uaf in bfq_limit_depth()
block: Don't allow an atomic write be truncated in blkdev_write_iter()
mq-deadline: don't call req_get_ioprio from the I/O completion handler
block: Prevent potential deadlock in blk_revalidate_disk_zones()
block: Remove extra part pointer NULLify in blk_rq_init()
nvme: tuning pr code by using defined structs and macros
nvme: introduce change ptpl and iekey definition
block: return bool from get_disk_ro and bdev_read_only
block: remove a duplicate definition for bdev_read_only
block: return bool from blk_rq_aligned
block: return unsigned int from blk_lim_dma_alignment_and_pad
block: return unsigned int from queue_dma_alignment
block: return unsigned int from bdev_io_opt
block: req->bio is always set in the merge code
block: don't bother checking the data direction for merges
block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
md/raid10: Atomic write support
md/raid1: Atomic write support
...
Linus Torvalds [Sat, 30 Nov 2024 23:43:02 +0000 (15:43 -0800)]
Merge tag 'io_uring-6.13-
20242901' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
- Remove a leftover struct from when the cqwait registered waiting was
transitioned to regions.
- Fix for an issue introduced in this merge window, where nop->fd might
be used uninitialized. Ensure it's always set.
- Add capping of the task_work run in local task_work mode, to prevent
bursty and long chains from adding too much latency.
- Work around xa_store() leaving ->head non-NULL if it encounters an
allocation error during storing. Just a debug trigger, and can go
away once xa_store() behaves in a more expected way for this
condition. Not a major thing as it basically requires fault injection
to trigger it.
- Fix a few mapping corner cases
- Fix KCSAN complaint on reading the table size post unlock. Again not
a "real" issue, but it's easy to silence by just keeping the reading
inside the lock that protects it.
* tag 'io_uring-6.13-
20242901' of git://git.kernel.dk/linux:
io_uring/tctx: work around xa_store() allocation error issue
io_uring: fix corner case forgetting to vunmap
io_uring: fix task_work cap overshooting
io_uring: check for overflows in io_pin_pages
io_uring/nop: ensure nop->fd is always initialized
io_uring: limit local tw done
io_uring: add io_local_work_pending()
io_uring/region: return negative -E2BIG in io_create_region()
io_uring: protect register tracing
io_uring: remove io_uring_cqwait_reg_arg
Linus Torvalds [Sat, 30 Nov 2024 23:36:17 +0000 (15:36 -0800)]
Merge tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix physical address calculation for struct dma_debug_entry (Fedor
Pchelkin)
* tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: fix physical address calculation for struct dma_debug_entry
Linus Torvalds [Sat, 30 Nov 2024 22:51:08 +0000 (14:51 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
- ARM fixes
- RISC-V Svade and Svadu (accessed and dirty bit) extension support for
host and guest
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: riscv: selftests: Add Svade and Svadu Extension to get-reg-list test
RISC-V: KVM: Add Svade and Svadu Extensions Support for Guest/VM
dt-bindings: riscv: Add Svade and Svadu Entries
RISC-V: Add Svade and Svadu Extensions Support
KVM: arm64: Use MDCR_EL2.HPME to evaluate overflow of hyp counters
KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
KVM: arm64: Mark set_sysreg_masks() as inline to avoid build failure
KVM: arm64: vgic-its: Add stronger type-checking to the ITS entry sizes
KVM: arm64: vgic: Kill VGIC_MAX_PRIVATE definition
KVM: arm64: vgic: Make vgic_get_irq() more robust
KVM: arm64: vgic-v3: Sanitise guest writes to GICR_INVLPIR
Linus Torvalds [Sat, 30 Nov 2024 22:45:29 +0000 (14:45 -0800)]
Merge tag 'sh-for-v6.13-tag1' of git://git./linux/kernel/git/glaubitz/sh-linux
Pull sh updates from John Paul Adrian Glaubitz:
"Two small fixes.
The first one by Huacai Chen addresses a runtime warning when
CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS are selected
which occurs because the cpuinfo code on sh incorrectly uses NR_CPUS
when iterating CPUs instead of the runtime limit nr_cpu_ids.
A second fix by Dan Carpenter fixes a use-after-free bug in
register_intc_controller() which occurred as a result of improper
error handling in the interrupt controller driver code when
registering an interrupt controller during plat_irq_setup() on sh"
* tag 'sh-for-v6.13-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: intc: Fix use-after-free bug in register_intc_controller()
sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
Linus Torvalds [Sat, 30 Nov 2024 22:33:44 +0000 (14:33 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Deselect ARCH_CORRECT_STACKTRACE_ON_KRETPROBE so that tests depending
on it don't run (and fail) on arm64
- Fix lockdep assert in the Arm SMMUv3 PMU driver
- Fix the port and device ID bits setting in the Arm CMN perf driver
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
perf/arm-cmn: Ensure port and device id bits are set properly
perf/arm-smmuv3: Fix lockdep assert in ->event_init()
arm64: disable ARCH_CORRECT_STACKTRACE_ON_KRETPROBE tests
Len Brown [Sat, 30 Nov 2024 21:22:00 +0000 (16:22 -0500)]
tools/power turbostat: 2024.11.30
since 2024.07.26:
assorted minor bug fixes
assorted platform specific tweaks
initial RAPL PSYS (SysWatt) support
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 2 Oct 2024 13:05:15 +0000 (15:05 +0200)]
tools/power turbostat: Add RAPL psys as a built-in counter
Introduce the counter as a part of global, platform counters structure.
We open the counter for only one cpu, but otherwise treat it as an
ordinary RAPL counter, allowing for grouped perf read.
The counter is disabled by default, because it's interpretation may
require additional, platform specific information, making it unsuitable
for general use.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 13 Nov 2024 14:48:22 +0000 (15:48 +0100)]
tools/power turbostat: Fix child's argument forwarding
Add '+' to optstring when early scanning for --no-msr and --no-perf.
It causes option processing to stop as soon as a nonoption argument is
encountered, effectively skipping child's arguments.
Fixes:
3e4048466c39 ("tools/power turbostat: Add --no-msr option")
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Thu, 24 Oct 2024 13:17:45 +0000 (15:17 +0200)]
tools/power turbostat: Force --no-perf in --dump mode
Force the --no-perf early to prevent using it as a source. User asks for
raw values, but perf returns them relative to the opening of the file
descriptor.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:46 +0000 (15:59 +0800)]
tools/power turbostat: Add support for /sys/class/drm/card1
On some machines, the graphics device is enumerated as
/sys/class/drm/card1 instead of /sys/class/drm/card0. The current
implementation does not handle this scenario, resulting in the loss of
graphics C6 residency and frequency information.
Add support for /sys/class/drm/card1, ensuring that turbostat can
retrieve and display the graphics columns for these platforms.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:45 +0000 (15:59 +0800)]
tools/power turbostat: Cache graphics sysfs file descriptors during probe
Snapshots of the graphics sysfs knobs are taken based on file
descriptors. To optimize this process, open the files and cache the file
descriptors during the graphics probe phase. As a result, the previously
cached pathnames become redundant and are removed.
This change aims to streamline the code without altering its functionality.
No functional change intended.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:44 +0000 (15:59 +0800)]
tools/power turbostat: Consolidate graphics sysfs access
Currently, there is an inconsistency in how graphics sysfs knobs are
accessed: graphics residency sysfs knobs are opened and closed for each
read, while graphics frequency sysfs knobs are opened once and remain
open until turbostat exits. This inconsistency is confusing and adds
unnecessary code complexity.
Consolidate the access method by opening the sysfs files once and
reusing the file pointers for subsequent accesses. This approach
simplifies the code and ensures a consistent method for accessing
graphics sysfs knobs.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:43 +0000 (15:59 +0800)]
tools/power turbostat: Remove unnecessary fflush() call
The graphics sysfs knobs are read-only, making the use of fflush()
before reading them redundant.
Remove the unnecessary fflush() call.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:42 +0000 (15:59 +0800)]
tools/power turbostat: Enhance platform divergence description
In various generations, platforms often share a majority of features,
diverging only in a few specific aspects. The current approach of using
hardcoded values in 'platform_features' structure fails to effectively
represent these divergences.
To improve the description of platform divergence:
1. Each newly introduced 'platform_features' structure must have a base,
typically derived from the previous generation.
2. Platform feature values should be inherited from the base structure
rather than being hardcoded.
This approach ensures a more accurate and maintainable representation of
platform-specific features across different generations.
Converts `adl_features` and `lnl_features` to follow this new scheme.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:41 +0000 (15:59 +0800)]
tools/power turbostat: Add initial support for GraniteRapids-D
Add initial support for GraniteRapids-D. It shares the same features
with SapphireRapids.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:40 +0000 (15:59 +0800)]
tools/power turbostat: Remove PC3 support on Lunarlake
Lunarlake supports CC1/CC6/CC7/PC2/PC6/PC10.
Remove PC3 support on Lunarlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:39 +0000 (15:59 +0800)]
tools/power turbostat: Rename arl_features to lnl_features
As ARL shares the same features with ADL/RPL/MTL, now 'arl_features' is
used by Lunarlake platform only.
Rename 'arl_features' to 'lnl_features'.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:38 +0000 (15:59 +0800)]
tools/power turbostat: Add back PC8 support on Arrowlake
Similar to ADL/RPL/MTL, ARL supports CC1/CC6/CC7/PC2/PC3/PC6/PC8/PC10.
Add back PC8 support on Arrowlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:37 +0000 (15:59 +0800)]
tools/power turbostat: Remove PC7/PC9 support on MTL
Similar to ADL/RPL, MTL support CC1/CC6/CC7/PC2/PC3/PC6/PC8/CP10.
Remove PC7/PC9 support on MTL.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Tue, 17 Sep 2024 20:33:26 +0000 (22:33 +0200)]
tools/power turbostat: Honor --show CPU, even when even when num_cpus=1
Honor --show CPU and --show Core when "topo.num_cpus == 1".
Previously turbostat assumed that on a 1-CPU system, these
columns should never appear.
Honoring these flags makes it easier for several programs
that parse turbostat output.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Tue, 27 Aug 2024 05:07:51 +0000 (13:07 +0800)]
tools/power turbostat: Fix trailing '\n' parsing
parse_cpu_string() parses the string input either from command line or
from /sys/fs/cgroup/cpuset.cpus.effective to get a list of CPUs that
turbostat can run with.
The cpu string returned by /sys/fs/cgroup/cpuset.cpus.effective contains
a trailing '\n', but strtoul() fails to treat this as an error.
That says, for the code below
val = ("\n", NULL, 10);
val returns 0, and errno is also not set.
As a result, CPU0 is erroneously considered as allowed CPU and this
causes failures when turbostat tries to run on CPU0.
get_counters: Could not migrate to CPU 0
...
turbostat: re-initialized with num_cpus 8, allowed_cpus 5
get_counters: Could not migrate to CPU 0
Add a check to return immediately if '\n' or '\0' is detected.
Fixes:
8c3dd2c9e542 ("tools/power/turbostat: Abstrct function for parsing cpu string")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Tue, 20 Aug 2024 16:47:59 +0000 (18:47 +0200)]
tools/power turbostat: Allow using cpu device in perf counters on hybrid platforms
Intel hybrid platforms expose different perf devices for P and E cores.
Instead of one, "/sys/bus/event_source/devices/cpu" device, there are
"/sys/bus/event_source/devices/{cpu_core,cpu_atom}".
This, however makes it more complicated for the user,
because most of the counters are available on both and had to be
handled manually.
This patch allows users to use "virtual" cpu device that is seemingly
translated to cpu_core and cpu_atom perf devices, depending on the type
of a CPU we are opening the counter for.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 7 Aug 2024 11:43:39 +0000 (13:43 +0200)]
tools/power turbostat: Fix column printing for PMT xtal_time counters
If the very first printed column was for a PMT counter of type xtal_time
we would misalign the column header, because we were always printing the
delimiter.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Todd Brandt [Wed, 31 Jul 2024 16:24:09 +0000 (12:24 -0400)]
tools/power turbostat: fix GCC9 build regression
Fix build regression seen when using old gcc-9 compiler.
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Linus Torvalds [Sat, 30 Nov 2024 21:41:50 +0000 (13:41 -0800)]
Merge tag 'kbuild-v6.13' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add generic support for built-in boot DTB files
- Enable TAB cycling for dialog buttons in nconfig
- Fix issues in streamline_config.pl
- Refactor Kconfig
- Add support for Clang's AutoFDO (Automatic Feedback-Directed
Optimization)
- Add support for Clang's Propeller, a profile-guided optimization.
- Change the working directory to the external module directory for M=
builds
- Support building external modules in a separate output directory
- Enable objtool for *.mod.o and additional kernel objects
- Use lz4 instead of deprecated lz4c
- Work around a performance issue with "git describe"
- Refactor modpost
* tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits)
kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms
gitignore: Don't ignore 'tags' directory
kbuild: add dependency from vmlinux to resolve_btfids
modpost: replace tdb_hash() with hash_str()
kbuild: deb-pkg: add python3:native to build dependency
genksyms: reduce indentation in export_symbol()
modpost: improve error messages in device_id_check()
modpost: rename alias symbol for MODULE_DEVICE_TABLE()
modpost: rename variables in handle_moddevtable()
modpost: move strstarts() to modpost.h
modpost: convert do_usb_table() to a generic handler
modpost: convert do_of_table() to a generic handler
modpost: convert do_pnp_device_entry() to a generic handler
modpost: convert do_pnp_card_entries() to a generic handler
modpost: call module_alias_printf() from all do_*_entry() functions
modpost: pass (struct module *) to do_*_entry() functions
modpost: remove DEF_FIELD_ADDR_VAR() macro
modpost: deduplicate MODULE_ALIAS() for all drivers
modpost: introduce module_alias_printf() helper
modpost: remove unnecessary check in do_acpi_entry()
...
Linus Torvalds [Sat, 30 Nov 2024 19:18:16 +0000 (11:18 -0800)]
Merge tag 'rtc-6.13' of git://git./linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"New drivers:
- Amlogic A4 and A5 RTC
- Marvell 88PM886 PMIC RTC
- Renesas RTCA-3 for Renesas RZ/G3S
Driver updates:
- ab-eoz9: fix temperature and alarm support
- cmos: improve locking behaviour
- isl12022: add alarm support
- m48t59: improve epoch handling
- mt6359: add range
- rzn1: fix BCD conversions and simplify driver"
* tag 'rtc-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (38 commits)
rtc: ab-eoz9: don't fail temperature reads on undervoltage notification
rtc: rzn1: reduce register access
rtc: rzn1: drop superfluous wday calculation
m68k: mvme147, mvme16x: Adopt rtc-m48t59 platform driver
rtc: brcmstb-waketimer: don't include 'pm_wakeup.h' directly
rtc: m48t59: Use platform_data struct for year offset value
rtc: ab-eoz9: fix abeoz9_rtc_read_alarm
rtc: rv3028: fix RV3028_TS_COUNT type
rtc: rzn1: update Michel's email
rtc: rzn1: fix BCD to rtc_time conversion errors
rtc: amlogic-a4: fix compile error
rtc: amlogic-a4: drop error messages
MAINTAINERS: Add an entry for Amlogic RTC driver
rtc: support for the Amlogic on-chip RTC
dt-bindings: rtc: Add Amlogic A4 and A5 RTC
rtc: add driver for Marvell 88PM886 PMIC RTC
rtc: check if __rtc_read_time was successful in rtc_timer_do_work()
rtc: pcf8563: Switch to regmap
rtc: pcf8563: Sort headers alphabetically
rtc: abx80x: Fix WDT bit position of the status register
...
Linus Torvalds [Sat, 30 Nov 2024 18:34:54 +0000 (10:34 -0800)]
Merge tag 'uml-for-linus-6.13-rc1' of git://git./linux/kernel/git/uml/linux
Pull UML updates from Richard Weinberger:
- Lots of cleanups, mostly from Benjamin Berg and Tiwei Bie
- Removal of unused code
- Fix for sparse warnings
- Cleanup around stub_exe()
* tag 'uml-for-linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (68 commits)
hostfs: Fix the NULL vs IS_ERR() bug for __filemap_get_folio()
um: move thread info into task
um: Always dump trace for specified task in show_stack
um: vector: Do not use drvdata in release
um: net: Do not use drvdata in release
um: ubd: Do not use drvdata in release
um: ubd: Initialize ubd's disk pointer in ubd_add
um: virtio_uml: query the number of vqs if supported
um: virtio_uml: fix call_fd IRQ allocation
um: virtio_uml: send SET_MEM_TABLE message with the exact size
um: remove broken double fault detection
um: remove duplicate UM_NSEC_PER_SEC definition
um: remove file sync for stub data
um: always include kconfig.h and compiler-version.h
um: set DONTDUMP and DONTFORK flags on KASAN shadow memory
um: fix sparse warnings in signal code
um: fix sparse warnings from regset refactor
um: Remove double zero check
um: fix stub exe build with CONFIG_GCOV
um: Use os_set_pdeathsig helper in winch thread/process
...
Linus Torvalds [Sat, 30 Nov 2024 18:32:47 +0000 (10:32 -0800)]
Merge tag 'ubifs-for-linus-6.13-rc1' of git://git./linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Bug fix for rtime compression
- Various cleanups
UBI:
- Cleanups for fastmap and wear leveling
UBIFS:
- Add support for FS_IOC_GETFSSYSFSPATH
- Remove dead ioctl code
- Fix UAF in ubifs_tnc_end_commit()"
* tag 'ubifs-for-linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: (25 commits)
ubifs: Fix uninitialized use of err in ubifs_jnl_write_inode()
jffs2: Prevent rtime decompress memory corruption
jffs2: remove redundant check on outpos > pos
fs: jffs2: Fix inconsistent indentation in jffs2_mark_node_obsolete
jffs2: Correct some typos in comments
jffs2: fix use of uninitialized variable
jffs2: Use str_yes_no() helper function
mtd: ubi: remove redundant check on bytes_left at end of function
mtd: ubi: fix unreleased fwnode_handle in find_volume_fwnode()
ubifs: authentication: Fix use-after-free in ubifs_tnc_end_commit
ubi: fastmap: Fix duplicate slab cache names while attaching
ubifs: xattr: remove unused anonymous enum
ubifs: Reduce kfree() calls in ubifs_purge_xattrs()
ubifs: Call iput(xino) only once in ubifs_purge_xattrs()
ubi: wl: Close down wear-leveling before nand is suspended
mtd: ubi: Rmove unused declaration in header file
ubifs: Correct the total block count by deducting journal reservation
ubifs: Convert to use ERR_CAST()
ubifs: add support for FS_IOC_GETFSSYSFSPATH
ubifs: remove unused ioctl flags GETFLAGS/SETFLAGS
...
Linus Torvalds [Sat, 30 Nov 2024 18:28:14 +0000 (10:28 -0800)]
Merge tag '9p-for-6.13-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- usbg: fix alloc failure handling & build-as-module
- xen: couple of fixes
- v9fs_cache_register/unregister code cleanup
* tag '9p-for-6.13-rc1' of https://github.com/martinetd/linux:
net/9p/usbg: allow building as standalone module
9p/xen: fix release of IRQ
9p/xen: fix init sequence
net/9p/usbg: fix handling of the failed kzalloc() memory allocation
fs/9p: replace functions v9fs_cache_{register|unregister} with direct calls
Linus Torvalds [Sat, 30 Nov 2024 18:22:38 +0000 (10:22 -0800)]
Merge tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A fix for the mount "device" string parser from Patrick and two cred
reference counting fixups from Max, marked for stable.
Also included a number of cleanups and a tweak to MAINTAINERS to avoid
unnecessarily CCing netdev list"
* tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client:
ceph: fix cred leak in ceph_mds_check_access()
ceph: pass cred pointer to ceph_mds_auth_match()
ceph: improve caps debugging output
ceph: correct ceph_mds_cap_peer field name
ceph: correct ceph_mds_cap_item field name
ceph: miscellaneous spelling fixes
ceph: Use strscpy() instead of strcpy() in __get_snap_name()
ceph: Use str_true_false() helper in status_show()
ceph: requalify some char pointers as const
ceph: extract entity name from device id
MAINTAINERS: exclude net/ceph from networking
ceph: Remove fs/ceph deadcode
libceph: Remove unused ceph_crypto_key_encode
libceph: Remove unused ceph_osdc_watch_check
libceph: Remove unused pagevec functions
libceph: Remove unused ceph_pagelist functions
Linus Torvalds [Sat, 30 Nov 2024 18:17:53 +0000 (10:17 -0800)]
Merge tag 'nfs-for-6.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- nfs/localio: fix for a memory corruption in nfs_local_read_done
- Revert "nfs: don't reuse partially completed requests in
nfs_lock_and_join_requests"
- nfsv4:
- ignore SB_RDONLY when mounting nfs
- Fix a use-after-free problem in open()
- sunrpc:
- clear XPRT_SOCK_UPD_TIMEOUT when reseting the transport
- timeout and cancel TLS handshake with -ETIMEDOUT
- fix one UAF issue caused by sunrpc kernel tcp socket
- Fix a hang in TLS sock_close if sk_write_pending
- pNFS/blocklayout: Fix device registration issues
Features and cleanups:
- localio cleanups from Mike Snitzer
- Clean up refcounting on the nfs version modules
- __counted_by() annotations
- nfs: make processes that are waiting for an I/O lock killable"
* tag 'nfs-for-6.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
fs/nfs/io: make nfs_start_io_*() killable
nfs/blocklayout: Limit repeat device registration on failure
nfs/blocklayout: Don't attempt unregister for invalid block device
sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket
SUNRPC: timeout and cancel TLS handshake with -ETIMEDOUT
sunrpc: clear XPRT_SOCK_UPD_TIMEOUT when reset transport
nfs: ignore SB_RDONLY when mounting nfs
Revert "nfs: don't reuse partially completed requests in nfs_lock_and_join_requests"
Revert "fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private"
nfs/localio: must clear res.replen in nfs_local_read_done
NFSv4.0: Fix a use-after-free problem in the asynchronous open()
NFSv4.0: Fix the wake up of the next waiter in nfs_release_seqid()
SUNRPC: Fix a hang in TLS sock_close if sk_write_pending
sunrpc: remove newlines from tracepoints
nfs: Annotate struct pnfs_commit_array with __counted_by()
nfs/localio: eliminate need for nfs_local_fsync_work forward declaration
nfs/localio: remove extra indirect nfs_to call to check {read,write}_iter
nfs/localio: eliminate unnecessary kref in nfs_local_fsync_ctx
nfs/localio: remove redundant suid/sgid handling
NFS: Implement get_nfs_version()
...
Linus Torvalds [Sat, 30 Nov 2024 18:14:42 +0000 (10:14 -0800)]
Merge tag '6.13-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- directory lease fixes
- password rotation fixes
- reconnect fix
- fix for SMB3.02 mounts
- DFS (global namespace) fixes
- fixes for special file handling (most relating to better handling
various types of symlinks)
- two minor cleanups
* tag '6.13-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (22 commits)
cifs: update internal version number
cifs: unlock on error in smb3_reconfigure()
cifs: during remount, make sure passwords are in sync
cifs: support mounting with alternate password to allow password rotation
smb: Initialize cfid->tcon before performing network ops
smb: During unmount, ensure all cached dir instances drop their dentry
smb: client: fix noisy message when mounting shares
smb: client: don't try following DFS links in cifs_tree_connect()
smb: client: allow reconnect when sending ioctl
smb: client: get rid of @nlsc param in cifs_tree_connect()
smb: client: allow more DFS referrals to be cached
cifs: Fix parsing reparse point with native symlink in SMB1 non-UNICODE session
cifs: Validate content of WSL reparse point buffers
cifs: Improve guard for excluding $LXDEV xattr
cifs: Add support for parsing WSL-style symlinks
cifs: Validate content of native symlink
cifs: Fix parsing native symlinks relative to the export
smb: client: fix NULL ptr deref in crypto_aead_setkey()
Update misleading comment in cifs_chan_update_iface
smb: client: change return value in open_cached_dir_by_dentry() if !cfids
...
Linus Torvalds [Sat, 30 Nov 2024 18:06:56 +0000 (10:06 -0800)]
Merge tag '6.13-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
- fix use after free due to race in ksmd workqueue handler
- debugging improvements
- fix incorrectly formatted response when client attempts SMB1
- improve memory allocation to reduce chance of OOM
- improve delays between retries when killing sessions
* tag '6.13-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix use-after-free in SMB request handling
ksmbd: add debug print for pending request during server shutdown
ksmbd: add netdev-up/down event debug print
ksmbd: add debug prints to know what smb2 requests were received
ksmbd: add debug print for rdma capable
ksmbd: use msleep instaed of schedule_timeout_interruptible()
ksmbd: use __GFP_RETRY_MAYFAIL
ksmbd: fix malformed unsupported smb1 negotiate response
Brian Norris [Tue, 26 Nov 2024 21:04:34 +0000 (13:04 -0800)]
PCI/pwrctrl: Unregister platform device only if one actually exists
If a PCI device has an associated device_node with power supplies,
pci_bus_add_device() creates platform devices for use by pwrctrl. When the
PCI device is removed, pci_stop_dev() uses of_find_device_by_node() to
locate the related platform device, then unregisters it.
But when we remove a PCI device with no associated device node,
dev_of_node(dev) is NULL, and of_find_device_by_node(NULL) returns the
first device with "dev->of_node == NULL". The result is that we (a)
mistakenly unregister a completely unrelated platform device, leading to
issues like the first trace below, and (b) dereference the NULL pointer
from dev_of_node() when clearing OF_POPULATED, as in the second trace.
Unregister a platform device only if there is one associated with this PCI
device. This resolves issues seen when doing:
# echo 1 > /sys/bus/pci/devices/.../remove
Sample issue from unregistering the wrong platform device:
WARNING: CPU: 0 PID: 5095 at drivers/regulator/core.c:5885 regulator_unregister+0x140/0x160
Call trace:
regulator_unregister+0x140/0x160
devm_rdev_release+0x1c/0x30
release_nodes+0x68/0x100
devres_release_all+0x98/0xf8
device_unbind_cleanup+0x20/0x70
device_release_driver_internal+0x1f4/0x240
device_release_driver+0x20/0x40
bus_remove_device+0xd8/0x170
device_del+0x154/0x380
device_unregister+0x28/0x88
of_device_unregister+0x1c/0x30
pci_stop_bus_device+0x154/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Later NULL pointer dereference for of_node_clear_flag(NULL, OF_POPULATED):
Unable to handle kernel NULL pointer dereference at virtual address
00000000000000c0
Call trace:
pci_stop_bus_device+0x190/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Link: https://lore.kernel.org/r/20241126210443.4052876-1-briannorris@chromium.org
Fixes:
681725afb6b9 ("PCI/pwrctl: Remove pwrctl device without iterating over all children of pwrctl parent")
Reported-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Closes: https://lore.kernel.org/r/
1732890621-19656-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Brian Norris <briannorris@chromium.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Linus Torvalds [Sat, 30 Nov 2024 17:03:16 +0000 (09:03 -0800)]
Merge tag 'tty-6.13-rc1' of git://git./linux/kernel/git/gregkh/tty
Pull tty / serial driver updates from Greg KH:
"Here is a small set of tty and serial driver updates for 6.13-rc1.
Nothing major at all this time, only some small changes:
- few device tree binding updates
- 8250_exar serial driver updates
- imx serial driver updates
- sprd_serial driver updates
- other tiny serial driver updates, full details in the shortlog
All of these have been in linux-next for a while with one reported
issue, but that commit has now been reverted"
* tag 'tty-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (37 commits)
Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit"
serial: amba-pl011: fix build regression
dt-bindings: serial: Add a new compatible string for ums9632
serial: sprd: Add support for sc9632
tty/serial/altera_uart: unwrap error log string
tty/serial/altera_jtaguart: unwrap error log string
serial: amba-pl011: Fix RX stall when DMA is used
tty: ldsic: fix tty_ldisc_autoload sysctl's proc_handler
serial: 8250_fintek: Add support for
F81216E
serial: sh-sci: Clean sci_ports[0] after at earlycon exit
tty: atmel_serial: Fix typo retreives to retrieves
tty: atmel_serial: Use devm_platform_ioremap_resource()
serial: 8250: omap: Move pm_runtime_get_sync
tty: serial: samsung: Add Exynos8895 compatible
dt-bindings: serial: samsung: Add samsung,exynos8895-uart compatible
serial: 8250_dw: Add Sophgo SG2044 quirk
dt-bindings: serial: snps-dw-apb-uart: Add Sophgo SG2044 uarts
dt-bindings: serial: snps,dw-apb-uart: merge duplicate compatible entry.
altera_jtaguart: Use dev_err() to report error attaching IRQ
altera_uart: Use dev_err() to report error attaching IRQ handler
...
Greg Kroah-Hartman [Sat, 30 Nov 2024 15:55:56 +0000 (16:55 +0100)]
Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit"
This reverts commit
3791ea69a4858b81e0277f695ca40f5aae40f312.
It was reported to cause boot-time issues, so revert it for now.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes:
3791ea69a485 ("serial: sh-sci: Clean sci_ports[0] after at earlycon exit")
Cc: stable <stable@kernel.org>
Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Wed, 23 Oct 2024 08:41:59 +0000 (11:41 +0300)]
sh: intc: Fix use-after-free bug in register_intc_controller()
In the error handling for this function, d is freed without ever
removing it from intc_list which would lead to a use after free.
To fix this, let's only add it to the list after everything has
succeeded.
Fixes:
2dcec7a988a1 ("sh: intc: set_irq_wake() support")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Huacai Chen [Thu, 14 Jul 2022 08:41:36 +0000 (16:41 +0800)]
sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
When CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS are selected,
cpu_max_bits_warn() generates a runtime warning similar as below when
showing /proc/cpuinfo. Fix this by using nr_cpu_ids (the runtime limit)
instead of NR_CPUS to iterate CPUs.
[ 3.052463] ------------[ cut here ]------------
[ 3.059679] WARNING: CPU: 3 PID: 1 at include/linux/cpumask.h:108 show_cpuinfo+0x5e8/0x5f0
[ 3.070072] Modules linked in: efivarfs autofs4
[ 3.076257] CPU: 0 PID: 1 Comm: systemd Not tainted 5.19-rc5+ #1052
[ 3.099465] Stack :
9000000100157b08 9000000000f18530 9000000000cf846c 9000000100154000
[ 3.109127]
9000000100157a50 0000000000000000 9000000100157a58 9000000000ef7430
[ 3.118774]
90000001001578e8 0000000000000040 0000000000000020 ffffffffffffffff
[ 3.128412]
0000000000aaaaaa 1ab25f00eec96a37 900000010021de80 900000000101c890
[ 3.138056]
0000000000000000 0000000000000000 0000000000000000 0000000000aaaaaa
[ 3.147711]
ffff8000339dc220 0000000000000001 0000000006ab4000 0000000000000000
[ 3.157364]
900000000101c998 0000000000000004 9000000000ef7430 0000000000000000
[ 3.167012]
0000000000000009 000000000000006c 0000000000000000 0000000000000000
[ 3.176641]
9000000000d3de08 9000000001639390 90000000002086d8 00007ffff0080286
[ 3.186260]
00000000000000b0 0000000000000004 0000000000000000 0000000000071c1c
[ 3.195868] ...
[ 3.199917] Call Trace:
[ 3.203941] [<
90000000002086d8>] show_stack+0x38/0x14c
[ 3.210666] [<
9000000000cf846c>] dump_stack_lvl+0x60/0x88
[ 3.217625] [<
900000000023d268>] __warn+0xd0/0x100
[ 3.223958] [<
9000000000cf3c90>] warn_slowpath_fmt+0x7c/0xcc
[ 3.231150] [<
9000000000210220>] show_cpuinfo+0x5e8/0x5f0
[ 3.238080] [<
90000000004f578c>] seq_read_iter+0x354/0x4b4
[ 3.245098] [<
90000000004c2e90>] new_sync_read+0x17c/0x1c4
[ 3.252114] [<
90000000004c5174>] vfs_read+0x138/0x1d0
[ 3.258694] [<
90000000004c55f8>] ksys_read+0x70/0x100
[ 3.265265] [<
9000000000cfde9c>] do_syscall+0x7c/0x94
[ 3.271820] [<
9000000000202fe4>] handle_syscall+0xc4/0x160
[ 3.281824] ---[ end trace
8b484262b4b8c24c ]---
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Linus Torvalds [Fri, 29 Nov 2024 21:06:06 +0000 (13:06 -0800)]
Merge tag 'drm-next-2024-11-29' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Merge window fixes, mostly amdgpu and xe, with a few other minor ones,
all looks fairly normal,
i915:
- hdcp: Fix when the first read and write are retried
xe:
- Wake up waiters after wait condition set to true
- Mark the preempt fence workqueue as reclaim
- Update xe2 graphics name string
- Fix a couple of guc submit races
- Fix pat index usage in migrate
- Ensure non-cached migrate pagetable bo mappings
- Take a PM ref in the delayed snapshot capture worker
amdgpu:
- SMU 13.0.6 fixes
- XGMI fixes
- SMU 13.0.7 fixes
- Misc code cleanups
- Plane refcount fixes
- DCN 4.0.1 fixes
- DC power fixes
- DTO fixes
- NBIO 7.11 fixes
- SMU 14.0.x fixes
- Reset fixes
- Enable DC on LoongArch
- Sysfs hotplug warning fix
- Misc small fixes
- VCN 4.0.3 fix
- Slab usage fix
- Jpeg delayed work fix
amdkfd:
- wptr handling fixes
radeon:
- Use ttm_bo_move_null()
- Constify struct pci_device_id
- Fix spurious hotplug
- HPD fix
rockchip
- fix 32-bit build"
* tag 'drm-next-2024-11-29' of https://gitlab.freedesktop.org/drm/kernel: (48 commits)
drm/xe: Take PM ref in delayed snapshot capture worker
drm/xe/migrate: use XE_BO_FLAG_PAGETABLE
drm/xe/migrate: fix pat index usage
drm/xe/guc_submit: fix race around suspend_pending
drm/xe/guc_submit: fix race around pending_disable
drm/xe: Update xe2_graphics name string
drm/rockchip: avoid 64-bit division
Revert "drm/radeon: Delay Connector detecting when HPD singals is unstable"
drm/amdgpu/jpeg: cancel the jpeg worker
drm/amdgpu: fix usage slab after free
drm/amdgpu/vcn: reset fw_shared when VCPU buffers corrupted on vcn v4.0.3
drm/amdgpu: Fix sysfs warning when hotplugging
drm/amdgpu: Add sysfs interface for vcn reset mask
drm/amdgpu/gmc7: fix wait_for_idle callers
drm/amd/pm: Remove arcturus min power limit
drm/amd/pm: skip setting the power source on smu v14.0.2/3
drm/amd/pm: disable pcie speed switching on Intel platform for smu v14.0.2/3
drm/amdkfd: Use the correct wptr size
drm/xe: Mark preempt fence workqueue as reclaim
drm/xe/ufence: Wake up waiters after setting ufence->signalled
...
Linus Torvalds [Fri, 29 Nov 2024 21:01:05 +0000 (13:01 -0800)]
Merge tag 'sound-fix-6.13-rc1' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. Majority of changes are device-specific
fixes and quirks, while there are a few core fixes to address
regressions and corner cases spotted by fuzzers.
- Fix of spinlock range that wrongly covered kvfree() call in rawmidi
- Fix potential NULL dereference at PCM mmap
- Fix incorrectly advertised MIDI 2.0 UMP Function Block info
- Various ASoC AMD quirks and fixes
- ASoC SOF Intel, Mediatek, HDMI-codec fixes
- A few more quirks and TAS2781 codec fix for HD-audio
- A couple of fixes for USB-audio for malicious USB descriptors"
* tag 'sound-fix-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (30 commits)
ALSA: hda: improve bass speaker support for ASUS Zenbook UM5606WA
ALSA: hda/realtek: Apply quirk for Medion E15433
ASoC: amd: yc: Add a quirk for microfone on Lenovo ThinkPad P14s Gen 5 21MES00B00
ASoC: SOF: ipc3-topology: Convert the topology pin index to ALH dai index
ASoC: mediatek: Check num_codecs is not zero to avoid panic during probe
ASoC: amd: yc: Fix for enabling DMIC on acp6x via _DSD entry
ALSA: ump: Fix evaluation of MIDI 1.0 FB info
ALSA: core: Fix possible NULL dereference caused by kunit_kzalloc()
ALSA: hda: Show the codec quirk info at probing
ALSA: asihpi: Remove unused variable
ALSA: hda/realtek: Set PCBeep to default value for ALC274
ALSA: hda/tas2781: Add speaker id check for ASUS projects
ALSA: hda/realtek: Update ALC225 depop procedure
ALSA: hda/realtek: Enable speaker pins for Medion E15443 platform
ALSA: hda/realtek: fix mute/micmute LEDs don't work for EliteBook X G1i
ALSA: usb-audio: Fix out of bounds reads when finding clock sources
ALSA: rawmidi: Fix kvfree() call in spinlock
ALSA: hda/realtek: Fix Internal Speaker and Mic boost of Infinix Y4 Max
ASoC: amd: yc: Add quirk for microphone on Lenovo Thinkpad T14s Gen 6 21M1CTO1WW
ASoC: doc: dapm: Add location information for dapm-graph tool
...
Linus Torvalds [Fri, 29 Nov 2024 19:58:27 +0000 (11:58 -0800)]
Merge tag 'char-misc-6.13-rc1' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc/IIO/whatever driver subsystem updates from Greg KH:
"Here is the 'big and hairy' char/misc/iio and other small driver
subsystem updates for 6.13-rc1.
Loads of things in here, and even a fun merge conflict!
- rust misc driver bindings and other rust changes to make misc
drivers actually possible.
I think this is the tipping point, expect to see way more rust
drivers going forward now that these bindings are present. Next
merge window hopefully we will have pci and platform drivers
working, which will fully enable almost all driver subsystems to
start accepting (or at least getting) rust drivers.
This is the end result of a lot of work from a lot of people,
congrats to all of them for getting this far, you've proved many of
us wrong in the best way possible, working code :)
- IIO driver updates, too many to list individually, that subsystem
keeps growing and growing...
- Interconnect driver updates
- nvmem driver updates
- pwm driver updates
- platform_driver::remove() fixups, loads of them
- counter driver updates
- misc driver updates (keba?)
- binder driver updates and fixes
- loads of other small char/misc/etc driver updates and additions,
full details in the shortlog.
All of these have been in linux-next for a while, with no other
reported issues other than that merge conflict"
* tag 'char-misc-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (401 commits)
mei: vsc: Fix typo "maintstepping" -> "mainstepping"
firmware: Switch back to struct platform_driver::remove()
misc: isl29020: Fix the wrong format specifier
scripts/tags.sh: Don't tag usages of DEFINE_MUTEX
fpga: Switch back to struct platform_driver::remove()
mei: vsc: Improve error logging in vsc_identify_silicon()
mei: vsc: Do not re-enable interrupt from vsc_tp_reset()
dt-bindings: spmi: qcom,x1e80100-spmi-pmic-arb: Add SAR2130P compatible
dt-bindings: spmi: spmi-mtk-pmif: Add compatible for MT8188
spmi: pmic-arb: fix return path in for_each_available_child_of_node()
iio: Move __private marking before struct element priv in struct iio_dev
docs: iio: ad7380: add adaq4370-4 and adaq4380-4
iio: adc: ad7380: add support for adaq4370-4 and adaq4380-4
iio: adc: ad7380: use local dev variable to shorten long lines
iio: adc: ad7380: fix oversampling formula
dt-bindings: iio: adc: ad7380: add adaq4370-4 and adaq4380-4 compatible parts
bus: mhi: host: pci_generic: Use pcim_iomap_region() to request and map MHI BAR
bus: mhi: host: Switch trace_mhi_gen_tre fields to native endian
misc: atmel-ssc: Use of_property_present() for non-boolean properties
misc: keba: Add hardware dependency
...
Linus Torvalds [Fri, 29 Nov 2024 19:43:29 +0000 (11:43 -0800)]
Merge tag 'driver-core-6.13-rc1' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of driver core changes for 6.13-rc1.
Nothing major for this merge cycle, except for the two simple merge
conflicts are here just to make life interesting.
Included in here are:
- sysfs core changes and preparations for more sysfs api cleanups
that can come through all driver trees after -rc1 is out
- fw_devlink fixes based on many reports and debugging sessions
- list_for_each_reverse() removal, no one was using it!
- last-minute seq_printf() format string bug found and fixed in many
drivers all at once.
- minor bugfixes and changes full details in the shortlog"
* tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
Fix a potential abuse of seq_printf() format string in drivers
cpu: Remove spurious NULL in attribute_group definition
s390/con3215: Remove spurious NULL in attribute_group definition
perf: arm-ni: Remove spurious NULL in attribute_group definition
driver core: Constify bin_attribute definitions
sysfs: attribute_group: allow registration of const bin_attribute
firmware_loader: Fix possible resource leak in fw_log_firmware_info()
drivers: core: fw_devlink: Fix excess parameter description in docstring
driver core: class: Correct WARN() message in APIs class_(for_each|find)_device()
cacheinfo: Use of_property_present() for non-boolean properties
cdx: Fix cdx_mmap_resource() after constifying attr in ->mmap()
drivers: core: fw_devlink: Make the error message a bit more useful
phy: tegra: xusb: Set fwnode for xusb port devices
drm: display: Set fwnode for aux bus devices
driver core: fw_devlink: Stop trying to optimize cycle detection logic
driver core: Constify attribute arguments of binary attributes
sysfs: bin_attribute: add const read/write callback variants
sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()
sysfs: treewide: constify attribute callback of bin_attribute::llseek()
sysfs: treewide: constify attribute callback of bin_attribute::mmap()
...
Linus Torvalds [Fri, 29 Nov 2024 19:36:13 +0000 (11:36 -0800)]
Merge tag 'staging-6.13-rc1' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver updates from Greg KH:
"Here is the big set of staging driver changes for 6.13-rc1.
Lots of changes this merge cycle, drivers removed and drivers added.
Highlights include:
- removals of the following staging drivers due to no forward
progress and no one having either the hardware or the time/energy
to deal with them anymore:
- fieldbus
- gdm724x
- olpc_dcon
- rtl8712
- rts5208
- vt6655
- vt6656
If anyone has this hardware and wants to work on the drivers, it
can be an easy revert to get them back.
- addition of the gpib driver subsystem. Lots of drivers for really
old and semi-old interfaces to lab equipments. We expect lots of
churn in these drivers as they get cleaned up to "working" order.
These were added at the request of a user and the maintainer/author
of them is helping out with the effort
- loads and loads of tiny coding style cleanups for almost all
staging drivers. Too many to list, see the shortlog for details.
All of these have been in linux-next for a very long time with no
reported issues"
* tag 'staging-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (216 commits)
Staging: gpib: gpib_os.c - Remove unnecessary OOM message
staging: gpib: avoid unintended sign extension
staging: vchiq_debugfs: Use forward declarations
staging: vchiq_core: Rectify header include for vchiq_dump_state()
staging: vc04_services: Cleanup TODO entry
staging: most: Remove TODO contact information
staging: rtl8723bs: Remove TODO contact information
staging: sm750fb: Remove TODO contact information
staging: iio: Remove TODO file
staging: greybus: uart: Fix atomicity violation in get_serial_info()
staging: rtl8723bs: Remove unused function Efuse_GetCurrentSize
staging: rtl8723bs: Remove unused function efuse_WordEnableDataRead
staging: rtl8723bs: Remove function hal_EfusePgPacketWrite1ByteHeader
staging: rtl8723bs: Remove function hal_EfusePgPacketWrite2ByteHeader
staging: rtl8723bs: Remove unused function hal_EfusePgCheckAvailableAddr
staging: rtl8723bs: Remove unused function hal_EfuseConstructPGPkt
staging: rtl8723bs: Remove unused function hal_EfusePartialWriteCheck
staging: rtl8723bs: Remove unused function hal_EfusePgPacketWriteHeader
staging: rtl8723bs: Remove unused function hal_EfusePgPacketWriteData
staging: rtl8723bs: Remove unused function Hal_EfusePgPacketWrite_BT
...
Linus Torvalds [Fri, 29 Nov 2024 19:19:31 +0000 (11:19 -0800)]
Merge tag 'usb-6.13-rc1' of git://git./linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt updates from Greg KH:
"Here is the big set of USB and Thunderbolt changes for 6.13-rc1.
Overall, a pretty slow development cycle, the majority of the work
going into the debugfs interface for the thunderbolt (i.e. USB4) code,
to help with debugging the myrad ways that hardware vendors get their
interfaces messed up. Other than that, here's the highlights:
- thunderbolt changes and additions to debugfs interfaces
- lots of device tree updates for new and old hardware
- UVC configfs gadget updates and new apis for features
- xhci driver updates and fixes
- dwc3 driver updates and fixes
- typec driver updates and fixes
- lots of other small updates and fixes, full details in the shortlog
All of these have been in linux-next for a while with no reported
problems"
* tag 'usb-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (148 commits)
usb: typec: tcpm: Add support for sink-bc12-completion-time-ms DT property
dt-bindings: usb: maxim,max33359: add usage of sink bc12 time property
dt-bindings: connector: Add time property for Sink BC12 detection completion
usb: dwc3: gadget: Remove dwc3_request->needs_extra_trb
usb: dwc3: gadget: Cleanup SG handling
usb: dwc3: gadget: Fix looping of queued SG entries
usb: dwc3: gadget: Fix checking for number of TRBs left
usb: dwc3: ep0: Don't clear ep0 DWC3_EP_TRANSFER_STARTED
Revert "usb: gadget: composite: fix OS descriptors w_value logic"
usb: ehci-spear: fix call balance of sehci clk handling routines
USB: make to_usb_device_driver() use container_of_const()
USB: make to_usb_driver() use container_of_const()
USB: properly lock dynamic id list when showing an id
USB: make single lock for all usb dynamic id lists
drivers/usb/storage: refactor min with min_t
drivers/usb/serial: refactor min with min_t
drivers/usb/musb: refactor min/max with min_t/max_t
drivers/usb/mon: refactor min with min_t
drivers/usb/misc: refactor min with min_t
drivers/usb/host: refactor min/max with min_t/max_t
...
Linus Torvalds [Fri, 29 Nov 2024 19:15:07 +0000 (11:15 -0800)]
Merge tag 'modules-6.13-rc1-v2' of git://git./linux/kernel/git/modules/linux
Pull modules fixes from Luis Chamberlain:
"Three fixes, the main one build that we build the kallsyms test
modules all over again if we just run make twice"
* tag 'modules-6.13-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
selftests: find_symbol: Actually use load_mod() parameter
selftests: kallsyms: fix and clarify current test boundaries
selftests: kallsyms: fix double build stupidity
Linus Torvalds [Fri, 29 Nov 2024 19:10:30 +0000 (11:10 -0800)]
Merge tag 'apparmor-pr-2024-11-27' of git://git./linux/kernel/git/jj/linux-apparmor
Pull apparmor updates from John Johansen:
"Features:
- extend next/check table to add support for 2^24 states to the state
machine.
- rework capability audit cache to use broader cred information
instead of just the profile. Also add a time stamp so old entries
can be aged out of the cache.
Bug Fixes:
- fix 'Do simple duplicate message elimination' to clear previous
state when updating in capability audit cache
- Fix memory leak for aa_unpack_strdup()
- properly handle cx/px lookup failure when in complain mode
- allocate xmatch for nullpdb inside aa_alloc_null fixing a NULL ptr
deref of tracking profiles in when in complain mode
Cleanups:
- Remove everything being reported as deadcode
- replace misleading 'scrubbing environment' phrase in debug print
- Remove unnecessary NULL check before kvfree()
- clean up duplicated parts of handle_onexec()
- Use IS_ERR_OR_NULL() helper function
- move new_profile declaration to top of block instead immediately
after label to remove C23 extension warning
Documentation:
- add comment to document capability.c:profile_capable ad ptr
parameter can not be NULL
- add comment to document first entry is in packed perms struct is
reserved for future planned expansion.
- Update LSM/apparmor.rst add blurb for DEFAULT_SECURITY_APPARMOR"
* tag 'apparmor-pr-2024-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: lift new_profile declaration to remove C23 extension warning
apparmor: replace misleading 'scrubbing environment' phrase in debug print
parser: drop dead code for XXX_comb macros
apparmor: Remove unused parameter L1 in macro next_comb
Docs: Update LSM/apparmor.rst
apparmor: audit_cap dedup based on subj_cred instead of profile
apparmor: add a cache entry expiration time aging out capability audit cache
apparmor: document capability.c:profile_capable ad ptr not being NULL
apparmor: fix 'Do simple duplicate message elimination'
apparmor: document first entry is in packed perms struct is reserved
apparmor: test: Fix memory leak for aa_unpack_strdup()
apparmor: Remove deadcode
apparmor: Remove unnecessary NULL check before kvfree()
apparmor: domain: clean up duplicated parts of handle_onexec()
apparmor: Use IS_ERR_OR_NULL() helper function
apparmor: add support for 2^24 states to the dfa state machine.
apparmor: properly handle cx/px lookup failure for complain
apparmor: allocate xmatch for nullpdb inside aa_alloc_null
Linus Torvalds [Fri, 29 Nov 2024 18:40:52 +0000 (10:40 -0800)]
Merge tag 's390-6.13-2' of git://git./linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Add swap entry for hugetlbfs support
- Add PTE_MARKER support for hugetlbs mappings; this fixes a regression
(possible page fault loop) which was introduced when support for
UFFDIO_POISON for hugetlbfs was added
- Add ARCH_HAS_PREEMPT_LAZY and PREEMPT_DYNAMIC support
- Mark IRQ entries in entry code, so that stack tracers can filter out
the non-IRQ parts of stack traces. This fixes stack depot capacity
limit warnings, since without filtering the number of unique stack
traces is huge
- In PCI code fix leak of struct zpci_dev object, and fix potential
double remove of hotplug slot
- Fix pagefault_disable() / pagefault_enable() unbalance in
arch_stack_user_walk_common()
- A couple of inline assembly optimizations, more cmpxchg() to
try_cmpxchg() conversions, and removal of usages of xchg() and
cmpxchg() on one and two byte memory areas
- Various other small improvements and cleanups
* tag 's390-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (27 commits)
Revert "s390/mm: Allow large pages for KASAN shadow mapping"
s390/spinlock: Use flag output constraint for arch_cmpxchg_niai8()
s390/spinlock: Use R constraint for arch_load_niai4()
s390/spinlock: Generate shorter code for arch_spin_unlock()
s390/spinlock: Remove condition code clobber from arch_spin_unlock()
s390/spinlock: Use symbolic names in inline assemblies
s390: Support PREEMPT_DYNAMIC
s390/pci: Fix potential double remove of hotplug slot
s390/pci: Fix leak of struct zpci_dev when zpci_add_device() fails
s390/mm/hugetlbfs: Add missing includes
s390/mm: Add PTE_MARKER support for hugetlbfs mappings
s390/mm: Introduce region-third and segment table swap entries
s390/mm: Introduce region-third and segment table entry present bits
s390/mm: Rearrange region-third and segment table entry SW bits
KVM: s390: Increase size of union sca_utility to four bytes
KVM: s390: Remove one byte cmpxchg() usage
KVM: s390: Use try_cmpxchg() instead of cmpxchg() loops
s390/ap: Replace xchg() with WRITE_ONCE()
s390/mm: Allow large pages for KASAN shadow mapping
s390: Add ARCH_HAS_PREEMPT_LAZY support
...
Linus Torvalds [Fri, 29 Nov 2024 18:36:01 +0000 (10:36 -0800)]
Merge tag 'mips_6.13_1' of git://git./linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- fix for loongson64 device tree
- add SPI nand to realtek device tree
- change clock tree for mobileye
* tag 'mips_6.13_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: Loongson64: DTS: Really fix PCIe port nodes for ls7a
mips: dts: realtek: Add SPI NAND controller
MIPS: mobileye: eyeq6h: add OLB nodes OLB and remove fixed clocks
MIPS: mobileye: eyeq5: use OLB as provider for fixed factor clocks
Linus Torvalds [Fri, 29 Nov 2024 18:31:18 +0000 (10:31 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rmk/linux
Pull ARM updates from Russell King:
- add dev_is_amba() function to allow conversions during the next cycle
- improve PREEMPT_RT performance with VFP
- KASAN fixes for vmap stack
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9431/1: mm: Pair atomic_set_release() with _read_acquire()
ARM: 9430/1: entry: Do a dummy read from VMAP shadow
ARM: 9429/1: ioremap: Sync PGDs for VMALLOC shadow
ARM: 9426/1: vfp: Move sending signals outside of vfp_state_hold()ed section.
ARM: 9425/1: vfp: Use vfp_state_hold() in vfp_support_entry().
ARM: 9424/1: vfp: Use vfp_state_hold() in vfp_sync_hwstate().
ARM: 9423/1: vfp: Provide vfp_state_hold() for VFP locking.
ARM: 9415/1: amba: Add dev_is_amba() function and export it for modules
Linus Torvalds [Fri, 29 Nov 2024 18:27:49 +0000 (10:27 -0800)]
Merge tag 'sparc-for-6.13-tag1' of git://git./linux/kernel/git/alarsson/linux-sparc
Pull sparc updates from Andreas Larsson:
- Make sparc64 compilable with clang
- Replace one-element array with flexible array member
* tag 'sparc-for-6.13-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
sparc/vdso: Add helper function for 64-bit right shift on 32-bit target
sparc: Replace one-element array with flexible array member
sparc/build: Add SPARC target flags for compiling with clang
sparc/build: Put usage of -fcall-used* flags behind cc-option
Linus Torvalds [Fri, 29 Nov 2024 18:25:44 +0000 (10:25 -0800)]
Merge tag 'powerpc-6.13-2' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Fix htmldocs errors in sysfs-bus-event_source-devices-vpa-pmu
- Fix warning due to missing #size-cells on powermac
Thanks to Michael Ellerman, Yang Li, Rob Herring, and Stephen Rothwell.
* tag 'powerpc-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/prom_init: Fixup missing powermac #size-cells
docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Fix htmldocs errors
powerpc/machdep: Remove duplicated include in svm.c
Zhang Xianwei [Thu, 28 Nov 2024 09:00:56 +0000 (17:00 +0800)]
brd: decrease the number of allocated pages which discarded
The number of allocated pages which discarded will not decrease.
Fix it.
Fixes:
9ead7efc6f3f ("brd: implement discard support")
Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241128170056565nPKSz2vsP8K8X2uk2iaDG@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>