Alex Deucher [Thu, 10 Apr 2025 17:26:43 +0000 (13:26 -0400)]
drm/amdgpu/userq: add helpers to start/stop scheduling
This will be used to stop/start user queue scheduling for
example when switching between kernel and user queues when
enforce isolation is enabled.
v2: use idx
v3: only stop compute/gfx queues
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 11 Apr 2025 18:16:41 +0000 (14:16 -0400)]
drm/amdgpu/userq: track the xcp_id associated with the queue
Track this to align with KFD for enforce isolation
handling.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Emily Deng [Tue, 8 Apr 2025 12:25:43 +0000 (20:25 +0800)]
drm/amdgpu: Clear overflow for SRIOV
For VF, it doesn't have the permission to clear overflow, clear the bit
by reset.
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 14 Apr 2025 18:18:03 +0000 (14:18 -0400)]
drm/amdgpu/userq: rework driver parameter
Replace disable_kq parameter with user_queue parameter.
The parameter has the following logic:
-1 = auto (ASIC specific default)
0 = user queues disabled
1 = user queues enabled and kernel queues enabled (if supported)
2 = user queues enabled and kernel queues disabled
The default behavior (-1) is currently the same as 0 for current
ASICs. To enable user queues (in addition to kernel queues) set
user_queue=1. To enable user queues and disable kernel queues
(to make all resources available to user queues), set user_queue=2.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Sat, 12 Apr 2025 03:14:32 +0000 (11:14 +0800)]
drm/amd/pm: Enable host limit metrics support
Enable host limit metrics support for smuv_13_0_12
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Sun, 13 Apr 2025 14:32:23 +0000 (10:32 -0400)]
drm/amdgpu/sdma7: properly reference trap interrupts for userqs
We need to take a reference to the interrupts to make
sure they stay enabled even if the kernel queues have
disabled them.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Sun, 13 Apr 2025 14:31:08 +0000 (10:31 -0400)]
drm/amdgpu/sdma6: properly reference trap interrupts for userqs
We need to take a reference to the interrupts to make
sure they stay enabled even if the kernel queues have
disabled them.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Sat, 12 Apr 2025 03:08:01 +0000 (11:08 +0800)]
drm/amd/pm: Enable host limit metrics support
Enable host limit metrics support for smuv_13_0_6
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sathishkumar S [Thu, 10 Apr 2025 06:28:16 +0000 (11:58 +0530)]
drm/amdgpu: Enable doorbell for JPEG5_0_1
Enable doorbell for JPEG5_0_1 and adjust index for VCN5_0_1.
Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Shiwu Zhang [Thu, 10 Apr 2025 06:26:47 +0000 (11:56 +0530)]
drm/amdgpu: Update vcn doorbell range in NBIO 7.9
Increase vcn doorbell range for gfx950 to 11.
Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Sun, 13 Apr 2025 14:19:24 +0000 (10:19 -0400)]
drm/amdgpu/gfx12: properly reference EOP interrupts for userqs
Regardless of whether we disable kernel queues, we need
to take an extra reference to the pipe interrupts for
user queues to make sure they stay enabled in case we
disable them for kernel queues.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Sun, 13 Apr 2025 14:16:58 +0000 (10:16 -0400)]
drm/amdgpu/gfx11: properly reference EOP interrupts for userqs
Regardless of whether we disable kernel queues, we need
to take an extra reference to the pipe interrupts for
user queues to make sure they stay enabled in case we
disable them for kernel queues.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Eric Huang [Mon, 14 Apr 2025 15:19:01 +0000 (11:19 -0400)]
drm/amdkfd: fix a bug of smi event for superuser
rocm-smi with superuser permission doesn't show some
of smi events, i.e. page fault/migration, because the
condition of "(events & all)" is false. Superuser
should be able to detect all events, the condiiton of
"(events & all)" seems redundant, so removing it will
fix the issue.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Sun, 13 Apr 2025 20:51:21 +0000 (16:51 -0400)]
drm/amdgpu: add missing DCE6 to dce_version_to_string()
Missing DCE 6.0 6.1 and 6.4 are identified as UNKNOWN. Fix this.
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:11:00 +0000 (22:11 -0400)]
drm/amdgpu: fix typo in bios_parser.c
Probably a cut and paste error from using get_integrated_info_v8's comment.
This has to be get_integrated_info_v9
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:10:59 +0000 (22:10 -0400)]
drm/amdgpu: fix duplicated value setting in dce100_resource_construct()
i2c_speed_in_khz was set twice with the same values. Looking at other DCE
versions, we probably wanted to set the value for i2c_speed_in_khz_hdcp.
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:10:58 +0000 (22:10 -0400)]
drm/radeon: fix typo in atombios.h
"aligned" not "aligend"
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:10:57 +0000 (22:10 -0400)]
drm/amdgpu: fix typo in atombios.h
"aligned" not "aligend"
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:10:56 +0000 (22:10 -0400)]
drm/amdgpu: add missing parameter name in dce110_clk_src_construct() declaration
While not needed per speaking, all the other parameters have names but
this one.
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alexandre Demers [Tue, 8 Apr 2025 02:10:55 +0000 (22:10 -0400)]
drm/amdgpu: rename function to follow naming convention in dce110
The prefix dce110 is used on all functions, but init_pipes() and
init_hw(). Under DCN, these sames functions are prefixed.
Let's keep thing coherent.
Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Dan Carpenter [Sat, 12 Apr 2025 14:39:43 +0000 (17:39 +0300)]
drm/amdgpu: Clean up error handling in amdgpu_userq_fence_driver_alloc()
1) Checkpatch complains if we print an error message for kzalloc()
failure. The kzalloc() failure already has it's own error messages
built in. Also this allocation is small enough that it is guaranteed
to succeed.
2) Return directly instead of doing a goto free_fence_drv. The
"fence_drv" is already NULL so no cleanup is necessary.
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Dan Carpenter [Sat, 12 Apr 2025 14:39:32 +0000 (17:39 +0300)]
drm/amdgpu: Fix double free in amdgpu_userq_fence_driver_alloc()
The goto frees "fence_drv" so this is a double free bug. There is no
need to call amdgpu_seq64_free(adev, fence_drv->va) since the seq64
allocation failed so change the goto to goto free_fence_drv. Also
propagate the error code from amdgpu_seq64_alloc() instead of hard coding
it to -ENOMEM.
Fixes:
e7cf21fbb277 ("drm/amdgpu: Few optimization and fixes for userq fence driver")
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 11 Apr 2025 19:49:51 +0000 (15:49 -0400)]
drm/amdgpu/userq: move runpm handling into core userq code
Pull it out of the MES code and into the generic code.
It's not MES specific and needs to be applied to all user
queues regardless of the backend.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Eric Huang [Mon, 14 Apr 2025 14:45:12 +0000 (10:45 -0400)]
drm/amdkfd: fix NULL check mistake for process smi event
The mistake will lead to NULL kernel oops, so fix it.
Fixes:
4172b556fd5b ("drm/amdkfd: add smi events for process start and end")
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jesse.zhang@amd.com [Mon, 14 Apr 2025 07:25:16 +0000 (15:25 +0800)]
drm/amdgpu/sdma_v4: Register the new sdma function pointers
Register stop/start/soft_reset queue functions for sdma v4_4_2.
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jesse.zhang@amd.com [Fri, 11 Apr 2025 05:01:19 +0000 (13:01 +0800)]
drm/amdgpu: Add the new sdma function pointers for amdgpu_sdma.h
This patch introduces new function pointers in the amdgpu_sdma structure
to handle queue stop, start and soft reset operations. These will replace
the older callback mechanism.
The new functions are:
- stop_kernel_queue: Stops a specific SDMA queue
- start_kernel_queue: Starts/Restores a specific SDMA queue
- soft_reset_kernel_queue: Performs soft reset on a specific SDMA queue
v2: Update stop_queue/start_queue function paramters to use ring pointer instead of device/instance(Chritian)
v3: move stop_queue/start_queue to struct amdgpu_sdma_instance and rename them. (Alex)
v4: rework the ordering a bit (Alex)
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 10 Apr 2025 17:17:08 +0000 (13:17 -0400)]
drm/amdgpu: don't swallow errors in amdgpu_userqueue_resume_all()
since we loop through the queues |= the errors.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 21 Feb 2025 17:16:34 +0000 (12:16 -0500)]
drm/amdgpu/userq: handle system suspend and resume
Unmap user queues on suspend and map them on resume.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Feb 2025 21:31:40 +0000 (16:31 -0500)]
drm/amdgpu/userq: add suspend and resume helpers
Add helpers to unmap and map user queues on suspend and
resume.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 10 Apr 2025 17:54:15 +0000 (13:54 -0400)]
drm/amdgpu/userq: properly clean up userq fence driver on failure
If userq creation fails, we need to properly unwind and free the
user queue fence driver.
v2: free idr as well (Sunil)
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 10 Apr 2025 17:49:47 +0000 (13:49 -0400)]
drm/amdgpu/userq: move some code around
Move some userq fence handling code into amdgpu_userq_fence.c.
This matches the other code in that file.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 21 Feb 2025 19:47:00 +0000 (14:47 -0500)]
drm/amdgpu/userq: rework front end call sequence
Split out the queue map from the mqd create call and split
out the queue unmap from the mqd destroy call. This splits
the queue setup and teardown with the actual enablement
in the firmware.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 10 Apr 2025 16:29:37 +0000 (12:29 -0400)]
drm/amdgpu/userq: rename suspend/resume callbacks
Rename to map and umap to better align with what is happening
at the firmware level and remove the extra level of indirection
in the MES userq code.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 27 Feb 2025 03:16:53 +0000 (22:16 -0500)]
drm/amdgpu/userq/mes: remove unused header
This is unused so remove it.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Shane Xiao [Thu, 10 Apr 2025 04:35:15 +0000 (12:35 +0800)]
drm/amdkfd: Add rec SDMA engines support with limited XGMI
This patch adds recommended SDMA engines with limited XGMI SDMA engines.
It will help improve overall performance for device to device copies
with this optimization.
v2: Update the formatting issues and data type
Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Srinivasan Shanmugam [Fri, 11 Apr 2025 16:02:08 +0000 (21:32 +0530)]
drm/amdgpu: Enhance Cleaner Shader Handling in GFX v9.0 Architecture v2
This commit modifies the gfx_v9_0_ring_emit_cleaner_shader function
to use a switch statement for cleaner shader emission based on the
specific GFX IP version.
The function now distinguishes between different IP versions, using
PACKET3_RUN_CLEANER_SHADER_9_0 for the versions 9.0.1, 9.1.0,
9.2.1, 9.2.2, 9.3.0, and 9.4.0, while retaining
PACKET3_RUN_CLEANER_SHADER for version 9.4.2.
v2: Simplify logic (Alex).
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Srinivasan Shanmugam [Fri, 11 Apr 2025 15:45:41 +0000 (21:15 +0530)]
drm/amdgpu: Add PACKET3_RUN_CLEANER_SHADER_9_0 for Cleaner Shader execution
This commit introduces the PACKET3_RUN_CLEANER_SHADER_9_0 definition,
which is a command packet utilized to instruct the GPU to execute the
cleaner shader for the GFX9.0 graphics architecture.
The cleaner shader is a piece of GPU code that is responsible for
clearing or initializing essential GPU resources, such as Local Data
Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar
General Purpose Registers (SGPRs). Properly clearing these resources is
vital for ensuring data isolation and security between different
workloads executed on the GPU.
When the GPU receives this packet, it fetches and runs the cleaner
shader instructions from the specified location in the packet. Thus by
preventing data leaks and ensuring that previous job states do not
interfere with subsequent workloads.
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jesse Zhang [Fri, 11 Apr 2025 08:22:27 +0000 (16:22 +0800)]
drm/amd/amdgpu: Fix out of bounds warning in amdgpu_hw_ip_info
Fix an array index out of bounds warning in the DMA IP case of
amdgpu_hw_ip_info() where it was incorrectly checking
adev->gfx.gfx_ring[i].no_user_submission instead of
adev->sdma.instance[i].ring.no_user_submission.
The mismatch caused UBSAN to report an array bounds violation since
it was accessing the GFX ring array with SDMA instance indices.
Fixes:
4310acd4464b ("drm/amdgpu: add ring flag for no user submissions")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Eric Huang [Mon, 7 Apr 2025 19:32:33 +0000 (15:32 -0400)]
drm/amdkfd: add smi events for process start and end
rocm-smi will be able to show the events for KFD process
start/end, it is the implementation of this feature.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lijo Lazar [Fri, 11 Apr 2025 12:10:26 +0000 (17:40 +0530)]
drm/amdgpu: Use the right function for hdp flush
There are a few prechecks made before HDP flush like a flush is not
required on APU bare metal. Using hdp callback directly bypasses those
checks. Use amdgpu_device_flush_hdp which takes care of prechecks.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ellen Pan [Fri, 11 Apr 2025 02:12:24 +0000 (22:12 -0400)]
drm/amdgpu: Direct ret in ras_reset_err_cnt on VF
With adding sriov_vf check, we directly return EOPNOTSUPP in
ras_reset_error_count as we should not do anything on VF to reset RAS error
count.
This also fixes the issue that loading guest driver causes register
violations.
Reviewed-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Signed-off-by: Ellen Pan <yunru.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lijo Lazar [Fri, 11 Apr 2025 11:15:46 +0000 (16:45 +0530)]
drm/amdgpu: Use generic hdp flush function
Except HDP v5.2 all use a common logic for HDP flush. Use a generic
function. HDP v5.2 forces NO_KIQ logic, revisit it later.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Candice Li [Fri, 11 Apr 2025 05:12:06 +0000 (13:12 +0800)]
drm/amdgpu: Set RAS EEPROM table version to v3 for umc v12_5
Set RAS EEPROM table version to v3 for umc v12_5.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jesse.zhang@amd.com [Tue, 8 Apr 2025 06:23:45 +0000 (14:23 +0800)]
drm/amdgpu: Enable per-queue reset for SDMA v4.4.2 on IP v9.5.0
Add support for per-queue reset on SDMA v4.4.2 when running with:
1. MEC firmware version 17 or later
2. DPM indicates SDMA reset is supported
v2: Fixed supported firmware versions (Lijo)
Signed-off-by: Jesse.Zhang <Jesse.zhang@amd.com>
Reviewed-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Srinivasan Shanmugam [Thu, 10 Apr 2025 14:07:06 +0000 (19:37 +0530)]
drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.5.2/11.5.3 GPUs
Enable the cleaner shader for additional GFX11.5.2/11.5.3 series GPUs to
ensure data isolation among GPU tasks. The cleaner shader is tasked with
clearing the Local Data Store (LDS), Vector General Purpose Registers
(VGPRs), and Scalar General Purpose Registers (SGPRs), which helps avoid
data leakage and guarantees the accuracy of computational results.
This update extends cleaner shader support to GFX11.5.2/11.5.3 GPUs,
previously available for GFX11.0.3. It enhances security by clearing GPU
memory between processes and maintains a consistent GPU state across KGD
and KFD workloads.
Cc: Mario Sopena-Novales <mario.novales@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 9 Apr 2025 01:27:15 +0000 (21:27 -0400)]
drm/amd/display/dml2: use vzalloc rather than kzalloc
The structures are large and they do not require contiguous
memory so use vzalloc.
Fixes:
70839da63605 ("drm/amd/display: Add new DCN401 sources")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4126
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Thu, 10 Apr 2025 17:23:06 +0000 (12:23 -0500)]
Documentation/amdgpu: Add Ryzen AI 350 series processors
These have been announced so add them to the table.
Link: https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-7-350.html
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Roman Li [Wed, 9 Apr 2025 16:03:00 +0000 (12:03 -0400)]
drm/amd/display: Add htmldocs description for fused_io interface
[Why]
htmldocs build warning: "Function parameter or struct member 'fused_io'
not described in 'amdgpu_display_manager'".
[How]
Add missing description.
Fixes:
ce801e5d6c1b ("drm/amd/display: HDCP Locality check using DMUB Fused IO")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Roman Li <Roman.Li@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 8 Apr 2025 14:39:06 +0000 (10:39 -0400)]
drm/amdgpu: adjust enforce_isolation handling
Switch from a bool to an enum and allow more options
for enforce isolation. There are now 3 modes of operation:
- Disabled (0)
- Enabled (serialization and cleaner shader) (1)
- Enabled in legacy mode (no serialization or cleaner shader) (2)
This provides better flexibility for more use cases.
Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 8 Apr 2025 14:47:24 +0000 (10:47 -0400)]
drm/amdgpu/mes12: use the device value for enforce isolation
Use the local setting rather than the global parameter.
Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 8 Apr 2025 14:45:52 +0000 (10:45 -0400)]
drm/amdgpu/mes11: use the device value for enforce isolation
Use the local setting rather than the global parameter.
Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
David Rosca [Mon, 7 Apr 2025 11:12:11 +0000 (13:12 +0200)]
drm/amdgpu: Add back JPEG to video caps for carrizo and newer
JPEG is not supported on Vega only.
Fixes:
0a6e7b06bdbe ("drm/amdgpu: Remove JPEG from vega and carrizo video caps")
Signed-off-by: David Rosca <david.rosca@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Fri, 21 Feb 2025 12:34:32 +0000 (20:34 +0800)]
drm/amdgpu/gfx12: Implement the GFX12 KCQ pipe reset
Implement the GFX12 KCQ pipe reset, and disable the GFX12
kernel compute queue until the CPFW fully supports it.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ce Sun [Wed, 9 Apr 2025 11:53:11 +0000 (19:53 +0800)]
drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset
Checking hive is more readable.
The following smatch warning:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6820 amdgpu_pci_slot_reset()
warn: iterator used outside loop: 'tmp_adev'
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
ZhenGuo Yin [Tue, 8 Apr 2025 08:18:28 +0000 (16:18 +0800)]
drm/amdgpu: fix warning of drm_mm_clean
Kernel doorbell BOs needs to be freed before ttm_fini.
Fixes:
54c30d2a8def ("drm/amdgpu: create kernel doorbell pages")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Kenneth Feng [Tue, 1 Apr 2025 08:04:41 +0000 (16:04 +0800)]
drm/amd/amdgpu: disable ASPM in some situations
disable ASPM with some ASICs on some specific platforms.
required from PCIe controller owner.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Fri, 28 Mar 2025 10:33:09 +0000 (18:33 +0800)]
drm/amdgpu: remove the duplicated mes queue active state setting
The MES queue deactivation and active status are already set in
mes_userq_unmap|map(), so the caller needn't set the queue_active
bit again.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ruili Ji [Tue, 1 Apr 2025 00:55:09 +0000 (20:55 -0400)]
amd/amdgpu: Implement VCN queue reset for vcn 4.0.3
Add function for vcn queue reset to make driver to
do fine-grained reset instead of the whole gpu reset.
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Masha Grinman [Thu, 3 Apr 2025 19:08:17 +0000 (14:08 -0500)]
drm/amdgpu: Move read of snoop register from guest to host
Guest is reading/writing to snoop register which is a security violation
We moved the code to the host driver
And also added a validation on the guest side to check if it's guest
Signed-off-by: Masha Grinman <Masha.Grinman@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Tue, 8 Apr 2025 18:09:57 +0000 (13:09 -0500)]
drm/amd: Forbid suspending into non-default suspend states
On systems that default to 'deep' some userspace software likes
to try to suspend in 'deep' first. If there is a failure for any
reason (such as -ENOMEM) the failure is ignored and then it will
try to use 's2idle' as a fallback. This fails, but more importantly
it leads to graphical problems.
Forbid this behavior and only allow suspending in the last state
supported by the system.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4093
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250408180957.4027643-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Fri, 28 Mar 2025 17:58:17 +0000 (18:58 +0100)]
drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
Otherwise triggering sysfs multiple times without other submissions in
between only runs the shader once.
v2: add some comment
v3: re-add missing cast
v4: squash in semicolon fix
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 18:23:55 +0000 (13:23 -0500)]
drm/amdgpu/sdma7: add support for disable_kq
When the parameter is set, disable user submissions
to kernel queues.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 18:22:49 +0000 (13:22 -0500)]
drm/amdgpu/sdma6: add support for disable_kq
When the parameter is set, disable user submissions
to kernel queues.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 18:09:55 +0000 (13:09 -0500)]
drm/amdgpu/sdma: add flag for tracking disable_kq
For SDMA, we still need kernel queues for paging so
they need to be initialized, but we no not want to
accept submissions from userspace when disable_kq
is set.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 17:37:51 +0000 (12:37 -0500)]
drm/amdgpu/gfx12: add support for disable_kq
Plumb in support for disabling kernel queues.
v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)
v5: clean up after removing extra sched.ready settings
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 17:16:26 +0000 (12:16 -0500)]
drm/amdgpu/gfx11: add support for disable_kq
Plumb in support for disabling kernel queues in
GFX11. We have to bring up a GFX queue briefly in
order to initialize the clear state. After that
we can disable it.
v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Feb 2025 17:51:30 +0000 (12:51 -0500)]
drm/amdgpu/mes: make more vmids available when disable_kq=1
If we don't have kernel queues, the vmids can be used by
the MES for user queues.
Acked-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Feb 2025 17:40:30 +0000 (12:40 -0500)]
drm/amdgpu/mes: update hqd masks when disable_kq is set
Make all resources available to user queues.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Suggested-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 17:07:48 +0000 (12:07 -0500)]
drm/amdgpu/gfx: add generic handling for disable_kq
Add proper checks for disable_kq functionality in
gfx helper functions. Add special logic for families
that require the clear state setup.
v2: use ring count as per Felix suggestion
v3: fix num_gfx_rings handling in amdgpu_gfx_graphics_queue_acquire()
v4: fix error code (Alex)
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 18:06:19 +0000 (13:06 -0500)]
drm/amdgpu: add ring flag for no user submissions
This would be set by IPs which only accept submissions
from the kernel, not userspace, such as when kernel
queues are disabled. Don't expose the rings to userspace
and reject any submissions in the CS IOCTL.
v2: fix error code (Alex)
Reviewed-by: Sunil Khatri<sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Tue, 18 Feb 2025 15:33:53 +0000 (10:33 -0500)]
drm/amdgpu: add parameter to disable kernel queues
On chips that support user queues, setting this option
will disable kernel queues to be used to validate
user queues without kernel queues.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Mar 2025 16:49:30 +0000 (12:49 -0400)]
drm/amdgpu/userq: prevent runtime pm when userqs are active
Similar to KFD, prevent runtime pm while user queues are active.
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Feb 2025 20:56:24 +0000 (15:56 -0500)]
drm/amdgpu: store userq_managers in a list in adev
So we can iterate across them when we need to manage
all user queues.
v2: add uq_mgr to adev list in amdgpu_userq_mgr_init
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 24 Mar 2025 20:29:03 +0000 (16:29 -0400)]
drm/amdgpu: bump version for user queue IP support query
Add the user queue IP support query to the drm_amdgpu_info_device
query.
Cc: marek.olsak@amd.com
Cc: prike.liang@amd.com
Cc: sunil.khatri@amd.com
Cc: yogesh.mohanmarimuthu@amd.com
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 24 Mar 2025 20:26:00 +0000 (16:26 -0400)]
drm/amdgpu: add UAPI to query if user queues are supported
Add an INFO query to check if user queues are supported.
v2: switch to a mask of IPs (Marek)
v3: move to drm_amdgpu_info_device (Marek)
Cc: marek.olsak@amd.com
Cc: prike.liang@amd.com
Cc: sunil.khatri@amd.com
Cc: yogesh.mohanmarimuthu@amd.com
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Mar 2025 16:21:22 +0000 (12:21 -0400)]
drm/amdgpu/gfx12: split userq setup to a separate switch
Add a separate switch statement for the userq callback
assignment so that we can assign the callbacks for each
asic as the firmware becomes available.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Mar 2025 16:09:12 +0000 (12:09 -0400)]
drm/amdgpu/gfx11: clean up and consolidate sw_init
With the ME details fixed, we can now consolidate
this state. Also split out the userq setup into a separate
switch statement so that we can set them per IP version
when the firmwares are ready.
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arvind Yadav [Tue, 18 Mar 2025 13:15:40 +0000 (18:45 +0530)]
drm/amdgpu: Fix display freezing issue when resizing apps
The display is freezing because the amdgpu_userq_wait_ioctl()
is waiting for a non-user queue fence(specifically, the PT update fence).
RootCause:
The resume_work is initiated by both amdgpu_userq_suspend and
amdgpu_userqueue_ensure_ev_fence at same time. The amdgpu_userq_suspend
signals a dma-fence and subsequently triggers the resume_work, which is
intended to replace the existing fence by creating new dma-fence. However,
following this, the amdgpu_userqueue_ensure_ev_fence schedules another
resume_work that generates a new dma-fence, thereby replacing the one
created by amdgpu_userq_suspend. Consequently, the original fence will
never be signaled.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Mar 2025 14:18:58 +0000 (10:18 -0400)]
drm/amdgpu/mes: warn on unexpected pipe numbers
Warn if the number of pipes exceeds what the MES supports.
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Feb 2025 17:31:46 +0000 (12:31 -0500)]
drm/amdgpu/mes: centralize gfx_hqd mask management
Move it to amdgpu_mes to align with the compute and
sdma hqd masks. No functional change.
v2: rebase on new changes
v3: misc optimizations
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri<sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 12 Mar 2025 17:47:33 +0000 (13:47 -0400)]
drm/amdgpu: remove is_mes_queue flag
This was leftover from MES bring up when we had MES
user queues in the kernel. It's no longer used so
remove it.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Feb 2025 20:39:02 +0000 (15:39 -0500)]
drm/amdgpu/mes: remove unused functions
Leftover from the MES self tests that were removed previously.
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 26 Feb 2025 21:31:57 +0000 (16:31 -0500)]
drm/amdgpu: validate user queue parameters
Make sure these are set properly to ensure compatibility if
we ever update the IOCTL interface.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arvind Yadav [Tue, 18 Feb 2025 13:26:25 +0000 (18:56 +0530)]
drm/amdgpu: fix the memleak caused by fence not released
Encountering a taint issue during the unloading of gpu_sched
due to the fence not being released/put. In this context,
amdgpu_vm_clear_freed is responsible for creating a job to
update the page table (PT). It allocates kmem_cache for
drm_sched_fence and returns the finished fence associated
with job->base.s_fence. In case of Usermode queue this finished
fence is added to the timeline sync object through
amdgpu_gem_update_bo_mapping, which is utilized by user
space to ensure the completion of the PT update.
[ 508.900587] =============================================================================
[ 508.900605] BUG drm_sched_fence (Tainted: G N): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
[ 508.900617] -----------------------------------------------------------------------------
[ 508.900627] Slab 0xffffe0cc04548780 objects=32 used=2 fp=0xffff8ea81521f000 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 508.900645] CPU: 3 UID: 0 PID: 2337 Comm: rmmod Tainted: G N 6.12.0+ #1
[ 508.900651] Tainted: [N]=TEST
[ 508.900653] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F34 06/10/2021
[ 508.900656] Call Trace:
[ 508.900659] <TASK>
[ 508.900665] dump_stack_lvl+0x70/0x90
[ 508.900674] dump_stack+0x14/0x20
[ 508.900678] slab_err+0xcb/0x110
[ 508.900687] ? srso_return_thunk+0x5/0x5f
[ 508.900692] ? try_to_grab_pending+0xd3/0x1d0
[ 508.900697] ? srso_return_thunk+0x5/0x5f
[ 508.900701] ? mutex_lock+0x17/0x50
[ 508.900708] __kmem_cache_shutdown+0x144/0x2d0
[ 508.900713] ? flush_rcu_work+0x50/0x60
[ 508.900719] kmem_cache_destroy+0x46/0x1f0
[ 508.900728] drm_sched_fence_slab_fini+0x19/0x970 [gpu_sched]
[ 508.900736] __do_sys_delete_module.constprop.0+0x184/0x320
[ 508.900744] ? srso_return_thunk+0x5/0x5f
[ 508.900747] ? debug_smp_processor_id+0x1b/0x30
[ 508.900754] __x64_sys_delete_module+0x16/0x20
[ 508.900758] x64_sys_call+0xdf/0x20d0
[ 508.900763] do_syscall_64+0x51/0x120
[ 508.900769] entry_SYSCALL_64_after_hwframe+0x76/0x7e
v2: call dma_fence_put in amdgpu_gem_va_update_vm
v3: Addressed review comments from Christian.
- calling amdgpu_gem_update_timeline_node before switch.
- puting a dma_fence in case of error or !timeline_syncobj.
v4: Addressed review comments from Christian.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Le Ma <le.ma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 28 Feb 2025 19:55:57 +0000 (14:55 -0500)]
drm/amdgpu/userq: move the header to amdgpu directory
To align with other headers.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 19 Feb 2025 21:46:52 +0000 (16:46 -0500)]
drm/amdgpu/userq: remove BROKEN from config
This can be enabled now. We have the firmware checks
in place.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 28 Feb 2025 19:50:11 +0000 (14:50 -0500)]
drm/amdgpu: add userq firmware version checks
Currently disabled until the firmwares are officially
released.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 28 Feb 2025 19:45:37 +0000 (14:45 -0500)]
drm/amdgpu/gfx11: fix config guard
s/CONFIG_DRM_AMD_USERQ_GFX/CONFIG_DRM_AMDGPU_NAVI3X_USERQ/
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 28 Feb 2025 19:37:31 +0000 (14:37 -0500)]
drm/amdgpu/Kconfig: fix wording of DRM_AMDGPU_NAVI3X_USERQ
The feature is not navi3x specific at this point.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 28 Feb 2025 19:14:35 +0000 (14:14 -0500)]
drm/amdgpu: return an error in the userq IOCTL when DRM_AMDGPU_NAVI3X_USERQ=n
I'd swear this was already fixed, but I guess the patch never
landed. Add it now.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Feb 2025 14:44:39 +0000 (09:44 -0500)]
drm/amdgpu/userq: handle runtime pm
Take a reference when we create a queue and drop it
when we destroy the queue. We need to keep the device
active while user queues are active.
v2: squash in fix from Sunil
v3: squash in fix from Prike
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 20 Feb 2025 21:08:02 +0000 (16:08 -0500)]
drm/amdgpu/userq: fix hardcoded uq functions
Use the IP type to look up the userq functions rather
than hardcoding it.
Reviewed-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arvind Yadav [Mon, 27 Jan 2025 12:52:01 +0000 (18:22 +0530)]
drm/amdgpu: Fix display freeze lockup error
A deadlock situation has arised between the userq
signal ioctl and the eviction fence. In this scenario,
the function amdgpu_userq_signal_ioctl() has acquired a reservation
lock on the read/write buffer object (BO) through drm_exec.
Subsequently, it calls amdgpu_userqueue_ensure_ev_fence(),
which is in a waiting for the userq resume work.
Meanwhile, the userq suspend worker has initiated the userq resume
work(amdgpu_userqueue_resume_worker). This userq resume work attempts
to validate the vm->done BO, leading to amdgpu_userqueue_validate_bos
also attempting to reservation lock the same write BO that is already
locked by amdgpu_userq_signal_ioctl.
As a result, the resume work becomes stalled, causing
amdgpu_userqueue_ensure_ev_fence to remain in a waiting state.
Call Trace:
[ 242.836469] INFO: task gnome-shel:cs0:1288 blocked for more than 120 seconds.
[ 242.836486] Tainted: G OE 6.12.0-rc2rebased-oct-24+ #4
[ 242.836491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.836494] task:gnome-shel:cs0 state:D stack:0 pid:1288 tgid:1282 ppid:1180 flags:0x00000002
[ 242.836503] Call Trace:
[ 242.836508] <TASK>
[ 242.836517] __schedule+0x3e0/0xb10
[ 242.836530] ? srso_return_thunk+0x5/0x5f
[ 242.836541] schedule+0x31/0x120
[ 242.836546] schedule_timeout+0x150/0x160
[ 242.836551] ? srso_return_thunk+0x5/0x5f
[ 242.836555] ? sysvec_call_function+0x69/0xd0
[ 242.836562] ? srso_return_thunk+0x5/0x5f
[ 242.836567] ? preempt_count_add+0x7f/0xd0
[ 242.836577] __wait_for_common+0x91/0x180
[ 242.836582] ? __pfx_schedule_timeout+0x10/0x10
[ 242.836590] wait_for_completion+0x28/0x30
[ 242.836595] __flush_work+0x16c/0x290
[ 242.836602] ? __pfx_wq_barrier_func+0x10/0x10
[ 242.836611] flush_delayed_work+0x3a/0x60
[ 242.836621] amdgpu_userqueue_ensure_ev_fence+0x2d/0xb0 [amdgpu]
[ 242.836966] amdgpu_userq_signal_ioctl+0x959/0xec0 [amdgpu]
[ 242.837171] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ 242.837365] drm_ioctl_kernel+0xae/0x100 [drm]
[ 242.837398] drm_ioctl+0x2a1/0x500 [drm]
[ 242.837420] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ 242.837622] ? srso_return_thunk+0x5/0x5f
[ 242.837627] ? srso_return_thunk+0x5/0x5f
[ 242.837630] ? _raw_spin_unlock_irqrestore+0x2b/0x50
[ 242.837635] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[ 242.837811] __x64_sys_ioctl+0x99/0xd0
[ 242.837820] x64_sys_call+0x1209/0x20d0
[ 242.837825] do_syscall_64+0x51/0x120
[ 242.837830] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 242.837835] RIP: 0033:0x7f2f33f1a94f
[ 242.837838] RSP: 002b:
00007f2f24ffea30 EFLAGS:
00000246 ORIG_RAX:
0000000000000010
[ 242.837842] RAX:
ffffffffffffffda RBX:
00007f2f24ffebd0 RCX:
00007f2f33f1a94f
[ 242.837845] RDX:
00007f2f24ffebd0 RSI:
00000000c0306457 RDI:
000000000000000d
[ 242.837847] RBP:
00007f2f24ffeab0 R08:
0000000000000000 R09:
0000000000000000
[ 242.837849] R10:
00007f2f24ffecd0 R11:
0000000000000246 R12:
00007f2f25000640
[ 242.837851] R13:
00000000c0306457 R14:
000000000000000d R15:
00007fff3b39c1e0
[ 242.837858] </TASK>
[ 242.837865] INFO: task Xwayland:cs0:1517 blocked for more than 120 seconds.
[ 242.837869] Tainted: G OE 6.12.0-rc2rebased-oct-24+ #4
[ 242.837872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.837874] task:Xwayland:cs0 state:D stack:0 pid:1517 tgid:1338 ppid:1282 flags:0x00004002
[ 242.837878] Call Trace:
[ 242.837880] <TASK>
[ 242.837883] __schedule+0x3e0/0xb10
[ 242.837890] schedule+0x31/0x120
[ 242.837894] schedule_preempt_disabled+0x1c/0x30
[ 242.837897] __mutex_lock.constprop.0+0x386/0x6e0
[ 242.837902] ? srso_return_thunk+0x5/0x5f
[ 242.837905] ? __timer_delete_sync+0x81/0xe0
[ 242.837911] __mutex_lock_slowpath+0x13/0x20
[ 242.837915] mutex_lock+0x3b/0x50
[ 242.837919] amdgpu_userqueue_ensure_ev_fence+0x35/0xb0 [amdgpu]
[ 242.838138] amdgpu_userq_signal_ioctl+0x959/0xec0 [amdgpu]
[ 242.838340] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ 242.838531] drm_ioctl_kernel+0xae/0x100 [drm]
[ 242.838559] drm_ioctl+0x2a1/0x500 [drm]
[ 242.838580] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ 242.838778] ? srso_return_thunk+0x5/0x5f
[ 242.838783] ? srso_return_thunk+0x5/0x5f
[ 242.838786] ? _raw_spin_unlock_irqrestore+0x2b/0x50
[ 242.838791] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[ 242.838967] __x64_sys_ioctl+0x99/0xd0
[ 242.838972] x64_sys_call+0x1209/0x20d0
[ 242.838975] do_syscall_64+0x51/0x120
[ 242.838979] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 242.838982] RIP: 0033:0x7f9118b1a94f
[ 242.838985] RSP: 002b:
00007f910cdff760 EFLAGS:
00000246 ORIG_RAX:
0000000000000010
[ 242.838989] RAX:
ffffffffffffffda RBX:
00007f910cdff910 RCX:
00007f9118b1a94f
[ 242.838991] RDX:
00007f910cdff910 RSI:
00000000c0306457 RDI:
000000000000000c
[ 242.838993] RBP:
00007f910cdff7e0 R08:
0000000000000000 R09:
0000000000000001
[ 242.838995] R10:
00007f910cdff9d4 R11:
0000000000000246 R12:
00007f910ce00640
[ 242.838997] R13:
00000000c0306457 R14:
000000000000000c R15:
00007fff9dd11d10
[ 242.839004] </TASK>
v2: Addressed review comemnts from Christian.
v3/v4: Addressed review comemnts from Christian.
- Move drm_exec drm_exec loop after userq fence create.
- cleanup the newly created userq fence in case of error.
v5 - Addressed review comemnts from Christian.
- Create a new amdgpu_userq_fence_alloc() function for allocation.
- Calling dma_fence_put for cleanup procedure.
- make amdgpu_userq_fence_create() function static.
- drm_exec_init is called after mutex_unlock.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arunpravin Paneer Selvam [Mon, 10 Feb 2025 16:47:28 +0000 (22:17 +0530)]
drm/amdgpu: Modify the seq64 VM cache policy
The seq64 VM cache policy should be set to UC (Uncached) to
match with userqueue fence address kernel mapped memory's
cache settings.
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arunpravin Paneer Selvam [Wed, 1 Jan 2025 08:52:29 +0000 (14:22 +0530)]
drm/amdgpu: Fix out-of-bounds issue in user fence
Fix out-of-bounds issue in userq fence create when
accessing the userq xa structure. Added a lock to
protect the race condition.
v2:(Christian)
- Allocate memory with GFP_ATOMIC.
v3:
- Moved to 2 xa approach.
v4:(Christian)
- Lock the xa_for_each blocks and memory allocation part
as well to make sure that xa is not modified in between
the 2 xa_for_each blocks.
BUG: KASAN: slab-out-of-bounds in amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[ +0.000006] Call Trace:
[ +0.000005] <TASK>
[ +0.000005] dump_stack_lvl+0x6c/0x90
[ +0.000011] print_report+0xc4/0x5e0
[ +0.000009] ? srso_return_thunk+0x5/0x5f
[ +0.000008] ? kasan_complete_mode_report_info+0x26/0x1d0
[ +0.000007] ? amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[ +0.000405] kasan_report+0xdf/0x120
[ +0.000009] ? amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[ +0.000405] __asan_report_store8_noabort+0x17/0x20
[ +0.000007] amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[ +0.000406] ? __pfx_amdgpu_userq_fence_create+0x10/0x10 [amdgpu]
[ +0.000408] ? srso_return_thunk+0x5/0x5f
[ +0.000008] ? ttm_resource_move_to_lru_tail+0x235/0x4f0 [ttm]
[ +0.000013] ? srso_return_thunk+0x5/0x5f
[ +0.000008] amdgpu_userq_signal_ioctl+0xd29/0x1c70 [amdgpu]
[ +0.000412] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ +0.000404] ? try_to_wake_up+0x165/0x1840
[ +0.000010] ? __pfx_futex_wake_mark+0x10/0x10
[ +0.000011] drm_ioctl_kernel+0x178/0x2f0 [drm]
[ +0.000050] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ +0.000404] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
[ +0.000043] ? __kasan_check_read+0x11/0x20
[ +0.000007] ? srso_return_thunk+0x5/0x5f
[ +0.000007] ? __kasan_check_write+0x14/0x20
[ +0.000008] drm_ioctl+0x513/0xd20 [drm]
[ +0.000040] ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[ +0.000407] ? __pfx_drm_ioctl+0x10/0x10 [drm]
[ +0.000044] ? srso_return_thunk+0x5/0x5f
[ +0.000007] ? _raw_spin_lock_irqsave+0x99/0x100
[ +0.000007] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ +0.000006] ? __rseq_handle_notify_resume+0x188/0xc30
[ +0.000008] ? srso_return_thunk+0x5/0x5f
[ +0.000008] ? srso_return_thunk+0x5/0x5f
[ +0.000006] ? _raw_spin_unlock_irqrestore+0x27/0x50
[ +0.000010] amdgpu_drm_ioctl+0xcd/0x1d0 [amdgpu]
[ +0.000388] __x64_sys_ioctl+0x135/0x1b0
[ +0.000009] x64_sys_call+0x1205/0x20d0
[ +0.000007] do_syscall_64+0x4d/0x120
[ +0.000008] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ +0.000007] RIP: 0033:0x7f7c3d31a94f
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Saleemkhan Jamadar [Mon, 6 Jan 2025 07:20:50 +0000 (12:50 +0530)]
drm/amdgpu: add db size and offset range for VCN and VPE
VCN and VPE have different offset range, update the doorbell
offset range repsectively.
Doorbell size for VCN and VPE is 32bit.
v1 : add gfx switch case and fix checkpatch warnings (Shashank)
Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Saleemkhan Jamadar [Fri, 3 Jan 2025 13:32:59 +0000 (19:02 +0530)]
drm/amdgpu: map doorbell for the requested userq
Introduce db_info structure to the populate the doorbell
information that is required to be mapped.
Made changes to the doorbell mapping func more generic,
by taking parameters that vary based on IPs and/or usecase
into db_info structure.
v2 - Fix space alignment and checkpatch warnings(Shashank)
Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Fri, 20 Dec 2024 12:44:23 +0000 (13:44 +0100)]
drm/amdgpu: fix call to amdgpu_eviction_fence_detach
That needs to be done after grabbing the lock, not before.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arvind Yadav [Thu, 19 Dec 2024 14:13:54 +0000 (19:43 +0530)]
drm/amdgpu: Fix Illegal opcode in command stream Error
When applications closes, it triggers the drm_file_free
function which subsequently releases all allocated buffer
objects. Concurrently, the resume_worker thread will attempt
to map the usermode queue. However, since the wptr buffer
object has already been deallocated, this will result in
an Illegal opcode error being raised in the command stream.
Now replacing drm_release() with a new function
amdgpu_drm_release(). This function will set the flag to
prevent the scheduling of any new queue resume/map, stop
all queues and then call drm_release().
V2:
- Replace drm_release with amdgpu_drm_release(Christian).
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Arunpravin Paneer Selvam [Thu, 12 Dec 2024 14:06:16 +0000 (19:36 +0530)]
drm/amdgpu: Apply sign extension to seq64
Apply sign extension to seq64 va address.
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>