git.kernel.dk Git - linux-block.git/commit

author	Lijo Lazar <lijo.lazar@amd.com>
	Thu, 24 Oct 2024 05:31:57 +0000 (11:01 +0530)
committer	Alex Deucher <alexander.deucher@amd.com>
	Tue, 10 Dec 2024 15:26:46 +0000 (10:26 -0500)
commit	e1ee2111ca48169a9fdc5075f7863f5d4d591e2f
tree	487517237aa6a8c5587ef88e717accf3a72878b2	tree
parent	0eecff79e49f8ce5475e1b4d968f26263587be66	commit \| diff

drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.

An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drivers/gpu/drm/amd/amdgpu/aldebaran.c		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h		diff \| blob \| blame \| history
drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c		diff \| blob \| blame \| history