drm/amdgpu: Avoid VF for RAS recovery source check
authorLijo Lazar <lijo.lazar@amd.com>
Mon, 9 Dec 2024 03:44:53 +0000 (09:14 +0530)
committerAlex Deucher <alexander.deucher@amd.com>
Wed, 11 Dec 2024 22:30:59 +0000 (17:30 -0500)
VF device sets the RAS flag when mailbox data can't be read properly.
There is no conclusive way to tell if the real source is RAS error.
Therefore VF schedules a KFD based reset which doesn't set RAS source.
SKip checking RAS source for any VF scheduled recovery.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: Vojislav Tomasevic <vojislav.tomasevic@amd.com>
Reviewed-by: Yiqing Yao <yiqing.yao@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
Fixes: e1ee2111ca48 ("drm/amdgpu: Prefer RAS recovery for scheduler hang")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 144295da9e4cc2bcf7fd3b68209fcf646314a735..e22fc7a8101f0cb55ca5483fa8e28aa5b6ae9083 100644 (file)
@@ -5866,6 +5866,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
         * detected at the same time, let RAS recovery take care of it.
         */
        if (amdgpu_ras_is_err_state(adev, AMDGPU_RAS_BLOCK__ANY) &&
+           !amdgpu_sriov_vf(adev) &&
            reset_context->src != AMDGPU_RESET_SRC_RAS) {
                dev_dbg(adev->dev,
                        "Gpu recovery from source: %d yielding to RAS error recovery handling",