drm/amd: Fail the suspend if resources can't be evicted
authorMario Limonciello <mario.limonciello@amd.com>
Wed, 26 Oct 2022 19:03:55 +0000 (14:03 -0500)
committerAlex Deucher <alexander.deucher@amd.com>
Wed, 2 Nov 2022 21:00:39 +0000 (17:00 -0400)
If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index ddaecb2610c94009cc543ebceef2ac005a668c57..64510898eedd0609a3b62a68121a5ac883196fb6 100644 (file)
@@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
  * at suspend time.
  *
  */
-static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
+static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
 {
+       int ret;
+
        /* No need to evict vram on APUs for suspend to ram or s2idle */
        if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
-               return;
+               return 0;
 
-       if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
+       ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
+       if (ret)
                DRM_WARN("evicting device resources failed\n");
-
+       return ret;
 }
 
 /*
@@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
        if (!adev->in_s0ix)
                amdgpu_amdkfd_suspend(adev, adev->in_runpm);
 
-       amdgpu_device_evict_resources(adev);
+       r = amdgpu_device_evict_resources(adev);
+       if (r)
+               return r;
 
        amdgpu_fence_driver_hw_fini(adev);