powerpc/eeh: Set channel state after notifying the drivers
authorGanesh Goudar <ganeshgr@linux.ibm.com>
Thu, 9 Feb 2023 10:56:49 +0000 (16:26 +0530)
committerMichael Ellerman <mpe@ellerman.id.au>
Wed, 15 Feb 2023 11:41:11 +0000 (22:41 +1100)
When a PCI error is encountered 6th time in an hour we
set the channel state to perm_failure and notify the
driver about the permanent failure.

However, after upstream commit 38ddc011478e ("powerpc/eeh:
Make permanently failed devices non-actionable"), EEH handler
stops calling any routine once the device is marked as
permanent failure. This issue can lead to fatal consequences
like kernel hang with certain PCI devices.

Following log is observed with lpfc driver, with and without
this change, Without this change kernel hangs, If PCI error
is encountered 6 times for a device in an hour.

Without the change

 EEH: Beginning: 'error_detected(permanent failure)'
 PCI 0132:60:00.0#600000: EEH: not actionable (1,1,1)
 PCI 0132:60:00.1#600000: EEH: not actionable (1,1,1)
 EEH: Finished:'error_detected(permanent failure)'

With the change

 EEH: Beginning: 'error_detected(permanent failure)'
 EEH: Invoking lpfc->error_detected(permanent failure)
 EEH: lpfc driver reports: 'disconnect'
 EEH: Invoking lpfc->error_detected(permanent failure)
 EEH: lpfc driver reports: 'disconnect'
 EEH: Finished:'error_detected(permanent failure)'

To fix the issue, set channel state to permanent failure after
notifying the drivers.

Fixes: 38ddc011478e ("powerpc/eeh: Make permanently failed devices non-actionable")
Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230209105649.127707-1-ganeshgr@linux.ibm.com
arch/powerpc/kernel/eeh_driver.c

index f279295179bdfe6c8e48a4b505e5d3134d9e8ff2..438568a472d03bba2ac5db805ad5ae3e13bd1921 100644 (file)
@@ -1065,10 +1065,10 @@ recover_failed:
        eeh_slot_error_detail(pe, EEH_LOG_PERM);
 
        /* Notify all devices that they're about to go down. */
-       eeh_set_channel_state(pe, pci_channel_io_perm_failure);
        eeh_set_irq_state(pe, false);
        eeh_pe_report("error_detected(permanent failure)", pe,
                      eeh_report_failure, NULL);
+       eeh_set_channel_state(pe, pci_channel_io_perm_failure);
 
        /* Mark the PE to be removed permanently */
        eeh_pe_state_mark(pe, EEH_PE_REMOVED);
@@ -1185,10 +1185,10 @@ void eeh_handle_special_event(void)
 
                        /* Notify all devices to be down */
                        eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
-                       eeh_set_channel_state(pe, pci_channel_io_perm_failure);
                        eeh_pe_report(
                                "error_detected(permanent failure)", pe,
                                eeh_report_failure, NULL);
+                       eeh_set_channel_state(pe, pci_channel_io_perm_failure);
 
                        pci_lock_rescan_remove();
                        list_for_each_entry(hose, &hose_list, list_node) {