Commit | Line | Data |
---|---|---|
8a01fa64 | 1 | .. SPDX-License-Identifier: GPL-2.0 |
065c6359 | 2 | |
8a01fa64 CD |
3 | ================== |
4 | PCI Error Recovery | |
5 | ================== | |
c9ab8b68 | 6 | |
8a01fa64 CD |
7 | |
8 | :Authors: - Linas Vepstas <linasvepstas@gmail.com> | |
9 | - Richard Lary <rlary@us.ibm.com> | |
10 | - Mike Mason <mmlnx@us.ibm.com> | |
c9ab8b68 LV |
11 | |
12 | ||
13 | Many PCI bus controllers are able to detect a variety of hardware | |
14 | PCI errors on the bus, such as parity errors on the data and address | |
97e4e959 | 15 | buses, as well as SERR and PERR errors. Some of the more advanced |
c9ab8b68 | 16 | chipsets are able to deal with these errors; these include PCI-E chipsets, |
fe14acd4 MM |
17 | and the PCI-host bridges found on IBM Power4, Power5 and Power6-based |
18 | pSeries boxes. A typical action taken is to disconnect the affected device, | |
c9ab8b68 LV |
19 | halting all I/O to it. The goal of a disconnection is to avoid system |
20 | corruption; for example, to halt system memory corruption due to DMA's | |
21 | to "wild" addresses. Typically, a reconnection mechanism is also | |
22 | offered, so that the affected PCI device(s) are reset and put back | |
23 | into working condition. The reset phase requires coordination | |
24 | between the affected device drivers and the PCI controller chip. | |
25 | This document describes a generic API for notifying device drivers | |
26 | of a bus disconnection, and then performing error recovery. | |
27 | This API is currently implemented in the 2.6.16 and later kernels. | |
28 | ||
29 | Reporting and recovery is performed in several steps. First, when | |
30 | a PCI hardware error has resulted in a bus disconnect, that event | |
31 | is reported as soon as possible to all affected device drivers, | |
32 | including multiple instances of a device driver on multi-function | |
33 | cards. This allows device drivers to avoid deadlocking in spinloops, | |
34 | waiting for some i/o-space register to change, when it never will. | |
35 | It also gives the drivers a chance to defer incoming I/O as | |
36 | needed. | |
37 | ||
38 | Next, recovery is performed in several stages. Most of the complexity | |
39 | is forced by the need to handle multi-function devices, that is, | |
40 | devices that have multiple device drivers associated with them. | |
41 | In the first stage, each driver is allowed to indicate what type | |
42 | of reset it desires, the choices being a simple re-enabling of I/O | |
fe14acd4 | 43 | or requesting a slot reset. |
c9ab8b68 | 44 | |
fe14acd4 MM |
45 | If any driver requests a slot reset, that is what will be done. |
46 | ||
47 | After a reset and/or a re-enabling of I/O, all drivers are | |
c9ab8b68 LV |
48 | again notified, so that they may then perform any device setup/config |
49 | that may be required. After these have all completed, a final | |
50 | "resume normal operations" event is sent out. | |
51 | ||
52 | The biggest reason for choosing a kernel-based implementation rather | |
53 | than a user-space implementation was the need to deal with bus | |
54 | disconnects of PCI devices attached to storage media, and, in particular, | |
55 | disconnects from devices holding the root file system. If the root | |
56 | file system is disconnected, a user-space mechanism would have to go | |
57 | through a large number of contortions to complete recovery. Almost all | |
58 | of the current Linux file systems are not tolerant of disconnection | |
59 | from/reconnection to their underlying block device. By contrast, | |
60 | bus errors are easy to manage in the device driver. Indeed, most | |
61 | device drivers already handle very similar recovery procedures; | |
62 | for example, the SCSI-generic layer already provides significant | |
63 | mechanisms for dealing with SCSI bus errors and SCSI bus resets. | |
64 | ||
65 | ||
66 | Detailed Design | |
8a01fa64 CD |
67 | =============== |
68 | ||
c9ab8b68 LV |
69 | Design and implementation details below, based on a chain of |
70 | public email discussions with Ben Herrenschmidt, circa 5 April 2005. | |
065c6359 | 71 | |
72 | The error recovery API support is exposed to the driver in the form of | |
73 | a structure of function pointers pointed to by a new field in struct | |
c9ab8b68 LV |
74 | pci_driver. A driver that fails to provide the structure is "non-aware", |
75 | and the actual recovery steps taken are platform dependent. The | |
76 | arch/powerpc implementation will simulate a PCI hotplug remove/add. | |
065c6359 | 77 | |
8a01fa64 CD |
78 | This structure has the form:: |
79 | ||
80 | struct pci_error_handlers | |
81 | { | |
82 | int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); | |
83 | int (*mmio_enabled)(struct pci_dev *dev); | |
84 | int (*slot_reset)(struct pci_dev *dev); | |
85 | void (*resume)(struct pci_dev *dev); | |
86 | }; | |
87 | ||
88 | The possible channel states are:: | |
89 | ||
90 | enum pci_channel_state { | |
91 | pci_channel_io_normal, /* I/O channel is in normal state */ | |
92 | pci_channel_io_frozen, /* I/O to channel is blocked */ | |
93 | pci_channel_io_perm_failure, /* PCI card is dead */ | |
94 | }; | |
95 | ||
96 | Possible return values are:: | |
97 | ||
98 | enum pci_ers_result { | |
99 | PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ | |
100 | PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ | |
101 | PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ | |
102 | PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ | |
103 | PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ | |
104 | }; | |
c9ab8b68 LV |
105 | |
106 | A driver does not have to implement all of these callbacks; however, | |
107 | if it implements any, it must implement error_detected(). If a callback | |
108 | is not implemented, the corresponding feature is considered unsupported. | |
109 | For example, if mmio_enabled() and resume() aren't there, then it | |
110 | is assumed that the driver is not doing any direct recovery and requires | |
2fd260f0 | 111 | a slot reset. Typically a driver will want to know about |
c9ab8b68 LV |
112 | a slot_reset(). |
113 | ||
114 | The actual steps taken by a platform to recover from a PCI error | |
115 | event will be platform-dependent, but will follow the general | |
116 | sequence described below. | |
117 | ||
bdb5ac85 | 118 | STEP 0: Error Event |
c9ab8b68 | 119 | ------------------- |
fe14acd4 | 120 | A PCI bus error is detected by the PCI hardware. On powerpc, the slot |
c9ab8b68 LV |
121 | is isolated, in that all I/O is blocked: all reads return 0xffffffff, |
122 | all writes are ignored. | |
123 | ||
124 | ||
125 | STEP 1: Notification | |
126 | -------------------- | |
127 | Platform calls the error_detected() callback on every instance of | |
128 | every driver affected by the error. | |
129 | ||
130 | At this point, the device might not be accessible anymore, depending on | |
131 | the platform (the slot will be isolated on powerpc). The driver may | |
132 | already have "noticed" the error because of a failing I/O, but this | |
133 | is the proper "synchronization point", that is, it gives the driver | |
134 | a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) | |
135 | to complete; it can take semaphores, schedule, etc... everything but | |
136 | touch the device. Within this function and after it returns, the driver | |
065c6359 | 137 | shouldn't do any new IOs. Called in task context. This is sort of a |
138 | "quiesce" point. See note about interrupts at the end of this doc. | |
139 | ||
c9ab8b68 LV |
140 | All drivers participating in this system must implement this call. |
141 | The driver must return one of the following result codes: | |
8a01fa64 CD |
142 | |
143 | - PCI_ERS_RESULT_CAN_RECOVER | |
144 | Driver returns this if it thinks it might be able to recover | |
145 | the HW by just banging IOs or if it wants to be given | |
146 | a chance to extract some diagnostic information (see | |
147 | mmio_enable, below). | |
148 | - PCI_ERS_RESULT_NEED_RESET | |
149 | Driver returns this if it can't recover without a | |
150 | slot reset. | |
151 | - PCI_ERS_RESULT_DISCONNECT | |
152 | Driver returns this if it doesn't want to recover at all. | |
c9ab8b68 LV |
153 | |
154 | The next step taken will depend on the result codes returned by the | |
155 | drivers. | |
156 | ||
157 | If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, | |
158 | then the platform should re-enable IOs on the slot (or do nothing in | |
159 | particular, if the platform doesn't isolate slots), and recovery | |
160 | proceeds to STEP 2 (MMIO Enable). | |
161 | ||
162 | If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), | |
163 | then recovery proceeds to STEP 4 (Slot Reset). | |
164 | ||
165 | If the platform is unable to recover the slot, the next step | |
166 | is STEP 6 (Permanent Failure). | |
167 | ||
8a01fa64 CD |
168 | .. note:: |
169 | ||
170 | The current powerpc implementation assumes that a device driver will | |
171 | *not* schedule or semaphore in this routine; the current powerpc | |
172 | implementation uses one kernel thread to notify all devices; | |
173 | thus, if one device sleeps/schedules, all devices are affected. | |
174 | Doing better requires complex multi-threaded logic in the error | |
175 | recovery implementation (e.g. waiting for all notification threads | |
176 | to "join" before proceeding with recovery.) This seems excessively | |
177 | complex and not worth implementing. | |
178 | ||
179 | The current powerpc implementation doesn't much care if the device | |
180 | attempts I/O at this point, or not. I/O's will fail, returning | |
181 | a value of 0xff on read, and writes will be dropped. If more than | |
182 | EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH | |
183 | assumes that the device driver has gone into an infinite loop | |
184 | and prints an error to syslog. A reboot is then required to | |
185 | get the device working again. | |
065c6359 | 186 | |
c9ab8b68 | 187 | STEP 2: MMIO Enabled |
8a01fa64 | 188 | -------------------- |
c9ab8b68 LV |
189 | The platform re-enables MMIO to the device (but typically not the |
190 | DMA), and then calls the mmio_enabled() callback on all affected | |
191 | device drivers. | |
065c6359 | 192 | |
c9ab8b68 | 193 | This is the "early recovery" call. IOs are allowed again, but DMA is |
fe14acd4 MM |
194 | not, with some restrictions. This is NOT a callback for the driver to |
195 | start operations again, only to peek/poke at the device, extract diagnostic | |
196 | information, if any, and eventually do things like trigger a device local | |
197 | reset or some such, but not restart operations. This callback is made if | |
198 | all drivers on a segment agree that they can try to recover and if no automatic | |
199 | link reset was performed by the HW. If the platform can't just re-enable IOs | |
200 | without a slot reset or a link reset, it will not call this callback, and | |
201 | instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) | |
c9ab8b68 | 202 | |
8a01fa64 CD |
203 | .. note:: |
204 | ||
205 | The following is proposed; no platform implements this yet: | |
206 | Proposal: All I/O's should be done _synchronously_ from within | |
207 | this callback, errors triggered by them will be returned via | |
208 | the normal pci_check_whatever() API, no new error_detected() | |
209 | callback will be issued due to an error happening here. However, | |
210 | such an error might cause IOs to be re-blocked for the whole | |
211 | segment, and thus invalidate the recovery that other devices | |
212 | on the same segment might have done, forcing the whole segment | |
213 | into one of the next states, that is, link reset or slot reset. | |
c9ab8b68 LV |
214 | |
215 | The driver should return one of the following result codes: | |
8a01fa64 CD |
216 | - PCI_ERS_RESULT_RECOVERED |
217 | Driver returns this if it thinks the device is fully | |
218 | functional and thinks it is ready to start | |
219 | normal driver operations again. There is no | |
220 | guarantee that the driver will actually be | |
221 | allowed to proceed, as another driver on the | |
222 | same segment might have failed and thus triggered a | |
223 | slot reset on platforms that support it. | |
224 | ||
225 | - PCI_ERS_RESULT_NEED_RESET | |
226 | Driver returns this if it thinks the device is not | |
227 | recoverable in its current state and it needs a slot | |
228 | reset to proceed. | |
229 | ||
230 | - PCI_ERS_RESULT_DISCONNECT | |
231 | Same as above. Total failure, no recovery even after | |
232 | reset driver dead. (To be defined more precisely) | |
065c6359 | 233 | |
c9ab8b68 LV |
234 | The next step taken depends on the results returned by the drivers. |
235 | If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform | |
236 | proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). | |
237 | ||
238 | If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform | |
239 | proceeds to STEP 4 (Slot Reset) | |
065c6359 | 240 | |
bdb5ac85 KB |
241 | STEP 3: Link Reset |
242 | ------------------ | |
243 | The platform resets the link. This is a PCI-Express specific step | |
244 | and is done whenever a fatal error has been detected that can be | |
245 | "solved" by resetting the link. | |
246 | ||
247 | STEP 4: Slot Reset | |
c9ab8b68 | 248 | ------------------ |
c9ab8b68 | 249 | |
fe14acd4 | 250 | In response to a return value of PCI_ERS_RESULT_NEED_RESET, the |
97e4e959 | 251 | the platform will perform a slot reset on the requesting PCI device(s). |
fe14acd4 MM |
252 | The actual steps taken by a platform to perform a slot reset |
253 | will be platform-dependent. Upon completion of slot reset, the | |
254 | platform will call the device slot_reset() callback. | |
255 | ||
256 | Powerpc platforms implement two levels of slot reset: | |
257 | soft reset(default) and fundamental(optional) reset. | |
258 | ||
259 | Powerpc soft reset consists of asserting the adapter #RST line and then | |
c9ab8b68 LV |
260 | restoring the PCI BAR's and PCI configuration header to a state |
261 | that is equivalent to what it would be after a fresh system | |
262 | power-on followed by power-on BIOS/system firmware initialization. | |
fe14acd4 MM |
263 | Soft reset is also known as hot-reset. |
264 | ||
265 | Powerpc fundamental reset is supported by PCI Express cards only | |
266 | and results in device's state machines, hardware logic, port states and | |
267 | configuration registers to initialize to their default conditions. | |
268 | ||
269 | For most PCI devices, a soft reset will be sufficient for recovery. | |
270 | Optional fundamental reset is provided to support a limited number | |
97e4e959 | 271 | of PCI Express devices for which a soft reset is not sufficient |
fe14acd4 MM |
272 | for recovery. |
273 | ||
c9ab8b68 LV |
274 | If the platform supports PCI hotplug, then the reset might be |
275 | performed by toggling the slot electrical power off/on. | |
065c6359 | 276 | |
c9ab8b68 LV |
277 | It is important for the platform to restore the PCI config space |
278 | to the "fresh poweron" state, rather than the "last state". After | |
279 | a slot reset, the device driver will almost always use its standard | |
280 | device initialization routines, and an unusual config space setup | |
281 | may result in hung devices, kernel panics, or silent data corruption. | |
065c6359 | 282 | |
c9ab8b68 LV |
283 | This call gives drivers the chance to re-initialize the hardware |
284 | (re-download firmware, etc.). At this point, the driver may assume | |
fe14acd4 MM |
285 | that the card is in a fresh state and is fully functional. The slot |
286 | is unfrozen and the driver has full access to PCI config space, | |
287 | memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) | |
288 | will also be available. | |
065c6359 | 289 | |
fe14acd4 | 290 | Drivers should not restart normal I/O processing operations |
c9ab8b68 LV |
291 | at this point. If all device drivers report success on this |
292 | callback, the platform will call resume() to complete the sequence, | |
293 | and let the driver restart normal I/O processing. | |
065c6359 | 294 | |
295 | A driver can still return a critical failure for this function if | |
296 | it can't get the device operational after reset. If the platform | |
c9ab8b68 | 297 | previously tried a soft reset, it might now try a hard reset (power |
065c6359 | 298 | cycle) and then call slot_reset() again. It the device still can't |
299 | be recovered, there is nothing more that can be done; the platform | |
300 | will typically report a "permanent failure" in such a case. The | |
301 | device will be considered "dead" in this case. | |
302 | ||
c9ab8b68 LV |
303 | Drivers for multi-function cards will need to coordinate among |
304 | themselves as to which driver instance will perform any "one-shot" | |
305 | or global device initialization. For example, the Symbios sym53cxx2 | |
8a01fa64 | 306 | driver performs device init only from PCI function 0:: |
065c6359 | 307 | |
8a01fa64 CD |
308 | + if (PCI_FUNC(pdev->devfn) == 0) |
309 | + sym_reset_scsi_bus(np, 0); | |
065c6359 | 310 | |
8a01fa64 CD |
311 | Result codes: |
312 | - PCI_ERS_RESULT_DISCONNECT | |
313 | Same as above. | |
065c6359 | 314 | |
fe14acd4 | 315 | Drivers for PCI Express cards that require a fundamental reset must |
97e4e959 | 316 | set the needs_freset bit in the pci_dev structure in their probe function. |
fe14acd4 | 317 | For example, the QLogic qla2xxx driver sets the needs_freset bit for certain |
8a01fa64 | 318 | PCI card types:: |
fe14acd4 | 319 | |
8a01fa64 CD |
320 | + /* Set EEH reset type to fundamental if required by hba */ |
321 | + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) | |
322 | + pdev->needs_freset = 1; | |
323 | + | |
fe14acd4 | 324 | |
c9ab8b68 LV |
325 | Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent |
326 | Failure). | |
327 | ||
8a01fa64 CD |
328 | .. note:: |
329 | ||
330 | The current powerpc implementation does not try a power-cycle | |
331 | reset if the driver returned PCI_ERS_RESULT_DISCONNECT. | |
332 | However, it probably should. | |
c9ab8b68 LV |
333 | |
334 | ||
bdb5ac85 | 335 | STEP 5: Resume Operations |
c9ab8b68 LV |
336 | ------------------------- |
337 | The platform will call the resume() callback on all affected device | |
338 | drivers if all drivers on the segment have returned | |
339 | PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. | |
340 | The goal of this callback is to tell the driver to restart activity, | |
341 | that everything is back and running. This callback does not return | |
342 | a result code. | |
343 | ||
344 | At this point, if a new error happens, the platform will restart | |
345 | a new error recovery sequence. | |
346 | ||
bdb5ac85 | 347 | STEP 6: Permanent Failure |
c9ab8b68 LV |
348 | ------------------------- |
349 | A "permanent failure" has occurred, and the platform cannot recover | |
350 | the device. The platform will call error_detected() with a | |
351 | pci_channel_state value of pci_channel_io_perm_failure. | |
352 | ||
353 | The device driver should, at this point, assume the worst. It should | |
354 | cancel all pending I/O, refuse all new I/O, returning -EIO to | |
355 | higher layers. The device driver should then clean up all of its | |
356 | memory and remove itself from kernel operations, much as it would | |
357 | during system shutdown. | |
358 | ||
359 | The platform will typically notify the system operator of the | |
360 | permanent failure in some way. If the device is hotplug-capable, | |
361 | the operator will probably want to remove and replace the device. | |
362 | Note, however, not all failures are truly "permanent". Some are | |
363 | caused by over-heating, some by a poorly seated card. Many | |
364 | PCI error events are caused by software bugs, e.g. DMA's to | |
365 | wild addresses or bogus split transactions due to programming | |
366 | errors. See the discussion in powerpc/eeh-pci-error-recovery.txt | |
367 | for additional detail on real-life experience of the causes of | |
368 | software errors. | |
369 | ||
370 | ||
371 | Conclusion; General Remarks | |
372 | --------------------------- | |
fe14acd4 | 373 | The way the callbacks are called is platform policy. A platform with |
c9ab8b68 | 374 | no slot reset capability may want to just "ignore" drivers that can't |
065c6359 | 375 | recover (disconnect them) and try to let other cards on the same segment |
376 | recover. Keep in mind that in most real life cases, though, there will | |
377 | be only one driver per segment. | |
378 | ||
c9ab8b68 | 379 | Now, a note about interrupts. If you get an interrupt and your |
065c6359 | 380 | device is dead or has been isolated, there is a problem :) |
c9ab8b68 LV |
381 | The current policy is to turn this into a platform policy. |
382 | That is, the recovery API only requires that: | |
065c6359 | 383 | |
384 | - There is no guarantee that interrupt delivery can proceed from any | |
8a01fa64 CD |
385 | device on the segment starting from the error detection and until the |
386 | slot_reset callback is called, at which point interrupts are expected | |
387 | to be fully operational. | |
065c6359 | 388 | |
c9ab8b68 | 389 | - There is no guarantee that interrupt delivery is stopped, that is, |
8a01fa64 CD |
390 | a driver that gets an interrupt after detecting an error, or that detects |
391 | an error within the interrupt handler such that it prevents proper | |
392 | ack'ing of the interrupt (and thus removal of the source) should just | |
393 | return IRQ_NOTHANDLED. It's up to the platform to deal with that | |
394 | condition, typically by masking the IRQ source during the duration of | |
395 | the error handling. It is expected that the platform "knows" which | |
396 | interrupts are routed to error-management capable slots and can deal | |
397 | with temporarily disabling that IRQ number during error processing (this | |
398 | isn't terribly complex). That means some IRQ latency for other devices | |
399 | sharing the interrupt, but there is simply no other way. High end | |
400 | platforms aren't supposed to share interrupts between many devices | |
401 | anyway :) | |
402 | ||
403 | .. note:: | |
404 | ||
405 | Implementation details for the powerpc platform are discussed in | |
4d2e26a3 | 406 | the file Documentation/powerpc/eeh-pci-error-recovery.rst |
8a01fa64 CD |
407 | |
408 | As of this writing, there is a growing list of device drivers with | |
409 | patches implementing error recovery. Not all of these patches are in | |
410 | mainline yet. These may be used as "examples": | |
411 | ||
412 | - drivers/scsi/ipr | |
413 | - drivers/scsi/sym53c8xx_2 | |
414 | - drivers/scsi/qla2xxx | |
415 | - drivers/scsi/lpfc | |
416 | - drivers/next/bnx2.c | |
417 | - drivers/next/e100.c | |
418 | - drivers/net/e1000 | |
419 | - drivers/net/e1000e | |
420 | - drivers/net/ixgb | |
421 | - drivers/net/ixgbe | |
422 | - drivers/net/cxgb3 | |
423 | - drivers/net/s2io.c | |
4d2e26a3 MCC |
424 | |
425 | The End | |
426 | ------- |