Commit | Line | Data |
---|---|---|
4d2e26a3 MCC |
1 | ========================== |
2 | PCI Bus EEH Error Recovery | |
3 | ========================== | |
1da177e4 | 4 | |
4d2e26a3 | 5 | Linas Vepstas <linas@austin.ibm.com> |
1da177e4 | 6 | |
4d2e26a3 | 7 | 12 January 2005 |
1da177e4 LT |
8 | |
9 | ||
10 | Overview: | |
11 | --------- | |
12 | The IBM POWER-based pSeries and iSeries computers include PCI bus | |
13 | controller chips that have extended capabilities for detecting and | |
14 | reporting a large variety of PCI bus error conditions. These features | |
8ee26530 | 15 | go under the name of "EEH", for "Enhanced Error Handling". The EEH |
1da177e4 LT |
16 | hardware features allow PCI bus errors to be cleared and a PCI |
17 | card to be "rebooted", without also having to reboot the operating | |
18 | system. | |
19 | ||
20 | This is in contrast to traditional PCI error handling, where the | |
21 | PCI chip is wired directly to the CPU, and an error would cause | |
22 | a CPU machine-check/check-stop condition, halting the CPU entirely. | |
23 | Another "traditional" technique is to ignore such errors, which | |
24 | can lead to data corruption, both of user data or of kernel data, | |
25 | hung/unresponsive adapters, or system crashes/lockups. Thus, | |
26 | the idea behind EEH is that the operating system can become more | |
27 | reliable and robust by protecting it from PCI errors, and giving | |
28 | the OS the ability to "reboot"/recover individual PCI devices. | |
29 | ||
30 | Future systems from other vendors, based on the PCI-E specification, | |
31 | may contain similar features. | |
32 | ||
33 | ||
34 | Causes of EEH Errors | |
35 | -------------------- | |
36 | EEH was originally designed to guard against hardware failure, such | |
37 | as PCI cards dying from heat, humidity, dust, vibration and bad | |
38 | electrical connections. The vast majority of EEH errors seen in | |
01dd2fbf ML |
39 | "real life" are due to either poorly seated PCI cards, or, |
40 | unfortunately quite commonly, due to device driver bugs, device firmware | |
1da177e4 LT |
41 | bugs, and sometimes PCI card hardware bugs. |
42 | ||
43 | The most common software bug, is one that causes the device to | |
44 | attempt to DMA to a location in system memory that has not been | |
45 | reserved for DMA access for that card. This is a powerful feature, | |
46 | as it prevents what; otherwise, would have been silent memory | |
47 | corruption caused by the bad DMA. A number of device driver | |
48 | bugs have been found and fixed in this way over the past few | |
49 | years. Other possible causes of EEH errors include data or | |
50 | address line parity errors (for example, due to poor electrical | |
51 | connectivity due to a poorly seated card), and PCI-X split-completion | |
52 | errors (due to software, device firmware, or device PCI hardware bugs). | |
53 | The vast majority of "true hardware failures" can be cured by | |
54 | physically removing and re-seating the PCI card. | |
55 | ||
56 | ||
57 | Detection and Recovery | |
58 | ---------------------- | |
59 | In the following discussion, a generic overview of how to detect | |
60 | and recover from EEH errors will be presented. This is followed | |
61 | by an overview of how the current implementation in the Linux | |
62 | kernel does it. The actual implementation is subject to change, | |
63 | and some of the finer points are still being debated. These | |
64 | may in turn be swayed if or when other architectures implement | |
65 | similar functionality. | |
66 | ||
67 | When a PCI Host Bridge (PHB, the bus controller connecting the | |
68 | PCI bus to the system CPU electronics complex) detects a PCI error | |
69 | condition, it will "isolate" the affected PCI card. Isolation | |
70 | will block all writes (either to the card from the system, or | |
71 | from the card to the system), and it will cause all reads to | |
72 | return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). | |
73 | This value was chosen because it is the same value you would | |
74 | get if the device was physically unplugged from the slot. | |
75 | This includes access to PCI memory, I/O space, and PCI config | |
76 | space. Interrupts; however, will continued to be delivered. | |
77 | ||
78 | Detection and recovery are performed with the aid of ppc64 | |
79 | firmware. The programming interfaces in the Linux kernel | |
80 | into the firmware are referred to as RTAS (Run-Time Abstraction | |
81 | Services). The Linux kernel does not (should not) access | |
82 | the EEH function in the PCI chipsets directly, primarily because | |
83 | there are a number of different chipsets out there, each with | |
84 | different interfaces and quirks. The firmware provides a | |
85 | uniform abstraction layer that will work with all pSeries | |
86 | and iSeries hardware (and be forwards-compatible). | |
87 | ||
88 | If the OS or device driver suspects that a PCI slot has been | |
89 | EEH-isolated, there is a firmware call it can make to determine if | |
90 | this is the case. If so, then the device driver should put itself | |
91 | into a consistent state (given that it won't be able to complete any | |
92 | pending work) and start recovery of the card. Recovery normally | |
d6bc8ac9 | 93 | would consist of resetting the PCI device (holding the PCI #RST |
1da177e4 LT |
94 | line high for two seconds), followed by setting up the device |
95 | config space (the base address registers (BAR's), latency timer, | |
96 | cache line size, interrupt line, and so on). This is followed by a | |
97 | reinitialization of the device driver. In a worst-case scenario, | |
98 | the power to the card can be toggled, at least on hot-plug-capable | |
99 | slots. In principle, layers far above the device driver probably | |
100 | do not need to know that the PCI card has been "rebooted" in this | |
101 | way; ideally, there should be at most a pause in Ethernet/disk/USB | |
102 | I/O while the card is being reset. | |
103 | ||
104 | If the card cannot be recovered after three or four resets, the | |
105 | kernel/device driver should assume the worst-case scenario, that the | |
106 | card has died completely, and report this error to the sysadmin. | |
107 | In addition, error messages are reported through RTAS and also through | |
108 | syslogd (/var/log/messages) to alert the sysadmin of PCI resets. | |
109 | The correct way to deal with failed adapters is to use the standard | |
110 | PCI hotplug tools to remove and replace the dead card. | |
111 | ||
112 | ||
113 | Current PPC64 Linux EEH Implementation | |
114 | -------------------------------------- | |
115 | At this time, a generic EEH recovery mechanism has been implemented, | |
116 | so that individual device drivers do not need to be modified to support | |
117 | EEH recovery. This generic mechanism piggy-backs on the PCI hotplug | |
312c004d | 118 | infrastructure, and percolates events up through the userspace/udev |
a2ffd275 | 119 | infrastructure. Following is a detailed description of how this is |
1da177e4 LT |
120 | accomplished. |
121 | ||
122 | EEH must be enabled in the PHB's very early during the boot process, | |
123 | and if a PCI slot is hot-plugged. The former is performed by | |
2ef9481e | 124 | eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by |
1da177e4 LT |
125 | drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. |
126 | EEH must be enabled before a PCI scan of the device can proceed. | |
127 | Current Power5 hardware will not work unless EEH is enabled; | |
128 | although older Power4 can run with it disabled. Effectively, | |
129 | EEH can no longer be turned off. PCI devices *must* be | |
130 | registered with the EEH code; the EEH code needs to know about | |
131 | the I/O address ranges of the PCI device in order to detect an | |
132 | error. Given an arbitrary address, the routine | |
133 | pci_get_device_by_addr() will find the pci device associated | |
134 | with that address (if any). | |
135 | ||
b8b572e1 | 136 | The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(), |
d533f671 | 137 | etc. include a check to see if the i/o read returned all-0xff's. |
1da177e4 LT |
138 | If so, these make a call to eeh_dn_check_failure(), which in turn |
139 | asks the firmware if the all-ff's value is the sign of a true EEH | |
140 | error. If it is not, processing continues as normal. The grand | |
141 | total number of these false alarms or "false positives" can be | |
142 | seen in /proc/ppc64/eeh (subject to change). Normally, almost | |
143 | all of these occur during boot, when the PCI bus is scanned, where | |
144 | a large number of 0xff reads are part of the bus scan procedure. | |
145 | ||
4d2e26a3 MCC |
146 | If a frozen slot is detected, code in |
147 | arch/powerpc/platforms/pseries/eeh.c will print a stack trace to | |
148 | syslog (/var/log/messages). This stack trace has proven to be very | |
149 | useful to device-driver authors for finding out at what point the EEH | |
150 | error was detected, as the error itself usually occurs slightly | |
2ef9481e | 151 | beforehand. |
1da177e4 LT |
152 | |
153 | Next, it uses the Linux kernel notifier chain/work queue mechanism to | |
154 | allow any interested parties to find out about the failure. Device | |
155 | drivers, or other parts of the kernel, can use | |
4d2e26a3 | 156 | `eeh_register_notifier(struct notifier_block *)` to find out about EEH |
1da177e4 LT |
157 | events. The event will include a pointer to the pci device, the |
158 | device node and some state info. Receivers of the event can "do as | |
159 | they wish"; the default handler will be described further in this | |
160 | section. | |
161 | ||
162 | To assist in the recovery of the device, eeh.c exports the | |
163 | following functions: | |
164 | ||
4d2e26a3 MCC |
165 | rtas_set_slot_reset() |
166 | assert the PCI #RST line for 1/8th of a second | |
167 | rtas_configure_bridge() | |
168 | ask firmware to configure any PCI bridges | |
1da177e4 | 169 | located topologically under the pci slot. |
4d2e26a3 MCC |
170 | eeh_save_bars() and eeh_restore_bars(): |
171 | save and restore the PCI | |
1da177e4 LT |
172 | config-space info for a device and any devices under it. |
173 | ||
174 | ||
175 | A handler for the EEH notifier_block events is implemented in | |
176 | drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). | |
177 | It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). | |
178 | This last call causes the device driver for the card to be stopped, | |
312c004d | 179 | which causes uevents to go out to user space. This triggers |
1da177e4 LT |
180 | user-space scripts that might issue commands such as "ifdown eth0" |
181 | for ethernet cards, and so on. This handler then sleeps for 5 seconds, | |
182 | hoping to give the user-space scripts enough time to complete. | |
183 | It then resets the PCI card, reconfigures the device BAR's, and | |
184 | any bridges underneath. It then calls rpaphp_enable_pci_slot(), | |
185 | which restarts the device driver and triggers more user-space | |
186 | events (for example, calling "ifup eth0" for ethernet cards). | |
187 | ||
188 | ||
189 | Device Shutdown and User-Space Events | |
190 | ------------------------------------- | |
191 | This section documents what happens when a pci slot is unconfigured, | |
192 | focusing on how the device driver gets shut down, and on how the | |
193 | events get delivered to user-space scripts. | |
194 | ||
195 | Following is an example sequence of events that cause a device driver | |
196 | close function to be called during the first phase of an EEH reset. | |
4d2e26a3 | 197 | The following sequence is an example of the pcnet32 device driver:: |
1da177e4 LT |
198 | |
199 | rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c | |
200 | { | |
201 | calls | |
202 | pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c | |
203 | { | |
204 | calls | |
205 | pci_destroy_dev (struct pci_dev *) | |
206 | { | |
207 | calls | |
208 | device_unregister (&dev->dev) // in /drivers/base/core.c | |
209 | { | |
210 | calls | |
211 | device_del (struct device *) | |
212 | { | |
213 | calls | |
214 | bus_remove_device() // in /drivers/base/bus.c | |
215 | { | |
216 | calls | |
217 | device_release_driver() | |
218 | { | |
219 | calls | |
220 | struct device_driver->remove() which is just | |
221 | pci_device_remove() // in /drivers/pci/pci_driver.c | |
222 | { | |
223 | calls | |
224 | struct pci_driver->remove() which is just | |
225 | pcnet32_remove_one() // in /drivers/net/pcnet32.c | |
226 | { | |
227 | calls | |
228 | unregister_netdev() // in /net/core/dev.c | |
229 | { | |
230 | calls | |
231 | dev_close() // in /net/core/dev.c | |
232 | { | |
233 | calls dev->stop(); | |
234 | which is just pcnet32_close() // in pcnet32.c | |
235 | { | |
236 | which does what you wanted | |
237 | to stop the device | |
238 | } | |
239 | } | |
240 | } | |
241 | which | |
242 | frees pcnet32 device driver memory | |
243 | } | |
244 | }}}}}} | |
245 | ||
246 | ||
4d2e26a3 MCC |
247 | in drivers/pci/pci_driver.c, |
248 | struct device_driver->remove() is just pci_device_remove() | |
249 | which calls struct pci_driver->remove() which is pcnet32_remove_one() | |
250 | which calls unregister_netdev() (in net/core/dev.c) | |
251 | which calls dev_close() (in net/core/dev.c) | |
252 | which calls dev->stop() which is pcnet32_close() | |
253 | which then does the appropriate shutdown. | |
1da177e4 LT |
254 | |
255 | --- | |
4d2e26a3 | 256 | |
1da177e4 | 257 | Following is the analogous stack trace for events sent to user-space |
4d2e26a3 | 258 | when the pci device is unconfigured:: |
1da177e4 | 259 | |
4d2e26a3 | 260 | rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c |
1da177e4 | 261 | calls |
4d2e26a3 | 262 | pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c |
1da177e4 | 263 | calls |
4d2e26a3 | 264 | pci_destroy_dev (struct pci_dev *) { |
1da177e4 | 265 | calls |
4d2e26a3 | 266 | device_unregister (&dev->dev) { // in /drivers/base/core.c |
1da177e4 | 267 | calls |
4d2e26a3 | 268 | device_del(struct device * dev) { // in /drivers/base/core.c |
1da177e4 | 269 | calls |
4d2e26a3 | 270 | kobject_del() { //in /libs/kobject.c |
1da177e4 | 271 | calls |
4d2e26a3 | 272 | kobject_uevent() { // in /libs/kobject.c |
1da177e4 | 273 | calls |
4d2e26a3 | 274 | kset_uevent() { // in /lib/kobject.c |
1da177e4 | 275 | calls |
4d2e26a3 MCC |
276 | kset->uevent_ops->uevent() // which is really just |
277 | a call to | |
278 | dev_uevent() { // in /drivers/base/core.c | |
279 | calls | |
280 | dev->bus->uevent() which is really just a call to | |
281 | pci_uevent () { // in drivers/pci/hotplug.c | |
282 | which prints device name, etc.... | |
283 | } | |
1da177e4 | 284 | } |
4d2e26a3 MCC |
285 | then kobject_uevent() sends a netlink uevent to userspace |
286 | --> userspace uevent | |
287 | (during early boot, nobody listens to netlink events and | |
288 | kobject_uevent() executes uevent_helper[], which runs the | |
289 | event process /sbin/hotplug) | |
290 | } | |
1da177e4 | 291 | } |
4d2e26a3 MCC |
292 | kobject_del() then calls sysfs_remove_dir(), which would |
293 | trigger any user-space daemon that was watching /sysfs, | |
294 | and notice the delete event. | |
1da177e4 LT |
295 | |
296 | ||
297 | Pro's and Con's of the Current Design | |
298 | ------------------------------------- | |
299 | There are several issues with the current EEH software recovery design, | |
300 | which may be addressed in future revisions. But first, note that the | |
301 | big plus of the current design is that no changes need to be made to | |
302 | individual device drivers, so that the current design throws a wide net. | |
303 | The biggest negative of the design is that it potentially disturbs | |
304 | network daemons and file systems that didn't need to be disturbed. | |
305 | ||
4d2e26a3 | 306 | - A minor complaint is that resetting the network card causes |
1da177e4 LT |
307 | user-space back-to-back ifdown/ifup burps that potentially disturb |
308 | network daemons, that didn't need to even know that the pci | |
309 | card was being rebooted. | |
310 | ||
4d2e26a3 | 311 | - A more serious concern is that the same reset, for SCSI devices, |
1da177e4 LT |
312 | causes havoc to mounted file systems. Scripts cannot post-facto |
313 | unmount a file system without flushing pending buffers, but this | |
314 | is impossible, because I/O has already been stopped. Thus, | |
315 | ideally, the reset should happen at or below the block layer, | |
316 | so that the file systems are not disturbed. | |
317 | ||
318 | Reiserfs does not tolerate errors returned from the block device. | |
319 | Ext3fs seems to be tolerant, retrying reads/writes until it does | |
320 | succeed. Both have been only lightly tested in this scenario. | |
321 | ||
322 | The SCSI-generic subsystem already has built-in code for performing | |
323 | SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter | |
324 | (HBA) resets. These are cascaded into a chain of attempted | |
325 | resets if a SCSI command fails. These are completely hidden | |
326 | from the block layer. It would be very natural to add an EEH | |
327 | reset into this chain of events. | |
328 | ||
4d2e26a3 | 329 | - If a SCSI error occurs for the root device, all is lost unless |
1da177e4 LT |
330 | the sysadmin had the foresight to run /bin, /sbin, /etc, /var |
331 | and so on, out of ramdisk/tmpfs. | |
332 | ||
333 | ||
334 | Conclusions | |
335 | ----------- | |
336 | There's forward progress ... |