Commit | Line | Data |
---|---|---|
4e37f055 CD |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | .. include:: <isonum.txt> | |
47402400 | 3 | |
4e37f055 CD |
4 | =========================================================== |
5 | The PCI Express Advanced Error Reporting Driver Guide HOWTO | |
6 | =========================================================== | |
47402400 | 7 | |
4e37f055 CD |
8 | :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> |
9 | - Yanmin Zhang <yanmin.zhang@intel.com> | |
47402400 | 10 | |
4e37f055 CD |
11 | :Copyright: |copy| 2006 Intel Corporation |
12 | ||
13 | Overview | |
14 | =========== | |
15 | ||
16 | About this guide | |
17 | ---------------- | |
47402400 ZY |
18 | |
19 | This guide describes the basics of the PCI Express Advanced Error | |
20 | Reporting (AER) driver and provides information on how to use it, as | |
21 | well as how to enable the drivers of endpoint devices to conform with | |
22 | PCI Express AER driver. | |
23 | ||
47402400 | 24 | |
4e37f055 CD |
25 | What is the PCI Express AER Driver? |
26 | ----------------------------------- | |
47402400 ZY |
27 | |
28 | PCI Express error signaling can occur on the PCI Express link itself | |
29 | or on behalf of transactions initiated on the link. PCI Express | |
30 | defines two error reporting paradigms: the baseline capability and | |
31 | the Advanced Error Reporting capability. The baseline capability is | |
32 | required of all PCI Express components providing a minimum defined | |
33 | set of error reporting requirements. Advanced Error Reporting | |
34 | capability is implemented with a PCI Express advanced error reporting | |
35 | extended capability structure providing more robust error reporting. | |
36 | ||
37 | The PCI Express AER driver provides the infrastructure to support PCI | |
38 | Express Advanced Error Reporting capability. The PCI Express AER | |
39 | driver provides three basic functions: | |
40 | ||
4e37f055 CD |
41 | - Gathers the comprehensive error information if errors occurred. |
42 | - Reports error to the users. | |
43 | - Performs error recovery actions. | |
47402400 ZY |
44 | |
45 | AER driver only attaches root ports which support PCI-Express AER | |
46 | capability. | |
47 | ||
48 | ||
4e37f055 CD |
49 | User Guide |
50 | ========== | |
47402400 | 51 | |
4e37f055 CD |
52 | Include the PCI Express AER Root Driver into the Linux Kernel |
53 | ------------------------------------------------------------- | |
47402400 ZY |
54 | |
55 | The PCI Express AER Root driver is a Root Port service driver attached | |
56 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | |
57 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | |
58 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | |
59 | CONFIG_PCIEAER = y. | |
60 | ||
4e37f055 CD |
61 | Load PCI Express AER Root Driver |
62 | -------------------------------- | |
7ece1417 BH |
63 | |
64 | Some systems have AER support in firmware. Enabling Linux AER support at | |
65 | the same time the firmware handles AER may result in unpredictable | |
66 | behavior. Therefore, Linux does not handle AER events unless the firmware | |
67 | grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 | |
68 | Specification for details regarding _OSC usage. | |
47402400 | 69 | |
4e37f055 CD |
70 | AER error output |
71 | ---------------- | |
7ece1417 BH |
72 | |
73 | When a PCIe AER error is captured, an error message will be output to | |
74 | console. If it's a correctable error, it is output as a warning. | |
47402400 ZY |
75 | Otherwise, it is printed as an error. So users could choose different |
76 | log level to filter out correctable error messages. | |
77 | ||
4e37f055 CD |
78 | Below shows an example:: |
79 | ||
80 | 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) | |
81 | 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 | |
82 | 0000:50:00.0: [20] Unsupported Request (First) | |
83 | 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 | |
47402400 ZY |
84 | |
85 | In the example, 'Requester ID' means the ID of the device who sends | |
86 | the error message to root port. Pls. refer to pci express specs for | |
87 | other fields. | |
88 | ||
4e37f055 CD |
89 | AER Statistics / Counters |
90 | ------------------------- | |
81aa5206 RJ |
91 | |
92 | When PCIe AER errors are captured, the counters / statistics are also exposed | |
93 | in the form of sysfs attributes which are documented at | |
94 | Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats | |
47402400 | 95 | |
4e37f055 CD |
96 | Developer Guide |
97 | =============== | |
47402400 ZY |
98 | |
99 | To enable AER aware support requires a software driver to configure | |
100 | the AER capability structure within its device and to provide callbacks. | |
101 | ||
102 | To support AER better, developers need understand how AER does work | |
103 | firstly. | |
104 | ||
105 | PCI Express errors are classified into two types: correctable errors | |
106 | and uncorrectable errors. This classification is based on the impacts | |
107 | of those errors, which may result in degraded performance or function | |
108 | failure. | |
109 | ||
110 | Correctable errors pose no impacts on the functionality of the | |
111 | interface. The PCI Express protocol can recover without any software | |
112 | intervention or any loss of data. These errors are detected and | |
113 | corrected by hardware. Unlike correctable errors, uncorrectable | |
114 | errors impact functionality of the interface. Uncorrectable errors | |
115 | can cause a particular transaction or a particular PCI Express link | |
116 | to be unreliable. Depending on those error conditions, uncorrectable | |
117 | errors are further classified into non-fatal errors and fatal errors. | |
118 | Non-fatal errors cause the particular transaction to be unreliable, | |
119 | but the PCI Express link itself is fully functional. Fatal errors, on | |
120 | the other hand, cause the link to be unreliable. | |
121 | ||
122 | When AER is enabled, a PCI Express device will automatically send an | |
89713422 | 123 | error message to the PCIe root port above it when the device captures |
47402400 ZY |
124 | an error. The Root Port, upon receiving an error reporting message, |
125 | internally processes and logs the error message in its PCI Express | |
126 | capability structure. Error information being logged includes storing | |
127 | the error reporting agent's requestor ID into the Error Source | |
128 | Identification Registers and setting the error bits of the Root Error | |
129 | Status Register accordingly. If AER error reporting is enabled in Root | |
130 | Error Command Register, the Root Port generates an interrupt if an | |
131 | error is detected. | |
132 | ||
133 | Note that the errors as described above are related to the PCI Express | |
134 | hierarchy and links. These errors do not include any device specific | |
135 | errors because device specific errors will still get sent directly to | |
136 | the device driver. | |
137 | ||
4e37f055 CD |
138 | Configure the AER capability structure |
139 | -------------------------------------- | |
47402400 ZY |
140 | |
141 | AER aware drivers of PCI Express component need change the device | |
142 | control registers to enable AER. They also could change AER registers, | |
143 | including mask and severity registers. Helper function | |
144 | pci_enable_pcie_error_reporting could be used to enable AER. See | |
145 | section 3.3. | |
146 | ||
4e37f055 CD |
147 | Provide callbacks |
148 | ----------------- | |
47402400 | 149 | |
4e37f055 CD |
150 | callback reset_link to reset pci express link |
151 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
47402400 ZY |
152 | |
153 | This callback is used to reset the pci express physical link when a | |
154 | fatal error happens. The root port aer service driver provides a | |
155 | default reset_link function, but different upstream ports might | |
156 | have different specifications to reset pci express link, so all | |
157 | upstream ports should provide their own reset_link functions. | |
158 | ||
159 | In struct pcie_port_service_driver, a new pointer, reset_link, is | |
160 | added. | |
4e37f055 | 161 | :: |
47402400 | 162 | |
4e37f055 | 163 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); |
47402400 ZY |
164 | |
165 | Section 3.2.2.2 provides more detailed info on when to call | |
166 | reset_link. | |
167 | ||
4e37f055 CD |
168 | PCI error-recovery callbacks |
169 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
47402400 ZY |
170 | |
171 | The PCI Express AER Root driver uses error callbacks to coordinate | |
172 | with downstream device drivers associated with a hierarchy in question | |
173 | when performing error recovery actions. | |
174 | ||
175 | Data struct pci_driver has a pointer, err_handler, to point to | |
176 | pci_error_handlers who consists of a couple of callback function | |
177 | pointers. AER driver follows the rules defined in | |
178 | pci-error-recovery.txt except pci express specific parts (e.g. | |
179 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | |
180 | definitions of the callbacks. | |
181 | ||
182 | Below sections specify when to call the error callback functions. | |
183 | ||
4e37f055 CD |
184 | Correctable errors |
185 | ~~~~~~~~~~~~~~~~~~ | |
47402400 ZY |
186 | |
187 | Correctable errors pose no impacts on the functionality of | |
188 | the interface. The PCI Express protocol can recover without any | |
189 | software intervention or any loss of data. These errors do not | |
190 | require any recovery actions. The AER driver clears the device's | |
191 | correctable error status register accordingly and logs these errors. | |
192 | ||
4e37f055 CD |
193 | Non-correctable (non-fatal and fatal) errors |
194 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
47402400 ZY |
195 | |
196 | If an error message indicates a non-fatal error, performing link reset | |
197 | at upstream is not required. The AER driver calls error_detected(dev, | |
198 | pci_channel_io_normal) to all drivers associated within a hierarchy in | |
4e37f055 CD |
199 | question. for example:: |
200 | ||
201 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort | |
202 | ||
47402400 ZY |
203 | If Upstream port A captures an AER error, the hierarchy consists of |
204 | Downstream port B and EndPoint. | |
205 | ||
206 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | |
207 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | |
208 | whether it can recover or the AER driver calls mmio_enabled as next. | |
209 | ||
210 | If an error message indicates a fatal error, kernel will broadcast | |
211 | error_detected(dev, pci_channel_io_frozen) to all drivers within | |
212 | a hierarchy in question. Then, performing link reset at upstream is | |
213 | necessary. As different kinds of devices might use different approaches | |
214 | to reset link, AER port service driver is required to provide the | |
215 | function to reset link. Firstly, kernel looks for if the upstream | |
216 | component has an aer driver. If it has, kernel uses the reset_link | |
217 | callback of the aer driver. If the upstream component has no aer driver | |
89713422 HS |
218 | and the port is downstream port, we will perform a hot reset as the |
219 | default by setting the Secondary Bus Reset bit of the Bridge Control | |
220 | register associated with the downstream port. As for upstream ports, | |
47402400 ZY |
221 | they should provide their own aer service drivers with reset_link |
222 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | |
223 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | |
224 | to mmio_enabled. | |
225 | ||
4e37f055 CD |
226 | helper functions |
227 | ---------------- | |
228 | :: | |
229 | ||
230 | int pci_enable_pcie_error_reporting(struct pci_dev *dev); | |
47402400 | 231 | |
47402400 ZY |
232 | pci_enable_pcie_error_reporting enables the device to send error |
233 | messages to root port when an error is detected. Note that devices | |
234 | don't enable the error reporting by default, so device drivers need | |
235 | call this function to enable it. | |
236 | ||
4e37f055 CD |
237 | :: |
238 | ||
239 | int pci_disable_pcie_error_reporting(struct pci_dev *dev); | |
240 | ||
47402400 ZY |
241 | pci_disable_pcie_error_reporting disables the device to send error |
242 | messages to root port when an error is detected. | |
243 | ||
4e37f055 CD |
244 | :: |
245 | ||
246 | int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` | |
247 | ||
47402400 ZY |
248 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable |
249 | error status register. | |
250 | ||
4e37f055 CD |
251 | Frequent Asked Questions |
252 | ------------------------ | |
47402400 | 253 | |
4e37f055 CD |
254 | Q: |
255 | What happens if a PCI Express device driver does not provide an | |
256 | error recovery handler (pci_driver->err_handler is equal to NULL)? | |
47402400 | 257 | |
4e37f055 CD |
258 | A: |
259 | The devices attached with the driver won't be recovered. If the | |
260 | error is fatal, kernel will print out warning messages. Please refer | |
261 | to section 3 for more information. | |
47402400 | 262 | |
4e37f055 CD |
263 | Q: |
264 | What happens if an upstream port service driver does not provide | |
265 | callback reset_link? | |
47402400 | 266 | |
4e37f055 CD |
267 | A: |
268 | Fatal error recovery will fail if the errors are reported by the | |
269 | upstream ports who are attached by the service driver. | |
47402400 | 270 | |
4e37f055 CD |
271 | Q: |
272 | How does this infrastructure deal with driver that is not PCI | |
273 | Express aware? | |
47402400 | 274 | |
4e37f055 CD |
275 | A: |
276 | This infrastructure calls the error callback functions of the | |
277 | driver when an error happens. But if the driver is not aware of | |
278 | PCI Express, the device might not report its own errors to root | |
279 | port. | |
47402400 | 280 | |
4e37f055 CD |
281 | Q: |
282 | What modifications will that driver need to make it compatible | |
283 | with the PCI Express AER Root driver? | |
47402400 | 284 | |
4e37f055 CD |
285 | A: |
286 | It could call the helper functions to enable AER in devices and | |
287 | cleanup uncorrectable status register. Pls. refer to section 3.3. | |
47402400 | 288 | |
bfe5a740 | 289 | |
4e37f055 CD |
290 | Software error injection |
291 | ======================== | |
bfe5a740 | 292 | |
89713422 | 293 | Debugging PCIe AER error recovery code is quite difficult because it |
bfe5a740 | 294 | is hard to trigger real hardware errors. Software based error |
89713422 | 295 | injection can be used to fake various kinds of PCIe errors. |
bfe5a740 | 296 | |
89713422 | 297 | First you should enable PCIe AER software error injection in kernel |
bfe5a740 HY |
298 | configuration, that is, following item should be in your .config. |
299 | ||
300 | CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m | |
301 | ||
302 | After reboot with new kernel or insert the module, a device file named | |
303 | /dev/aer_inject should be created. | |
304 | ||
305 | Then, you need a user space tool named aer-inject, which can be gotten | |
306 | from: | |
4e37f055 | 307 | |
2eb6a4b2 | 308 | https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ |
bfe5a740 HY |
309 | |
310 | More information about aer-inject can be found in the document comes | |
311 | with its source code. |