Commit | Line | Data |
---|---|---|
6634fbb6 MCC |
1 | Error Detection And Correction (EDAC) Devices |
2 | ============================================= | |
3 | ||
6b1fb6f7 MCC |
4 | Main Concepts used at the EDAC subsystem |
5 | ---------------------------------------- | |
6 | ||
7 | There are several things to be aware of that aren't at all obvious, like | |
8 | *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*, | |
9 | etc... | |
10 | ||
11 | These are some of the many terms that are thrown about that don't always | |
12 | mean what people think they mean (Inconceivable!). In the interest of | |
13 | creating a common ground for discussion, terms and their definitions | |
14 | will be established. | |
15 | ||
16 | * Memory devices | |
17 | ||
18 | The individual DRAM chips on a memory stick. These devices commonly | |
19 | output 4 and 8 bits each (x4, x8). Grouping several of these in parallel | |
20 | provides the number of bits that the memory controller expects: | |
21 | typically 72 bits, in order to provide 64 bits + 8 bits of ECC data. | |
22 | ||
23 | * Memory Stick | |
24 | ||
25 | A printed circuit board that aggregates multiple memory devices in | |
26 | parallel. In general, this is the Field Replaceable Unit (FRU) which | |
27 | gets replaced, in the case of excessive errors. Most often it is also | |
28 | called DIMM (Dual Inline Memory Module). | |
29 | ||
30 | * Memory Socket | |
31 | ||
32 | A physical connector on the motherboard that accepts a single memory | |
33 | stick. Also called as "slot" on several datasheets. | |
34 | ||
35 | * Channel | |
36 | ||
37 | A memory controller channel, responsible to communicate with a group of | |
38 | DIMMs. Each channel has its own independent control (command) and data | |
39 | bus, and can be used independently or grouped with other channels. | |
40 | ||
41 | * Branch | |
42 | ||
43 | It is typically the highest hierarchy on a Fully-Buffered DIMM memory | |
44 | controller. Typically, it contains two channels. Two channels at the | |
45 | same branch can be used in single mode or in lockstep mode. When | |
46 | lockstep is enabled, the cacheline is doubled, but it generally brings | |
47 | some performance penalty. Also, it is generally not possible to point to | |
48 | just one memory stick when an error occurs, as the error correction code | |
49 | is calculated using two DIMMs instead of one. Due to that, it is capable | |
50 | of correcting more errors than on single mode. | |
51 | ||
52 | * Single-channel | |
53 | ||
54 | The data accessed by the memory controller is contained into one dimm | |
55 | only. E. g. if the data is 64 bits-wide, the data flows to the CPU using | |
56 | one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3 | |
57 | memories. FB-DIMM and RAMBUS use a different concept for channel, so | |
58 | this concept doesn't apply there. | |
59 | ||
60 | * Double-channel | |
61 | ||
62 | The data size accessed by the memory controller is interlaced into two | |
63 | dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72 | |
64 | bits with ECC), the data flows to the CPU using a 128 bits parallel | |
65 | access. | |
66 | ||
67 | * Chip-select row | |
68 | ||
69 | This is the name of the DRAM signal used to select the DRAM ranks to be | |
70 | accessed. Common chip-select rows for single channel are 64 bits, for | |
71 | dual channel 128 bits. It may not be visible by the memory controller, | |
72 | as some DIMM types have a memory buffer that can hide direct access to | |
73 | it from the Memory Controller. | |
74 | ||
75 | * Single-Ranked stick | |
76 | ||
77 | A Single-ranked stick has 1 chip-select row of memory. Motherboards | |
78 | commonly drive two chip-select pins to a memory stick. A single-ranked | |
79 | stick, will occupy only one of those rows. The other will be unused. | |
80 | ||
81 | .. _doubleranked: | |
82 | ||
83 | * Double-Ranked stick | |
84 | ||
85 | A double-ranked stick has two chip-select rows which access different | |
86 | sets of memory devices. The two rows cannot be accessed concurrently. | |
87 | ||
88 | * Double-sided stick | |
89 | ||
90 | **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`. | |
91 | ||
92 | A double-sided stick has two chip-select rows which access different sets | |
93 | of memory devices. The two rows cannot be accessed concurrently. | |
94 | "Double-sided" is irrespective of the memory devices being mounted on | |
95 | both sides of the memory stick. | |
96 | ||
97 | * Socket set | |
98 | ||
99 | All of the memory sticks that are required for a single memory access or | |
100 | all of the memory sticks spanned by a chip-select row. A single socket | |
101 | set has two chip-select rows and if double-sided sticks are used these | |
102 | will occupy those chip-select rows. | |
103 | ||
104 | * Bank | |
105 | ||
106 | This term is avoided because it is unclear when needing to distinguish | |
107 | between chip-select rows and socket sets. | |
108 | ||
109 | ||
6634fbb6 MCC |
110 | Memory Controllers |
111 | ------------------ | |
112 | ||
113 | Most of the EDAC core is focused on doing Memory Controller error detection. | |
114 | The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info`` | |
115 | to describe the memory controllers, with is an opaque struct for the EDAC | |
116 | drivers. Only the EDAC core is allowed to touch it. | |
117 | ||
118 | .. kernel-doc:: include/linux/edac.h | |
119 | ||
120 | .. kernel-doc:: drivers/edac/edac_mc.h | |
121 | ||
122 | PCI Controllers | |
123 | --------------- | |
124 | ||
125 | The EDAC subsystem provides a mechanism to handle PCI controllers by calling | |
126 | the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct | |
127 | :c:type:`edac_pci_ctl_info` to describe the PCI controllers. | |
128 | ||
129 | .. kernel-doc:: drivers/edac/edac_pci.h | |
130 | ||
131 | EDAC Blocks | |
132 | ----------- | |
133 | ||
134 | The EDAC subsystem also provides a generic mechanism to report errors on | |
135 | other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function. | |
136 | ||
137 | The structures :c:type:`edac_dev_sysfs_block_attribute`, | |
138 | :c:type:`edac_device_block`, :c:type:`edac_device_instance` and | |
139 | :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device' | |
140 | representation at sysfs. | |
141 | ||
142 | This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or | |
143 | PCI, like: | |
144 | ||
145 | - CPU caches (L1 and L2) | |
146 | - DMA engines | |
147 | - Core CPU switches | |
148 | - Fabric switch units | |
149 | - PCIe interface controllers | |
150 | - other EDAC/ECC type devices that can be monitored for | |
151 | errors, etc. | |
152 | ||
153 | It allows for a 2 level set of hierarchy. | |
154 | ||
155 | For example, a cache could be composed of L1, L2 and L3 levels of cache. | |
156 | Each CPU core would have its own L1 cache, while sharing L2 and maybe L3 | |
157 | caches. On such case, those can be represented via the following sysfs | |
158 | nodes:: | |
159 | ||
160 | /sys/devices/system/edac/.. | |
161 | ||
162 | pci/ <existing pci directory (if available)> | |
163 | mc/ <existing memory device directory> | |
164 | cpu/cpu0/.. <L1 and L2 block directory> | |
165 | /L1-cache/ce_count | |
166 | /ue_count | |
167 | /L2-cache/ce_count | |
168 | /ue_count | |
169 | cpu/cpu1/.. <L1 and L2 block directory> | |
170 | /L1-cache/ce_count | |
171 | /ue_count | |
172 | /L2-cache/ce_count | |
173 | /ue_count | |
174 | ... | |
175 | ||
176 | the L1 and L2 directories would be "edac_device_block's" | |
177 | ||
178 | .. kernel-doc:: drivers/edac/edac_device.h |