Commit | Line | Data |
---|---|---|
f58ee00f AK |
1 | What is hwpoison? |
2 | ||
3 | Upcoming Intel CPUs have support for recovering from some memory errors | |
4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", | |
5 | kill the processes associated with it and avoid using it in the future. | |
6 | ||
7 | This patchkit implements the necessary infrastructure in the VM. | |
8 | ||
9 | To quote the overview comment: | |
10 | ||
11 | * High level machine check handler. Handles pages reported by the | |
12 | * hardware as being corrupted usually due to a 2bit ECC memory or cache | |
13 | * failure. | |
14 | * | |
15 | * This focusses on pages detected as corrupted in the background. | |
16 | * When the current CPU tries to consume corruption the currently | |
17 | * running process can just be killed directly instead. This implies | |
18 | * that if the error cannot be handled for some reason it's safe to | |
19 | * just ignore it because no corruption has been consumed yet. Instead | |
20 | * when that happens another machine check will happen. | |
21 | * | |
22 | * Handles page cache pages in various states. The tricky part | |
23 | * here is that we can access any page asynchronous to other VM | |
24 | * users, because memory failures could happen anytime and anywhere, | |
25 | * possibly violating some of their assumptions. This is why this code | |
26 | * has to be extremely careful. Generally it tries to use normal locking | |
27 | * rules, as in get the standard locks, even if that means the | |
28 | * error handling takes potentially a long time. | |
29 | * | |
30 | * Some of the operations here are somewhat inefficient and have non | |
31 | * linear algorithmic complexity, because the data structures have not | |
32 | * been optimized for this case. This is in particular the case | |
33 | * for the mapping from a vma to a process. Since this case is expected | |
34 | * to be rare we hope we can get away with this. | |
35 | ||
36 | The code consists of a the high level handler in mm/memory-failure.c, | |
37 | a new page poison bit and various checks in the VM to handle poisoned | |
38 | pages. | |
39 | ||
40 | The main target right now is KVM guests, but it works for all kinds | |
41 | of applications. KVM support requires a recent qemu-kvm release. | |
42 | ||
43 | For the KVM use there was need for a new signal type so that | |
44 | KVM can inject the machine check into the guest with the proper | |
45 | address. This in theory allows other applications to handle | |
46 | memory failures too. The expection is that near all applications | |
47 | won't do that, but some very specialized ones might. | |
48 | ||
49 | --- | |
50 | ||
51 | There are two (actually three) modi memory failure recovery can be in: | |
52 | ||
53 | vm.memory_failure_recovery sysctl set to zero: | |
54 | All memory failures cause a panic. Do not attempt recovery. | |
55 | (on x86 this can be also affected by the tolerant level of the | |
56 | MCE subsystem) | |
57 | ||
58 | early kill | |
59 | (can be controlled globally and per process) | |
60 | Send SIGBUS to the application as soon as the error is detected | |
61 | This allows applications who can process memory errors in a gentle | |
62 | way (e.g. drop affected object) | |
63 | This is the mode used by KVM qemu. | |
64 | ||
65 | late kill | |
66 | Send SIGBUS when the application runs into the corrupted page. | |
67 | This is best for memory error unaware applications and default | |
68 | Note some pages are always handled as late kill. | |
69 | ||
70 | --- | |
71 | ||
72 | User control: | |
73 | ||
74 | vm.memory_failure_recovery | |
75 | See sysctl.txt | |
76 | ||
77 | vm.memory_failure_early_kill | |
78 | Enable early kill mode globally | |
79 | ||
80 | PR_MCE_KILL | |
81 | Set early/late kill mode/revert to system default | |
82 | arg1: PR_MCE_KILL_CLEAR: Revert to system default | |
83 | arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode | |
84 | PR_MCE_KILL_EARLY: Early kill | |
85 | PR_MCE_KILL_LATE: Late kill | |
86 | PR_MCE_KILL_DEFAULT: Use system global default | |
3ba08129 NH |
87 | Note that if you want to have a dedicated thread which handles |
88 | the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should | |
89 | call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, | |
90 | the SIGBUS is sent to the main thread. | |
91 | ||
f58ee00f AK |
92 | PR_MCE_KILL_GET |
93 | return current mode | |
94 | ||
95 | ||
96 | --- | |
97 | ||
98 | Testing: | |
99 | ||
fe194d3e | 100 | madvise(MADV_HWPOISON, ....) |
f58ee00f AK |
101 | (as root) |
102 | Poison a page in the process for testing | |
103 | ||
104 | ||
105 | hwpoison-inject module through debugfs | |
f58ee00f | 106 | |
847ce401 | 107 | /sys/debug/hwpoison/ |
f58ee00f | 108 | |
847ce401 WF |
109 | corrupt-pfn |
110 | ||
31d3d348 WF |
111 | Inject hwpoison fault at PFN echoed into this file. This does |
112 | some early filtering to avoid corrupted unintended pages in test suites. | |
847ce401 WF |
113 | |
114 | unpoison-pfn | |
115 | ||
116 | Software-unpoison page at PFN echoed into this file. This | |
117 | way a page can be reused again. | |
118 | This only works for Linux injected failures, not for real | |
119 | memory failures. | |
120 | ||
121 | Note these injection interfaces are not stable and might change between | |
122 | kernel versions | |
f58ee00f | 123 | |
7c116f2b WF |
124 | corrupt-filter-dev-major |
125 | corrupt-filter-dev-minor | |
126 | ||
127 | Only handle memory failures to pages associated with the file system defined | |
128 | by block device major/minor. -1U is the wildcard value. | |
129 | This should be only used for testing with artificial injection. | |
130 | ||
4fd466eb AK |
131 | corrupt-filter-memcg |
132 | ||
133 | Limit injection to pages owned by memgroup. Specified by inode number | |
134 | of the memcg. | |
135 | ||
136 | Example: | |
f6e07d38 | 137 | mkdir /sys/fs/cgroup/mem/hwpoison |
4fd466eb AK |
138 | |
139 | usemem -m 100 -s 1000 & | |
f6e07d38 | 140 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks |
4fd466eb | 141 | |
f6e07d38 | 142 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
4fd466eb AK |
143 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg |
144 | ||
145 | page-types -p `pidof init` --hwpoison # shall do nothing | |
146 | page-types -p `pidof usemem` --hwpoison # poison its pages | |
478c5ffc WF |
147 | |
148 | corrupt-filter-flags-mask | |
149 | corrupt-filter-flags-value | |
150 | ||
151 | When specified, only poison pages if ((page_flags & mask) == value). | |
152 | This allows stress testing of many kinds of pages. The page_flags | |
153 | are the same as in /proc/kpageflags. The flag bits are defined in | |
154 | include/linux/kernel-page-flags.h and documented in | |
155 | Documentation/vm/pagemap.txt | |
156 | ||
f58ee00f AK |
157 | Architecture specific MCE injector |
158 | ||
159 | x86 has mce-inject, mce-test | |
160 | ||
161 | Some portable hwpoison test programs in mce-test, see blow. | |
162 | ||
163 | --- | |
164 | ||
165 | References: | |
166 | ||
167 | http://halobates.de/mce-lc09-2.pdf | |
168 | Overview presentation from LinuxCon 09 | |
169 | ||
170 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | |
171 | Test suite (hwpoison specific portable tests in tsrc) | |
172 | ||
173 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | |
174 | x86 specific injector | |
175 | ||
176 | ||
177 | --- | |
178 | ||
179 | Limitations: | |
180 | ||
181 | - Not all page types are supported and never will. Most kernel internal | |
182 | objects cannot be recovered, only LRU pages for now. | |
183 | - Right now hugepage support is missing. | |
184 | ||
185 | --- | |
186 | Andi Kleen, Oct 2009 | |
187 |