Commit | Line | Data |
---|---|---|
f58ee00f AK |
1 | What is hwpoison? |
2 | ||
3 | Upcoming Intel CPUs have support for recovering from some memory errors | |
4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", | |
5 | kill the processes associated with it and avoid using it in the future. | |
6 | ||
7 | This patchkit implements the necessary infrastructure in the VM. | |
8 | ||
9 | To quote the overview comment: | |
10 | ||
11 | * High level machine check handler. Handles pages reported by the | |
12 | * hardware as being corrupted usually due to a 2bit ECC memory or cache | |
13 | * failure. | |
14 | * | |
15 | * This focusses on pages detected as corrupted in the background. | |
16 | * When the current CPU tries to consume corruption the currently | |
17 | * running process can just be killed directly instead. This implies | |
18 | * that if the error cannot be handled for some reason it's safe to | |
19 | * just ignore it because no corruption has been consumed yet. Instead | |
20 | * when that happens another machine check will happen. | |
21 | * | |
22 | * Handles page cache pages in various states. The tricky part | |
23 | * here is that we can access any page asynchronous to other VM | |
24 | * users, because memory failures could happen anytime and anywhere, | |
25 | * possibly violating some of their assumptions. This is why this code | |
26 | * has to be extremely careful. Generally it tries to use normal locking | |
27 | * rules, as in get the standard locks, even if that means the | |
28 | * error handling takes potentially a long time. | |
29 | * | |
30 | * Some of the operations here are somewhat inefficient and have non | |
31 | * linear algorithmic complexity, because the data structures have not | |
32 | * been optimized for this case. This is in particular the case | |
33 | * for the mapping from a vma to a process. Since this case is expected | |
34 | * to be rare we hope we can get away with this. | |
35 | ||
36 | The code consists of a the high level handler in mm/memory-failure.c, | |
37 | a new page poison bit and various checks in the VM to handle poisoned | |
38 | pages. | |
39 | ||
40 | The main target right now is KVM guests, but it works for all kinds | |
41 | of applications. KVM support requires a recent qemu-kvm release. | |
42 | ||
43 | For the KVM use there was need for a new signal type so that | |
44 | KVM can inject the machine check into the guest with the proper | |
45 | address. This in theory allows other applications to handle | |
46 | memory failures too. The expection is that near all applications | |
47 | won't do that, but some very specialized ones might. | |
48 | ||
49 | --- | |
50 | ||
51 | There are two (actually three) modi memory failure recovery can be in: | |
52 | ||
53 | vm.memory_failure_recovery sysctl set to zero: | |
54 | All memory failures cause a panic. Do not attempt recovery. | |
55 | (on x86 this can be also affected by the tolerant level of the | |
56 | MCE subsystem) | |
57 | ||
58 | early kill | |
59 | (can be controlled globally and per process) | |
60 | Send SIGBUS to the application as soon as the error is detected | |
61 | This allows applications who can process memory errors in a gentle | |
62 | way (e.g. drop affected object) | |
63 | This is the mode used by KVM qemu. | |
64 | ||
65 | late kill | |
66 | Send SIGBUS when the application runs into the corrupted page. | |
67 | This is best for memory error unaware applications and default | |
68 | Note some pages are always handled as late kill. | |
69 | ||
70 | --- | |
71 | ||
72 | User control: | |
73 | ||
74 | vm.memory_failure_recovery | |
75 | See sysctl.txt | |
76 | ||
77 | vm.memory_failure_early_kill | |
78 | Enable early kill mode globally | |
79 | ||
80 | PR_MCE_KILL | |
81 | Set early/late kill mode/revert to system default | |
82 | arg1: PR_MCE_KILL_CLEAR: Revert to system default | |
83 | arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode | |
84 | PR_MCE_KILL_EARLY: Early kill | |
85 | PR_MCE_KILL_LATE: Late kill | |
86 | PR_MCE_KILL_DEFAULT: Use system global default | |
87 | PR_MCE_KILL_GET | |
88 | return current mode | |
89 | ||
90 | ||
91 | --- | |
92 | ||
93 | Testing: | |
94 | ||
fe194d3e | 95 | madvise(MADV_HWPOISON, ....) |
f58ee00f AK |
96 | (as root) |
97 | Poison a page in the process for testing | |
98 | ||
99 | ||
100 | hwpoison-inject module through debugfs | |
f58ee00f | 101 | |
847ce401 | 102 | /sys/debug/hwpoison/ |
f58ee00f | 103 | |
847ce401 WF |
104 | corrupt-pfn |
105 | ||
31d3d348 WF |
106 | Inject hwpoison fault at PFN echoed into this file. This does |
107 | some early filtering to avoid corrupted unintended pages in test suites. | |
847ce401 WF |
108 | |
109 | unpoison-pfn | |
110 | ||
111 | Software-unpoison page at PFN echoed into this file. This | |
112 | way a page can be reused again. | |
113 | This only works for Linux injected failures, not for real | |
114 | memory failures. | |
115 | ||
116 | Note these injection interfaces are not stable and might change between | |
117 | kernel versions | |
f58ee00f | 118 | |
7c116f2b WF |
119 | corrupt-filter-dev-major |
120 | corrupt-filter-dev-minor | |
121 | ||
122 | Only handle memory failures to pages associated with the file system defined | |
123 | by block device major/minor. -1U is the wildcard value. | |
124 | This should be only used for testing with artificial injection. | |
125 | ||
4fd466eb AK |
126 | corrupt-filter-memcg |
127 | ||
128 | Limit injection to pages owned by memgroup. Specified by inode number | |
129 | of the memcg. | |
130 | ||
131 | Example: | |
132 | mkdir /cgroup/hwpoison | |
133 | ||
134 | usemem -m 100 -s 1000 & | |
135 | echo `jobs -p` > /cgroup/hwpoison/tasks | |
136 | ||
137 | memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') | |
138 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | |
139 | ||
140 | page-types -p `pidof init` --hwpoison # shall do nothing | |
141 | page-types -p `pidof usemem` --hwpoison # poison its pages | |
478c5ffc WF |
142 | |
143 | corrupt-filter-flags-mask | |
144 | corrupt-filter-flags-value | |
145 | ||
146 | When specified, only poison pages if ((page_flags & mask) == value). | |
147 | This allows stress testing of many kinds of pages. The page_flags | |
148 | are the same as in /proc/kpageflags. The flag bits are defined in | |
149 | include/linux/kernel-page-flags.h and documented in | |
150 | Documentation/vm/pagemap.txt | |
151 | ||
f58ee00f AK |
152 | Architecture specific MCE injector |
153 | ||
154 | x86 has mce-inject, mce-test | |
155 | ||
156 | Some portable hwpoison test programs in mce-test, see blow. | |
157 | ||
158 | --- | |
159 | ||
160 | References: | |
161 | ||
162 | http://halobates.de/mce-lc09-2.pdf | |
163 | Overview presentation from LinuxCon 09 | |
164 | ||
165 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | |
166 | Test suite (hwpoison specific portable tests in tsrc) | |
167 | ||
168 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | |
169 | x86 specific injector | |
170 | ||
171 | ||
172 | --- | |
173 | ||
174 | Limitations: | |
175 | ||
176 | - Not all page types are supported and never will. Most kernel internal | |
177 | objects cannot be recovered, only LRU pages for now. | |
178 | - Right now hugepage support is missing. | |
179 | ||
180 | --- | |
181 | Andi Kleen, Oct 2009 | |
182 |