Commit | Line | Data |
---|---|---|
b53ba588 MR |
1 | ======== |
2 | hwpoison | |
3 | ======== | |
4 | ||
f58ee00f | 5 | What is hwpoison? |
b53ba588 | 6 | ================= |
f58ee00f AK |
7 | |
8 | Upcoming Intel CPUs have support for recovering from some memory errors | |
b53ba588 | 9 | (``MCA recovery``). This requires the OS to declare a page "poisoned", |
f58ee00f AK |
10 | kill the processes associated with it and avoid using it in the future. |
11 | ||
12 | This patchkit implements the necessary infrastructure in the VM. | |
13 | ||
22aac857 VS |
14 | To quote the overview comment:: |
15 | ||
16 | High level machine check handler. Handles pages reported by the | |
17 | hardware as being corrupted usually due to a 2bit ECC memory or cache | |
18 | failure. | |
19 | ||
20 | This focusses on pages detected as corrupted in the background. | |
21 | When the current CPU tries to consume corruption the currently | |
22 | running process can just be killed directly instead. This implies | |
23 | that if the error cannot be handled for some reason it's safe to | |
24 | just ignore it because no corruption has been consumed yet. Instead | |
25 | when that happens another machine check will happen. | |
26 | ||
27 | Handles page cache pages in various states. The tricky part | |
28 | here is that we can access any page asynchronous to other VM | |
29 | users, because memory failures could happen anytime and anywhere, | |
30 | possibly violating some of their assumptions. This is why this code | |
31 | has to be extremely careful. Generally it tries to use normal locking | |
32 | rules, as in get the standard locks, even if that means the | |
33 | error handling takes potentially a long time. | |
34 | ||
35 | Some of the operations here are somewhat inefficient and have non | |
36 | linear algorithmic complexity, because the data structures have not | |
37 | been optimized for this case. This is in particular the case | |
38 | for the mapping from a vma to a process. Since this case is expected | |
39 | to be rare we hope we can get away with this. | |
f58ee00f AK |
40 | |
41 | The code consists of a the high level handler in mm/memory-failure.c, | |
42 | a new page poison bit and various checks in the VM to handle poisoned | |
43 | pages. | |
44 | ||
45 | The main target right now is KVM guests, but it works for all kinds | |
46 | of applications. KVM support requires a recent qemu-kvm release. | |
47 | ||
48 | For the KVM use there was need for a new signal type so that | |
49 | KVM can inject the machine check into the guest with the proper | |
50 | address. This in theory allows other applications to handle | |
51 | memory failures too. The expection is that near all applications | |
52 | won't do that, but some very specialized ones might. | |
53 | ||
b53ba588 MR |
54 | Failure recovery modes |
55 | ====================== | |
f58ee00f | 56 | |
b53ba588 | 57 | There are two (actually three) modes memory failure recovery can be in: |
f58ee00f AK |
58 | |
59 | vm.memory_failure_recovery sysctl set to zero: | |
60 | All memory failures cause a panic. Do not attempt recovery. | |
f58ee00f AK |
61 | |
62 | early kill | |
63 | (can be controlled globally and per process) | |
64 | Send SIGBUS to the application as soon as the error is detected | |
65 | This allows applications who can process memory errors in a gentle | |
66 | way (e.g. drop affected object) | |
67 | This is the mode used by KVM qemu. | |
68 | ||
69 | late kill | |
70 | Send SIGBUS when the application runs into the corrupted page. | |
71 | This is best for memory error unaware applications and default | |
72 | Note some pages are always handled as late kill. | |
73 | ||
b53ba588 MR |
74 | User control |
75 | ============ | |
f58ee00f AK |
76 | |
77 | vm.memory_failure_recovery | |
78 | See sysctl.txt | |
79 | ||
80 | vm.memory_failure_early_kill | |
81 | Enable early kill mode globally | |
82 | ||
83 | PR_MCE_KILL | |
84 | Set early/late kill mode/revert to system default | |
b53ba588 MR |
85 | |
86 | arg1: PR_MCE_KILL_CLEAR: | |
87 | Revert to system default | |
88 | arg1: PR_MCE_KILL_SET: | |
89 | arg2 defines thread specific mode | |
90 | ||
91 | PR_MCE_KILL_EARLY: | |
92 | Early kill | |
93 | PR_MCE_KILL_LATE: | |
94 | Late kill | |
95 | PR_MCE_KILL_DEFAULT | |
96 | Use system global default | |
97 | ||
3ba08129 NH |
98 | Note that if you want to have a dedicated thread which handles |
99 | the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should | |
100 | call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, | |
101 | the SIGBUS is sent to the main thread. | |
102 | ||
f58ee00f AK |
103 | PR_MCE_KILL_GET |
104 | return current mode | |
105 | ||
b53ba588 MR |
106 | Testing |
107 | ======= | |
f58ee00f | 108 | |
b53ba588 MR |
109 | * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the |
110 | process for testing | |
f58ee00f | 111 | |
b53ba588 | 112 | * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` |
f58ee00f | 113 | |
b53ba588 MR |
114 | corrupt-pfn |
115 | Inject hwpoison fault at PFN echoed into this file. This does | |
116 | some early filtering to avoid corrupted unintended pages in test suites. | |
f58ee00f | 117 | |
b53ba588 MR |
118 | unpoison-pfn |
119 | Software-unpoison page at PFN echoed into this file. This way | |
120 | a page can be reused again. This only works for Linux | |
67f22ba7 | 121 | injected failures, not for real memory failures. Once any hardware |
122 | memory failure happens, this feature is disabled. | |
847ce401 | 123 | |
b53ba588 MR |
124 | Note these injection interfaces are not stable and might change between |
125 | kernel versions | |
847ce401 | 126 | |
b53ba588 MR |
127 | corrupt-filter-dev-major, corrupt-filter-dev-minor |
128 | Only handle memory failures to pages associated with the file | |
129 | system defined by block device major/minor. -1U is the | |
130 | wildcard value. This should be only used for testing with | |
131 | artificial injection. | |
847ce401 | 132 | |
b53ba588 MR |
133 | corrupt-filter-memcg |
134 | Limit injection to pages owned by memgroup. Specified by inode | |
135 | number of the memcg. | |
847ce401 | 136 | |
b53ba588 | 137 | Example:: |
f58ee00f | 138 | |
b53ba588 | 139 | mkdir /sys/fs/cgroup/mem/hwpoison |
7c116f2b | 140 | |
b53ba588 MR |
141 | usemem -m 100 -s 1000 & |
142 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks | |
7c116f2b | 143 | |
b53ba588 MR |
144 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
145 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | |
4fd466eb | 146 | |
b53ba588 MR |
147 | page-types -p `pidof init` --hwpoison # shall do nothing |
148 | page-types -p `pidof usemem` --hwpoison # poison its pages | |
4fd466eb | 149 | |
b53ba588 MR |
150 | corrupt-filter-flags-mask, corrupt-filter-flags-value |
151 | When specified, only poison pages if ((page_flags & mask) == | |
152 | value). This allows stress testing of many kinds of | |
153 | pages. The page_flags are the same as in /proc/kpageflags. The | |
154 | flag bits are defined in include/linux/kernel-page-flags.h and | |
1ad1335d | 155 | documented in Documentation/admin-guide/mm/pagemap.rst |
4fd466eb | 156 | |
b53ba588 | 157 | * Architecture specific MCE injector |
4fd466eb | 158 | |
b53ba588 | 159 | x86 has mce-inject, mce-test |
4fd466eb | 160 | |
b53ba588 | 161 | Some portable hwpoison test programs in mce-test, see below. |
478c5ffc | 162 | |
b53ba588 MR |
163 | References |
164 | ========== | |
f58ee00f AK |
165 | |
166 | http://halobates.de/mce-lc09-2.pdf | |
167 | Overview presentation from LinuxCon 09 | |
168 | ||
169 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | |
170 | Test suite (hwpoison specific portable tests in tsrc) | |
171 | ||
172 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | |
173 | x86 specific injector | |
174 | ||
175 | ||
b53ba588 MR |
176 | Limitations |
177 | =========== | |
f58ee00f | 178 | - Not all page types are supported and never will. Most kernel internal |
b53ba588 | 179 | objects cannot be recovered, only LRU pages for now. |
f58ee00f AK |
180 | |
181 | --- | |
182 | Andi Kleen, Oct 2009 |