Commit | Line | Data |
---|---|---|
76b387bd MR |
1 | .. _frontswap: |
2 | ||
3 | ========= | |
4 | Frontswap | |
5 | ========= | |
6 | ||
27c6aec2 DM |
7 | Frontswap provides a "transcendent memory" interface for swap pages. |
8 | In some environments, dramatic performance savings may be obtained because | |
9 | swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | |
10 | ||
76b387bd | 11 | .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ |
27c6aec2 DM |
12 | |
13 | Frontswap is so named because it can be thought of as the opposite of | |
14 | a "backing" store for a swap device. The storage is assumed to be | |
15 | a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | |
16 | to the requirements of transcendent memory (such as Xen's "tmem", or | |
17 | in-kernel compressed memory, aka "zcache", or future RAM-like devices); | |
18 | this pseudo-RAM device is not directly accessible or addressable by the | |
19 | kernel and is of unknown and possibly time-varying size. The driver | |
20 | links itself to frontswap by calling frontswap_register_ops to set the | |
21 | frontswap_ops funcs appropriately and the functions it provides must | |
22 | conform to certain policies as follows: | |
23 | ||
24 | An "init" prepares the device to receive frontswap pages associated | |
165c8aed | 25 | with the specified swap device number (aka "type"). A "store" will |
27c6aec2 | 26 | copy the page to transcendent memory and associate it with the type and |
165c8aed | 27 | offset associated with the page. A "load" will copy the page, if found, |
27c6aec2 | 28 | from transcendent memory into kernel memory, but will NOT remove the page |
1d00015e | 29 | from transcendent memory. An "invalidate_page" will remove the page |
27c6aec2 DM |
30 | from transcendent memory and an "invalidate_area" will remove ALL pages |
31 | associated with the swap type (e.g., like swapoff) and notify the "device" | |
165c8aed | 32 | to refuse further stores with that swap type. |
27c6aec2 | 33 | |
165c8aed | 34 | Once a page is successfully stored, a matching load on the page will normally |
27c6aec2 | 35 | succeed. So when the kernel finds itself in a situation where it needs |
165c8aed | 36 | to swap out a page, it first attempts to use frontswap. If the store returns |
27c6aec2 DM |
37 | success, the data has been successfully saved to transcendent memory and |
38 | a disk write and, if the data is later read back, a disk read are avoided. | |
165c8aed | 39 | If a store returns failure, transcendent memory has rejected the data, and the |
27c6aec2 DM |
40 | page can be written to swap as usual. |
41 | ||
165c8aed KRW |
42 | Note that if a page is stored and the page already exists in transcendent memory |
43 | (a "duplicate" store), either the store succeeds and the data is overwritten, | |
44 | or the store fails AND the page is invalidated. This ensures stale data may | |
27c6aec2 DM |
45 | never be obtained from frontswap. |
46 | ||
47 | If properly configured, monitoring of frontswap is done via debugfs in | |
76b387bd | 48 | the `/sys/kernel/debug/frontswap` directory. The effectiveness of |
27c6aec2 DM |
49 | frontswap can be measured (across all swap devices) with: |
50 | ||
76b387bd MR |
51 | ``failed_stores`` |
52 | how many store attempts have failed | |
53 | ||
54 | ``loads`` | |
55 | how many loads were attempted (all should succeed) | |
56 | ||
57 | ``succ_stores`` | |
58 | how many store attempts have succeeded | |
59 | ||
60 | ``invalidates`` | |
61 | how many invalidates were attempted | |
27c6aec2 DM |
62 | |
63 | A backend implementation may provide additional metrics. | |
64 | ||
65 | FAQ | |
76b387bd | 66 | === |
27c6aec2 | 67 | |
76b387bd | 68 | * Where's the value? |
27c6aec2 DM |
69 | |
70 | When a workload starts swapping, performance falls through the floor. | |
71 | Frontswap significantly increases performance in many such workloads by | |
72 | providing a clean, dynamic interface to read and write swap pages to | |
73 | "transcendent memory" that is otherwise not directly addressable to the kernel. | |
74 | This interface is ideal when data is transformed to a different form | |
75 | and size (such as with compression) or secretly moved (as might be | |
76 | useful for write-balancing for some RAM-like devices). Swap pages (and | |
77 | evicted page-cache pages) are a great use for this kind of slower-than-RAM- | |
0a4ee518 | 78 | but-much-faster-than-disk "pseudo-RAM device". |
27c6aec2 | 79 | |
0a4ee518 | 80 | Frontswap with a fairly small impact on the kernel, |
27c6aec2 DM |
81 | provides a huge amount of flexibility for more dynamic, flexible RAM |
82 | utilization in various system configurations: | |
83 | ||
84 | In the single kernel case, aka "zcache", pages are compressed and | |
85 | stored in local memory, thus increasing the total anonymous pages | |
86 | that can be safely kept in RAM. Zcache essentially trades off CPU | |
87 | cycles used in compression/decompression for better memory utilization. | |
88 | Benchmarks have shown little or no impact when memory pressure is | |
89 | low while providing a significant performance improvement (25%+) | |
90 | on some workloads under high memory pressure. | |
91 | ||
92 | "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | |
93 | support for clustered systems. Frontswap pages are locally compressed | |
94 | as in zcache, but then "remotified" to another system's RAM. This | |
95 | allows RAM to be dynamically load-balanced back-and-forth as needed, | |
96 | i.e. when system A is overcommitted, it can swap to system B, and | |
97 | vice versa. RAMster can also be configured as a memory server so | |
98 | many servers in a cluster can swap, dynamically as needed, to a single | |
99 | server configured with a large amount of RAM... without pre-configuring | |
100 | how much of the RAM is available for each of the clients! | |
101 | ||
102 | In the virtual case, the whole point of virtualization is to statistically | |
1d00015e | 103 | multiplex physical resources across the varying demands of multiple |
27c6aec2 DM |
104 | virtual machines. This is really hard to do with RAM and efforts to do |
105 | it well with no kernel changes have essentially failed (except in some | |
106 | well-publicized special-case workloads). | |
107 | Specifically, the Xen Transcendent Memory backend allows otherwise | |
108 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | |
109 | virtual machines, but the pages can be compressed and deduplicated to | |
110 | optimize RAM utilization. And when guest OS's are induced to surrender | |
111 | underutilized RAM (e.g. with "selfballooning"), sudden unexpected | |
112 | memory pressure may result in swapping; frontswap allows those pages | |
113 | to be swapped to and from hypervisor RAM (if overall host system memory | |
114 | conditions allow), thus mitigating the potentially awful performance impact | |
115 | of unplanned swapping. | |
116 | ||
117 | A KVM implementation is underway and has been RFC'ed to lkml. And, | |
118 | using frontswap, investigation is also underway on the use of NVM as | |
119 | a memory extension technology. | |
120 | ||
76b387bd MR |
121 | * Sure there may be performance advantages in some situations, but |
122 | what's the space/time overhead of frontswap? | |
27c6aec2 DM |
123 | |
124 | If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | |
125 | nothingness and the only overhead is a few extra bytes per swapon'ed | |
126 | swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | |
127 | registers, there is one extra global variable compared to zero for | |
128 | every swap page read or written. If CONFIG_FRONTSWAP is enabled | |
165c8aed | 129 | AND a frontswap backend registers AND the backend fails every "store" |
27c6aec2 DM |
130 | request (i.e. provides no memory despite claiming it might), |
131 | CPU overhead is still negligible -- and since every frontswap fail | |
132 | precedes a swap page write-to-disk, the system is highly likely | |
133 | to be I/O bound and using a small fraction of a percent of a CPU | |
134 | will be irrelevant anyway. | |
135 | ||
136 | As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | |
137 | registers, one bit is allocated for every swap page for every swap | |
138 | device that is swapon'd. This is added to the EIGHT bits (which | |
139 | was sixteen until about 2.6.34) that the kernel already allocates | |
140 | for every swap page for every swap device that is swapon'd. (Hugh | |
141 | Dickins has observed that frontswap could probably steal one of | |
142 | the existing eight bits, but let's worry about that minor optimization | |
143 | later.) For very large swap disks (which are rare) on a standard | |
144 | 4K pagesize, this is 1MB per 32GB swap. | |
145 | ||
146 | When swap pages are stored in transcendent memory instead of written | |
147 | out to disk, there is a side effect that this may create more memory | |
148 | pressure that can potentially outweigh the other advantages. A | |
149 | backend, such as zcache, must implement policies to carefully (but | |
150 | dynamically) manage memory limits to ensure this doesn't happen. | |
151 | ||
76b387bd MR |
152 | * OK, how about a quick overview of what this frontswap patch does |
153 | in terms that a kernel hacker can grok? | |
27c6aec2 DM |
154 | |
155 | Let's assume that a frontswap "backend" has registered during | |
156 | kernel initialization; this registration indicates that this | |
157 | frontswap backend has access to some "memory" that is not directly | |
158 | accessible by the kernel. Exactly how much memory it provides is | |
159 | entirely dynamic and random. | |
160 | ||
161 | Whenever a swap-device is swapon'd frontswap_init() is called, | |
162 | passing the swap device number (aka "type") as a parameter. | |
165c8aed | 163 | This notifies frontswap to expect attempts to "store" swap pages |
27c6aec2 DM |
164 | associated with that number. |
165 | ||
166 | Whenever the swap subsystem is readying a page to write to a swap | |
165c8aed | 167 | device (c.f swap_writepage()), frontswap_store is called. Frontswap |
27c6aec2 | 168 | consults with the frontswap backend and if the backend says it does NOT |
165c8aed | 169 | have room, frontswap_store returns -1 and the kernel swaps the page |
27c6aec2 DM |
170 | to the swap device as normal. Note that the response from the frontswap |
171 | backend is unpredictable to the kernel; it may choose to never accept a | |
172 | page, it could accept every ninth page, or it might accept every | |
173 | page. But if the backend does accept a page, the data from the page | |
174 | has already been copied and associated with the type and offset, | |
175 | and the backend guarantees the persistence of the data. In this case, | |
176 | frontswap sets a bit in the "frontswap_map" for the swap device | |
177 | corresponding to the page offset on the swap device to which it would | |
178 | otherwise have written the data. | |
179 | ||
180 | When the swap subsystem needs to swap-in a page (swap_readpage()), | |
165c8aed | 181 | it first calls frontswap_load() which checks the frontswap_map to |
27c6aec2 DM |
182 | see if the page was earlier accepted by the frontswap backend. If |
183 | it was, the page of data is filled from the frontswap backend and | |
184 | the swap-in is complete. If not, the normal swap-in code is | |
185 | executed to obtain the page of data from the real swap device. | |
186 | ||
187 | So every time the frontswap backend accepts a page, a swap device read | |
188 | and (potentially) a swap device write are replaced by a "frontswap backend | |
165c8aed | 189 | store" and (possibly) a "frontswap backend loads", which are presumably much |
27c6aec2 DM |
190 | faster. |
191 | ||
76b387bd MR |
192 | * Can't frontswap be configured as a "special" swap device that is |
193 | just higher priority than any real swap device (e.g. like zswap, | |
194 | or maybe swap-over-nbd/NFS)? | |
27c6aec2 DM |
195 | |
196 | No. First, the existing swap subsystem doesn't allow for any kind of | |
4e79162a | 197 | swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, |
27c6aec2 DM |
198 | but this would require fairly drastic changes. Even if it were |
199 | rewritten, the existing swap subsystem uses the block I/O layer which | |
200 | assumes a swap device is fixed size and any page in it is linearly | |
201 | addressable. Frontswap barely touches the existing swap subsystem, | |
202 | and works around the constraints of the block I/O subsystem to provide | |
203 | a great deal of flexibility and dynamicity. | |
204 | ||
205 | For example, the acceptance of any swap page by the frontswap backend is | |
206 | entirely unpredictable. This is critical to the definition of frontswap | |
207 | backends because it grants completely dynamic discretion to the | |
208 | backend. In zcache, one cannot know a priori how compressible a page is. | |
209 | "Poorly" compressible pages can be rejected, and "poorly" can itself be | |
210 | defined dynamically depending on current memory constraints. | |
211 | ||
212 | Further, frontswap is entirely synchronous whereas a real swap | |
213 | device is, by definition, asynchronous and uses block I/O. The | |
214 | block I/O layer is not only unnecessary, but may perform "optimizations" | |
215 | that are inappropriate for a RAM-oriented device including delaying | |
216 | the write of some pages for a significant amount of time. Synchrony is | |
217 | required to ensure the dynamicity of the backend and to avoid thorny race | |
218 | conditions that would unnecessarily and greatly complicate frontswap | |
165c8aed KRW |
219 | and/or the block I/O subsystem. That said, only the initial "store" |
220 | and "load" operations need be synchronous. A separate asynchronous thread | |
27c6aec2 DM |
221 | is free to manipulate the pages stored by frontswap. For example, |
222 | the "remotification" thread in RAMster uses standard asynchronous | |
223 | kernel sockets to move compressed frontswap pages to a remote machine. | |
224 | Similarly, a KVM guest-side implementation could do in-guest compression | |
225 | and use "batched" hypercalls. | |
226 | ||
227 | In a virtualized environment, the dynamicity allows the hypervisor | |
228 | (or host OS) to do "intelligent overcommit". For example, it can | |
229 | choose to accept pages only until host-swapping might be imminent, | |
230 | then force guests to do their own swapping. | |
231 | ||
232 | There is a downside to the transcendent memory specifications for | |
165c8aed | 233 | frontswap: Since any "store" might fail, there must always be a real |
27c6aec2 DM |
234 | slot on a real swap device to swap the page. Thus frontswap must be |
235 | implemented as a "shadow" to every swapon'd device with the potential | |
236 | capability of holding every page that the swap device might have held | |
237 | and the possibility that it might hold no pages at all. This means | |
238 | that frontswap cannot contain more pages than the total of swapon'd | |
239 | swap devices. For example, if NO swap device is configured on some | |
240 | installation, frontswap is useless. Swapless portable devices | |
241 | can still use frontswap but a backend for such devices must configure | |
242 | some kind of "ghost" swap device and ensure that it is never used. | |
243 | ||
76b387bd MR |
244 | * Why this weird definition about "duplicate stores"? If a page |
245 | has been previously successfully stored, can't it always be | |
246 | successfully overwritten? | |
27c6aec2 DM |
247 | |
248 | Nearly always it can, but no, sometimes it cannot. Consider an example | |
249 | where data is compressed and the original 4K page has been compressed | |
250 | to 1K. Now an attempt is made to overwrite the page with data that | |
251 | is non-compressible and so would take the entire 4K. But the backend | |
165c8aed KRW |
252 | has no more space. In this case, the store must be rejected. Whenever |
253 | frontswap rejects a store that would overwrite, it also must invalidate | |
27c6aec2 DM |
254 | the old data and ensure that it is no longer accessible. Since the |
255 | swap subsystem then writes the new data to the read swap device, | |
256 | this is the correct course of action to ensure coherency. | |
257 | ||
76b387bd | 258 | * Why does the frontswap patch create the new include file swapfile.h? |
27c6aec2 DM |
259 | |
260 | The frontswap code depends on some swap-subsystem-internal data | |
261 | structures that have, over the years, moved back and forth between | |
262 | static and global. This seemed a reasonable compromise: Define | |
263 | them as global but declare them in a new include file that isn't | |
264 | included by the large number of source files that include swap.h. | |
265 | ||
266 | Dan Magenheimer, last updated April 9, 2012 |