Commit | Line | Data |
---|---|---|
00ac9ba0 NG |
1 | zram: Compressed RAM based block devices |
2 | ---------------------------------------- | |
47f9afb3 | 3 | |
47f9afb3 NG |
4 | * Introduction |
5 | ||
9b9913d8 NG |
6 | The zram module creates RAM based block devices named /dev/zram<id> |
7 | (<id> = 0, 1, ...). Pages written to these disks are compressed and stored | |
8 | in memory itself. These disks allow very fast I/O and compression provides | |
9 | good amounts of memory savings. Some of the usecases include /tmp storage, | |
10 | use as swap disks, various caches under /var and maybe many more :) | |
47f9afb3 | 11 | |
9b9913d8 NG |
12 | Statistics for individual zram devices are exported through sysfs nodes at |
13 | /sys/block/zram<id>/ | |
47f9afb3 NG |
14 | |
15 | * Usage | |
16 | ||
3657c20d SS |
17 | There are several ways to configure and manage zram device(-s): |
18 | a) using zram and zram_control sysfs attributes | |
19 | b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). | |
20 | ||
21 | In this document we will describe only 'manual' zram configuration steps, | |
22 | IOW, zram and zram_control sysfs attributes. | |
23 | ||
24 | In order to get a better idea about zramctl please consult util-linux | |
25 | documentation, zramctl man-page or `zramctl --help'. Please be informed | |
26 | that zram maintainers do not develop/maintain util-linux or zramctl, should | |
27 | you have any questions please contact util-linux@vger.kernel.org | |
28 | ||
00ac9ba0 | 29 | Following shows a typical sequence of steps for using zram. |
47f9afb3 | 30 | |
3657c20d SS |
31 | WARNING |
32 | ======= | |
33 | For the sake of simplicity we skip error checking parts in most of the | |
34 | examples below. However, it is your sole responsibility to handle errors. | |
35 | ||
36 | zram sysfs attributes always return negative values in case of errors. | |
37 | The list of possible return codes: | |
38 | -EBUSY -- an attempt to modify an attribute that cannot be changed once | |
39 | the device has been initialised. Please reset device first; | |
40 | -ENOMEM -- zram was not able to allocate enough memory to fulfil your | |
41 | needs; | |
42 | -EINVAL -- invalid input has been provided. | |
43 | ||
44 | If you use 'echo', the returned value that is changed by 'echo' utility, | |
45 | and, in general case, something like: | |
46 | ||
47 | echo 3 > /sys/block/zram0/max_comp_streams | |
48 | if [ $? -ne 0 ]; | |
49 | handle_error | |
50 | fi | |
51 | ||
52 | should suffice. | |
53 | ||
9b9913d8 | 54 | 1) Load Module: |
00ac9ba0 | 55 | modprobe zram num_devices=4 |
9b9913d8 | 56 | This creates 4 devices: /dev/zram{0,1,2,3} |
c3cdb40e SS |
57 | |
58 | num_devices parameter is optional and tells zram how many devices should be | |
59 | pre-created. Default: 1. | |
47f9afb3 | 60 | |
beca3ec7 | 61 | 2) Set max number of compression streams |
69a30a8d SS |
62 | Regardless the value passed to this attribute, ZRAM will always |
63 | allocate multiple compression streams - one per online CPUs - thus | |
64 | allowing several concurrent compression operations. The number of | |
65 | allocated compression streams goes down when some of the CPUs | |
66 | become offline. There is no single-compression-stream mode anymore, | |
67 | unless you are running a UP system or has only 1 CPU online. | |
68 | ||
69 | To find out how many streams are currently available: | |
beca3ec7 SS |
70 | cat /sys/block/zram0/max_comp_streams |
71 | ||
e46b8a03 | 72 | 3) Select compression algorithm |
69a30a8d SS |
73 | Using comp_algorithm device attribute one can see available and |
74 | currently selected (shown in square brackets) compression algorithms, | |
75 | change selected compression algorithm (once the device is initialised | |
76 | there is no way to change compression algorithm). | |
e46b8a03 | 77 | |
69a30a8d | 78 | Examples: |
e46b8a03 SS |
79 | #show supported compression algorithms |
80 | cat /sys/block/zram0/comp_algorithm | |
81 | lzo [lz4] | |
82 | ||
83 | #select lzo compression algorithm | |
84 | echo lzo > /sys/block/zram0/comp_algorithm | |
85 | ||
69a30a8d SS |
86 | For the time being, the `comp_algorithm' content does not necessarily |
87 | show every compression algorithm supported by the kernel. We keep this | |
88 | list primarily to simplify device configuration and one can configure | |
89 | a new device with a compression algorithm that is not listed in | |
90 | `comp_algorithm'. The thing is that, internally, ZRAM uses Crypto API | |
91 | and, if some of the algorithms were built as modules, it's impossible | |
92 | to list all of them using, for instance, /proc/crypto or any other | |
93 | method. This, however, has an advantage of permitting the usage of | |
94 | custom crypto compression modules (implementing S/W or H/W compression). | |
415403be | 95 | |
e46b8a03 | 96 | 4) Set Disksize |
69a30a8d SS |
97 | Set disk size by writing the value to sysfs node 'disksize'. |
98 | The value can be either in bytes or you can use mem suffixes. | |
99 | Examples: | |
100 | # Initialize /dev/zram0 with 50MB disksize | |
101 | echo $((50*1024*1024)) > /sys/block/zram0/disksize | |
0231c403 | 102 | |
69a30a8d SS |
103 | # Using mem suffixes |
104 | echo 256K > /sys/block/zram0/disksize | |
105 | echo 512M > /sys/block/zram0/disksize | |
106 | echo 1G > /sys/block/zram0/disksize | |
47f9afb3 | 107 | |
e64cd51d SS |
108 | Note: |
109 | There is little point creating a zram of greater than twice the size of memory | |
110 | since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the | |
111 | size of the disk when not in use so a huge zram is wasteful. | |
112 | ||
9ada9da9 | 113 | 5) Set memory limit: Optional |
69a30a8d SS |
114 | Set memory limit by writing the value to sysfs node 'mem_limit'. |
115 | The value can be either in bytes or you can use mem suffixes. | |
116 | In addition, you could change the value in runtime. | |
117 | Examples: | |
118 | # limit /dev/zram0 with 50MB memory | |
119 | echo $((50*1024*1024)) > /sys/block/zram0/mem_limit | |
120 | ||
121 | # Using mem suffixes | |
122 | echo 256K > /sys/block/zram0/mem_limit | |
123 | echo 512M > /sys/block/zram0/mem_limit | |
124 | echo 1G > /sys/block/zram0/mem_limit | |
125 | ||
126 | # To disable memory limit | |
127 | echo 0 > /sys/block/zram0/mem_limit | |
9ada9da9 MK |
128 | |
129 | 6) Activate: | |
00ac9ba0 NG |
130 | mkswap /dev/zram0 |
131 | swapon /dev/zram0 | |
132 | ||
133 | mkfs.ext4 /dev/zram1 | |
134 | mount /dev/zram1 /tmp | |
47f9afb3 | 135 | |
6566d1a3 SS |
136 | 7) Add/remove zram devices |
137 | ||
138 | zram provides a control interface, which enables dynamic (on-demand) device | |
139 | addition and removal. | |
140 | ||
141 | In order to add a new /dev/zramX device, perform read operation on hot_add | |
142 | attribute. This will return either new device's device id (meaning that you | |
143 | can use /dev/zram<id>) or error code. | |
144 | ||
145 | Example: | |
146 | cat /sys/class/zram-control/hot_add | |
147 | 1 | |
148 | ||
149 | To remove the existing /dev/zramX device (where X is a device id) | |
150 | execute | |
151 | echo X > /sys/class/zram-control/hot_remove | |
152 | ||
153 | 8) Stats: | |
77ba015f SS |
154 | Per-device statistics are exported as various nodes under /sys/block/zram<id>/ |
155 | ||
3657c20d | 156 | A brief description of exported device attributes. For more details please |
77ba015f SS |
157 | read Documentation/ABI/testing/sysfs-block-zram. |
158 | ||
1d69a3f8 MK |
159 | Name access description |
160 | ---- ------ ----------- | |
161 | disksize RW show and set the device's disk size | |
162 | initstate RO shows the initialization state of the device | |
163 | reset WO trigger device reset | |
164 | mem_used_max WO reset the `mem_used_max' counter (see later) | |
165 | mem_limit WO specifies the maximum amount of memory ZRAM can use | |
166 | to store the compressed data | |
167 | writeback_limit WO specifies the maximum amount of write IO zram can | |
168 | write out to backing device as 4KB unit | |
169 | writeback_limit_enable RW show and set writeback_limit feature | |
170 | max_comp_streams RW the number of possible concurrent compress operations | |
171 | comp_algorithm RW show and change the compression algorithm | |
172 | compact WO trigger memory compaction | |
173 | debug_stat RO this file is used for zram debugging purposes | |
174 | backing_dev RW set up backend storage for zram to write out | |
175 | idle WO mark allocated slot as idle | |
77ba015f | 176 | |
8f7d282c SS |
177 | |
178 | User space is advised to use the following files to read the device statistics. | |
179 | ||
77ba015f SS |
180 | File /sys/block/zram<id>/stat |
181 | ||
182 | Represents block layer statistics. Read Documentation/block/stat.txt for | |
183 | details. | |
47f9afb3 | 184 | |
2f6a3bed SS |
185 | File /sys/block/zram<id>/io_stat |
186 | ||
187 | The stat file represents device's I/O statistics not accounted by block | |
188 | layer and, thus, not available in zram<id>/stat file. It consists of a | |
189 | single line of text and contains the following stats separated by | |
190 | whitespace: | |
c87d1655 SS |
191 | failed_reads the number of failed reads |
192 | failed_writes the number of failed writes | |
193 | invalid_io the number of non-page-size-aligned I/O requests | |
194 | notify_free Depending on device usage scenario it may account | |
195 | a) the number of pages freed because of swap slot free | |
196 | notifications or b) the number of pages freed because of | |
9305455a | 197 | REQ_OP_DISCARD requests sent by bio. The former ones are |
c87d1655 SS |
198 | sent to a swap block device when a swap slot is freed, |
199 | which implies that this disk is being used as a swap disk. | |
200 | The latter ones are sent by filesystem mounted with | |
201 | discard option, whenever some data blocks are getting | |
202 | discarded. | |
2f6a3bed | 203 | |
4f2109f6 SS |
204 | File /sys/block/zram<id>/mm_stat |
205 | ||
206 | The stat file represents device's mm statistics. It consists of a single | |
207 | line of text and contains the following stats separated by whitespace: | |
c87d1655 | 208 | orig_data_size uncompressed size of data stored in this disk. |
8e19d540 | 209 | This excludes same-element-filled pages (same_pages) since |
210 | no memory is allocated for them. | |
c87d1655 SS |
211 | Unit: bytes |
212 | compr_data_size compressed size of data stored in this disk | |
213 | mem_used_total the amount of memory allocated for this disk. This | |
214 | includes allocator fragmentation and metadata overhead, | |
215 | allocated for this disk. So, allocator space efficiency | |
216 | can be calculated using compr_data_size and this statistic. | |
217 | Unit: bytes | |
218 | mem_limit the maximum amount of memory ZRAM can use to store | |
219 | the compressed data | |
220 | mem_used_max the maximum amount of memory zram have consumed to | |
221 | store the data | |
8e19d540 | 222 | same_pages the number of same element filled pages written to this disk. |
c87d1655 SS |
223 | No memory is allocated for such pages. |
224 | pages_compacted the number of pages freed during compaction | |
89e85bce | 225 | huge_pages the number of incompressible pages |
4f2109f6 | 226 | |
23eddf39 MK |
227 | File /sys/block/zram<id>/bd_stat |
228 | ||
229 | The stat file represents device's backing device statistics. It consists of | |
230 | a single line of text and contains the following stats separated by whitespace: | |
231 | bd_count size of data written in backing device. | |
232 | Unit: 4K bytes | |
233 | bd_reads the number of reads from backing device | |
234 | Unit: 4K bytes | |
235 | bd_writes the number of writes to backing device | |
236 | Unit: 4K bytes | |
237 | ||
6566d1a3 | 238 | 9) Deactivate: |
00ac9ba0 NG |
239 | swapoff /dev/zram0 |
240 | umount /dev/zram1 | |
47f9afb3 | 241 | |
6566d1a3 | 242 | 10) Reset: |
9b9913d8 NG |
243 | Write any positive value to 'reset' sysfs node |
244 | echo 1 > /sys/block/zram0/reset | |
245 | echo 1 > /sys/block/zram1/reset | |
246 | ||
0231c403 MK |
247 | This frees all the memory allocated for the given device and |
248 | resets the disksize to zero. You must set the disksize again | |
249 | before reusing the device. | |
47f9afb3 | 250 | |
5a47074f MK |
251 | * Optional Feature |
252 | ||
253 | = writeback | |
254 | ||
a939888e | 255 | With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page |
5a47074f | 256 | to backing storage rather than keeping it in memory. |
a939888e MK |
257 | To use the feature, admin should set up backing device via |
258 | ||
259 | "echo /dev/sda5 > /sys/block/zramX/backing_dev" | |
260 | ||
261 | before disksize setting. It supports only partition at this moment. | |
262 | If admin want to use incompressible page writeback, they could do via | |
263 | ||
264 | "echo huge > /sys/block/zramX/write" | |
265 | ||
266 | To use idle page writeback, first, user need to declare zram pages | |
267 | as idle. | |
268 | ||
269 | "echo all > /sys/block/zramX/idle" | |
270 | ||
271 | From now on, any pages on zram are idle pages. The idle mark | |
272 | will be removed until someone request access of the block. | |
273 | IOW, unless there is access request, those pages are still idle pages. | |
274 | ||
275 | Admin can request writeback of those idle pages at right timing via | |
276 | ||
277 | "echo idle > /sys/block/zramX/writeback" | |
278 | ||
279 | With the command, zram writeback idle pages from memory to the storage. | |
5a47074f | 280 | |
bb416d18 MK |
281 | If there are lots of write IO with flash device, potentially, it has |
282 | flash wearout problem so that admin needs to design write limitation | |
283 | to guarantee storage health for entire product life. | |
1d69a3f8 MK |
284 | |
285 | To overcome the concern, zram supports "writeback_limit" feature. | |
286 | The "writeback_limit_enable"'s default value is 0 so that it doesn't limit | |
287 | any writeback. IOW, if admin want to apply writeback budget, he should | |
288 | enable writeback_limit_enable via | |
289 | ||
290 | $ echo 1 > /sys/block/zramX/writeback_limit_enable | |
291 | ||
292 | Once writeback_limit_enable is set, zram doesn't allow any writeback | |
293 | until admin set the budget via /sys/block/zramX/writeback_limit. | |
294 | ||
295 | (If admin doesn't enable writeback_limit_enable, writeback_limit's value | |
296 | assigned via /sys/block/zramX/writeback_limit is meaninless.) | |
bb416d18 MK |
297 | |
298 | If admin want to limit writeback as per-day 400M, he could do it | |
299 | like below. | |
300 | ||
1d69a3f8 MK |
301 | $ MB_SHIFT=20 |
302 | $ 4K_SHIFT=12 | |
303 | $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ | |
304 | /sys/block/zram0/writeback_limit. | |
305 | $ echo 1 > /sys/block/zram0/writeback_limit_enable | |
bb416d18 | 306 | |
1d69a3f8 MK |
307 | If admin want to allow further write again once the bugdet is exausted, |
308 | he could do it like below | |
bb416d18 | 309 | |
1d69a3f8 MK |
310 | $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ |
311 | /sys/block/zram0/writeback_limit | |
bb416d18 MK |
312 | |
313 | If admin want to see remaining writeback budget since he set, | |
314 | ||
1d69a3f8 MK |
315 | $ cat /sys/block/zramX/writeback_limit |
316 | ||
317 | If admin want to disable writeback limit, he could do | |
318 | ||
319 | $ echo 0 > /sys/block/zramX/writeback_limit_enable | |
bb416d18 MK |
320 | |
321 | The writeback_limit count will reset whenever you reset zram(e.g., | |
322 | system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of | |
323 | writeback happened until you reset the zram to allocate extra writeback | |
324 | budget in next setting is user's job. | |
325 | ||
1d69a3f8 MK |
326 | If admin want to measure writeback count in a certain period, he could |
327 | know it via /sys/block/zram0/bd_stat's 3rd column. | |
328 | ||
c0265342 MK |
329 | = memory tracking |
330 | ||
331 | With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the | |
332 | zram block. It could be useful to catch cold or incompressible | |
333 | pages of the process with*pagemap. | |
334 | If you enable the feature, you could see block state via | |
335 | /sys/kernel/debug/zram/zram0/block_state". The output is as follows, | |
336 | ||
e82592c4 MK |
337 | 300 75.033841 .wh. |
338 | 301 63.806904 s... | |
339 | 302 63.806919 ..hi | |
c0265342 MK |
340 | |
341 | First column is zram's block index. | |
342 | Second column is access time since the system was booted | |
343 | Third column is state of the block. | |
344 | (s: same page | |
345 | w: written page to backing store | |
e82592c4 MK |
346 | h: huge page |
347 | i: idle page) | |
c0265342 MK |
348 | |
349 | First line of above example says 300th block is accessed at 75.033841sec | |
350 | and the block's state is huge so it is written back to the backing | |
351 | storage. It's a debugging feature so anyone shouldn't rely on it to work | |
352 | properly. | |
353 | ||
47f9afb3 NG |
354 | Nitin Gupta |
355 | ngupta@vflare.org |