Commit | Line | Data |
---|---|---|
2a05c58b | 1 | ======== |
d02be50d | 2 | zsmalloc |
2a05c58b | 3 | ======== |
d02be50d MK |
4 | |
5 | This allocator is designed for use with zram. Thus, the allocator is | |
6 | supposed to work well under low memory conditions. In particular, it | |
7 | never attempts higher order page allocation which is very likely to | |
8 | fail under memory pressure. On the other hand, if we just use single | |
9 | (0-order) pages, it would suffer from very high fragmentation -- | |
10 | any object of size PAGE_SIZE/2 or larger would occupy an entire page. | |
11 | This was one of the major issues with its predecessor (xvmalloc). | |
12 | ||
13 | To overcome these issues, zsmalloc allocates a bunch of 0-order pages | |
14 | and links them together using various 'struct page' fields. These linked | |
15 | pages act as a single higher-order page i.e. an object can span 0-order | |
16 | page boundaries. The code refers to these linked pages as a single entity | |
17 | called zspage. | |
18 | ||
19 | For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE | |
20 | since this satisfies the requirements of all its current users (in the | |
21 | worst case, page is incompressible and is thus stored "as-is" i.e. in | |
22 | uncompressed form). For allocation requests larger than this size, failure | |
23 | is returned (see zs_malloc). | |
24 | ||
25 | Additionally, zs_malloc() does not return a dereferenceable pointer. | |
26 | Instead, it returns an opaque handle (unsigned long) which encodes actual | |
27 | location of the allocated object. The reason for this indirection is that | |
28 | zsmalloc does not keep zspages permanently mapped since that would cause | |
29 | issues on 32-bit systems where the VA region for kernel space mappings | |
30 | is very small. So, before using the allocating memory, the object has to | |
31 | be mapped using zs_map_object() to get a usable pointer and subsequently | |
32 | unmapped using zs_unmap_object(). | |
33 | ||
34 | stat | |
2a05c58b | 35 | ==== |
d02be50d MK |
36 | |
37 | With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via | |
2a05c58b | 38 | ``/sys/kernel/debug/zsmalloc/<user name>``. Here is a sample of stat output:: |
d02be50d | 39 | |
2a05c58b | 40 | # cat /sys/kernel/debug/zsmalloc/zram0/classes |
d02be50d | 41 | |
119b57ea | 42 | class size 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
2a05c58b MR |
43 | ... |
44 | ... | |
119b57ea SS |
45 | 30 512 0 12 4 1 0 1 0 0 1 0 414 3464 3346 433 1 14 |
46 | 31 528 2 7 2 2 1 0 1 0 0 2 117 4154 3793 536 4 44 | |
47 | 32 544 6 3 4 1 2 1 0 0 0 1 260 4170 3965 556 2 26 | |
2a05c58b MR |
48 | ... |
49 | ... | |
50 | ||
d02be50d | 51 | |
2a05c58b MR |
52 | class |
53 | index | |
54 | size | |
55 | object size zspage stores | |
119b57ea SS |
56 | 10% |
57 | the number of zspages with usage ratio less than 10% (see below) | |
58 | 20% | |
59 | the number of zspages with usage ratio between 10% and 20% | |
60 | 30% | |
61 | the number of zspages with usage ratio between 20% and 30% | |
62 | 40% | |
63 | the number of zspages with usage ratio between 30% and 40% | |
64 | 50% | |
65 | the number of zspages with usage ratio between 40% and 50% | |
66 | 60% | |
67 | the number of zspages with usage ratio between 50% and 60% | |
68 | 70% | |
69 | the number of zspages with usage ratio between 60% and 70% | |
70 | 80% | |
71 | the number of zspages with usage ratio between 70% and 80% | |
72 | 90% | |
73 | the number of zspages with usage ratio between 80% and 90% | |
74 | 99% | |
75 | the number of zspages with usage ratio between 90% and 99% | |
76 | 100% | |
77 | the number of zspages with usage ratio 100% | |
2a05c58b MR |
78 | obj_allocated |
79 | the number of objects allocated | |
80 | obj_used | |
81 | the number of objects allocated to the user | |
82 | pages_used | |
83 | the number of pages allocated for the class | |
84 | pages_per_zspage | |
85 | the number of 0-order pages to make a zspage | |
618a8a91 SS |
86 | freeable |
87 | the approximate number of pages class compaction can free | |
d02be50d | 88 | |
119b57ea SS |
89 | Each zspage maintains inuse counter which keeps track of the number of |
90 | objects stored in the zspage. The inuse counter determines the zspage's | |
91 | "fullness group" which is calculated as the ratio of the "inuse" objects to | |
92 | the total number of objects the zspage can hold (objs_per_zspage). The | |
93 | closer the inuse counter is to objs_per_zspage, the better. | |
4ff93b29 SS |
94 | |
95 | Internals | |
96 | ========= | |
97 | ||
98 | zsmalloc has 255 size classes, each of which can hold a number of zspages. | |
99 | Each zspage can contain up to ZSMALLOC_CHAIN_SIZE physical (0-order) pages. | |
100 | The optimal zspage chain size for each size class is calculated during the | |
101 | creation of the zsmalloc pool (see calculate_zspage_chain_size()). | |
102 | ||
103 | As an optimization, zsmalloc merges size classes that have similar | |
104 | characteristics in terms of the number of pages per zspage and the number | |
105 | of objects that each zspage can store. | |
106 | ||
107 | For instance, consider the following size classes::: | |
108 | ||
119b57ea | 109 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
4ff93b29 | 110 | ... |
119b57ea SS |
111 | 94 1536 0 .... 0 0 0 0 3 0 |
112 | 100 1632 0 .... 0 0 0 0 2 0 | |
4ff93b29 SS |
113 | ... |
114 | ||
115 | ||
116 | Size classes #95-99 are merged with size class #100. This means that when we | |
117 | need to store an object of size, say, 1568 bytes, we end up using size class | |
118 | #100 instead of size class #96. Size class #100 is meant for objects of size | |
119 | 1632 bytes, so each object of size 1568 bytes wastes 1632-1568=64 bytes. | |
120 | ||
121 | Size class #100 consists of zspages with 2 physical pages each, which can | |
122 | hold a total of 5 objects. If we need to store 13 objects of size 1568, we | |
123 | end up allocating three zspages, or 6 physical pages. | |
124 | ||
125 | However, if we take a closer look at size class #96 (which is meant for | |
126 | objects of size 1568 bytes) and trace `calculate_zspage_chain_size()`, we | |
127 | find that the most optimal zspage configuration for this class is a chain | |
128 | of 5 physical pages::: | |
129 | ||
130 | pages per zspage wasted bytes used% | |
131 | 1 960 76 | |
132 | 2 352 95 | |
133 | 3 1312 89 | |
134 | 4 704 95 | |
135 | 5 96 99 | |
136 | ||
137 | This means that a class #96 configuration with 5 physical pages can store 13 | |
138 | objects of size 1568 in a single zspage, using a total of 5 physical pages. | |
139 | This is more efficient than the class #100 configuration, which would use 6 | |
140 | physical pages to store the same number of objects. | |
141 | ||
142 | As the zspage chain size for class #96 increases, its key characteristics | |
143 | such as pages per-zspage and objects per-zspage also change. This leads to | |
144 | dewer class mergers, resulting in a more compact grouping of classes, which | |
145 | reduces memory wastage. | |
146 | ||
147 | Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`::: | |
148 | ||
119b57ea SS |
149 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
150 | ||
4ff93b29 | 151 | ... |
119b57ea SS |
152 | 202 3264 0 .. 0 0 0 0 4 0 |
153 | 254 4096 0 .. 0 0 0 0 1 0 | |
4ff93b29 SS |
154 | ... |
155 | ||
156 | Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages | |
157 | per zspage. Any object larger than 3264 bytes is considered huge and belongs | |
158 | to size class #254, which stores each object in its own physical page (objects | |
159 | in huge classes do not share pages). | |
160 | ||
161 | Increasing the size of the chain of zspages also results in a higher watermark | |
162 | for the huge size class and fewer huge classes overall. This allows for more | |
163 | efficient storage of large objects. | |
164 | ||
165 | For zspage chain size of 8, huge class watermark becomes 3632 bytes::: | |
166 | ||
119b57ea SS |
167 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
168 | ||
4ff93b29 | 169 | ... |
119b57ea SS |
170 | 202 3264 0 .. 0 0 0 0 4 0 |
171 | 211 3408 0 .. 0 0 0 0 5 0 | |
172 | 217 3504 0 .. 0 0 0 0 6 0 | |
173 | 222 3584 0 .. 0 0 0 0 7 0 | |
174 | 225 3632 0 .. 0 0 0 0 8 0 | |
175 | 254 4096 0 .. 0 0 0 0 1 0 | |
4ff93b29 SS |
176 | ... |
177 | ||
178 | For zspage chain size of 16, huge class watermark becomes 3840 bytes::: | |
179 | ||
119b57ea SS |
180 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
181 | ||
4ff93b29 | 182 | ... |
119b57ea SS |
183 | 202 3264 0 .. 0 0 0 0 4 0 |
184 | 206 3328 0 .. 0 0 0 0 13 0 | |
185 | 207 3344 0 .. 0 0 0 0 9 0 | |
186 | 208 3360 0 .. 0 0 0 0 14 0 | |
187 | 211 3408 0 .. 0 0 0 0 5 0 | |
188 | 212 3424 0 .. 0 0 0 0 16 0 | |
189 | 214 3456 0 .. 0 0 0 0 11 0 | |
190 | 217 3504 0 .. 0 0 0 0 6 0 | |
191 | 219 3536 0 .. 0 0 0 0 13 0 | |
192 | 222 3584 0 .. 0 0 0 0 7 0 | |
193 | 223 3600 0 .. 0 0 0 0 15 0 | |
194 | 225 3632 0 .. 0 0 0 0 8 0 | |
195 | 228 3680 0 .. 0 0 0 0 9 0 | |
196 | 230 3712 0 .. 0 0 0 0 10 0 | |
197 | 232 3744 0 .. 0 0 0 0 11 0 | |
198 | 234 3776 0 .. 0 0 0 0 12 0 | |
199 | 235 3792 0 .. 0 0 0 0 13 0 | |
200 | 236 3808 0 .. 0 0 0 0 14 0 | |
201 | 238 3840 0 .. 0 0 0 0 15 0 | |
202 | 254 4096 0 .. 0 0 0 0 1 0 | |
4ff93b29 SS |
203 | ... |
204 | ||
205 | Overall the combined zspage chain size effect on zsmalloc pool configuration::: | |
206 | ||
207 | pages per zspage number of size classes (clusters) huge size class watermark | |
208 | 4 69 3264 | |
209 | 5 86 3408 | |
210 | 6 93 3504 | |
211 | 7 112 3584 | |
212 | 8 123 3632 | |
213 | 9 140 3680 | |
214 | 10 143 3712 | |
215 | 11 159 3744 | |
216 | 12 164 3776 | |
217 | 13 180 3792 | |
218 | 14 183 3808 | |
219 | 15 188 3840 | |
220 | 16 191 3840 | |
221 | ||
222 | ||
223 | A synthetic test | |
224 | ---------------- | |
225 | ||
226 | zram as a build artifacts storage (Linux kernel compilation). | |
227 | ||
228 | * `CONFIG_ZSMALLOC_CHAIN_SIZE=4` | |
229 | ||
230 | zsmalloc classes stats::: | |
231 | ||
119b57ea SS |
232 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
233 | ||
4ff93b29 | 234 | ... |
119b57ea | 235 | Total 13 .. 51 413836 412973 159955 3 |
4ff93b29 SS |
236 | |
237 | zram mm_stat::: | |
238 | ||
239 | 1691783168 628083717 655175680 0 655175680 60 0 34048 34049 | |
240 | ||
241 | ||
242 | * `CONFIG_ZSMALLOC_CHAIN_SIZE=8` | |
243 | ||
244 | zsmalloc classes stats::: | |
245 | ||
119b57ea SS |
246 | class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable |
247 | ||
4ff93b29 | 248 | ... |
119b57ea | 249 | Total 18 .. 87 414852 412978 156666 0 |
4ff93b29 SS |
250 | |
251 | zram mm_stat::: | |
252 | ||
253 | 1691803648 627793930 641703936 0 641703936 60 0 33591 33591 | |
254 | ||
255 | Using larger zspage chains may result in using fewer physical pages, as seen | |
256 | in the example where the number of physical pages used decreased from 159955 | |
257 | to 156666, at the same time maximum zsmalloc pool memory usage went down from | |
258 | 655175680 to 641703936 bytes. | |
259 | ||
260 | However, this advantage may be offset by the potential for increased system | |
261 | memory pressure (as some zspages have larger chain sizes) in cases where there | |
262 | is heavy internal fragmentation and zspool compaction is unable to relocate | |
263 | objects and release zspages. In these cases, it is recommended to decrease | |
264 | the limit on the size of the zspage chains (as specified by the | |
265 | CONFIG_ZSMALLOC_CHAIN_SIZE option). | |
61ff748b MWO |
266 | |
267 | Functions | |
268 | ========= | |
269 | ||
270 | .. kernel-doc:: mm/zsmalloc.c |