Commit | Line | Data |
---|---|---|
c6b4fcba JT |
1 | Introduction |
2 | ============ | |
3 | ||
4 | dm-cache is a device mapper target written by Joe Thornber, Heinz | |
5 | Mauelshagen, and Mike Snitzer. | |
6 | ||
7 | It aims to improve performance of a block device (eg, a spindle) by | |
8 | dynamically migrating some of its data to a faster, smaller device | |
9 | (eg, an SSD). | |
10 | ||
11 | This device-mapper solution allows us to insert this caching at | |
12 | different levels of the dm stack, for instance above the data device for | |
13 | a thin-provisioning pool. Caching solutions that are integrated more | |
14 | closely with the virtual memory system should give better performance. | |
15 | ||
16 | The target reuses the metadata library used in the thin-provisioning | |
17 | library. | |
18 | ||
19 | The decision as to what data to migrate and when is left to a plug-in | |
20 | policy module. Several of these have been written as we experiment, | |
21 | and we hope other people will contribute others for specific io | |
22 | scenarios (eg. a vm image server). | |
23 | ||
24 | Glossary | |
25 | ======== | |
26 | ||
27 | Migration - Movement of the primary copy of a logical block from one | |
28 | device to the other. | |
29 | Promotion - Migration from slow device to fast device. | |
30 | Demotion - Migration from fast device to slow device. | |
31 | ||
32 | The origin device always contains a copy of the logical block, which | |
33 | may be out of date or kept in sync with the copy on the cache device | |
34 | (depending on policy). | |
35 | ||
36 | Design | |
37 | ====== | |
38 | ||
39 | Sub-devices | |
40 | ----------- | |
41 | ||
42 | The target is constructed by passing three devices to it (along with | |
43 | other parameters detailed later): | |
44 | ||
45 | 1. An origin device - the big, slow one. | |
46 | ||
47 | 2. A cache device - the small, fast one. | |
48 | ||
49 | 3. A small metadata device - records which blocks are in the cache, | |
50 | which are dirty, and extra hints for use by the policy object. | |
51 | This information could be put on the cache device, but having it | |
52 | separate allows the volume manager to configure it differently, | |
66bb2644 MS |
53 | e.g. as a mirror for extra robustness. This metadata device may only |
54 | be used by a single cache device. | |
c6b4fcba JT |
55 | |
56 | Fixed block size | |
57 | ---------------- | |
58 | ||
59 | The origin is divided up into blocks of a fixed size. This block size | |
60 | is configurable when you first create the cache. Typically we've been | |
05473044 | 61 | using block sizes of 256KB - 1024KB. The block size must be between 64 |
1346638e | 62 | sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). |
c6b4fcba JT |
63 | |
64 | Having a fixed block size simplifies the target a lot. But it is | |
65 | something of a compromise. For instance, a small part of a block may be | |
66 | getting hit a lot, yet the whole block will be promoted to the cache. | |
67 | So large block sizes are bad because they waste cache space. And small | |
68 | block sizes are bad because they increase the amount of metadata (both | |
69 | in core and on disk). | |
70 | ||
2ee57d58 JT |
71 | Cache operating modes |
72 | --------------------- | |
c6b4fcba | 73 | |
2ee57d58 JT |
74 | The cache has three operating modes: writeback, writethrough and |
75 | passthrough. | |
c6b4fcba JT |
76 | |
77 | If writeback, the default, is selected then a write to a block that is | |
78 | cached will go only to the cache and the block will be marked dirty in | |
79 | the metadata. | |
80 | ||
81 | If writethrough is selected then a write to a cached block will not | |
82 | complete until it has hit both the origin and cache devices. Clean | |
83 | blocks should remain clean. | |
84 | ||
2ee57d58 JT |
85 | If passthrough is selected, useful when the cache contents are not known |
86 | to be coherent with the origin device, then all reads are served from | |
87 | the origin device (all reads miss the cache) and all writes are | |
88 | forwarded to the origin device; additionally, write hits cause cache | |
7b6b2bc9 MS |
89 | block invalidates. To enable passthrough mode the cache must be clean. |
90 | Passthrough mode allows a cache device to be activated without having to | |
91 | worry about coherency. Coherency that exists is maintained, although | |
92 | the cache will gradually cool as writes take place. If the coherency of | |
93 | the cache can later be verified, or established through use of the | |
94 | "invalidate_cblocks" message, the cache device can be transitioned to | |
95 | writethrough or writeback mode while still warm. Otherwise, the cache | |
96 | contents can be discarded prior to transitioning to the desired | |
97 | operating mode. | |
2ee57d58 | 98 | |
c6b4fcba | 99 | A simple cleaner policy is provided, which will clean (write back) all |
7b6b2bc9 MS |
100 | dirty blocks in a cache. Useful for decommissioning a cache or when |
101 | shrinking a cache. Shrinking the cache's fast device requires all cache | |
102 | blocks, in the area of the cache being removed, to be clean. If the | |
103 | area being removed from the cache still contains dirty blocks the resize | |
104 | will fail. Care must be taken to never reduce the volume used for the | |
105 | cache's fast device until the cache is clean. This is of particular | |
106 | importance if writeback mode is used. Writethrough and passthrough | |
107 | modes already maintain a clean cache. Future support to partially clean | |
108 | the cache, above a specified threshold, will allow for keeping the cache | |
109 | warm and in writeback mode during resize. | |
c6b4fcba JT |
110 | |
111 | Migration throttling | |
112 | -------------------- | |
113 | ||
114 | Migrating data between the origin and cache device uses bandwidth. | |
115 | The user can set a throttle to prevent more than a certain amount of | |
f884ab15 | 116 | migration occurring at any one time. Currently we're not taking any |
c6b4fcba JT |
117 | account of normal io traffic going to the devices. More work needs |
118 | doing here to avoid migrating during those peak io moments. | |
119 | ||
120 | For the time being, a message "migration_threshold <#sectors>" | |
121 | can be used to set the maximum number of sectors being migrated, | |
9614e2ba | 122 | the default being 2048 sectors (1MB). |
c6b4fcba JT |
123 | |
124 | Updating on-disk metadata | |
125 | ------------------------- | |
126 | ||
07f2b6e0 MS |
127 | On-disk metadata is committed every time a FLUSH or FUA bio is written. |
128 | If no such requests are made then commits will occur every second. This | |
129 | means the cache behaves like a physical disk that has a volatile write | |
130 | cache. If power is lost you may lose some recent writes. The metadata | |
131 | should always be consistent in spite of any crash. | |
c6b4fcba JT |
132 | |
133 | The 'dirty' state for a cache block changes far too frequently for us | |
134 | to keep updating it on the fly. So we treat it as a hint. In normal | |
135 | operation it will be written when the dm device is suspended. If the | |
136 | system crashes all cache blocks will be assumed dirty when restarted. | |
137 | ||
138 | Per-block policy hints | |
139 | ---------------------- | |
140 | ||
141 | Policy plug-ins can store a chunk of data per cache block. It's up to | |
142 | the policy how big this chunk is, but it should be kept small. Like the | |
143 | dirty flags this data is lost if there's a crash so a safe fallback | |
144 | value should always be possible. | |
145 | ||
c6b4fcba JT |
146 | Policy hints affect performance, not correctness. |
147 | ||
148 | Policy messaging | |
149 | ---------------- | |
150 | ||
151 | Policies will have different tunables, specific to each one, so we | |
152 | need a generic way of getting and setting these. Device-mapper | |
153 | messages are used. Refer to cache-policies.txt. | |
154 | ||
155 | Discard bitset resolution | |
156 | ------------------------- | |
157 | ||
158 | We can avoid copying data during migration if we know the block has | |
159 | been discarded. A prime example of this is when mkfs discards the | |
160 | whole block device. We store a bitset tracking the discard state of | |
161 | blocks. However, we allow this bitset to have a different block size | |
162 | from the cache blocks. This is because we need to track the discard | |
163 | state for all of the origin device (compare with the dirty bitset | |
164 | which is just for the smaller cache device). | |
165 | ||
166 | Target interface | |
167 | ================ | |
168 | ||
169 | Constructor | |
170 | ----------- | |
171 | ||
172 | cache <metadata dev> <cache dev> <origin dev> <block size> | |
173 | <#feature args> [<feature arg>]* | |
174 | <policy> <#policy args> [policy args]* | |
175 | ||
176 | metadata dev : fast device holding the persistent metadata | |
177 | cache dev : fast device holding cached data blocks | |
178 | origin dev : slow device holding original data blocks | |
179 | block size : cache unit size in sectors | |
180 | ||
181 | #feature args : number of feature arguments passed | |
7b6b2bc9 | 182 | feature args : writethrough or passthrough (The default is writeback.) |
c6b4fcba JT |
183 | |
184 | policy : the replacement policy to use | |
185 | #policy args : an even number of arguments corresponding to | |
186 | key/value pairs passed to the policy | |
187 | policy args : key/value pairs passed to the policy | |
188 | E.g. 'sequential_threshold 1024' | |
189 | See cache-policies.txt for details. | |
190 | ||
191 | Optional feature arguments are: | |
192 | writethrough : write through caching that prohibits cache block | |
193 | content from being different from origin block content. | |
194 | Without this argument, the default behaviour is to write | |
195 | back cache block contents later for performance reasons, | |
196 | so they may differ from the corresponding origin blocks. | |
197 | ||
7b6b2bc9 MS |
198 | passthrough : a degraded mode useful for various cache coherency |
199 | situations (e.g., rolling back snapshots of | |
200 | underlying storage). Reads and writes always go to | |
201 | the origin. If a write goes to a cached origin | |
202 | block, then the cache block is invalidated. | |
203 | To enable passthrough mode the cache must be clean. | |
204 | ||
629d0a8a JT |
205 | metadata2 : use version 2 of the metadata. This stores the dirty bits |
206 | in a separate btree, which improves speed of shutting | |
207 | down the cache. | |
208 | ||
c6b4fcba JT |
209 | A policy called 'default' is always registered. This is an alias for |
210 | the policy we currently think is giving best all round performance. | |
211 | ||
212 | As the default policy could vary between kernels, if you are relying on | |
213 | the characteristics of a specific policy, always request it by name. | |
214 | ||
215 | Status | |
216 | ------ | |
217 | ||
6a388618 MS |
218 | <metadata block size> <#used metadata blocks>/<#total metadata blocks> |
219 | <cache block size> <#used cache blocks>/<#total cache blocks> | |
220 | <#read hits> <#read misses> <#write hits> <#write misses> | |
221 | <#demotions> <#promotions> <#dirty> <#features> <features>* | |
2e68c4e6 | 222 | <#core args> <core args>* <policy name> <#policy args> <policy args>* |
028ae9f7 | 223 | <cache metadata mode> |
6a388618 MS |
224 | |
225 | metadata block size : Fixed block size for each metadata block in | |
226 | sectors | |
227 | #used metadata blocks : Number of metadata blocks used | |
228 | #total metadata blocks : Total number of metadata blocks | |
229 | cache block size : Configurable block size for the cache device | |
230 | in sectors | |
231 | #used cache blocks : Number of blocks resident in the cache | |
232 | #total cache blocks : Total number of cache blocks | |
233 | #read hits : Number of times a READ bio has been mapped | |
c6b4fcba | 234 | to the cache |
6a388618 | 235 | #read misses : Number of times a READ bio has been mapped |
c6b4fcba | 236 | to the origin |
6a388618 | 237 | #write hits : Number of times a WRITE bio has been mapped |
c6b4fcba | 238 | to the cache |
6a388618 | 239 | #write misses : Number of times a WRITE bio has been |
c6b4fcba | 240 | mapped to the origin |
6a388618 | 241 | #demotions : Number of times a block has been removed |
c6b4fcba | 242 | from the cache |
6a388618 | 243 | #promotions : Number of times a block has been moved to |
c6b4fcba | 244 | the cache |
6a388618 | 245 | #dirty : Number of blocks in the cache that differ |
c6b4fcba | 246 | from the origin |
6a388618 MS |
247 | #feature args : Number of feature args to follow |
248 | feature args : 'writethrough' (optional) | |
249 | #core args : Number of core arguments (must be even) | |
250 | core args : Key/value pairs for tuning the core | |
c6b4fcba | 251 | e.g. migration_threshold |
2e68c4e6 | 252 | policy name : Name of the policy |
6a388618 | 253 | #policy args : Number of policy arguments to follow (must be even) |
028ae9f7 JT |
254 | policy args : Key/value pairs e.g. sequential_threshold |
255 | cache metadata mode : ro if read-only, rw if read-write | |
256 | In serious cases where even a read-only mode is deemed unsafe | |
257 | no further I/O will be permitted and the status will just | |
258 | contain the string 'Fail'. The userspace recovery tools | |
259 | should then be used. | |
255eac20 MS |
260 | needs_check : 'needs_check' if set, '-' if not set |
261 | A metadata operation has failed, resulting in the needs_check | |
262 | flag being set in the metadata's superblock. The metadata | |
263 | device must be deactivated and checked/repaired before the | |
264 | cache can be made fully operational again. '-' indicates | |
265 | needs_check is not set. | |
c6b4fcba JT |
266 | |
267 | Messages | |
268 | -------- | |
269 | ||
270 | Policies will have different tunables, specific to each one, so we | |
271 | need a generic way of getting and setting these. Device-mapper | |
272 | messages are used. (A sysfs interface would also be possible.) | |
273 | ||
274 | The message format is: | |
275 | ||
276 | <key> <value> | |
277 | ||
278 | E.g. | |
279 | dmsetup message my_cache 0 sequential_threshold 1024 | |
280 | ||
65790ff9 JT |
281 | |
282 | Invalidation is removing an entry from the cache without writing it | |
283 | back. Cache blocks can be invalidated via the invalidate_cblocks | |
7b6b2bc9 | 284 | message, which takes an arbitrary number of cblock ranges. Each cblock |
83f539e1 MS |
285 | range's end value is "one past the end", meaning 5-10 expresses a range |
286 | of values from 5 to 9. Each cblock must be expressed as a decimal | |
287 | value, in the future a variant message that takes cblock ranges | |
3f816bac | 288 | expressed in hexadecimal may be needed to better support efficient |
83f539e1 MS |
289 | invalidation of larger caches. The cache must be in passthrough mode |
290 | when invalidate_cblocks is used. | |
65790ff9 JT |
291 | |
292 | invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* | |
293 | ||
294 | E.g. | |
295 | dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 | |
296 | ||
c6b4fcba JT |
297 | Examples |
298 | ======== | |
299 | ||
300 | The test suite can be found here: | |
301 | ||
65790ff9 | 302 | https://github.com/jthornber/device-mapper-test-suite |
c6b4fcba JT |
303 | |
304 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | |
305 | /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' | |
306 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | |
307 | /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ | |
308 | mq 4 sequential_threshold 1024 random_threshold 8' |