Commit | Line | Data |
---|---|---|
c6ae4c04 | 1 | /* SPDX-License-Identifier: GPL-2.0-or-later */ |
b411b363 PR |
2 | /* |
3 | lru_cache.c | |
4 | ||
5 | This file is part of DRBD by Philipp Reisner and Lars Ellenberg. | |
6 | ||
7 | Copyright (C) 2003-2008, LINBIT Information Technologies GmbH. | |
8 | Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>. | |
9 | Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>. | |
10 | ||
b411b363 PR |
11 | |
12 | */ | |
13 | ||
14 | #ifndef LRU_CACHE_H | |
15 | #define LRU_CACHE_H | |
16 | ||
17 | #include <linux/list.h> | |
18 | #include <linux/slab.h> | |
19 | #include <linux/bitops.h> | |
20 | #include <linux/string.h> /* for memset */ | |
21 | #include <linux/seq_file.h> | |
22 | ||
23 | /* | |
24 | This header file (and its .c file; kernel-doc of functions see there) | |
25 | define a helper framework to easily keep track of index:label associations, | |
26 | and changes to an "active set" of objects, as well as pending transactions, | |
27 | to persistently record those changes. | |
28 | ||
29 | We use an LRU policy if it is necessary to "cool down" a region currently in | |
30 | the active set before we can "heat" a previously unused region. | |
31 | ||
32 | Because of this later property, it is called "lru_cache". | |
33 | As it actually Tracks Objects in an Active SeT, we could also call it | |
34 | toast (incidentally that is what may happen to the data on the | |
c23c8082 | 35 | backend storage upon next resync, if we don't get it right). |
b411b363 PR |
36 | |
37 | What for? | |
38 | ||
39 | We replicate IO (more or less synchronously) to local and remote disk. | |
40 | ||
41 | For crash recovery after replication node failure, | |
42 | we need to resync all regions that have been target of in-flight WRITE IO | |
48fc7f7e AB |
43 | (in use, or "hot", regions), as we don't know whether or not those WRITEs |
44 | have made it to stable storage. | |
b411b363 PR |
45 | |
46 | To avoid a "full resync", we need to persistently track these regions. | |
47 | ||
48 | This is known as "write intent log", and can be implemented as on-disk | |
49 | (coarse or fine grained) bitmap, or other meta data. | |
50 | ||
51 | To avoid the overhead of frequent extra writes to this meta data area, | |
52 | usually the condition is softened to regions that _may_ have been target of | |
53 | in-flight WRITE IO, e.g. by only lazily clearing the on-disk write-intent | |
54 | bitmap, trading frequency of meta data transactions against amount of | |
3ad2f3fb | 55 | (possibly unnecessary) resync traffic. |
b411b363 PR |
56 | |
57 | If we set a hard limit on the area that may be "hot" at any given time, we | |
58 | limit the amount of resync traffic needed for crash recovery. | |
59 | ||
60 | For recovery after replication link failure, | |
61 | we need to resync all blocks that have been changed on the other replica | |
62 | in the mean time, or, if both replica have been changed independently [*], | |
63 | all blocks that have been changed on either replica in the mean time. | |
64 | [*] usually as a result of a cluster split-brain and insufficient protection. | |
65 | but there are valid use cases to do this on purpose. | |
66 | ||
67 | Tracking those blocks can be implemented as "dirty bitmap". | |
68 | Having it fine-grained reduces the amount of resync traffic. | |
69 | It should also be persistent, to allow for reboots (or crashes) | |
70 | while the replication link is down. | |
71 | ||
72 | There are various possible implementations for persistently storing | |
73 | write intent log information, three of which are mentioned here. | |
74 | ||
75 | "Chunk dirtying" | |
76 | The on-disk "dirty bitmap" may be re-used as "write-intent" bitmap as well. | |
77 | To reduce the frequency of bitmap updates for write-intent log purposes, | |
78 | one could dirty "chunks" (of some size) at a time of the (fine grained) | |
79 | on-disk bitmap, while keeping the in-memory "dirty" bitmap as clean as | |
80 | possible, flushing it to disk again when a previously "hot" (and on-disk | |
81 | dirtied as full chunk) area "cools down" again (no IO in flight anymore, | |
82 | and none expected in the near future either). | |
83 | ||
84 | "Explicit (coarse) write intent bitmap" | |
85 | An other implementation could chose a (probably coarse) explicit bitmap, | |
86 | for write-intent log purposes, additionally to the fine grained dirty bitmap. | |
87 | ||
88 | "Activity log" | |
89 | Yet an other implementation may keep track of the hot regions, by starting | |
90 | with an empty set, and writing down a journal of region numbers that have | |
91 | become "hot", or have "cooled down" again. | |
92 | ||
93 | To be able to use a ring buffer for this journal of changes to the active | |
94 | set, we not only record the actual changes to that set, but also record the | |
95 | not changing members of the set in a round robin fashion. To do so, we use a | |
96 | fixed (but configurable) number of slots which we can identify by index, and | |
97 | associate region numbers (labels) with these indices. | |
98 | For each transaction recording a change to the active set, we record the | |
99 | change itself (index: -old_label, +new_label), and which index is associated | |
100 | with which label (index: current_label) within a certain sliding window that | |
101 | is moved further over the available indices with each such transaction. | |
102 | ||
103 | Thus, for crash recovery, if the ringbuffer is sufficiently large, we can | |
104 | accurately reconstruct the active set. | |
105 | ||
106 | Sufficiently large depends only on maximum number of active objects, and the | |
107 | size of the sliding window recording "index: current_label" associations within | |
108 | each transaction. | |
109 | ||
110 | This is what we call the "activity log". | |
111 | ||
112 | Currently we need one activity log transaction per single label change, which | |
113 | does not give much benefit over the "dirty chunks of bitmap" approach, other | |
114 | than potentially less seeks. | |
115 | ||
116 | We plan to change the transaction format to support multiple changes per | |
117 | transaction, which then would reduce several (disjoint, "random") updates to | |
118 | the bitmap into one transaction to the activity log ring buffer. | |
119 | */ | |
120 | ||
121 | /* this defines an element in a tracked set | |
122 | * .colision is for hash table lookup. | |
123 | * When we process a new IO request, we know its sector, thus can deduce the | |
124 | * region number (label) easily. To do the label -> object lookup without a | |
125 | * full list walk, we use a simple hash table. | |
126 | * | |
127 | * .list is on one of three lists: | |
128 | * in_use: currently in use (refcnt > 0, lc_number != LC_FREE) | |
129 | * lru: unused but ready to be reused or recycled | |
600942e0 | 130 | * (lc_refcnt == 0, lc_number != LC_FREE), |
b411b363 | 131 | * free: unused but ready to be recycled |
600942e0 | 132 | * (lc_refcnt == 0, lc_number == LC_FREE), |
b411b363 PR |
133 | * |
134 | * an element is said to be "in the active set", | |
135 | * if either on "in_use" or "lru", i.e. lc_number != LC_FREE. | |
136 | * | |
137 | * DRBD currently (May 2009) only uses 61 elements on the resync lru_cache | |
138 | * (total memory usage 2 pages), and up to 3833 elements on the act_log | |
25985edc | 139 | * lru_cache, totalling ~215 kB for 64bit architecture, ~53 pages. |
b411b363 PR |
140 | * |
141 | * We usually do not actually free these objects again, but only "recycle" | |
142 | * them, as the change "index: -old_label, +LC_FREE" would need a transaction | |
143 | * as well. Which also means that using a kmem_cache to allocate the objects | |
144 | * from wastes some resources. | |
145 | * But it avoids high order page allocations in kmalloc. | |
146 | */ | |
147 | struct lc_element { | |
148 | struct hlist_node colision; | |
149 | struct list_head list; /* LRU list or free list */ | |
150 | unsigned refcnt; | |
600942e0 LE |
151 | /* back "pointer" into lc_cache->element[index], |
152 | * for paranoia, and for "lc_element_to_index" */ | |
b411b363 PR |
153 | unsigned lc_index; |
154 | /* if we want to track a larger set of objects, | |
c23c8082 | 155 | * it needs to become an architecture independent u64 */ |
b411b363 | 156 | unsigned lc_number; |
b411b363 PR |
157 | /* special label when on free list */ |
158 | #define LC_FREE (~0U) | |
46a15bc3 LE |
159 | |
160 | /* for pending changes */ | |
161 | unsigned lc_new_number; | |
b411b363 PR |
162 | }; |
163 | ||
164 | struct lru_cache { | |
165 | /* the least recently used item is kept at lru->prev */ | |
166 | struct list_head lru; | |
167 | struct list_head free; | |
168 | struct list_head in_use; | |
46a15bc3 | 169 | struct list_head to_be_changed; |
b411b363 PR |
170 | |
171 | /* the pre-created kmem cache to allocate the objects from */ | |
172 | struct kmem_cache *lc_cache; | |
173 | ||
174 | /* size of tracked objects, used to memset(,0,) them in lc_reset */ | |
175 | size_t element_size; | |
176 | /* offset of struct lc_element member in the tracked object */ | |
177 | size_t element_off; | |
178 | ||
179 | /* number of elements (indices) */ | |
46a15bc3 | 180 | unsigned int nr_elements; |
b411b363 PR |
181 | /* Arbitrary limit on maximum tracked objects. Practical limit is much |
182 | * lower due to allocation failures, probably. For typical use cases, | |
183 | * nr_elements should be a few thousand at most. | |
600942e0 LE |
184 | * This also limits the maximum value of lc_element.lc_index, allowing the |
185 | * 8 high bits of .lc_index to be overloaded with flags in the future. */ | |
b411b363 PR |
186 | #define LC_MAX_ACTIVE (1<<24) |
187 | ||
46a15bc3 LE |
188 | /* allow to accumulate a few (index:label) changes, |
189 | * but no more than max_pending_changes */ | |
190 | unsigned int max_pending_changes; | |
191 | /* number of elements currently on to_be_changed list */ | |
192 | unsigned int pending_changes; | |
193 | ||
b411b363 | 194 | /* statistics */ |
46a15bc3 LE |
195 | unsigned used; /* number of elements currently on in_use list */ |
196 | unsigned long hits, misses, starving, locked, changed; | |
b411b363 PR |
197 | |
198 | /* see below: flag-bits for lru_cache */ | |
199 | unsigned long flags; | |
200 | ||
b411b363 | 201 | |
b411b363 PR |
202 | const char *name; |
203 | ||
204 | /* nr_elements there */ | |
205 | struct hlist_head *lc_slot; | |
206 | struct lc_element **lc_element; | |
207 | }; | |
208 | ||
209 | ||
210 | /* flag-bits for lru_cache */ | |
211 | enum { | |
212 | /* debugging aid, to catch concurrent access early. | |
213 | * user needs to guarantee exclusive access by proper locking! */ | |
214 | __LC_PARANOIA, | |
46a15bc3 LE |
215 | |
216 | /* annotate that the set is "dirty", possibly accumulating further | |
217 | * changes, until a transaction is finally triggered */ | |
b411b363 | 218 | __LC_DIRTY, |
46a15bc3 LE |
219 | |
220 | /* Locked, no further changes allowed. | |
221 | * Also used to serialize changing transactions. */ | |
222 | __LC_LOCKED, | |
223 | ||
b411b363 PR |
224 | /* if we need to change the set, but currently there is no free nor |
225 | * unused element available, we are "starving", and must not give out | |
226 | * further references, to guarantee that eventually some refcnt will | |
227 | * drop to zero and we will be able to make progress again, changing | |
228 | * the set, writing the transaction. | |
229 | * if the statistics say we are frequently starving, | |
230 | * nr_elements is too small. */ | |
231 | __LC_STARVING, | |
232 | }; | |
233 | #define LC_PARANOIA (1<<__LC_PARANOIA) | |
234 | #define LC_DIRTY (1<<__LC_DIRTY) | |
46a15bc3 | 235 | #define LC_LOCKED (1<<__LC_LOCKED) |
b411b363 PR |
236 | #define LC_STARVING (1<<__LC_STARVING) |
237 | ||
238 | extern struct lru_cache *lc_create(const char *name, struct kmem_cache *cache, | |
46a15bc3 | 239 | unsigned max_pending_changes, |
b411b363 PR |
240 | unsigned e_count, size_t e_size, size_t e_off); |
241 | extern void lc_reset(struct lru_cache *lc); | |
242 | extern void lc_destroy(struct lru_cache *lc); | |
b411b363 PR |
243 | extern void lc_del(struct lru_cache *lc, struct lc_element *element); |
244 | ||
cbe5e610 | 245 | extern struct lc_element *lc_get_cumulative(struct lru_cache *lc, unsigned int enr); |
b411b363 PR |
246 | extern struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr); |
247 | extern struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr); | |
248 | extern struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr); | |
249 | extern unsigned int lc_put(struct lru_cache *lc, struct lc_element *e); | |
46a15bc3 | 250 | extern void lc_committed(struct lru_cache *lc); |
b411b363 PR |
251 | |
252 | struct seq_file; | |
bb649b34 | 253 | extern void lc_seq_printf_stats(struct seq_file *seq, struct lru_cache *lc); |
b411b363 PR |
254 | |
255 | extern void lc_seq_dump_details(struct seq_file *seq, struct lru_cache *lc, char *utext, | |
256 | void (*detail) (struct seq_file *, struct lc_element *)); | |
257 | ||
258 | /** | |
46a15bc3 | 259 | * lc_try_lock_for_transaction - can be used to stop lc_get() from changing the tracked set |
b411b363 PR |
260 | * @lc: the lru cache to operate on |
261 | * | |
46a15bc3 LE |
262 | * Allows (expects) the set to be "dirty". Note that the reference counts and |
263 | * order on the active and lru lists may still change. Used to serialize | |
c23c8082 | 264 | * changing transactions. Returns true if we acquired the lock. |
b411b363 | 265 | */ |
46a15bc3 | 266 | static inline int lc_try_lock_for_transaction(struct lru_cache *lc) |
b411b363 | 267 | { |
46a15bc3 | 268 | return !test_and_set_bit(__LC_LOCKED, &lc->flags); |
b411b363 PR |
269 | } |
270 | ||
46a15bc3 LE |
271 | /** |
272 | * lc_try_lock - variant to stop lc_get() from changing the tracked set | |
273 | * @lc: the lru cache to operate on | |
274 | * | |
275 | * Note that the reference counts and order on the active and lru lists may | |
c23c8082 | 276 | * still change. Only works on a "clean" set. Returns true if we acquired the |
46a15bc3 LE |
277 | * lock, which means there are no pending changes, and any further attempt to |
278 | * change the set will not succeed until the next lc_unlock(). | |
279 | */ | |
280 | extern int lc_try_lock(struct lru_cache *lc); | |
281 | ||
b411b363 PR |
282 | /** |
283 | * lc_unlock - unlock @lc, allow lc_get() to change the set again | |
284 | * @lc: the lru cache to operate on | |
285 | */ | |
286 | static inline void lc_unlock(struct lru_cache *lc) | |
287 | { | |
288 | clear_bit(__LC_DIRTY, &lc->flags); | |
46a15bc3 | 289 | clear_bit_unlock(__LC_LOCKED, &lc->flags); |
b411b363 PR |
290 | } |
291 | ||
46a15bc3 | 292 | extern bool lc_is_used(struct lru_cache *lc, unsigned int enr); |
b411b363 PR |
293 | |
294 | #define lc_entry(ptr, type, member) \ | |
295 | container_of(ptr, type, member) | |
296 | ||
297 | extern struct lc_element *lc_element_by_index(struct lru_cache *lc, unsigned i); | |
b411b363 PR |
298 | |
299 | #endif |