Commit | Line | Data |
---|---|---|
c3d2f6cb MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ========================== | |
a9a745da | 4 | XFS Delayed Logging Design |
c3d2f6cb | 5 | ========================== |
a9a745da DC |
6 | |
7 | Introduction to Re-logging in XFS | |
c3d2f6cb | 8 | ================================= |
a9a745da DC |
9 | |
10 | XFS logging is a combination of logical and physical logging. Some objects, | |
11 | such as inodes and dquots, are logged in logical format where the details | |
12 | logged are made up of the changes to in-core structures rather than on-disk | |
13 | structures. Other objects - typically buffers - have their physical changes | |
14 | logged. The reason for these differences is to reduce the amount of log space | |
15 | required for objects that are frequently logged. Some parts of inodes are more | |
16 | frequently logged than others, and inodes are typically more frequently logged | |
17 | than any other object (except maybe the superblock buffer) so keeping the | |
18 | amount of metadata logged low is of prime importance. | |
19 | ||
20 | The reason that this is such a concern is that XFS allows multiple separate | |
21 | modifications to a single object to be carried in the log at any given time. | |
22 | This allows the log to avoid needing to flush each change to disk before | |
23 | recording a new change to the object. XFS does this via a method called | |
24 | "re-logging". Conceptually, this is quite simple - all it requires is that any | |
25 | new change to the object is recorded with a *new copy* of all the existing | |
26 | changes in the new transaction that is written to the log. | |
27 | ||
28 | That is, if we have a sequence of changes A through to F, and the object was | |
29 | written to disk after change D, we would see in the log the following series | |
30 | of transactions, their contents and the log sequence number (LSN) of the | |
c3d2f6cb | 31 | transaction:: |
a9a745da DC |
32 | |
33 | Transaction Contents LSN | |
34 | A A X | |
35 | B A+B X+n | |
36 | C A+B+C X+n+m | |
37 | D A+B+C+D X+n+m+o | |
38 | <object written to disk> | |
39 | E E Y (> X+n+m+o) | |
9d619443 | 40 | F E+F Y+p |
a9a745da DC |
41 | |
42 | In other words, each time an object is relogged, the new transaction contains | |
43 | the aggregation of all the previous changes currently held only in the log. | |
44 | ||
45 | This relogging technique also allows objects to be moved forward in the log so | |
46 | that an object being relogged does not prevent the tail of the log from ever | |
47 | moving forward. This can be seen in the table above by the changing | |
25985edc | 48 | (increasing) LSN of each subsequent transaction - the LSN is effectively a |
a9a745da DC |
49 | direct encoding of the location in the log of the transaction. |
50 | ||
51 | This relogging is also used to implement long-running, multiple-commit | |
52 | transactions. These transaction are known as rolling transactions, and require | |
53 | a special log reservation known as a permanent transaction reservation. A | |
54 | typical example of a rolling transaction is the removal of extents from an | |
55 | inode which can only be done at a rate of two extents per transaction because | |
56 | of reservation size limitations. Hence a rolling extent removal transaction | |
57 | keeps relogging the inode and btree buffers as they get modified in each | |
58 | removal operation. This keeps them moving forward in the log as the operation | |
59 | progresses, ensuring that current operation never gets blocked by itself if the | |
60 | log wraps around. | |
61 | ||
62 | Hence it can be seen that the relogging operation is fundamental to the correct | |
63 | working of the XFS journalling subsystem. From the above description, most | |
64 | people should be able to see why the XFS metadata operations writes so much to | |
65 | the log - repeated operations to the same objects write the same changes to | |
66 | the log over and over again. Worse is the fact that objects tend to get | |
67 | dirtier as they get relogged, so each subsequent transaction is writing more | |
68 | metadata into the log. | |
69 | ||
70 | Another feature of the XFS transaction subsystem is that most transactions are | |
71 | asynchronous. That is, they don't commit to disk until either a log buffer is | |
72 | filled (a log buffer can hold multiple transactions) or a synchronous operation | |
73 | forces the log buffers holding the transactions to disk. This means that XFS is | |
74 | doing aggregation of transactions in memory - batching them, if you like - to | |
75 | minimise the impact of the log IO on transaction throughput. | |
76 | ||
77 | The limitation on asynchronous transaction throughput is the number and size of | |
78 | log buffers made available by the log manager. By default there are 8 log | |
79 | buffers available and the size of each is 32kB - the size can be increased up | |
80 | to 256kB by use of a mount option. | |
81 | ||
82 | Effectively, this gives us the maximum bound of outstanding metadata changes | |
83 | that can be made to the filesystem at any point in time - if all the log | |
84 | buffers are full and under IO, then no more transactions can be committed until | |
85 | the current batch completes. It is now common for a single current CPU core to | |
86 | be to able to issue enough transactions to keep the log buffers full and under | |
87 | IO permanently. Hence the XFS journalling subsystem can be considered to be IO | |
88 | bound. | |
89 | ||
90 | Delayed Logging: Concepts | |
c3d2f6cb | 91 | ========================= |
a9a745da DC |
92 | |
93 | The key thing to note about the asynchronous logging combined with the | |
94 | relogging technique XFS uses is that we can be relogging changed objects | |
95 | multiple times before they are committed to disk in the log buffers. If we | |
96 | return to the previous relogging example, it is entirely possible that | |
97 | transactions A through D are committed to disk in the same log buffer. | |
98 | ||
99 | That is, a single log buffer may contain multiple copies of the same object, | |
100 | but only one of those copies needs to be there - the last one "D", as it | |
101 | contains all the changes from the previous changes. In other words, we have one | |
102 | necessary copy in the log buffer, and three stale copies that are simply | |
103 | wasting space. When we are doing repeated operations on the same set of | |
104 | objects, these "stale objects" can be over 90% of the space used in the log | |
105 | buffers. It is clear that reducing the number of stale objects written to the | |
106 | log would greatly reduce the amount of metadata we write to the log, and this | |
107 | is the fundamental goal of delayed logging. | |
108 | ||
109 | From a conceptual point of view, XFS is already doing relogging in memory (where | |
110 | memory == log buffer), only it is doing it extremely inefficiently. It is using | |
111 | logical to physical formatting to do the relogging because there is no | |
112 | infrastructure to keep track of logical changes in memory prior to physically | |
113 | formatting the changes in a transaction to the log buffer. Hence we cannot avoid | |
114 | accumulating stale objects in the log buffers. | |
115 | ||
116 | Delayed logging is the name we've given to keeping and tracking transactional | |
117 | changes to objects in memory outside the log buffer infrastructure. Because of | |
118 | the relogging concept fundamental to the XFS journalling subsystem, this is | |
119 | actually relatively easy to do - all the changes to logged items are already | |
120 | tracked in the current infrastructure. The big problem is how to accumulate | |
121 | them and get them to the log in a consistent, recoverable manner. | |
122 | Describing the problems and how they have been solved is the focus of this | |
123 | document. | |
124 | ||
125 | One of the key changes that delayed logging makes to the operation of the | |
126 | journalling subsystem is that it disassociates the amount of outstanding | |
127 | metadata changes from the size and number of log buffers available. In other | |
128 | words, instead of there only being a maximum of 2MB of transaction changes not | |
129 | written to the log at any point in time, there may be a much greater amount | |
130 | being accumulated in memory. Hence the potential for loss of metadata on a | |
131 | crash is much greater than for the existing logging mechanism. | |
132 | ||
133 | It should be noted that this does not change the guarantee that log recovery | |
134 | will result in a consistent filesystem. What it does mean is that as far as the | |
135 | recovered filesystem is concerned, there may be many thousands of transactions | |
136 | that simply did not occur as a result of the crash. This makes it even more | |
137 | important that applications that care about their data use fsync() where they | |
138 | need to ensure application level data integrity is maintained. | |
139 | ||
140 | It should be noted that delayed logging is not an innovative new concept that | |
141 | warrants rigorous proofs to determine whether it is correct or not. The method | |
142 | of accumulating changes in memory for some period before writing them to the | |
143 | log is used effectively in many filesystems including ext3 and ext4. Hence | |
144 | no time is spent in this document trying to convince the reader that the | |
145 | concept is sound. Instead it is simply considered a "solved problem" and as | |
146 | such implementing it in XFS is purely an exercise in software engineering. | |
147 | ||
148 | The fundamental requirements for delayed logging in XFS are simple: | |
149 | ||
150 | 1. Reduce the amount of metadata written to the log by at least | |
151 | an order of magnitude. | |
152 | 2. Supply sufficient statistics to validate Requirement #1. | |
153 | 3. Supply sufficient new tracing infrastructure to be able to debug | |
154 | problems with the new code. | |
155 | 4. No on-disk format change (metadata or log format). | |
156 | 5. Enable and disable with a mount option. | |
157 | 6. No performance regressions for synchronous transaction workloads. | |
158 | ||
159 | Delayed Logging: Design | |
c3d2f6cb | 160 | ======================= |
a9a745da DC |
161 | |
162 | Storing Changes | |
c3d2f6cb | 163 | --------------- |
a9a745da DC |
164 | |
165 | The problem with accumulating changes at a logical level (i.e. just using the | |
166 | existing log item dirty region tracking) is that when it comes to writing the | |
167 | changes to the log buffers, we need to ensure that the object we are formatting | |
168 | is not changing while we do this. This requires locking the object to prevent | |
169 | concurrent modification. Hence flushing the logical changes to the log would | |
170 | require us to lock every object, format them, and then unlock them again. | |
171 | ||
172 | This introduces lots of scope for deadlocks with transactions that are already | |
173 | running. For example, a transaction has object A locked and modified, but needs | |
174 | the delayed logging tracking lock to commit the transaction. However, the | |
175 | flushing thread has the delayed logging tracking lock already held, and is | |
176 | trying to get the lock on object A to flush it to the log buffer. This appears | |
177 | to be an unsolvable deadlock condition, and it was solving this problem that | |
178 | was the barrier to implementing delayed logging for so long. | |
179 | ||
180 | The solution is relatively simple - it just took a long time to recognise it. | |
181 | Put simply, the current logging code formats the changes to each item into an | |
182 | vector array that points to the changed regions in the item. The log write code | |
183 | simply copies the memory these vectors point to into the log buffer during | |
184 | transaction commit while the item is locked in the transaction. Instead of | |
185 | using the log buffer as the destination of the formatting code, we can use an | |
186 | allocated memory buffer big enough to fit the formatted vector. | |
187 | ||
188 | If we then copy the vector into the memory buffer and rewrite the vector to | |
189 | point to the memory buffer rather than the object itself, we now have a copy of | |
190 | the changes in a format that is compatible with the log buffer writing code. | |
191 | that does not require us to lock the item to access. This formatting and | |
192 | rewriting can all be done while the object is locked during transaction commit, | |
193 | resulting in a vector that is transactionally consistent and can be accessed | |
194 | without needing to lock the owning item. | |
195 | ||
196 | Hence we avoid the need to lock items when we need to flush outstanding | |
197 | asynchronous transactions to the log. The differences between the existing | |
198 | formatting method and the delayed logging formatting can be seen in the | |
199 | diagram below. | |
200 | ||
c3d2f6cb | 201 | Current format log vector:: |
a9a745da | 202 | |
c3d2f6cb MCC |
203 | Object +---------------------------------------------+ |
204 | Vector 1 +----+ | |
205 | Vector 2 +----+ | |
206 | Vector 3 +----------+ | |
a9a745da | 207 | |
c3d2f6cb | 208 | After formatting:: |
a9a745da | 209 | |
c3d2f6cb | 210 | Log Buffer +-V1-+-V2-+----V3----+ |
a9a745da | 211 | |
c3d2f6cb | 212 | Delayed logging vector:: |
a9a745da | 213 | |
c3d2f6cb MCC |
214 | Object +---------------------------------------------+ |
215 | Vector 1 +----+ | |
216 | Vector 2 +----+ | |
217 | Vector 3 +----------+ | |
a9a745da | 218 | |
c3d2f6cb | 219 | After formatting:: |
a9a745da | 220 | |
c3d2f6cb MCC |
221 | Memory Buffer +-V1-+-V2-+----V3----+ |
222 | Vector 1 +----+ | |
223 | Vector 2 +----+ | |
224 | Vector 3 +----------+ | |
a9a745da DC |
225 | |
226 | The memory buffer and associated vector need to be passed as a single object, | |
227 | but still need to be associated with the parent object so if the object is | |
228 | relogged we can replace the current memory buffer with a new memory buffer that | |
229 | contains the latest changes. | |
230 | ||
231 | The reason for keeping the vector around after we've formatted the memory | |
232 | buffer is to support splitting vectors across log buffer boundaries correctly. | |
233 | If we don't keep the vector around, we do not know where the region boundaries | |
234 | are in the item, so we'd need a new encapsulation method for regions in the log | |
235 | buffer writing (i.e. double encapsulation). This would be an on-disk format | |
236 | change and as such is not desirable. It also means we'd have to write the log | |
237 | region headers in the formatting stage, which is problematic as there is per | |
238 | region state that needs to be placed into the headers during the log write. | |
239 | ||
240 | Hence we need to keep the vector, but by attaching the memory buffer to it and | |
241 | rewriting the vector addresses to point at the memory buffer we end up with a | |
242 | self-describing object that can be passed to the log buffer write code to be | |
243 | handled in exactly the same manner as the existing log vectors are handled. | |
244 | Hence we avoid needing a new on-disk format to handle items that have been | |
245 | relogged in memory. | |
246 | ||
247 | ||
248 | Tracking Changes | |
c3d2f6cb | 249 | ---------------- |
a9a745da DC |
250 | |
251 | Now that we can record transactional changes in memory in a form that allows | |
252 | them to be used without limitations, we need to be able to track and accumulate | |
253 | them so that they can be written to the log at some later point in time. The | |
254 | log item is the natural place to store this vector and buffer, and also makes sense | |
255 | to be the object that is used to track committed objects as it will always | |
256 | exist once the object has been included in a transaction. | |
257 | ||
258 | The log item is already used to track the log items that have been written to | |
259 | the log but not yet written to disk. Such log items are considered "active" | |
260 | and as such are stored in the Active Item List (AIL) which is a LSN-ordered | |
261 | double linked list. Items are inserted into this list during log buffer IO | |
262 | completion, after which they are unpinned and can be written to disk. An object | |
263 | that is in the AIL can be relogged, which causes the object to be pinned again | |
264 | and then moved forward in the AIL when the log buffer IO completes for that | |
265 | transaction. | |
266 | ||
267 | Essentially, this shows that an item that is in the AIL can still be modified | |
268 | and relogged, so any tracking must be separate to the AIL infrastructure. As | |
269 | such, we cannot reuse the AIL list pointers for tracking committed items, nor | |
270 | can we store state in any field that is protected by the AIL lock. Hence the | |
271 | committed item tracking needs it's own locks, lists and state fields in the log | |
272 | item. | |
273 | ||
274 | Similar to the AIL, tracking of committed items is done through a new list | |
275 | called the Committed Item List (CIL). The list tracks log items that have been | |
276 | committed and have formatted memory buffers attached to them. It tracks objects | |
277 | in transaction commit order, so when an object is relogged it is removed from | |
278 | it's place in the list and re-inserted at the tail. This is entirely arbitrary | |
279 | and done to make it easy for debugging - the last items in the list are the | |
280 | ones that are most recently modified. Ordering of the CIL is not necessary for | |
281 | transactional integrity (as discussed in the next section) so the ordering is | |
282 | done for convenience/sanity of the developers. | |
283 | ||
284 | ||
285 | Delayed Logging: Checkpoints | |
c3d2f6cb | 286 | ---------------------------- |
a9a745da DC |
287 | |
288 | When we have a log synchronisation event, commonly known as a "log force", | |
289 | all the items in the CIL must be written into the log via the log buffers. | |
290 | We need to write these items in the order that they exist in the CIL, and they | |
291 | need to be written as an atomic transaction. The need for all the objects to be | |
292 | written as an atomic transaction comes from the requirements of relogging and | |
293 | log replay - all the changes in all the objects in a given transaction must | |
294 | either be completely replayed during log recovery, or not replayed at all. If | |
295 | a transaction is not replayed because it is not complete in the log, then | |
296 | no later transactions should be replayed, either. | |
297 | ||
298 | To fulfill this requirement, we need to write the entire CIL in a single log | |
299 | transaction. Fortunately, the XFS log code has no fixed limit on the size of a | |
300 | transaction, nor does the log replay code. The only fundamental limit is that | |
301 | the transaction cannot be larger than just under half the size of the log. The | |
302 | reason for this limit is that to find the head and tail of the log, there must | |
303 | be at least one complete transaction in the log at any given time. If a | |
304 | transaction is larger than half the log, then there is the possibility that a | |
305 | crash during the write of a such a transaction could partially overwrite the | |
306 | only complete previous transaction in the log. This will result in a recovery | |
307 | failure and an inconsistent filesystem and hence we must enforce the maximum | |
308 | size of a checkpoint to be slightly less than a half the log. | |
309 | ||
310 | Apart from this size requirement, a checkpoint transaction looks no different | |
311 | to any other transaction - it contains a transaction header, a series of | |
312 | formatted log items and a commit record at the tail. From a recovery | |
313 | perspective, the checkpoint transaction is also no different - just a lot | |
314 | bigger with a lot more items in it. The worst case effect of this is that we | |
315 | might need to tune the recovery transaction object hash size. | |
316 | ||
317 | Because the checkpoint is just another transaction and all the changes to log | |
318 | items are stored as log vectors, we can use the existing log buffer writing | |
319 | code to write the changes into the log. To do this efficiently, we need to | |
320 | minimise the time we hold the CIL locked while writing the checkpoint | |
321 | transaction. The current log write code enables us to do this easily with the | |
322 | way it separates the writing of the transaction contents (the log vectors) from | |
323 | the transaction commit record, but tracking this requires us to have a | |
324 | per-checkpoint context that travels through the log write process through to | |
325 | checkpoint completion. | |
326 | ||
327 | Hence a checkpoint has a context that tracks the state of the current | |
328 | checkpoint from initiation to checkpoint completion. A new context is initiated | |
329 | at the same time a checkpoint transaction is started. That is, when we remove | |
330 | all the current items from the CIL during a checkpoint operation, we move all | |
331 | those changes into the current checkpoint context. We then initialise a new | |
332 | context and attach that to the CIL for aggregation of new transactions. | |
333 | ||
334 | This allows us to unlock the CIL immediately after transfer of all the | |
335 | committed items and effectively allow new transactions to be issued while we | |
336 | are formatting the checkpoint into the log. It also allows concurrent | |
337 | checkpoints to be written into the log buffers in the case of log force heavy | |
338 | workloads, just like the existing transaction commit code does. This, however, | |
339 | requires that we strictly order the commit records in the log so that | |
340 | checkpoint sequence order is maintained during log replay. | |
341 | ||
342 | To ensure that we can be writing an item into a checkpoint transaction at | |
343 | the same time another transaction modifies the item and inserts the log item | |
344 | into the new CIL, then checkpoint transaction commit code cannot use log items | |
345 | to store the list of log vectors that need to be written into the transaction. | |
346 | Hence log vectors need to be able to be chained together to allow them to be | |
25985edc | 347 | detached from the log items. That is, when the CIL is flushed the memory |
a9a745da DC |
348 | buffer and log vector attached to each log item needs to be attached to the |
349 | checkpoint context so that the log item can be released. In diagrammatic form, | |
c3d2f6cb | 350 | the CIL would look like this before the flush:: |
a9a745da DC |
351 | |
352 | CIL Head | |
353 | | | |
354 | V | |
355 | Log Item <-> log vector 1 -> memory buffer | |
356 | | -> vector array | |
357 | V | |
358 | Log Item <-> log vector 2 -> memory buffer | |
359 | | -> vector array | |
360 | V | |
361 | ...... | |
362 | | | |
363 | V | |
364 | Log Item <-> log vector N-1 -> memory buffer | |
365 | | -> vector array | |
366 | V | |
367 | Log Item <-> log vector N -> memory buffer | |
368 | -> vector array | |
369 | ||
370 | And after the flush the CIL head is empty, and the checkpoint context log | |
c3d2f6cb | 371 | vector list would look like:: |
a9a745da DC |
372 | |
373 | Checkpoint Context | |
374 | | | |
375 | V | |
376 | log vector 1 -> memory buffer | |
377 | | -> vector array | |
378 | | -> Log Item | |
379 | V | |
380 | log vector 2 -> memory buffer | |
381 | | -> vector array | |
382 | | -> Log Item | |
383 | V | |
384 | ...... | |
385 | | | |
386 | V | |
387 | log vector N-1 -> memory buffer | |
388 | | -> vector array | |
389 | | -> Log Item | |
390 | V | |
391 | log vector N -> memory buffer | |
392 | -> vector array | |
393 | -> Log Item | |
394 | ||
395 | Once this transfer is done, the CIL can be unlocked and new transactions can | |
396 | start, while the checkpoint flush code works over the log vector chain to | |
397 | commit the checkpoint. | |
398 | ||
399 | Once the checkpoint is written into the log buffers, the checkpoint context is | |
400 | attached to the log buffer that the commit record was written to along with a | |
401 | completion callback. Log IO completion will call that callback, which can then | |
402 | run transaction committed processing for the log items (i.e. insert into AIL | |
403 | and unpin) in the log vector chain and then free the log vector chain and | |
404 | checkpoint context. | |
405 | ||
406 | Discussion Point: I am uncertain as to whether the log item is the most | |
407 | efficient way to track vectors, even though it seems like the natural way to do | |
408 | it. The fact that we walk the log items (in the CIL) just to chain the log | |
409 | vectors and break the link between the log item and the log vector means that | |
410 | we take a cache line hit for the log item list modification, then another for | |
411 | the log vector chaining. If we track by the log vectors, then we only need to | |
412 | break the link between the log item and the log vector, which means we should | |
413 | dirty only the log item cachelines. Normally I wouldn't be concerned about one | |
414 | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log | |
415 | vectors in one checkpoint transaction. I'd guess this is a "measure and | |
416 | compare" situation that can be done after a working and reviewed implementation | |
417 | is in the dev tree.... | |
418 | ||
419 | Delayed Logging: Checkpoint Sequencing | |
c3d2f6cb | 420 | -------------------------------------- |
a9a745da DC |
421 | |
422 | One of the key aspects of the XFS transaction subsystem is that it tags | |
423 | committed transactions with the log sequence number of the transaction commit. | |
424 | This allows transactions to be issued asynchronously even though there may be | |
425 | future operations that cannot be completed until that transaction is fully | |
426 | committed to the log. In the rare case that a dependent operation occurs (e.g. | |
427 | re-using a freed metadata extent for a data extent), a special, optimised log | |
428 | force can be issued to force the dependent transaction to disk immediately. | |
429 | ||
430 | To do this, transactions need to record the LSN of the commit record of the | |
431 | transaction. This LSN comes directly from the log buffer the transaction is | |
432 | written into. While this works just fine for the existing transaction | |
433 | mechanism, it does not work for delayed logging because transactions are not | |
434 | written directly into the log buffers. Hence some other method of sequencing | |
435 | transactions is required. | |
436 | ||
437 | As discussed in the checkpoint section, delayed logging uses per-checkpoint | |
438 | contexts, and as such it is simple to assign a sequence number to each | |
439 | checkpoint. Because the switching of checkpoint contexts must be done | |
440 | atomically, it is simple to ensure that each new context has a monotonically | |
441 | increasing sequence number assigned to it without the need for an external | |
442 | atomic counter - we can just take the current context sequence number and add | |
443 | one to it for the new context. | |
444 | ||
445 | Then, instead of assigning a log buffer LSN to the transaction commit LSN | |
446 | during the commit, we can assign the current checkpoint sequence. This allows | |
447 | operations that track transactions that have not yet completed know what | |
448 | checkpoint sequence needs to be committed before they can continue. As a | |
449 | result, the code that forces the log to a specific LSN now needs to ensure that | |
450 | the log forces to a specific checkpoint. | |
451 | ||
452 | To ensure that we can do this, we need to track all the checkpoint contexts | |
453 | that are currently committing to the log. When we flush a checkpoint, the | |
454 | context gets added to a "committing" list which can be searched. When a | |
455 | checkpoint commit completes, it is removed from the committing list. Because | |
456 | the checkpoint context records the LSN of the commit record for the checkpoint, | |
457 | we can also wait on the log buffer that contains the commit record, thereby | |
458 | using the existing log force mechanisms to execute synchronous forces. | |
459 | ||
460 | It should be noted that the synchronous forces may need to be extended with | |
461 | mitigation algorithms similar to the current log buffer code to allow | |
462 | aggregation of multiple synchronous transactions if there are already | |
463 | synchronous transactions being flushed. Investigation of the performance of the | |
464 | current design is needed before making any decisions here. | |
465 | ||
466 | The main concern with log forces is to ensure that all the previous checkpoints | |
467 | are also committed to disk before the one we need to wait for. Therefore we | |
468 | need to check that all the prior contexts in the committing list are also | |
469 | complete before waiting on the one we need to complete. We do this | |
470 | synchronisation in the log force code so that we don't need to wait anywhere | |
471 | else for such serialisation - it only matters when we do a log force. | |
472 | ||
473 | The only remaining complexity is that a log force now also has to handle the | |
474 | case where the forcing sequence number is the same as the current context. That | |
475 | is, we need to flush the CIL and potentially wait for it to complete. This is a | |
476 | simple addition to the existing log forcing code to check the sequence numbers | |
477 | and push if required. Indeed, placing the current sequence checkpoint flush in | |
478 | the log force code enables the current mechanism for issuing synchronous | |
479 | transactions to remain untouched (i.e. commit an asynchronous transaction, then | |
480 | force the log at the LSN of that transaction) and so the higher level code | |
481 | behaves the same regardless of whether delayed logging is being used or not. | |
482 | ||
483 | Delayed Logging: Checkpoint Log Space Accounting | |
c3d2f6cb | 484 | ------------------------------------------------ |
a9a745da DC |
485 | |
486 | The big issue for a checkpoint transaction is the log space reservation for the | |
487 | transaction. We don't know how big a checkpoint transaction is going to be | |
488 | ahead of time, nor how many log buffers it will take to write out, nor the | |
489 | number of split log vector regions are going to be used. We can track the | |
490 | amount of log space required as we add items to the commit item list, but we | |
491 | still need to reserve the space in the log for the checkpoint. | |
492 | ||
493 | A typical transaction reserves enough space in the log for the worst case space | |
494 | usage of the transaction. The reservation accounts for log record headers, | |
495 | transaction and region headers, headers for split regions, buffer tail padding, | |
496 | etc. as well as the actual space for all the changed metadata in the | |
497 | transaction. While some of this is fixed overhead, much of it is dependent on | |
498 | the size of the transaction and the number of regions being logged (the number | |
499 | of log vectors in the transaction). | |
500 | ||
501 | An example of the differences would be logging directory changes versus logging | |
c3d2f6cb | 502 | inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then |
a9a745da DC |
503 | there are lots of transactions that only contain an inode core and an inode log |
504 | format structure. That is, two vectors totaling roughly 150 bytes. If we modify | |
505 | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each | |
506 | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In | |
507 | comparison, if we are logging full directory buffers, they are typically 4KB | |
508 | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a | |
509 | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total | |
510 | space. From this, it should be obvious that a static log space reservation is | |
511 | not particularly flexible and is difficult to select the "optimal value" for | |
512 | all workloads. | |
513 | ||
514 | Further, if we are going to use a static reservation, which bit of the entire | |
515 | reservation does it cover? We account for space used by the transaction | |
516 | reservation by tracking the space currently used by the object in the CIL and | |
517 | then calculating the increase or decrease in space used as the object is | |
518 | relogged. This allows for a checkpoint reservation to only have to account for | |
519 | log buffer metadata used such as log header records. | |
520 | ||
521 | However, even using a static reservation for just the log metadata is | |
522 | problematic. Typically log record headers use at least 16KB of log space per | |
523 | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be | |
524 | large enough to handle arbitrary sized checkpoint transactions. This | |
525 | reservation needs to be made before the checkpoint is started, and we need to | |
526 | be able to reserve the space without sleeping. For a 8MB checkpoint, we need a | |
527 | reservation of around 150KB, which is a non-trivial amount of space. | |
528 | ||
529 | A static reservation needs to manipulate the log grant counters - we can take a | |
530 | permanent reservation on the space, but we still need to make sure we refresh | |
531 | the write reservation (the actual space available to the transaction) after | |
532 | every checkpoint transaction completion. Unfortunately, if this space is not | |
533 | available when required, then the regrant code will sleep waiting for it. | |
534 | ||
535 | The problem with this is that it can lead to deadlocks as we may need to commit | |
536 | checkpoints to be able to free up log space (refer back to the description of | |
537 | rolling transactions for an example of this). Hence we *must* always have | |
538 | space available in the log if we are to use static reservations, and that is | |
539 | very difficult and complex to arrange. It is possible to do, but there is a | |
540 | simpler way. | |
541 | ||
542 | The simpler way of doing this is tracking the entire log space used by the | |
543 | items in the CIL and using this to dynamically calculate the amount of log | |
544 | space required by the log metadata. If this log metadata space changes as a | |
545 | result of a transaction commit inserting a new memory buffer into the CIL, then | |
546 | the difference in space required is removed from the transaction that causes | |
547 | the change. Transactions at this level will *always* have enough space | |
548 | available in their reservation for this as they have already reserved the | |
549 | maximal amount of log metadata space they require, and such a delta reservation | |
550 | will always be less than or equal to the maximal amount in the reservation. | |
551 | ||
552 | Hence we can grow the checkpoint transaction reservation dynamically as items | |
553 | are added to the CIL and avoid the need for reserving and regranting log space | |
554 | up front. This avoids deadlocks and removes a blocking point from the | |
555 | checkpoint flush code. | |
556 | ||
557 | As mentioned early, transactions can't grow to more than half the size of the | |
558 | log. Hence as part of the reservation growing, we need to also check the size | |
559 | of the reservation against the maximum allowed transaction size. If we reach | |
560 | the maximum threshold, we need to push the CIL to the log. This is effectively | |
561 | a "background flush" and is done on demand. This is identical to | |
562 | a CIL push triggered by a log force, only that there is no waiting for the | |
563 | checkpoint commit to complete. This background push is checked and executed by | |
564 | transaction commit code. | |
565 | ||
566 | If the transaction subsystem goes idle while we still have items in the CIL, | |
567 | they will be flushed by the periodic log force issued by the xfssyncd. This log | |
568 | force will push the CIL to disk, and if the transaction subsystem stays idle, | |
569 | allow the idle log to be covered (effectively marked clean) in exactly the same | |
570 | manner that is done for the existing logging method. A discussion point is | |
571 | whether this log force needs to be done more frequently than the current rate | |
572 | which is once every 30s. | |
573 | ||
574 | ||
575 | Delayed Logging: Log Item Pinning | |
c3d2f6cb | 576 | --------------------------------- |
a9a745da DC |
577 | |
578 | Currently log items are pinned during transaction commit while the items are | |
579 | still locked. This happens just after the items are formatted, though it could | |
580 | be done any time before the items are unlocked. The result of this mechanism is | |
581 | that items get pinned once for every transaction that is committed to the log | |
582 | buffers. Hence items that are relogged in the log buffers will have a pin count | |
583 | for every outstanding transaction they were dirtied in. When each of these | |
584 | transactions is completed, they will unpin the item once. As a result, the item | |
585 | only becomes unpinned when all the transactions complete and there are no | |
586 | pending transactions. Thus the pinning and unpinning of a log item is symmetric | |
587 | as there is a 1:1 relationship with transaction commit and log item completion. | |
588 | ||
25985edc | 589 | For delayed logging, however, we have an asymmetric transaction commit to |
a9a745da DC |
590 | completion relationship. Every time an object is relogged in the CIL it goes |
591 | through the commit process without a corresponding completion being registered. | |
592 | That is, we now have a many-to-one relationship between transaction commit and | |
593 | log item completion. The result of this is that pinning and unpinning of the | |
594 | log items becomes unbalanced if we retain the "pin on transaction commit, unpin | |
595 | on transaction completion" model. | |
596 | ||
597 | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on | |
598 | insertion into the CIL, unpin on checkpoint completion". In other words, the | |
599 | pinning and unpinning becomes symmetric around a checkpoint context. We have to | |
600 | pin the object the first time it is inserted into the CIL - if it is already in | |
601 | the CIL during a transaction commit, then we do not pin it again. Because there | |
602 | can be multiple outstanding checkpoint contexts, we can still see elevated pin | |
603 | counts, but as each checkpoint completes the pin count will retain the correct | |
604 | value according to it's context. | |
605 | ||
606 | Just to make matters more slightly more complex, this checkpoint level context | |
607 | for the pin count means that the pinning of an item must take place under the | |
608 | CIL commit/flush lock. If we pin the object outside this lock, we cannot | |
609 | guarantee which context the pin count is associated with. This is because of | |
610 | the fact pinning the item is dependent on whether the item is present in the | |
611 | current CIL or not. If we don't pin the CIL first before we check and pin the | |
612 | object, we have a race with CIL being flushed between the check and the pin | |
613 | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit | |
614 | lock to guarantee that we pin the items correctly. | |
615 | ||
616 | Delayed Logging: Concurrent Scalability | |
c3d2f6cb | 617 | --------------------------------------- |
a9a745da DC |
618 | |
619 | A fundamental requirement for the CIL is that accesses through transaction | |
620 | commits must scale to many concurrent commits. The current transaction commit | |
621 | code does not break down even when there are transactions coming from 2048 | |
622 | processors at once. The current transaction code does not go any faster than if | |
623 | there was only one CPU using it, but it does not slow down either. | |
624 | ||
625 | As a result, the delayed logging transaction commit code needs to be designed | |
626 | for concurrency from the ground up. It is obvious that there are serialisation | |
627 | points in the design - the three important ones are: | |
628 | ||
629 | 1. Locking out new transaction commits while flushing the CIL | |
630 | 2. Adding items to the CIL and updating item space accounting | |
631 | 3. Checkpoint commit ordering | |
632 | ||
633 | Looking at the transaction commit and CIL flushing interactions, it is clear | |
634 | that we have a many-to-one interaction here. That is, the only restriction on | |
635 | the number of concurrent transactions that can be trying to commit at once is | |
636 | the amount of space available in the log for their reservations. The practical | |
637 | limit here is in the order of several hundred concurrent transactions for a | |
638 | 128MB log, which means that it is generally one per CPU in a machine. | |
639 | ||
640 | The amount of time a transaction commit needs to hold out a flush is a | |
641 | relatively long period of time - the pinning of log items needs to be done | |
642 | while we are holding out a CIL flush, so at the moment that means it is held | |
643 | across the formatting of the objects into memory buffers (i.e. while memcpy()s | |
644 | are in progress). Ultimately a two pass algorithm where the formatting is done | |
645 | separately to the pinning of objects could be used to reduce the hold time of | |
646 | the transaction commit side. | |
647 | ||
648 | Because of the number of potential transaction commit side holders, the lock | |
649 | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not | |
650 | want every other CPU in the machine spinning on the CIL lock. Given that | |
651 | flushing the CIL could involve walking a list of tens of thousands of log | |
652 | items, it will get held for a significant time and so spin contention is a | |
653 | significant concern. Preventing lots of CPUs spinning doing nothing is the | |
654 | main reason for choosing a sleeping lock even though nothing in either the | |
655 | transaction commit or CIL flush side sleeps with the lock held. | |
656 | ||
657 | It should also be noted that CIL flushing is also a relatively rare operation | |
658 | compared to transaction commit for asynchronous transaction workloads - only | |
659 | time will tell if using a read-write semaphore for exclusion will limit | |
660 | transaction commit concurrency due to cache line bouncing of the lock on the | |
661 | read side. | |
662 | ||
663 | The second serialisation point is on the transaction commit side where items | |
664 | are inserted into the CIL. Because transactions can enter this code | |
665 | concurrently, the CIL needs to be protected separately from the above | |
666 | commit/flush exclusion. It also needs to be an exclusive lock but it is only | |
667 | held for a very short time and so a spin lock is appropriate here. It is | |
668 | possible that this lock will become a contention point, but given the short | |
669 | hold time once per transaction I think that contention is unlikely. | |
670 | ||
671 | The final serialisation point is the checkpoint commit record ordering code | |
672 | that is run as part of the checkpoint commit and log force sequencing. The code | |
673 | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter | |
674 | an ordering loop after writing all the log vectors into the log buffers but | |
675 | before writing the commit record. This loop walks the list of committing | |
676 | checkpoints and needs to block waiting for checkpoints to complete their commit | |
677 | record write. As a result it needs a lock and a wait variable. Log force | |
678 | sequencing also requires the same lock, list walk, and blocking mechanism to | |
679 | ensure completion of checkpoints. | |
680 | ||
681 | These two sequencing operations can use the mechanism even though the | |
682 | events they are waiting for are different. The checkpoint commit record | |
683 | sequencing needs to wait until checkpoint contexts contain a commit LSN | |
684 | (obtained through completion of a commit record write) while log force | |
685 | sequencing needs to wait until previous checkpoint contexts are removed from | |
686 | the committing list (i.e. they've completed). A simple wait variable and | |
687 | broadcast wakeups (thundering herds) has been used to implement these two | |
688 | serialisation queues. They use the same lock as the CIL, too. If we see too | |
689 | much contention on the CIL lock, or too many context switches as a result of | |
690 | the broadcast wakeups these operations can be put under a new spinlock and | |
691 | given separate wait lists to reduce lock contention and the number of processes | |
692 | woken by the wrong event. | |
693 | ||
694 | ||
695 | Lifecycle Changes | |
c3d2f6cb | 696 | ----------------- |
a9a745da | 697 | |
c3d2f6cb | 698 | The existing log item life cycle is as follows:: |
a9a745da DC |
699 | |
700 | 1. Transaction allocate | |
701 | 2. Transaction reserve | |
702 | 3. Lock item | |
703 | 4. Join item to transaction | |
704 | If not already attached, | |
705 | Allocate log item | |
706 | Attach log item to owner item | |
707 | Attach log item to transaction | |
708 | 5. Modify item | |
709 | Record modifications in log item | |
710 | 6. Transaction commit | |
711 | Pin item in memory | |
712 | Format item into log buffer | |
713 | Write commit LSN into transaction | |
714 | Unlock item | |
715 | Attach transaction to log buffer | |
716 | ||
717 | <log buffer IO dispatched> | |
718 | <log buffer IO completes> | |
719 | ||
720 | 7. Transaction completion | |
721 | Mark log item committed | |
722 | Insert log item into AIL | |
723 | Write commit LSN into log item | |
724 | Unpin log item | |
725 | 8. AIL traversal | |
726 | Lock item | |
727 | Mark log item clean | |
728 | Flush item to disk | |
729 | ||
730 | <item IO completion> | |
731 | ||
732 | 9. Log item removed from AIL | |
733 | Moves log tail | |
734 | Item unlocked | |
735 | ||
736 | Essentially, steps 1-6 operate independently from step 7, which is also | |
737 | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 | |
738 | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur | |
739 | at the same time. If the log item is in the AIL or between steps 6 and 7 | |
740 | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 | |
741 | are entered and completed is the object considered clean. | |
742 | ||
c3d2f6cb | 743 | With delayed logging, there are new steps inserted into the life cycle:: |
a9a745da DC |
744 | |
745 | 1. Transaction allocate | |
746 | 2. Transaction reserve | |
747 | 3. Lock item | |
748 | 4. Join item to transaction | |
749 | If not already attached, | |
750 | Allocate log item | |
751 | Attach log item to owner item | |
752 | Attach log item to transaction | |
753 | 5. Modify item | |
754 | Record modifications in log item | |
755 | 6. Transaction commit | |
756 | Pin item in memory if not pinned in CIL | |
757 | Format item into log vector + buffer | |
758 | Attach log vector and buffer to log item | |
759 | Insert log item into CIL | |
760 | Write CIL context sequence into transaction | |
761 | Unlock item | |
762 | ||
763 | <next log force> | |
764 | ||
765 | 7. CIL push | |
766 | lock CIL flush | |
767 | Chain log vectors and buffers together | |
768 | Remove items from CIL | |
769 | unlock CIL flush | |
770 | write log vectors into log | |
771 | sequence commit records | |
772 | attach checkpoint context to log buffer | |
773 | ||
774 | <log buffer IO dispatched> | |
775 | <log buffer IO completes> | |
776 | ||
777 | 8. Checkpoint completion | |
778 | Mark log item committed | |
779 | Insert item into AIL | |
780 | Write commit LSN into log item | |
781 | Unpin log item | |
782 | 9. AIL traversal | |
783 | Lock item | |
784 | Mark log item clean | |
785 | Flush item to disk | |
786 | <item IO completion> | |
787 | 10. Log item removed from AIL | |
788 | Moves log tail | |
789 | Item unlocked | |
790 | ||
791 | From this, it can be seen that the only life cycle differences between the two | |
792 | logging methods are in the middle of the life cycle - they still have the same | |
793 | beginning and end and execution constraints. The only differences are in the | |
25985edc | 794 | committing of the log items to the log itself and the completion processing. |
a9a745da DC |
795 | Hence delayed logging should not introduce any constraints on log item |
796 | behaviour, allocation or freeing that don't already exist. | |
797 | ||
798 | As a result of this zero-impact "insertion" of delayed logging infrastructure | |
799 | and the design of the internal structures to avoid on disk format changes, we | |
800 | can basically switch between delayed logging and the existing mechanism with a | |
801 | mount option. Fundamentally, there is no reason why the log manager would not | |
802 | be able to swap methods automatically and transparently depending on load | |
803 | characteristics, but this should not be necessary if delayed logging works as | |
804 | designed. |