Commit | Line | Data |
---|---|---|
e0bc9cf0 | 1 | .. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only |
f00c313b MCC |
2 | |
3 | =========================== | |
4 | Lockless Ring Buffer Design | |
5 | =========================== | |
8b2c70d1 SR |
6 | |
7 | Copyright 2009 Red Hat Inc. | |
f00c313b MCC |
8 | |
9 | :Author: Steven Rostedt <srostedt@redhat.com> | |
10 | :License: The GNU Free Documentation License, Version 1.2 | |
11 | (dual licensed under the GPL v2) | |
12 | :Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, | |
8b2c70d1 SR |
13 | and Frederic Weisbecker. |
14 | ||
15 | ||
16 | Written for: 2.6.31 | |
17 | ||
18 | Terminology used in this Document | |
19 | --------------------------------- | |
20 | ||
f00c313b MCC |
21 | tail |
22 | - where new writes happen in the ring buffer. | |
8b2c70d1 | 23 | |
f00c313b MCC |
24 | head |
25 | - where new reads happen in the ring buffer. | |
8b2c70d1 | 26 | |
f00c313b MCC |
27 | producer |
28 | - the task that writes into the ring buffer (same as writer) | |
8b2c70d1 | 29 | |
f00c313b MCC |
30 | writer |
31 | - same as producer | |
8b2c70d1 | 32 | |
f00c313b MCC |
33 | consumer |
34 | - the task that reads from the buffer (same as reader) | |
8b2c70d1 | 35 | |
f00c313b MCC |
36 | reader |
37 | - same as consumer. | |
8b2c70d1 | 38 | |
f00c313b MCC |
39 | reader_page |
40 | - A page outside the ring buffer used solely (for the most part) | |
41 | by the reader. | |
8b2c70d1 | 42 | |
f00c313b MCC |
43 | head_page |
44 | - a pointer to the page that the reader will use next | |
8b2c70d1 | 45 | |
f00c313b MCC |
46 | tail_page |
47 | - a pointer to the page that will be written to next | |
8b2c70d1 | 48 | |
f00c313b MCC |
49 | commit_page |
50 | - a pointer to the page with the last finished non-nested write. | |
8b2c70d1 | 51 | |
f00c313b MCC |
52 | cmpxchg |
53 | - hardware-assisted atomic transaction that performs the following:: | |
8b2c70d1 | 54 | |
f00c313b | 55 | A = B if previous A == C |
8b2c70d1 | 56 | |
f00c313b MCC |
57 | R = cmpxchg(A, C, B) is saying that we replace A with B if and only |
58 | if current A is equal to C, and we put the old (current) | |
59 | A into R | |
8b2c70d1 | 60 | |
f00c313b | 61 | R gets the previous A regardless if A is updated with B or not. |
8b2c70d1 | 62 | |
f00c313b MCC |
63 | To see if the update was successful a compare of ``R == C`` |
64 | may be used. | |
8b2c70d1 SR |
65 | |
66 | The Generic Ring Buffer | |
67 | ----------------------- | |
68 | ||
69 | The ring buffer can be used in either an overwrite mode or in | |
70 | producer/consumer mode. | |
71 | ||
006b4298 | 72 | Producer/consumer mode is where if the producer were to fill up the |
8b2c70d1 SR |
73 | buffer before the consumer could free up anything, the producer |
74 | will stop writing to the buffer. This will lose most recent events. | |
75 | ||
006b4298 | 76 | Overwrite mode is where if the producer were to fill up the buffer |
8b2c70d1 SR |
77 | before the consumer could free up anything, the producer will |
78 | overwrite the older data. This will lose the oldest events. | |
79 | ||
006b4298 | 80 | No two writers can write at the same time (on the same per-cpu buffer), |
8b2c70d1 SR |
81 | but a writer may interrupt another writer, but it must finish writing |
82 | before the previous writer may continue. This is very important to the | |
83 | algorithm. The writers act like a "stack". The way interrupts works | |
f00c313b | 84 | enforces this behavior:: |
8b2c70d1 SR |
85 | |
86 | ||
87 | writer1 start | |
88 | <preempted> writer2 start | |
89 | <preempted> writer3 start | |
90 | writer3 finishes | |
91 | writer2 finishes | |
92 | writer1 finishes | |
93 | ||
94 | This is very much like a writer being preempted by an interrupt and | |
95 | the interrupt doing a write as well. | |
96 | ||
97 | Readers can happen at any time. But no two readers may run at the | |
98 | same time, nor can a reader preempt/interrupt another reader. A reader | |
006b4298 | 99 | cannot preempt/interrupt a writer, but it may read/consume from the |
8b2c70d1 SR |
100 | buffer at the same time as a writer is writing, but the reader must be |
101 | on another processor to do so. A reader may read on its own processor | |
102 | and can be preempted by a writer. | |
103 | ||
006b4298 | 104 | A writer can preempt a reader, but a reader cannot preempt a writer. |
8b2c70d1 SR |
105 | But a reader can read the buffer at the same time (on another processor) |
106 | as a writer. | |
107 | ||
006b4298 | 108 | The ring buffer is made up of a list of pages held together by a linked list. |
8b2c70d1 SR |
109 | |
110 | At initialization a reader page is allocated for the reader that is not | |
111 | part of the ring buffer. | |
112 | ||
113 | The head_page, tail_page and commit_page are all initialized to point | |
114 | to the same page. | |
115 | ||
116 | The reader page is initialized to have its next pointer pointing to | |
117 | the head page, and its previous pointer pointing to a page before | |
118 | the head page. | |
119 | ||
120 | The reader has its own page to use. At start up time, this page is | |
121 | allocated but is not attached to the list. When the reader wants | |
006b4298 | 122 | to read from the buffer, if its page is empty (like it is on start-up), |
8b2c70d1 SR |
123 | it will swap its page with the head_page. The old reader page will |
124 | become part of the ring buffer and the head_page will be removed. | |
125 | The page after the inserted page (old reader_page) will become the | |
126 | new head page. | |
127 | ||
128 | Once the new page is given to the reader, the reader could do what | |
129 | it wants with it, as long as a writer has left that page. | |
130 | ||
131 | A sample of how the reader page is swapped: Note this does not | |
132 | show the head page in the buffer, it is for demonstrating a swap | |
133 | only. | |
134 | ||
f00c313b MCC |
135 | :: |
136 | ||
8b2c70d1 SR |
137 | +------+ |
138 | |reader| RING BUFFER | |
139 | |page | | |
140 | +------+ | |
141 | +---+ +---+ +---+ | |
142 | | |-->| |-->| | | |
143 | | |<--| |<--| | | |
144 | +---+ +---+ +---+ | |
145 | ^ | ^ | | |
146 | | +-------------+ | | |
147 | +-----------------+ | |
148 | ||
149 | ||
150 | +------+ | |
151 | |reader| RING BUFFER | |
152 | |page |-------------------+ | |
153 | +------+ v | |
154 | | +---+ +---+ +---+ | |
155 | | | |-->| |-->| | | |
156 | | | |<--| |<--| |<-+ | |
157 | | +---+ +---+ +---+ | | |
158 | | ^ | ^ | | | |
159 | | | +-------------+ | | | |
160 | | +-----------------+ | | |
161 | +------------------------------------+ | |
162 | ||
163 | +------+ | |
164 | |reader| RING BUFFER | |
165 | |page |-------------------+ | |
166 | +------+ <---------------+ v | |
167 | | ^ +---+ +---+ +---+ | |
168 | | | | |-->| |-->| | | |
169 | | | | | | |<--| |<-+ | |
170 | | | +---+ +---+ +---+ | | |
171 | | | | ^ | | | |
172 | | | +-------------+ | | | |
173 | | +-----------------------------+ | | |
174 | +------------------------------------+ | |
175 | ||
176 | +------+ | |
177 | |buffer| RING BUFFER | |
178 | |page |-------------------+ | |
179 | +------+ <---------------+ v | |
180 | | ^ +---+ +---+ +---+ | |
181 | | | | | | |-->| | | |
182 | | | New | | | |<--| |<-+ | |
183 | | | Reader +---+ +---+ +---+ | | |
184 | | | page ----^ | | | |
185 | | | | | | |
186 | | +-----------------------------+ | | |
187 | +------------------------------------+ | |
188 | ||
189 | ||
190 | ||
191 | It is possible that the page swapped is the commit page and the tail page, | |
192 | if what is in the ring buffer is less than what is held in a buffer page. | |
193 | ||
f00c313b MCC |
194 | :: |
195 | ||
196 | reader page commit page tail page | |
197 | | | | | |
198 | v | | | |
199 | +---+ | | | |
200 | | |<----------+ | | |
201 | | |<------------------------+ | |
202 | | |------+ | |
203 | +---+ | | |
204 | | | |
205 | v | |
206 | +---+ +---+ +---+ +---+ | |
207 | <---| |--->| |--->| |--->| |---> | |
208 | --->| |<---| |<---| |<---| |<--- | |
209 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
210 | |
211 | This case is still valid for this algorithm. | |
212 | When the writer leaves the page, it simply goes into the ring buffer | |
213 | since the reader page still points to the next location in the ring | |
214 | buffer. | |
215 | ||
216 | ||
217 | The main pointers: | |
218 | ||
f00c313b MCC |
219 | reader page |
220 | - The page used solely by the reader and is not part | |
221 | of the ring buffer (may be swapped in) | |
8b2c70d1 | 222 | |
f00c313b MCC |
223 | head page |
224 | - the next page in the ring buffer that will be swapped | |
8b2c70d1 SR |
225 | with the reader page. |
226 | ||
f00c313b MCC |
227 | tail page |
228 | - the page where the next write will take place. | |
8b2c70d1 | 229 | |
f00c313b MCC |
230 | commit page |
231 | - the page that last finished a write. | |
8b2c70d1 | 232 | |
006b4298 | 233 | The commit page only is updated by the outermost writer in the |
8b2c70d1 SR |
234 | writer stack. A writer that preempts another writer will not move the |
235 | commit page. | |
236 | ||
237 | When data is written into the ring buffer, a position is reserved | |
238 | in the ring buffer and passed back to the writer. When the writer | |
239 | is finished writing data into that position, it commits the write. | |
240 | ||
241 | Another write (or a read) may take place at anytime during this | |
242 | transaction. If another write happens it must finish before continuing | |
243 | with the previous write. | |
244 | ||
245 | ||
f00c313b | 246 | Write reserve:: |
8b2c70d1 SR |
247 | |
248 | Buffer page | |
249 | +---------+ | |
250 | |written | | |
251 | +---------+ <--- given back to writer (current commit) | |
252 | |reserved | | |
253 | +---------+ <--- tail pointer | |
254 | | empty | | |
255 | +---------+ | |
256 | ||
f00c313b | 257 | Write commit:: |
8b2c70d1 SR |
258 | |
259 | Buffer page | |
260 | +---------+ | |
261 | |written | | |
262 | +---------+ | |
263 | |written | | |
25985edc | 264 | +---------+ <--- next position for write (current commit) |
8b2c70d1 SR |
265 | | empty | |
266 | +---------+ | |
267 | ||
268 | ||
f00c313b | 269 | If a write happens after the first reserve:: |
8b2c70d1 SR |
270 | |
271 | Buffer page | |
272 | +---------+ | |
273 | |written | | |
274 | +---------+ <-- current commit | |
275 | |reserved | | |
276 | +---------+ <--- given back to second writer | |
277 | |reserved | | |
278 | +---------+ <--- tail pointer | |
279 | ||
f00c313b | 280 | After second writer commits:: |
8b2c70d1 SR |
281 | |
282 | ||
283 | Buffer page | |
284 | +---------+ | |
285 | |written | | |
286 | +---------+ <--(last full commit) | |
287 | |reserved | | |
288 | +---------+ | |
289 | |pending | | |
290 | |commit | | |
291 | +---------+ <--- tail pointer | |
292 | ||
f00c313b | 293 | When the first writer commits:: |
8b2c70d1 SR |
294 | |
295 | Buffer page | |
296 | +---------+ | |
297 | |written | | |
298 | +---------+ | |
299 | |written | | |
300 | +---------+ | |
301 | |written | | |
302 | +---------+ <--(last full commit and tail pointer) | |
303 | ||
304 | ||
305 | The commit pointer points to the last write location that was | |
306 | committed without preempting another write. When a write that | |
307 | preempted another write is committed, it only becomes a pending commit | |
006b4298 | 308 | and will not be a full commit until all writes have been committed. |
8b2c70d1 SR |
309 | |
310 | The commit page points to the page that has the last full commit. | |
311 | The tail page points to the page with the last write (before | |
312 | committing). | |
313 | ||
314 | The tail page is always equal to or after the commit page. It may | |
315 | be several pages ahead. If the tail page catches up to the commit | |
316 | page then no more writes may take place (regardless of the mode | |
317 | of the ring buffer: overwrite and produce/consumer). | |
318 | ||
f00c313b | 319 | The order of pages is:: |
8b2c70d1 SR |
320 | |
321 | head page | |
322 | commit page | |
323 | tail page | |
324 | ||
f00c313b MCC |
325 | Possible scenario:: |
326 | ||
327 | tail page | |
328 | head page commit page | | |
329 | | | | | |
330 | v v v | |
331 | +---+ +---+ +---+ +---+ | |
332 | <---| |--->| |--->| |--->| |---> | |
333 | --->| |<---| |<---| |<---| |<--- | |
334 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
335 | |
336 | There is a special case that the head page is after either the commit page | |
337 | and possibly the tail page. That is when the commit (and tail) page has been | |
338 | swapped with the reader page. This is because the head page is always | |
006b4298 | 339 | part of the ring buffer, but the reader page is not. Whenever there |
8b2c70d1 SR |
340 | has been less than a full page that has been committed inside the ring buffer, |
341 | and a reader swaps out a page, it will be swapping out the commit page. | |
342 | ||
f00c313b MCC |
343 | :: |
344 | ||
345 | reader page commit page tail page | |
346 | | | | | |
347 | v | | | |
348 | +---+ | | | |
349 | | |<----------+ | | |
350 | | |<------------------------+ | |
351 | | |------+ | |
352 | +---+ | | |
353 | | | |
354 | v | |
355 | +---+ +---+ +---+ +---+ | |
356 | <---| |--->| |--->| |--->| |---> | |
357 | --->| |<---| |<---| |<---| |<--- | |
358 | +---+ +---+ +---+ +---+ | |
359 | ^ | |
360 | | | |
361 | head page | |
8b2c70d1 SR |
362 | |
363 | ||
364 | In this case, the head page will not move when the tail and commit | |
365 | move back into the ring buffer. | |
366 | ||
006b4298 | 367 | The reader cannot swap a page into the ring buffer if the commit page |
8b2c70d1 SR |
368 | is still on that page. If the read meets the last commit (real commit |
369 | not pending or reserved), then there is nothing more to read. | |
370 | The buffer is considered empty until another full commit finishes. | |
371 | ||
372 | When the tail meets the head page, if the buffer is in overwrite mode, | |
373 | the head page will be pushed ahead one. If the buffer is in producer/consumer | |
374 | mode, the write will fail. | |
375 | ||
f00c313b MCC |
376 | Overwrite mode:: |
377 | ||
378 | tail page | |
379 | | | |
380 | v | |
381 | +---+ +---+ +---+ +---+ | |
382 | <---| |--->| |--->| |--->| |---> | |
383 | --->| |<---| |<---| |<---| |<--- | |
384 | +---+ +---+ +---+ +---+ | |
385 | ^ | |
386 | | | |
387 | head page | |
388 | ||
389 | ||
390 | tail page | |
391 | | | |
392 | v | |
393 | +---+ +---+ +---+ +---+ | |
394 | <---| |--->| |--->| |--->| |---> | |
395 | --->| |<---| |<---| |<---| |<--- | |
396 | +---+ +---+ +---+ +---+ | |
397 | ^ | |
398 | | | |
399 | head page | |
400 | ||
401 | ||
402 | tail page | |
403 | | | |
404 | v | |
405 | +---+ +---+ +---+ +---+ | |
406 | <---| |--->| |--->| |--->| |---> | |
407 | --->| |<---| |<---| |<---| |<--- | |
408 | +---+ +---+ +---+ +---+ | |
409 | ^ | |
410 | | | |
411 | head page | |
8b2c70d1 SR |
412 | |
413 | Note, the reader page will still point to the previous head page. | |
414 | But when a swap takes place, it will use the most recent head page. | |
415 | ||
416 | ||
417 | Making the Ring Buffer Lockless: | |
418 | -------------------------------- | |
419 | ||
420 | The main idea behind the lockless algorithm is to combine the moving | |
421 | of the head_page pointer with the swapping of pages with the reader. | |
422 | State flags are placed inside the pointer to the page. To do this, | |
423 | each page must be aligned in memory by 4 bytes. This will allow the 2 | |
006b4298 | 424 | least significant bits of the address to be used as flags, since |
8b2c70d1 | 425 | they will always be zero for the address. To get the address, |
f00c313b | 426 | simply mask out the flags:: |
8b2c70d1 SR |
427 | |
428 | MASK = ~3 | |
429 | ||
430 | address & MASK | |
431 | ||
432 | Two flags will be kept by these two bits: | |
433 | ||
f00c313b MCC |
434 | HEADER |
435 | - the page being pointed to is a head page | |
8b2c70d1 | 436 | |
f00c313b MCC |
437 | UPDATE |
438 | - the page being pointed to is being updated by a writer | |
8b2c70d1 SR |
439 | and was or is about to be a head page. |
440 | ||
f00c313b | 441 | :: |
8b2c70d1 | 442 | |
f00c313b MCC |
443 | reader page |
444 | | | |
445 | v | |
446 | +---+ | |
447 | | |------+ | |
448 | +---+ | | |
449 | | | |
450 | v | |
451 | +---+ +---+ +---+ +---+ | |
452 | <---| |--->| |-H->| |--->| |---> | |
453 | --->| |<---| |<---| |<---| |<--- | |
454 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
455 | |
456 | ||
457 | The above pointer "-H->" would have the HEADER flag set. That is | |
458 | the next page is the next page to be swapped out by the reader. | |
459 | This pointer means the next page is the head page. | |
460 | ||
461 | When the tail page meets the head pointer, it will use cmpxchg to | |
f00c313b | 462 | change the pointer to the UPDATE state:: |
8b2c70d1 SR |
463 | |
464 | ||
f00c313b MCC |
465 | tail page |
466 | | | |
467 | v | |
468 | +---+ +---+ +---+ +---+ | |
469 | <---| |--->| |-H->| |--->| |---> | |
470 | --->| |<---| |<---| |<---| |<--- | |
471 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 472 | |
f00c313b MCC |
473 | tail page |
474 | | | |
475 | v | |
476 | +---+ +---+ +---+ +---+ | |
477 | <---| |--->| |-U->| |--->| |---> | |
478 | --->| |<---| |<---| |<---| |<--- | |
479 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
480 | |
481 | "-U->" represents a pointer in the UPDATE state. | |
482 | ||
483 | Any access to the reader will need to take some sort of lock to serialize | |
484 | the readers. But the writers will never take a lock to write to the | |
485 | ring buffer. This means we only need to worry about a single reader, | |
486 | and writes only preempt in "stack" formation. | |
487 | ||
488 | When the reader tries to swap the page with the ring buffer, it | |
489 | will also use cmpxchg. If the flag bit in the pointer to the | |
490 | head page does not have the HEADER flag set, the compare will fail | |
491 | and the reader will need to look for the new head page and try again. | |
006b4298 | 492 | Note, the flags UPDATE and HEADER are never set at the same time. |
8b2c70d1 | 493 | |
f00c313b | 494 | The reader swaps the reader page as follows:: |
8b2c70d1 SR |
495 | |
496 | +------+ | |
497 | |reader| RING BUFFER | |
498 | |page | | |
499 | +------+ | |
500 | +---+ +---+ +---+ | |
501 | | |--->| |--->| | | |
502 | | |<---| |<---| | | |
503 | +---+ +---+ +---+ | |
504 | ^ | ^ | | |
505 | | +---------------+ | | |
506 | +-----H-------------+ | |
507 | ||
508 | The reader sets the reader page next pointer as HEADER to the page after | |
f00c313b | 509 | the head page:: |
8b2c70d1 SR |
510 | |
511 | ||
512 | +------+ | |
513 | |reader| RING BUFFER | |
514 | |page |-------H-----------+ | |
515 | +------+ v | |
516 | | +---+ +---+ +---+ | |
517 | | | |--->| |--->| | | |
518 | | | |<---| |<---| |<-+ | |
519 | | +---+ +---+ +---+ | | |
520 | | ^ | ^ | | | |
521 | | | +---------------+ | | | |
522 | | +-----H-------------+ | | |
523 | +--------------------------------------+ | |
524 | ||
525 | It does a cmpxchg with the pointer to the previous head page to make it | |
526 | point to the reader page. Note that the new pointer does not have the HEADER | |
f00c313b | 527 | flag set. This action atomically moves the head page forward:: |
8b2c70d1 SR |
528 | |
529 | +------+ | |
530 | |reader| RING BUFFER | |
531 | |page |-------H-----------+ | |
532 | +------+ v | |
533 | | ^ +---+ +---+ +---+ | |
534 | | | | |-->| |-->| | | |
535 | | | | |<--| |<--| |<-+ | |
536 | | | +---+ +---+ +---+ | | |
537 | | | | ^ | | | |
538 | | | +-------------+ | | | |
539 | | +-----------------------------+ | | |
540 | +------------------------------------+ | |
541 | ||
542 | After the new head page is set, the previous pointer of the head page is | |
f00c313b | 543 | updated to the reader page:: |
8b2c70d1 SR |
544 | |
545 | +------+ | |
546 | |reader| RING BUFFER | |
547 | |page |-------H-----------+ | |
548 | +------+ <---------------+ v | |
549 | | ^ +---+ +---+ +---+ | |
550 | | | | |-->| |-->| | | |
551 | | | | | | |<--| |<-+ | |
552 | | | +---+ +---+ +---+ | | |
553 | | | | ^ | | | |
554 | | | +-------------+ | | | |
555 | | +-----------------------------+ | | |
556 | +------------------------------------+ | |
557 | ||
558 | +------+ | |
559 | |buffer| RING BUFFER | |
560 | |page |-------H-----------+ <--- New head page | |
561 | +------+ <---------------+ v | |
562 | | ^ +---+ +---+ +---+ | |
563 | | | | | | |-->| | | |
564 | | | New | | | |<--| |<-+ | |
565 | | | Reader +---+ +---+ +---+ | | |
566 | | | page ----^ | | | |
567 | | | | | | |
568 | | +-----------------------------+ | | |
569 | +------------------------------------+ | |
570 | ||
006b4298 | 571 | Another important point: The page that the reader page points back to |
8b2c70d1 SR |
572 | by its previous pointer (the one that now points to the new head page) |
573 | never points back to the reader page. That is because the reader page is | |
574 | not part of the ring buffer. Traversing the ring buffer via the next pointers | |
575 | will always stay in the ring buffer. Traversing the ring buffer via the | |
576 | prev pointers may not. | |
577 | ||
578 | Note, the way to determine a reader page is simply by examining the previous | |
579 | pointer of the page. If the next pointer of the previous page does not | |
f00c313b | 580 | point back to the original page, then the original page is a reader page:: |
8b2c70d1 SR |
581 | |
582 | ||
583 | +--------+ | |
584 | | reader | next +----+ | |
585 | | page |-------->| |<====== (buffer page) | |
586 | +--------+ +----+ | |
587 | | | ^ | |
588 | | v | next | |
589 | prev | +----+ | |
590 | +------------->| | | |
591 | +----+ | |
592 | ||
593 | The way the head page moves forward: | |
594 | ||
595 | When the tail page meets the head page and the buffer is in overwrite mode | |
596 | and more writes take place, the head page must be moved forward before the | |
597 | writer may move the tail page. The way this is done is that the writer | |
598 | performs a cmpxchg to convert the pointer to the head page from the HEADER | |
599 | flag to have the UPDATE flag set. Once this is done, the reader will | |
600 | not be able to swap the head page from the buffer, nor will it be able to | |
601 | move the head page, until the writer is finished with the move. | |
602 | ||
603 | This eliminates any races that the reader can have on the writer. The reader | |
f00c313b MCC |
604 | must spin, and this is why the reader cannot preempt the writer:: |
605 | ||
606 | tail page | |
607 | | | |
608 | v | |
609 | +---+ +---+ +---+ +---+ | |
610 | <---| |--->| |-H->| |--->| |---> | |
611 | --->| |<---| |<---| |<---| |<--- | |
612 | +---+ +---+ +---+ +---+ | |
613 | ||
614 | tail page | |
615 | | | |
616 | v | |
617 | +---+ +---+ +---+ +---+ | |
618 | <---| |--->| |-U->| |--->| |---> | |
619 | --->| |<---| |<---| |<---| |<--- | |
620 | +---+ +---+ +---+ +---+ | |
621 | ||
622 | The following page will be made into the new head page:: | |
623 | ||
624 | tail page | |
625 | | | |
626 | v | |
627 | +---+ +---+ +---+ +---+ | |
628 | <---| |--->| |-U->| |-H->| |---> | |
629 | --->| |<---| |<---| |<---| |<--- | |
630 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
631 | |
632 | After the new head page has been set, we can set the old head page | |
f00c313b | 633 | pointer back to NORMAL:: |
8b2c70d1 | 634 | |
f00c313b MCC |
635 | tail page |
636 | | | |
637 | v | |
638 | +---+ +---+ +---+ +---+ | |
639 | <---| |--->| |--->| |-H->| |---> | |
640 | --->| |<---| |<---| |<---| |<--- | |
641 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 642 | |
f00c313b | 643 | After the head page has been moved, the tail page may now move forward:: |
8b2c70d1 | 644 | |
f00c313b MCC |
645 | tail page |
646 | | | |
647 | v | |
648 | +---+ +---+ +---+ +---+ | |
649 | <---| |--->| |--->| |-H->| |---> | |
650 | --->| |<---| |<---| |<---| |<--- | |
651 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
652 | |
653 | ||
654 | The above are the trivial updates. Now for the more complex scenarios. | |
655 | ||
656 | ||
657 | As stated before, if enough writes preempt the first write, the | |
658 | tail page may make it all the way around the buffer and meet the commit | |
659 | page. At this time, we must start dropping writes (usually with some kind | |
660 | of warning to the user). But what happens if the commit was still on the | |
661 | reader page? The commit page is not part of the ring buffer. The tail page | |
f00c313b MCC |
662 | must account for this:: |
663 | ||
664 | ||
665 | reader page commit page | |
666 | | | | |
667 | v | | |
668 | +---+ | | |
669 | | |<----------+ | |
670 | | | | |
671 | | |------+ | |
672 | +---+ | | |
673 | | | |
674 | v | |
675 | +---+ +---+ +---+ +---+ | |
676 | <---| |--->| |-H->| |--->| |---> | |
677 | --->| |<---| |<---| |<---| |<--- | |
678 | +---+ +---+ +---+ +---+ | |
679 | ^ | |
680 | | | |
681 | tail page | |
8b2c70d1 SR |
682 | |
683 | If the tail page were to simply push the head page forward, the commit when | |
684 | leaving the reader page would not be pointing to the correct page. | |
685 | ||
686 | The solution to this is to test if the commit page is on the reader page | |
687 | before pushing the head page. If it is, then it can be assumed that the | |
688 | tail page wrapped the buffer, and we must drop new writes. | |
689 | ||
690 | This is not a race condition, because the commit page can only be moved | |
006b4298 | 691 | by the outermost writer (the writer that was preempted). |
8b2c70d1 | 692 | This means that the commit will not move while a writer is moving the |
006b4298 | 693 | tail page. The reader cannot swap the reader page if it is also being |
8b2c70d1 SR |
694 | used as the commit page. The reader can simply check that the commit |
695 | is off the reader page. Once the commit page leaves the reader page | |
696 | it will never go back on it unless a reader does another swap with the | |
697 | buffer page that is also the commit page. | |
698 | ||
699 | ||
700 | Nested writes | |
701 | ------------- | |
702 | ||
703 | In the pushing forward of the tail page we must first push forward | |
704 | the head page if the head page is the next page. If the head page | |
705 | is not the next page, the tail page is simply updated with a cmpxchg. | |
706 | ||
707 | Only writers move the tail page. This must be done atomically to protect | |
f00c313b | 708 | against nested writers:: |
8b2c70d1 SR |
709 | |
710 | temp_page = tail_page | |
711 | next_page = temp_page->next | |
712 | cmpxchg(tail_page, temp_page, next_page) | |
713 | ||
714 | The above will update the tail page if it is still pointing to the expected | |
df5cbb27 | 715 | page. If this fails, a nested write pushed it forward, the current write |
f00c313b MCC |
716 | does not need to push it:: |
717 | ||
718 | ||
719 | temp page | |
720 | | | |
721 | v | |
722 | tail page | |
723 | | | |
724 | v | |
725 | +---+ +---+ +---+ +---+ | |
726 | <---| |--->| |--->| |--->| |---> | |
727 | --->| |<---| |<---| |<---| |<--- | |
728 | +---+ +---+ +---+ +---+ | |
729 | ||
730 | Nested write comes in and moves the tail page forward:: | |
731 | ||
732 | tail page (moved by nested writer) | |
733 | temp page | | |
734 | | | | |
735 | v v | |
736 | +---+ +---+ +---+ +---+ | |
737 | <---| |--->| |--->| |--->| |---> | |
738 | --->| |<---| |<---| |<---| |<--- | |
739 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
740 | |
741 | The above would fail the cmpxchg, but since the tail page has already | |
742 | been moved forward, the writer will just try again to reserve storage | |
743 | on the new tail page. | |
744 | ||
f00c313b | 745 | But the moving of the head page is a bit more complex:: |
8b2c70d1 | 746 | |
f00c313b MCC |
747 | tail page |
748 | | | |
749 | v | |
750 | +---+ +---+ +---+ +---+ | |
751 | <---| |--->| |-H->| |--->| |---> | |
752 | --->| |<---| |<---| |<---| |<--- | |
753 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 754 | |
f00c313b | 755 | The write converts the head page pointer to UPDATE:: |
8b2c70d1 | 756 | |
f00c313b MCC |
757 | tail page |
758 | | | |
759 | v | |
760 | +---+ +---+ +---+ +---+ | |
761 | <---| |--->| |-U->| |--->| |---> | |
762 | --->| |<---| |<---| |<---| |<--- | |
763 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 764 | |
006b4298 | 765 | But if a nested writer preempts here, it will see that the next |
8b2c70d1 SR |
766 | page is a head page, but it is also nested. It will detect that |
767 | it is nested and will save that information. The detection is the | |
768 | fact that it sees the UPDATE flag instead of a HEADER or NORMAL | |
769 | pointer. | |
770 | ||
f00c313b | 771 | The nested writer will set the new head page pointer:: |
8b2c70d1 | 772 | |
f00c313b MCC |
773 | tail page |
774 | | | |
775 | v | |
776 | +---+ +---+ +---+ +---+ | |
777 | <---| |--->| |-U->| |-H->| |---> | |
778 | --->| |<---| |<---| |<---| |<--- | |
779 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
780 | |
781 | But it will not reset the update back to normal. Only the writer | |
782 | that converted a pointer from HEAD to UPDATE will convert it back | |
f00c313b | 783 | to NORMAL:: |
8b2c70d1 | 784 | |
f00c313b MCC |
785 | tail page |
786 | | | |
787 | v | |
788 | +---+ +---+ +---+ +---+ | |
789 | <---| |--->| |-U->| |-H->| |---> | |
790 | --->| |<---| |<---| |<---| |<--- | |
791 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 792 | |
006b4298 | 793 | After the nested writer finishes, the outermost writer will convert |
f00c313b | 794 | the UPDATE pointer to NORMAL:: |
8b2c70d1 SR |
795 | |
796 | ||
f00c313b MCC |
797 | tail page |
798 | | | |
799 | v | |
800 | +---+ +---+ +---+ +---+ | |
801 | <---| |--->| |--->| |-H->| |---> | |
802 | --->| |<---| |<---| |<---| |<--- | |
803 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
804 | |
805 | ||
806 | It can be even more complex if several nested writes came in and moved | |
f00c313b | 807 | the tail page ahead several pages:: |
8b2c70d1 SR |
808 | |
809 | ||
f00c313b | 810 | (first writer) |
8b2c70d1 | 811 | |
f00c313b MCC |
812 | tail page |
813 | | | |
814 | v | |
815 | +---+ +---+ +---+ +---+ | |
816 | <---| |--->| |-H->| |--->| |---> | |
817 | --->| |<---| |<---| |<---| |<--- | |
818 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 819 | |
f00c313b | 820 | The write converts the head page pointer to UPDATE:: |
8b2c70d1 | 821 | |
f00c313b MCC |
822 | tail page |
823 | | | |
824 | v | |
825 | +---+ +---+ +---+ +---+ | |
826 | <---| |--->| |-U->| |--->| |---> | |
827 | --->| |<---| |<---| |<---| |<--- | |
828 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
829 | |
830 | Next writer comes in, and sees the update and sets up the new | |
f00c313b | 831 | head page:: |
8b2c70d1 | 832 | |
f00c313b | 833 | (second writer) |
8b2c70d1 | 834 | |
f00c313b MCC |
835 | tail page |
836 | | | |
837 | v | |
838 | +---+ +---+ +---+ +---+ | |
839 | <---| |--->| |-U->| |-H->| |---> | |
840 | --->| |<---| |<---| |<---| |<--- | |
841 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
842 | |
843 | The nested writer moves the tail page forward. But does not set the old | |
f00c313b | 844 | update page to NORMAL because it is not the outermost writer:: |
8b2c70d1 | 845 | |
f00c313b MCC |
846 | tail page |
847 | | | |
848 | v | |
849 | +---+ +---+ +---+ +---+ | |
850 | <---| |--->| |-U->| |-H->| |---> | |
851 | --->| |<---| |<---| |<---| |<--- | |
852 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
853 | |
854 | Another writer preempts and sees the page after the tail page is a head page. | |
f00c313b | 855 | It changes it from HEAD to UPDATE:: |
8b2c70d1 | 856 | |
f00c313b | 857 | (third writer) |
8b2c70d1 | 858 | |
f00c313b MCC |
859 | tail page |
860 | | | |
861 | v | |
862 | +---+ +---+ +---+ +---+ | |
863 | <---| |--->| |-U->| |-U->| |---> | |
864 | --->| |<---| |<---| |<---| |<--- | |
865 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 866 | |
f00c313b | 867 | The writer will move the head page forward:: |
8b2c70d1 SR |
868 | |
869 | ||
f00c313b | 870 | (third writer) |
8b2c70d1 | 871 | |
f00c313b MCC |
872 | tail page |
873 | | | |
874 | v | |
875 | +---+ +---+ +---+ +---+ | |
876 | <---| |--->| |-U->| |-U->| |-H-> | |
877 | --->| |<---| |<---| |<---| |<--- | |
878 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
879 | |
880 | But now that the third writer did change the HEAD flag to UPDATE it | |
f00c313b | 881 | will convert it to normal:: |
8b2c70d1 SR |
882 | |
883 | ||
f00c313b | 884 | (third writer) |
8b2c70d1 | 885 | |
f00c313b MCC |
886 | tail page |
887 | | | |
888 | v | |
889 | +---+ +---+ +---+ +---+ | |
890 | <---| |--->| |-U->| |--->| |-H-> | |
891 | --->| |<---| |<---| |<---| |<--- | |
892 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
893 | |
894 | ||
f00c313b | 895 | Then it will move the tail page, and return back to the second writer:: |
8b2c70d1 SR |
896 | |
897 | ||
f00c313b | 898 | (second writer) |
8b2c70d1 | 899 | |
f00c313b MCC |
900 | tail page |
901 | | | |
902 | v | |
903 | +---+ +---+ +---+ +---+ | |
904 | <---| |--->| |-U->| |--->| |-H-> | |
905 | --->| |<---| |<---| |<---| |<--- | |
906 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
907 | |
908 | ||
909 | The second writer will fail to move the tail page because it was already | |
910 | moved, so it will try again and add its data to the new tail page. | |
f00c313b | 911 | It will return to the first writer:: |
8b2c70d1 SR |
912 | |
913 | ||
f00c313b | 914 | (first writer) |
8b2c70d1 | 915 | |
f00c313b MCC |
916 | tail page |
917 | | | |
918 | v | |
919 | +---+ +---+ +---+ +---+ | |
920 | <---| |--->| |-U->| |--->| |-H-> | |
921 | --->| |<---| |<---| |<---| |<--- | |
922 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 923 | |
006b4298 | 924 | The first writer cannot know atomically if the tail page moved |
8b2c70d1 | 925 | while it updates the HEAD page. It will then update the head page to |
f00c313b | 926 | what it thinks is the new head page:: |
8b2c70d1 SR |
927 | |
928 | ||
f00c313b | 929 | (first writer) |
8b2c70d1 | 930 | |
f00c313b MCC |
931 | tail page |
932 | | | |
933 | v | |
934 | +---+ +---+ +---+ +---+ | |
935 | <---| |--->| |-U->| |-H->| |-H-> | |
936 | --->| |<---| |<---| |<---| |<--- | |
937 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
938 | |
939 | Since the cmpxchg returns the old value of the pointer the first writer | |
940 | will see it succeeded in updating the pointer from NORMAL to HEAD. | |
941 | But as we can see, this is not good enough. It must also check to see | |
f00c313b | 942 | if the tail page is either where it use to be or on the next page:: |
8b2c70d1 SR |
943 | |
944 | ||
f00c313b | 945 | (first writer) |
8b2c70d1 | 946 | |
f00c313b MCC |
947 | A B tail page |
948 | | | | | |
949 | v v v | |
950 | +---+ +---+ +---+ +---+ | |
951 | <---| |--->| |-U->| |-H->| |-H-> | |
952 | --->| |<---| |<---| |<---| |<--- | |
953 | +---+ +---+ +---+ +---+ | |
8b2c70d1 | 954 | |
006b4298 RD |
955 | If tail page != A and tail page != B, then it must reset the pointer |
956 | back to NORMAL. The fact that it only needs to worry about nested | |
f00c313b | 957 | writers means that it only needs to check this after setting the HEAD page:: |
8b2c70d1 SR |
958 | |
959 | ||
f00c313b | 960 | (first writer) |
8b2c70d1 | 961 | |
f00c313b MCC |
962 | A B tail page |
963 | | | | | |
964 | v v v | |
965 | +---+ +---+ +---+ +---+ | |
966 | <---| |--->| |-U->| |--->| |-H-> | |
967 | --->| |<---| |<---| |<---| |<--- | |
968 | +---+ +---+ +---+ +---+ | |
8b2c70d1 SR |
969 | |
970 | Now the writer can update the head page. This is also why the head page must | |
006b4298 | 971 | remain in UPDATE and only reset by the outermost writer. This prevents |
f00c313b | 972 | the reader from seeing the incorrect head page:: |
8b2c70d1 | 973 | |
8b2c70d1 | 974 | |
f00c313b | 975 | (first writer) |
8b2c70d1 | 976 | |
f00c313b MCC |
977 | A B tail page |
978 | | | | | |
979 | v v v | |
980 | +---+ +---+ +---+ +---+ | |
981 | <---| |--->| |--->| |--->| |-H-> | |
982 | --->| |<---| |<---| |<---| |<--- | |
983 | +---+ +---+ +---+ +---+ |