Commit | Line | Data |
---|---|---|
387b1468 | 1 | ====================================== |
08295b3b | 2 | Wound/Wait Deadlock-Proof Mutex Design |
040a0a37 ML |
3 | ====================================== |
4 | ||
8c7a729d | 5 | Please read mutex-design.rst first, as it applies to wait/wound mutexes too. |
040a0a37 ML |
6 | |
7 | Motivation for WW-Mutexes | |
8 | ------------------------- | |
9 | ||
10 | GPU's do operations that commonly involve many buffers. Those buffers | |
11 | can be shared across contexts/processes, exist in different memory | |
12 | domains (for example VRAM vs system memory), and so on. And with | |
13 | PRIME / dmabuf, they can even be shared across devices. So there are | |
14 | a handful of situations where the driver needs to wait for buffers to | |
15 | become ready. If you think about this in terms of waiting on a buffer | |
16 | mutex for it to become available, this presents a problem because | |
17 | there is no way to guarantee that buffers appear in a execbuf/batch in | |
18 | the same order in all contexts. That is directly under control of | |
19 | userspace, and a result of the sequence of GL calls that an application | |
20 | makes. Which results in the potential for deadlock. The problem gets | |
21 | more complex when you consider that the kernel may need to migrate the | |
22 | buffer(s) into VRAM before the GPU operates on the buffer(s), which | |
23 | may in turn require evicting some other buffers (and you don't want to | |
24 | evict other buffers which are already queued up to the GPU), but for a | |
25 | simplified understanding of the problem you can ignore this. | |
26 | ||
27 | The algorithm that the TTM graphics subsystem came up with for dealing with | |
28 | this problem is quite simple. For each group of buffers (execbuf) that need | |
29 | to be locked, the caller would be assigned a unique reservation id/ticket, | |
30 | from a global counter. In case of deadlock while locking all the buffers | |
31 | associated with a execbuf, the one with the lowest reservation ticket (i.e. | |
32 | the oldest task) wins, and the one with the higher reservation id (i.e. the | |
33 | younger task) unlocks all of the buffers that it has already locked, and then | |
34 | tries again. | |
35 | ||
08295b3b TH |
36 | In the RDBMS literature, a reservation ticket is associated with a transaction. |
37 | and the deadlock handling approach is called Wait-Die. The name is based on | |
38 | the actions of a locking thread when it encounters an already locked mutex. | |
39 | If the transaction holding the lock is younger, the locking transaction waits. | |
40 | If the transaction holding the lock is older, the locking transaction backs off | |
41 | and dies. Hence Wait-Die. | |
42 | There is also another algorithm called Wound-Wait: | |
43 | If the transaction holding the lock is younger, the locking transaction | |
44 | wounds the transaction holding the lock, requesting it to die. | |
45 | If the transaction holding the lock is older, it waits for the other | |
46 | transaction. Hence Wound-Wait. | |
47 | The two algorithms are both fair in that a transaction will eventually succeed. | |
48 | However, the Wound-Wait algorithm is typically stated to generate fewer backoffs | |
49 | compared to Wait-Die, but is, on the other hand, associated with more work than | |
50 | Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive | |
51 | algorithm in that transactions are wounded by other transactions, and that | |
8b1a17c7 | 52 | requires a reliable way to pick up the wounded condition and preempt the |
08295b3b TH |
53 | running transaction. Note that this is not the same as process preemption. A |
54 | Wound-Wait transaction is considered preempted when it dies (returning | |
55 | -EDEADLK) following a wound. | |
040a0a37 ML |
56 | |
57 | Concepts | |
58 | -------- | |
59 | ||
60 | Compared to normal mutexes two additional concepts/objects show up in the lock | |
61 | interface for w/w mutexes: | |
62 | ||
18be03ef | 63 | Acquire context: To ensure eventual forward progress it is important that a task |
040a0a37 ML |
64 | trying to acquire locks doesn't grab a new reservation id, but keeps the one it |
65 | acquired when starting the lock acquisition. This ticket is stored in the | |
66 | acquire context. Furthermore the acquire context keeps track of debugging state | |
08295b3b TH |
67 | to catch w/w mutex interface abuse. An acquire context is representing a |
68 | transaction. | |
040a0a37 ML |
69 | |
70 | W/w class: In contrast to normal mutexes the lock class needs to be explicit for | |
08295b3b TH |
71 | w/w mutexes, since it is required to initialize the acquire context. The lock |
72 | class also specifies what algorithm to use, Wound-Wait or Wait-Die. | |
040a0a37 ML |
73 | |
74 | Furthermore there are three different class of w/w lock acquire functions: | |
75 | ||
76 | * Normal lock acquisition with a context, using ww_mutex_lock. | |
77 | ||
55f036ca PZ |
78 | * Slowpath lock acquisition on the contending lock, used by the task that just |
79 | killed its transaction after having dropped all already acquired locks. | |
80 | These functions have the _slow postfix. | |
040a0a37 ML |
81 | |
82 | From a simple semantics point-of-view the _slow functions are not strictly | |
83 | required, since simply calling the normal ww_mutex_lock functions on the | |
84 | contending lock (after having dropped all other already acquired locks) will | |
85 | work correctly. After all if no other ww mutex has been acquired yet there's | |
86 | no deadlock potential and hence the ww_mutex_lock call will block and not | |
87 | prematurely return -EDEADLK. The advantage of the _slow functions is in | |
88 | interface safety: | |
387b1468 | 89 | |
040a0a37 ML |
90 | - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow |
91 | has a void return type. Note that since ww mutex code needs loops/retries | |
92 | anyway the __must_check doesn't result in spurious warnings, even though the | |
93 | very first lock operation can never fail. | |
94 | - When full debugging is enabled ww_mutex_lock_slow checks that all acquired | |
95 | ww mutex have been released (preventing deadlocks) and makes sure that we | |
96 | block on the contending lock (preventing spinning through the -EDEADLK | |
97 | slowpath until the contended lock can be acquired). | |
98 | ||
99 | * Functions to only acquire a single w/w mutex, which results in the exact same | |
100 | semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL | |
101 | context. | |
102 | ||
103 | Again this is not strictly required. But often you only want to acquire a | |
104 | single lock in which case it's pointless to set up an acquire context (and so | |
105 | better to avoid grabbing a deadlock avoidance ticket). | |
106 | ||
107 | Of course, all the usual variants for handling wake-ups due to signals are also | |
108 | provided. | |
109 | ||
110 | Usage | |
111 | ----- | |
112 | ||
08295b3b TH |
113 | The algorithm (Wait-Die vs Wound-Wait) is chosen by using either |
114 | DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) | |
115 | As a rough rule of thumb, use Wound-Wait iff you | |
116 | expect the number of simultaneous competing transactions to be typically small, | |
117 | and you want to reduce the number of rollbacks. | |
118 | ||
040a0a37 | 119 | Three different ways to acquire locks within the same w/w class. Common |
387b1468 | 120 | definitions for methods #1 and #2:: |
040a0a37 | 121 | |
387b1468 | 122 | static DEFINE_WW_CLASS(ww_class); |
040a0a37 | 123 | |
387b1468 | 124 | struct obj { |
040a0a37 ML |
125 | struct ww_mutex lock; |
126 | /* obj data */ | |
387b1468 | 127 | }; |
040a0a37 | 128 | |
387b1468 | 129 | struct obj_entry { |
040a0a37 ML |
130 | struct list_head head; |
131 | struct obj *obj; | |
387b1468 | 132 | }; |
040a0a37 ML |
133 | |
134 | Method 1, using a list in execbuf->buffers that's not allowed to be reordered. | |
135 | This is useful if a list of required objects is already tracked somewhere. | |
136 | Furthermore the lock helper can use propagate the -EALREADY return code back to | |
137 | the caller as a signal that an object is twice on the list. This is useful if | |
138 | the list is constructed from userspace input and the ABI requires userspace to | |
387b1468 | 139 | not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl):: |
040a0a37 | 140 | |
387b1468 MCC |
141 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
142 | { | |
040a0a37 ML |
143 | struct obj *res_obj = NULL; |
144 | struct obj_entry *contended_entry = NULL; | |
145 | struct obj_entry *entry; | |
146 | ||
147 | ww_acquire_init(ctx, &ww_class); | |
148 | ||
387b1468 | 149 | retry: |
040a0a37 ML |
150 | list_for_each_entry (entry, list, head) { |
151 | if (entry->obj == res_obj) { | |
152 | res_obj = NULL; | |
153 | continue; | |
154 | } | |
155 | ret = ww_mutex_lock(&entry->obj->lock, ctx); | |
156 | if (ret < 0) { | |
157 | contended_entry = entry; | |
158 | goto err; | |
159 | } | |
160 | } | |
161 | ||
162 | ww_acquire_done(ctx); | |
163 | return 0; | |
164 | ||
387b1468 | 165 | err: |
040a0a37 ML |
166 | list_for_each_entry_continue_reverse (entry, list, head) |
167 | ww_mutex_unlock(&entry->obj->lock); | |
168 | ||
169 | if (res_obj) | |
170 | ww_mutex_unlock(&res_obj->lock); | |
171 | ||
172 | if (ret == -EDEADLK) { | |
173 | /* we lost out in a seqno race, lock and retry.. */ | |
174 | ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); | |
175 | res_obj = contended_entry->obj; | |
176 | goto retry; | |
177 | } | |
178 | ww_acquire_fini(ctx); | |
179 | ||
180 | return ret; | |
387b1468 | 181 | } |
040a0a37 ML |
182 | |
183 | Method 2, using a list in execbuf->buffers that can be reordered. Same semantics | |
184 | of duplicate entry detection using -EALREADY as method 1 above. But the | |
387b1468 | 185 | list-reordering allows for a bit more idiomatic code:: |
040a0a37 | 186 | |
387b1468 MCC |
187 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
188 | { | |
040a0a37 ML |
189 | struct obj_entry *entry, *entry2; |
190 | ||
191 | ww_acquire_init(ctx, &ww_class); | |
192 | ||
193 | list_for_each_entry (entry, list, head) { | |
194 | ret = ww_mutex_lock(&entry->obj->lock, ctx); | |
195 | if (ret < 0) { | |
196 | entry2 = entry; | |
197 | ||
198 | list_for_each_entry_continue_reverse (entry2, list, head) | |
199 | ww_mutex_unlock(&entry2->obj->lock); | |
200 | ||
201 | if (ret != -EDEADLK) { | |
202 | ww_acquire_fini(ctx); | |
203 | return ret; | |
204 | } | |
205 | ||
206 | /* we lost out in a seqno race, lock and retry.. */ | |
207 | ww_mutex_lock_slow(&entry->obj->lock, ctx); | |
208 | ||
209 | /* | |
210 | * Move buf to head of the list, this will point | |
211 | * buf->next to the first unlocked entry, | |
212 | * restarting the for loop. | |
213 | */ | |
214 | list_del(&entry->head); | |
215 | list_add(&entry->head, list); | |
216 | } | |
217 | } | |
218 | ||
219 | ww_acquire_done(ctx); | |
220 | return 0; | |
387b1468 | 221 | } |
040a0a37 | 222 | |
387b1468 | 223 | Unlocking works the same way for both methods #1 and #2:: |
040a0a37 | 224 | |
387b1468 MCC |
225 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
226 | { | |
040a0a37 ML |
227 | struct obj_entry *entry; |
228 | ||
229 | list_for_each_entry (entry, list, head) | |
230 | ww_mutex_unlock(&entry->obj->lock); | |
231 | ||
232 | ww_acquire_fini(ctx); | |
387b1468 | 233 | } |
040a0a37 ML |
234 | |
235 | Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, | |
236 | e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, | |
237 | and edges can only be changed when holding the locks of all involved nodes. w/w | |
238 | mutexes are a natural fit for such a case for two reasons: | |
387b1468 | 239 | |
040a0a37 ML |
240 | - They can handle lock-acquisition in any order which allows us to start walking |
241 | a graph from a starting point and then iteratively discovering new edges and | |
242 | locking down the nodes those edges connect to. | |
243 | - Due to the -EALREADY return code signalling that a given objects is already | |
244 | held there's no need for additional book-keeping to break cycles in the graph | |
245 | or keep track off which looks are already held (when using more than one node | |
246 | as a starting point). | |
247 | ||
248 | Note that this approach differs in two important ways from the above methods: | |
387b1468 | 249 | |
040a0a37 | 250 | - Since the list of objects is dynamically constructed (and might very well be |
55f036ca | 251 | different when retrying due to hitting the -EDEADLK die condition) there's |
040a0a37 ML |
252 | no need to keep any object on a persistent list when it's not locked. We can |
253 | therefore move the list_head into the object itself. | |
254 | - On the other hand the dynamic object list construction also means that the -EALREADY return | |
255 | code can't be propagated. | |
256 | ||
257 | Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a | |
258 | list of starting nodes (passed in from userspace) using one of the above | |
259 | methods. And then lock any additional objects affected by the operations using | |
260 | method #3 below. The backoff/retry procedure will be a bit more involved, since | |
261 | when the dynamic locking step hits -EDEADLK we also need to unlock all the | |
262 | objects acquired with the fixed list. But the w/w mutex debug checks will catch | |
263 | any interface misuse for these cases. | |
264 | ||
265 | Also, method 3 can't fail the lock acquisition step since it doesn't return | |
266 | -EALREADY. Of course this would be different when using the _interruptible | |
387b1468 | 267 | variants, but that's outside of the scope of these examples here:: |
040a0a37 | 268 | |
387b1468 | 269 | struct obj { |
040a0a37 ML |
270 | struct ww_mutex ww_mutex; |
271 | struct list_head locked_list; | |
387b1468 | 272 | }; |
040a0a37 | 273 | |
387b1468 | 274 | static DEFINE_WW_CLASS(ww_class); |
040a0a37 | 275 | |
387b1468 MCC |
276 | void __unlock_objs(struct list_head *list) |
277 | { | |
040a0a37 ML |
278 | struct obj *entry, *temp; |
279 | ||
280 | list_for_each_entry_safe (entry, temp, list, locked_list) { | |
281 | /* need to do that before unlocking, since only the current lock holder is | |
282 | allowed to use object */ | |
283 | list_del(&entry->locked_list); | |
284 | ww_mutex_unlock(entry->ww_mutex) | |
285 | } | |
387b1468 | 286 | } |
040a0a37 | 287 | |
387b1468 MCC |
288 | void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
289 | { | |
040a0a37 ML |
290 | struct obj *obj; |
291 | ||
292 | ww_acquire_init(ctx, &ww_class); | |
293 | ||
387b1468 | 294 | retry: |
040a0a37 ML |
295 | /* re-init loop start state */ |
296 | loop { | |
297 | /* magic code which walks over a graph and decides which objects | |
298 | * to lock */ | |
299 | ||
300 | ret = ww_mutex_lock(obj->ww_mutex, ctx); | |
301 | if (ret == -EALREADY) { | |
302 | /* we have that one already, get to the next object */ | |
303 | continue; | |
304 | } | |
305 | if (ret == -EDEADLK) { | |
306 | __unlock_objs(list); | |
307 | ||
308 | ww_mutex_lock_slow(obj, ctx); | |
309 | list_add(&entry->locked_list, list); | |
310 | goto retry; | |
311 | } | |
312 | ||
313 | /* locked a new object, add it to the list */ | |
314 | list_add_tail(&entry->locked_list, list); | |
315 | } | |
316 | ||
317 | ww_acquire_done(ctx); | |
318 | return 0; | |
387b1468 | 319 | } |
040a0a37 | 320 | |
387b1468 MCC |
321 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
322 | { | |
040a0a37 ML |
323 | __unlock_objs(list); |
324 | ww_acquire_fini(ctx); | |
387b1468 | 325 | } |
040a0a37 ML |
326 | |
327 | Method 4: Only lock one single objects. In that case deadlock detection and | |
328 | prevention is obviously overkill, since with grabbing just one lock you can't | |
329 | produce a deadlock within just one class. To simplify this case the w/w mutex | |
330 | api can be used with a NULL context. | |
331 | ||
332 | Implementation Details | |
333 | ---------------------- | |
334 | ||
335 | Design: | |
387b1468 MCC |
336 | ^^^^^^^ |
337 | ||
040a0a37 ML |
338 | ww_mutex currently encapsulates a struct mutex, this means no extra overhead for |
339 | normal mutex locks, which are far more common. As such there is only a small | |
340 | increase in code size if wait/wound mutexes are not used. | |
341 | ||
27bd57aa | 342 | We maintain the following invariants for the wait list: |
387b1468 | 343 | |
27bd57aa NH |
344 | (1) Waiters with an acquire context are sorted by stamp order; waiters |
345 | without an acquire context are interspersed in FIFO order. | |
08295b3b TH |
346 | (2) For Wait-Die, among waiters with contexts, only the first one can have |
347 | other locks acquired already (ctx->acquired > 0). Note that this waiter | |
348 | may come after other waiters without contexts in the list. | |
349 | ||
350 | The Wound-Wait preemption is implemented with a lazy-preemption scheme: | |
351 | The wounded status of the transaction is checked only when there is | |
352 | contention for a new lock and hence a true chance of deadlock. In that | |
353 | situation, if the transaction is wounded, it backs off, clears the | |
354 | wounded status and retries. A great benefit of implementing preemption in | |
355 | this way is that the wounded transaction can identify a contending lock to | |
356 | wait for before restarting the transaction. Just blindly restarting the | |
357 | transaction would likely make the transaction end up in a situation where | |
358 | it would have to back off again. | |
27bd57aa | 359 | |
040a0a37 | 360 | In general, not much contention is expected. The locks are typically used to |
08295b3b TH |
361 | serialize access to resources for devices, and optimization focus should |
362 | therefore be directed towards the uncontended cases. | |
040a0a37 ML |
363 | |
364 | Lockdep: | |
387b1468 MCC |
365 | ^^^^^^^^ |
366 | ||
040a0a37 ML |
367 | Special care has been taken to warn for as many cases of api abuse |
368 | as possible. Some common api abuses will be caught with | |
369 | CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. | |
370 | ||
371 | Some of the errors which will be warned about: | |
372 | - Forgetting to call ww_acquire_fini or ww_acquire_init. | |
373 | - Attempting to lock more mutexes after ww_acquire_done. | |
374 | - Attempting to lock the wrong mutex after -EDEADLK and | |
375 | unlocking all mutexes. | |
376 | - Attempting to lock the right mutex after -EDEADLK, | |
377 | before unlocking all mutexes. | |
378 | ||
379 | - Calling ww_mutex_lock_slow before -EDEADLK was returned. | |
380 | ||
381 | - Unlocking mutexes with the wrong unlock function. | |
382 | - Calling one of the ww_acquire_* twice on the same context. | |
383 | - Using a different ww_class for the mutex than for the ww_acquire_ctx. | |
384 | - Normal lockdep errors that can result in deadlocks. | |
385 | ||
386 | Some of the lockdep errors that can result in deadlocks: | |
387 | - Calling ww_acquire_init to initialize a second ww_acquire_ctx before | |
388 | having called ww_acquire_fini on the first. | |
389 | - 'normal' deadlocks that can occur. | |
390 | ||
387b1468 MCC |
391 | FIXME: |
392 | Update this section once we have the TASK_DEADLOCK task state flag magic | |
393 | implemented. |