[linux-2.6-block.git] / Documentation / locking / ww-mutex-design.rst

======================================
Wound/Wait Deadlock-Proof Mutex Design
======================================

Please read mutex-design.txt first, as it applies to wait/wound mutexes too.

Motivation for WW-Mutexes
-------------------------

GPU's do operations that commonly involve many buffers.  Those buffers
can be shared across contexts/processes, exist in different memory
domains (for example VRAM vs system memory), and so on.  And with
PRIME / dmabuf, they can even be shared across devices.  So there are
a handful of situations where the driver needs to wait for buffers to
become ready.  If you think about this in terms of waiting on a buffer
mutex for it to become available, this presents a problem because
there is no way to guarantee that buffers appear in a execbuf/batch in
the same order in all contexts.  That is directly under control of
userspace, and a result of the sequence of GL calls that an application
makes.	Which results in the potential for deadlock.  The problem gets
more complex when you consider that the kernel may need to migrate the
buffer(s) into VRAM before the GPU operates on the buffer(s), which
may in turn require evicting some other buffers (and you don't want to
evict other buffers which are already queued up to the GPU), but for a
simplified understanding of the problem you can ignore this.

The algorithm that the TTM graphics subsystem came up with for dealing with
this problem is quite simple.  For each group of buffers (execbuf) that need
to be locked, the caller would be assigned a unique reservation id/ticket,
from a global counter.  In case of deadlock while locking all the buffers
associated with a execbuf, the one with the lowest reservation ticket (i.e.
the oldest task) wins, and the one with the higher reservation id (i.e. the
younger task) unlocks all of the buffers that it has already locked, and then
tries again.

In the RDBMS literature, a reservation ticket is associated with a transaction.
and the deadlock handling approach is called Wait-Die. The name is based on
the actions of a locking thread when it encounters an already locked mutex.
If the transaction holding the lock is younger, the locking transaction waits.
If the transaction holding the lock is older, the locking transaction backs off
and dies. Hence Wait-Die.
There is also another algorithm called Wound-Wait:
If the transaction holding the lock is younger, the locking transaction
wounds the transaction holding the lock, requesting it to die.
If the transaction holding the lock is older, it waits for the other
transaction. Hence Wound-Wait.
The two algorithms are both fair in that a transaction will eventually succeed.
However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
compared to Wait-Die, but is, on the other hand, associated with more work than
Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
algorithm in that transactions are wounded by other transactions, and that
requires a reliable way to pick up up the wounded condition and preempt the
running transaction. Note that this is not the same as process preemption. A
Wound-Wait transaction is considered preempted when it dies (returning
-EDEADLK) following a wound.

Concepts
--------

Compared to normal mutexes two additional concepts/objects show up in the lock
interface for w/w mutexes:

Acquire context: To ensure eventual forward progress it is important the a task
trying to acquire locks doesn't grab a new reservation id, but keeps the one it
acquired when starting the lock acquisition. This ticket is stored in the
acquire context. Furthermore the acquire context keeps track of debugging state
to catch w/w mutex interface abuse. An acquire context is representing a
transaction.

W/w class: In contrast to normal mutexes the lock class needs to be explicit for
w/w mutexes, since it is required to initialize the acquire context. The lock
class also specifies what algorithm to use, Wound-Wait or Wait-Die.

Furthermore there are three different class of w/w lock acquire functions:

* Normal lock acquisition with a context, using ww_mutex_lock.

* Slowpath lock acquisition on the contending lock, used by the task that just
  killed its transaction after having dropped all already acquired locks.
  These functions have the _slow postfix.

  From a simple semantics point-of-view the _slow functions are not strictly
  required, since simply calling the normal ww_mutex_lock functions on the
  contending lock (after having dropped all other already acquired locks) will
  work correctly. After all if no other ww mutex has been acquired yet there's
  no deadlock potential and hence the ww_mutex_lock call will block and not
  prematurely return -EDEADLK. The advantage of the _slow functions is in
  interface safety:

  - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
    has a void return type. Note that since ww mutex code needs loops/retries
    anyway the __must_check doesn't result in spurious warnings, even though the
    very first lock operation can never fail.
  - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
    ww mutex have been released (preventing deadlocks) and makes sure that we
    block on the contending lock (preventing spinning through the -EDEADLK
    slowpath until the contended lock can be acquired).

* Functions to only acquire a single w/w mutex, which results in the exact same
  semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
  context.

  Again this is not strictly required. But often you only want to acquire a
  single lock in which case it's pointless to set up an acquire context (and so
  better to avoid grabbing a deadlock avoidance ticket).

Of course, all the usual variants for handling wake-ups due to signals are also
provided.

Usage
-----

The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
As a rough rule of thumb, use Wound-Wait iff you
expect the number of simultaneous competing transactions to be typically small,
and you want to reduce the number of rollbacks.

Three different ways to acquire locks within the same w/w class. Common
definitions for methods #1 and #2::

  static DEFINE_WW_CLASS(ww_class);

  struct obj {
	struct ww_mutex lock;
	/* obj data */
  };

  struct obj_entry {
	struct list_head head;
	struct obj *obj;
  };

Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
This is useful if a list of required objects is already tracked somewhere.
Furthermore the lock helper can use propagate the -EALREADY return code back to
the caller as a signal that an object is twice on the list. This is useful if
the list is constructed from userspace input and the ABI requires userspace to
not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::

  int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
  {
	struct obj *res_obj = NULL;
	struct obj_entry *contended_entry = NULL;
	struct obj_entry *entry;

	ww_acquire_init(ctx, &ww_class);

  retry:
	list_for_each_entry (entry, list, head) {
		if (entry->obj == res_obj) {
			res_obj = NULL;
			continue;
		}
		ret = ww_mutex_lock(&entry->obj->lock, ctx);
		if (ret < 0) {
			contended_entry = entry;
			goto err;
		}
	}

	ww_acquire_done(ctx);
	return 0;

  err:
	list_for_each_entry_continue_reverse (entry, list, head)
		ww_mutex_unlock(&entry->obj->lock);

	if (res_obj)
		ww_mutex_unlock(&res_obj->lock);

	if (ret == -EDEADLK) {
		/* we lost out in a seqno race, lock and retry.. */
		ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
		res_obj = contended_entry->obj;
		goto retry;
	}
	ww_acquire_fini(ctx);

	return ret;
  }

Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
of duplicate entry detection using -EALREADY as method 1 above. But the
list-reordering allows for a bit more idiomatic code::

  int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
  {
	struct obj_entry *entry, *entry2;

	ww_acquire_init(ctx, &ww_class);

	list_for_each_entry (entry, list, head) {
		ret = ww_mutex_lock(&entry->obj->lock, ctx);
		if (ret < 0) {
			entry2 = entry;

			list_for_each_entry_continue_reverse (entry2, list, head)
				ww_mutex_unlock(&entry2->obj->lock);

			if (ret != -EDEADLK) {
				ww_acquire_fini(ctx);
				return ret;
			}

			/* we lost out in a seqno race, lock and retry.. */
			ww_mutex_lock_slow(&entry->obj->lock, ctx);

			/*
			 * Move buf to head of the list, this will point
			 * buf->next to the first unlocked entry,
			 * restarting the for loop.
			 */
			list_del(&entry->head);
			list_add(&entry->head, list);
		}
	}

	ww_acquire_done(ctx);
	return 0;
  }

Unlocking works the same way for both methods #1 and #2::

  void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
  {
	struct obj_entry *entry;

	list_for_each_entry (entry, list, head)
		ww_mutex_unlock(&entry->obj->lock);

	ww_acquire_fini(ctx);
  }

Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
and edges can only be changed when holding the locks of all involved nodes. w/w
mutexes are a natural fit for such a case for two reasons:

- They can handle lock-acquisition in any order which allows us to start walking
  a graph from a starting point and then iteratively discovering new edges and
  locking down the nodes those edges connect to.
- Due to the -EALREADY return code signalling that a given objects is already
  held there's no need for additional book-keeping to break cycles in the graph
  or keep track off which looks are already held (when using more than one node
  as a starting point).

Note that this approach differs in two important ways from the above methods:

- Since the list of objects is dynamically constructed (and might very well be
  different when retrying due to hitting the -EDEADLK die condition) there's
  no need to keep any object on a persistent list when it's not locked. We can
  therefore move the list_head into the object itself.
- On the other hand the dynamic object list construction also means that the -EALREADY return
  code can't be propagated.

Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
list of starting nodes (passed in from userspace) using one of the above
methods. And then lock any additional objects affected by the operations using
method #3 below. The backoff/retry procedure will be a bit more involved, since
when the dynamic locking step hits -EDEADLK we also need to unlock all the
objects acquired with the fixed list. But the w/w mutex debug checks will catch
any interface misuse for these cases.

Also, method 3 can't fail the lock acquisition step since it doesn't return
-EALREADY. Of course this would be different when using the _interruptible
variants, but that's outside of the scope of these examples here::

  struct obj {
	struct ww_mutex ww_mutex;
	struct list_head locked_list;
  };

  static DEFINE_WW_CLASS(ww_class);

  void __unlock_objs(struct list_head *list)
  {
	struct obj *entry, *temp;

	list_for_each_entry_safe (entry, temp, list, locked_list) {
		/* need to do that before unlocking, since only the current lock holder is
		allowed to use object */
		list_del(&entry->locked_list);
		ww_mutex_unlock(entry->ww_mutex)
	}
  }

  void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
  {
	struct obj *obj;

	ww_acquire_init(ctx, &ww_class);

  retry:
	/* re-init loop start state */
	loop {
		/* magic code which walks over a graph and decides which objects
		 * to lock */

		ret = ww_mutex_lock(obj->ww_mutex, ctx);
		if (ret == -EALREADY) {
			/* we have that one already, get to the next object */
			continue;
		}
		if (ret == -EDEADLK) {
			__unlock_objs(list);

			ww_mutex_lock_slow(obj, ctx);
			list_add(&entry->locked_list, list);
			goto retry;
		}

		/* locked a new object, add it to the list */
		list_add_tail(&entry->locked_list, list);
	}

	ww_acquire_done(ctx);
	return 0;
  }

  void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
  {
	__unlock_objs(list);
	ww_acquire_fini(ctx);
  }

Method 4: Only lock one single objects. In that case deadlock detection and
prevention is obviously overkill, since with grabbing just one lock you can't
produce a deadlock within just one class. To simplify this case the w/w mutex
api can be used with a NULL context.

Implementation Details
----------------------

Design:
^^^^^^^

  ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
  normal mutex locks, which are far more common. As such there is only a small
  increase in code size if wait/wound mutexes are not used.

  We maintain the following invariants for the wait list:

  (1) Waiters with an acquire context are sorted by stamp order; waiters
      without an acquire context are interspersed in FIFO order.
  (2) For Wait-Die, among waiters with contexts, only the first one can have
      other locks acquired already (ctx->acquired > 0). Note that this waiter
      may come after other waiters without contexts in the list.

  The Wound-Wait preemption is implemented with a lazy-preemption scheme:
  The wounded status of the transaction is checked only when there is
  contention for a new lock and hence a true chance of deadlock. In that
  situation, if the transaction is wounded, it backs off, clears the
  wounded status and retries. A great benefit of implementing preemption in
  this way is that the wounded transaction can identify a contending lock to
  wait for before restarting the transaction. Just blindly restarting the
  transaction would likely make the transaction end up in a situation where
  it would have to back off again.

  In general, not much contention is expected. The locks are typically used to
  serialize access to resources for devices, and optimization focus should
  therefore be directed towards the uncontended cases.

Lockdep:
^^^^^^^^

  Special care has been taken to warn for as many cases of api abuse
  as possible. Some common api abuses will be caught with
  CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.

  Some of the errors which will be warned about:
   - Forgetting to call ww_acquire_fini or ww_acquire_init.
   - Attempting to lock more mutexes after ww_acquire_done.
   - Attempting to lock the wrong mutex after -EDEADLK and
     unlocking all mutexes.
   - Attempting to lock the right mutex after -EDEADLK,
     before unlocking all mutexes.

   - Calling ww_mutex_lock_slow before -EDEADLK was returned.

   - Unlocking mutexes with the wrong unlock function.
   - Calling one of the ww_acquire_* twice on the same context.
   - Using a different ww_class for the mutex than for the ww_acquire_ctx.
   - Normal lockdep errors that can result in deadlocks.

  Some of the lockdep errors that can result in deadlocks:
   - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
     having called ww_acquire_fini on the first.
   - 'normal' deadlocks that can occur.

FIXME:
  Update this section once we have the TASK_DEADLOCK task state flag magic
  implemented.
Commit	Line	Data
387b1468	1	======================================
08295b3b	2	Wound/Wait Deadlock-Proof Mutex Design
040a0a37 ML	3	======================================
	4
	5	Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
	6
	7	Motivation for WW-Mutexes
	8	-------------------------
	9
	10	GPU's do operations that commonly involve many buffers. Those buffers
	11	can be shared across contexts/processes, exist in different memory
	12	domains (for example VRAM vs system memory), and so on. And with
	13	PRIME / dmabuf, they can even be shared across devices. So there are
	14	a handful of situations where the driver needs to wait for buffers to
	15	become ready. If you think about this in terms of waiting on a buffer
	16	mutex for it to become available, this presents a problem because
	17	there is no way to guarantee that buffers appear in a execbuf/batch in
	18	the same order in all contexts. That is directly under control of
	19	userspace, and a result of the sequence of GL calls that an application
	20	makes. Which results in the potential for deadlock. The problem gets
	21	more complex when you consider that the kernel may need to migrate the
	22	buffer(s) into VRAM before the GPU operates on the buffer(s), which
	23	may in turn require evicting some other buffers (and you don't want to
	24	evict other buffers which are already queued up to the GPU), but for a
	25	simplified understanding of the problem you can ignore this.
	26
	27	The algorithm that the TTM graphics subsystem came up with for dealing with
	28	this problem is quite simple. For each group of buffers (execbuf) that need
	29	to be locked, the caller would be assigned a unique reservation id/ticket,
	30	from a global counter. In case of deadlock while locking all the buffers
	31	associated with a execbuf, the one with the lowest reservation ticket (i.e.
	32	the oldest task) wins, and the one with the higher reservation id (i.e. the
	33	younger task) unlocks all of the buffers that it has already locked, and then
	34	tries again.
	35
08295b3b TH	36	In the RDBMS literature, a reservation ticket is associated with a transaction.
	37	and the deadlock handling approach is called Wait-Die. The name is based on
	38	the actions of a locking thread when it encounters an already locked mutex.
	39	If the transaction holding the lock is younger, the locking transaction waits.
	40	If the transaction holding the lock is older, the locking transaction backs off
	41	and dies. Hence Wait-Die.
	42	There is also another algorithm called Wound-Wait:
	43	If the transaction holding the lock is younger, the locking transaction
	44	wounds the transaction holding the lock, requesting it to die.
	45	If the transaction holding the lock is older, it waits for the other
	46	transaction. Hence Wound-Wait.
	47	The two algorithms are both fair in that a transaction will eventually succeed.
	48	However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
	49	compared to Wait-Die, but is, on the other hand, associated with more work than
	50	Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
	51	algorithm in that transactions are wounded by other transactions, and that
	52	requires a reliable way to pick up up the wounded condition and preempt the
	53	running transaction. Note that this is not the same as process preemption. A
	54	Wound-Wait transaction is considered preempted when it dies (returning
	55	-EDEADLK) following a wound.
040a0a37 ML	56
	57	Concepts
	58	--------
	59
	60	Compared to normal mutexes two additional concepts/objects show up in the lock
	61	interface for w/w mutexes:
	62
	63	Acquire context: To ensure eventual forward progress it is important the a task
	64	trying to acquire locks doesn't grab a new reservation id, but keeps the one it
	65	acquired when starting the lock acquisition. This ticket is stored in the
	66	acquire context. Furthermore the acquire context keeps track of debugging state
08295b3b TH	67	to catch w/w mutex interface abuse. An acquire context is representing a
08295b3b TH	68	transaction.
040a0a37 ML	69
040a0a37 ML	70	W/w class: In contrast to normal mutexes the lock class needs to be explicit for
08295b3b TH	71	w/w mutexes, since it is required to initialize the acquire context. The lock
08295b3b TH	72	class also specifies what algorithm to use, Wound-Wait or Wait-Die.
040a0a37 ML	73
	74	Furthermore there are three different class of w/w lock acquire functions:
	75
	76	* Normal lock acquisition with a context, using ww_mutex_lock.
	77
55f036ca PZ	78	* Slowpath lock acquisition on the contending lock, used by the task that just
	79	killed its transaction after having dropped all already acquired locks.
	80	These functions have the _slow postfix.
040a0a37 ML	81
	82	From a simple semantics point-of-view the _slow functions are not strictly
	83	required, since simply calling the normal ww_mutex_lock functions on the
	84	contending lock (after having dropped all other already acquired locks) will
	85	work correctly. After all if no other ww mutex has been acquired yet there's
	86	no deadlock potential and hence the ww_mutex_lock call will block and not
	87	prematurely return -EDEADLK. The advantage of the _slow functions is in
	88	interface safety:
387b1468	89
040a0a37 ML	90	- ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
	91	has a void return type. Note that since ww mutex code needs loops/retries
	92	anyway the __must_check doesn't result in spurious warnings, even though the
	93	very first lock operation can never fail.
	94	- When full debugging is enabled ww_mutex_lock_slow checks that all acquired
	95	ww mutex have been released (preventing deadlocks) and makes sure that we
	96	block on the contending lock (preventing spinning through the -EDEADLK
	97	slowpath until the contended lock can be acquired).
	98
	99	* Functions to only acquire a single w/w mutex, which results in the exact same
	100	semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
	101	context.
	102
	103	Again this is not strictly required. But often you only want to acquire a
	104	single lock in which case it's pointless to set up an acquire context (and so
	105	better to avoid grabbing a deadlock avoidance ticket).
	106
	107	Of course, all the usual variants for handling wake-ups due to signals are also
	108	provided.
	109
	110	Usage
	111	-----
	112
08295b3b TH	113	The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
	114	DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
	115	As a rough rule of thumb, use Wound-Wait iff you
	116	expect the number of simultaneous competing transactions to be typically small,
	117	and you want to reduce the number of rollbacks.
	118
040a0a37	119	Three different ways to acquire locks within the same w/w class. Common
387b1468	120	definitions for methods #1 and #2::
040a0a37	121
387b1468	122	static DEFINE_WW_CLASS(ww_class);
040a0a37	123
387b1468	124	struct obj {
040a0a37 ML	125	struct ww_mutex lock;
040a0a37 ML	126	/* obj data */
387b1468	127	};
040a0a37	128
387b1468	129	struct obj_entry {
040a0a37 ML	130	struct list_head head;
040a0a37 ML	131	struct obj *obj;
387b1468	132	};
040a0a37 ML	133
	134	Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
	135	This is useful if a list of required objects is already tracked somewhere.
	136	Furthermore the lock helper can use propagate the -EALREADY return code back to
	137	the caller as a signal that an object is twice on the list. This is useful if
	138	the list is constructed from userspace input and the ABI requires userspace to
387b1468	139	not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::
040a0a37	140
387b1468 MCC	141	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
387b1468 MCC	142	{
040a0a37 ML	143	struct obj *res_obj = NULL;
	144	struct obj_entry *contended_entry = NULL;
	145	struct obj_entry *entry;
	146
	147	ww_acquire_init(ctx, &ww_class);
	148
387b1468	149	retry:
040a0a37 ML	150	list_for_each_entry (entry, list, head) {
	151	if (entry->obj == res_obj) {
	152	res_obj = NULL;
	153	continue;
	154	}
	155	ret = ww_mutex_lock(&entry->obj->lock, ctx);
	156	if (ret < 0) {
	157	contended_entry = entry;
	158	goto err;
	159	}
	160	}
	161
	162	ww_acquire_done(ctx);
	163	return 0;
	164
387b1468	165	err:
040a0a37 ML	166	list_for_each_entry_continue_reverse (entry, list, head)
	167	ww_mutex_unlock(&entry->obj->lock);
	168
	169	if (res_obj)
	170	ww_mutex_unlock(&res_obj->lock);
	171
	172	if (ret == -EDEADLK) {
	173	/* we lost out in a seqno race, lock and retry.. */
	174	ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
	175	res_obj = contended_entry->obj;
	176	goto retry;
	177	}
	178	ww_acquire_fini(ctx);
	179
	180	return ret;
387b1468	181	}
040a0a37 ML	182
	183	Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
	184	of duplicate entry detection using -EALREADY as method 1 above. But the
387b1468	185	list-reordering allows for a bit more idiomatic code::
040a0a37	186
387b1468 MCC	187	int lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
387b1468 MCC	188	{
040a0a37 ML	189	struct obj_entry entry, entry2;
	190
	191	ww_acquire_init(ctx, &ww_class);
	192
	193	list_for_each_entry (entry, list, head) {
	194	ret = ww_mutex_lock(&entry->obj->lock, ctx);
	195	if (ret < 0) {
	196	entry2 = entry;
	197
	198	list_for_each_entry_continue_reverse (entry2, list, head)
	199	ww_mutex_unlock(&entry2->obj->lock);
	200
	201	if (ret != -EDEADLK) {
	202	ww_acquire_fini(ctx);
	203	return ret;
	204	}
	205
	206	/* we lost out in a seqno race, lock and retry.. */
	207	ww_mutex_lock_slow(&entry->obj->lock, ctx);
	208
	209	/*
	210	* Move buf to head of the list, this will point
	211	* buf->next to the first unlocked entry,
	212	* restarting the for loop.
	213	*/
	214	list_del(&entry->head);
	215	list_add(&entry->head, list);
	216	}
	217	}
	218
	219	ww_acquire_done(ctx);
	220	return 0;
387b1468	221	}
040a0a37	222
387b1468	223	Unlocking works the same way for both methods #1 and #2::
040a0a37	224
387b1468 MCC	225	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
387b1468 MCC	226	{
040a0a37 ML	227	struct obj_entry *entry;
	228
	229	list_for_each_entry (entry, list, head)
	230	ww_mutex_unlock(&entry->obj->lock);
	231
	232	ww_acquire_fini(ctx);
387b1468	233	}
040a0a37 ML	234
	235	Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
	236	e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
	237	and edges can only be changed when holding the locks of all involved nodes. w/w
	238	mutexes are a natural fit for such a case for two reasons:
387b1468	239
040a0a37 ML	240	- They can handle lock-acquisition in any order which allows us to start walking
	241	a graph from a starting point and then iteratively discovering new edges and
	242	locking down the nodes those edges connect to.
	243	- Due to the -EALREADY return code signalling that a given objects is already
	244	held there's no need for additional book-keeping to break cycles in the graph
	245	or keep track off which looks are already held (when using more than one node
	246	as a starting point).
	247
	248	Note that this approach differs in two important ways from the above methods:
387b1468	249
040a0a37	250	- Since the list of objects is dynamically constructed (and might very well be
55f036ca	251	different when retrying due to hitting the -EDEADLK die condition) there's
040a0a37 ML	252	no need to keep any object on a persistent list when it's not locked. We can
	253	therefore move the list_head into the object itself.
	254	- On the other hand the dynamic object list construction also means that the -EALREADY return
	255	code can't be propagated.
	256
	257	Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
	258	list of starting nodes (passed in from userspace) using one of the above
	259	methods. And then lock any additional objects affected by the operations using
	260	method #3 below. The backoff/retry procedure will be a bit more involved, since
	261	when the dynamic locking step hits -EDEADLK we also need to unlock all the
	262	objects acquired with the fixed list. But the w/w mutex debug checks will catch
	263	any interface misuse for these cases.
	264
	265	Also, method 3 can't fail the lock acquisition step since it doesn't return
	266	-EALREADY. Of course this would be different when using the _interruptible
387b1468	267	variants, but that's outside of the scope of these examples here::
040a0a37	268
387b1468	269	struct obj {
040a0a37 ML	270	struct ww_mutex ww_mutex;
040a0a37 ML	271	struct list_head locked_list;
387b1468	272	};
040a0a37	273
387b1468	274	static DEFINE_WW_CLASS(ww_class);
040a0a37	275
387b1468 MCC	276	void __unlock_objs(struct list_head *list)
387b1468 MCC	277	{
040a0a37 ML	278	struct obj entry, temp;
	279
	280	list_for_each_entry_safe (entry, temp, list, locked_list) {
	281	/* need to do that before unlocking, since only the current lock holder is
	282	allowed to use object */
	283	list_del(&entry->locked_list);
	284	ww_mutex_unlock(entry->ww_mutex)
	285	}
387b1468	286	}
040a0a37	287
387b1468 MCC	288	void lock_objs(struct list_head list, struct ww_acquire_ctx ctx)
387b1468 MCC	289	{
040a0a37 ML	290	struct obj *obj;
	291
	292	ww_acquire_init(ctx, &ww_class);
	293
387b1468	294	retry:
040a0a37 ML	295	/* re-init loop start state */
	296	loop {
	297	/* magic code which walks over a graph and decides which objects
	298	* to lock */
	299
	300	ret = ww_mutex_lock(obj->ww_mutex, ctx);
	301	if (ret == -EALREADY) {
	302	/* we have that one already, get to the next object */
	303	continue;
	304	}
	305	if (ret == -EDEADLK) {
	306	__unlock_objs(list);
	307
	308	ww_mutex_lock_slow(obj, ctx);
	309	list_add(&entry->locked_list, list);
	310	goto retry;
	311	}
	312
	313	/* locked a new object, add it to the list */
	314	list_add_tail(&entry->locked_list, list);
	315	}
	316
	317	ww_acquire_done(ctx);
	318	return 0;
387b1468	319	}
040a0a37	320
387b1468 MCC	321	void unlock_objs(struct list_head list, struct ww_acquire_ctx ctx)
387b1468 MCC	322	{
040a0a37 ML	323	__unlock_objs(list);
040a0a37 ML	324	ww_acquire_fini(ctx);
387b1468	325	}
040a0a37 ML	326
	327	Method 4: Only lock one single objects. In that case deadlock detection and
	328	prevention is obviously overkill, since with grabbing just one lock you can't
	329	produce a deadlock within just one class. To simplify this case the w/w mutex
	330	api can be used with a NULL context.
	331
	332	Implementation Details
	333	----------------------
	334
	335	Design:
387b1468 MCC	336	^^^^^^^
387b1468 MCC	337
040a0a37 ML	338	ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
	339	normal mutex locks, which are far more common. As such there is only a small
	340	increase in code size if wait/wound mutexes are not used.
	341
27bd57aa	342	We maintain the following invariants for the wait list:
387b1468	343
27bd57aa NH	344	(1) Waiters with an acquire context are sorted by stamp order; waiters
27bd57aa NH	345	without an acquire context are interspersed in FIFO order.
08295b3b TH	346	(2) For Wait-Die, among waiters with contexts, only the first one can have
	347	other locks acquired already (ctx->acquired > 0). Note that this waiter
	348	may come after other waiters without contexts in the list.
	349
	350	The Wound-Wait preemption is implemented with a lazy-preemption scheme:
	351	The wounded status of the transaction is checked only when there is
	352	contention for a new lock and hence a true chance of deadlock. In that
	353	situation, if the transaction is wounded, it backs off, clears the
	354	wounded status and retries. A great benefit of implementing preemption in
	355	this way is that the wounded transaction can identify a contending lock to
	356	wait for before restarting the transaction. Just blindly restarting the
	357	transaction would likely make the transaction end up in a situation where
	358	it would have to back off again.
27bd57aa	359
040a0a37	360	In general, not much contention is expected. The locks are typically used to
08295b3b TH	361	serialize access to resources for devices, and optimization focus should
08295b3b TH	362	therefore be directed towards the uncontended cases.
040a0a37 ML	363
040a0a37 ML	364	Lockdep:
387b1468 MCC	365	^^^^^^^^
387b1468 MCC	366
040a0a37 ML	367	Special care has been taken to warn for as many cases of api abuse
	368	as possible. Some common api abuses will be caught with
	369	CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
	370
	371	Some of the errors which will be warned about:
	372	- Forgetting to call ww_acquire_fini or ww_acquire_init.
	373	- Attempting to lock more mutexes after ww_acquire_done.
	374	- Attempting to lock the wrong mutex after -EDEADLK and
	375	unlocking all mutexes.
	376	- Attempting to lock the right mutex after -EDEADLK,
	377	before unlocking all mutexes.
	378
	379	- Calling ww_mutex_lock_slow before -EDEADLK was returned.
	380
	381	- Unlocking mutexes with the wrong unlock function.
	382	- Calling one of the ww_acquire_* twice on the same context.
	383	- Using a different ww_class for the mutex than for the ww_acquire_ctx.
	384	- Normal lockdep errors that can result in deadlocks.
	385
	386	Some of the lockdep errors that can result in deadlocks:
	387	- Calling ww_acquire_init to initialize a second ww_acquire_ctx before
	388	having called ww_acquire_fini on the first.
	389	- 'normal' deadlocks that can occur.
	390
387b1468 MCC	391	FIXME:
	392	Update this section once we have the TASK_DEADLOCK task state flag magic
	393	implemented.