[linux-block.git] / Documentation / driver-api / md / md-cluster.rst

==========
MD Cluster
==========

The cluster MD is a shared-device RAID for a cluster, it supports
two levels: raid1 and raid10 (limited support).


1. On-disk format
=================

Separate write-intent-bitmaps are used for each cluster node.
The bitmaps record all writes that may have been started on that node,
and may not yet have finished. The on-disk layout is::

  0                    4k                     8k                    12k
  -------------------------------------------------------------------
  | idle                | md super            | bm super [0] + bits |
  | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
  | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
  | bm bits [3, contd]  |                     |                     |

During "normal" functioning we assume the filesystem ensures that only
one node writes to any given block at a time, so a write request will

 - set the appropriate bit (if not already set)
 - commit the write to all mirrors
 - schedule the bit to be cleared after a timeout.

Reads are just handled normally. It is up to the filesystem to ensure
one node doesn't read from a location where another node (or the same
node) is writing.


2. DLM Locks for management
===========================

There are three groups of locks for managing the device:

2.1 Bitmap lock resource (bm_lockres)
-------------------------------------

 The bm_lockres protects individual node bitmaps. They are named in
 the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
 node joins the cluster, it acquires the lock in PW mode and it stays
 so during the lifetime the node is part of the cluster. The lock
 resource number is based on the slot number returned by the DLM
 subsystem. Since DLM starts node count from one and bitmap slots
 start from zero, one is subtracted from the DLM slot number to arrive
 at the bitmap slot number.

 The LVB of the bitmap lock for a particular node records the range
 of sectors that are being re-synced by that node.  No other
 node may write to those sectors.  This is used when a new nodes
 joins the cluster.

2.2 Message passing locks
-------------------------

 Each node has to communicate with other nodes when starting or ending
 resync, and for metadata superblock updates.  This communication is
 managed through three locks: "token", "message", and "ack", together
 with the Lock Value Block (LVB) of one of the "message" lock.

2.3 new-device management
-------------------------

 A single lock: "no-new-dev" is used to coordinate the addition of
 new devices - this must be synchronized across the array.
 Normally all nodes hold a concurrent-read lock on this device.

3. Communication
================

 Messages can be broadcast to all nodes, and the sender waits for all
 other nodes to acknowledge the message before proceeding.  Only one
 message can be processed at a time.

3.1 Message Types
-----------------

 There are six types of messages which are passed:

3.1.1 METADATA_UPDATED
^^^^^^^^^^^^^^^^^^^^^^

   informs other nodes that the metadata has
   been updated, and the node must re-read the md superblock. This is
   performed synchronously. It is primarily used to signal device
   failure.

3.1.2 RESYNCING
^^^^^^^^^^^^^^^
   informs other nodes that a resync is initiated or
   ended so that each node may suspend or resume the region.  Each
   RESYNCING message identifies a range of the devices that the
   sending node is about to resync. This overrides any previous
   notification from that node: only one ranged can be resynced at a
   time per-node.

3.1.3 NEWDISK
^^^^^^^^^^^^^

   informs other nodes that a device is being added to
   the array. Message contains an identifier for that device.  See
   below for further details.

3.1.4 REMOVE
^^^^^^^^^^^^

   A failed or spare device is being removed from the
   array. The slot-number of the device is included in the message.

 3.1.5 RE_ADD:

   A failed device is being re-activated - the assumption
   is that it has been determined to be working again.

 3.1.6 BITMAP_NEEDS_SYNC:

   If a node is stopped locally but the bitmap
   isn't clean, then another node is informed to take the ownership of
   resync.

3.2 Communication mechanism
---------------------------

 The DLM LVB is used to communicate within nodes of the cluster. There
 are three resources used for the purpose:

3.2.1 token
^^^^^^^^^^^
   The resource which protects the entire communication
   system. The node having the token resource is allowed to
   communicate.

3.2.2 message
^^^^^^^^^^^^^
   The lock resource which carries the data to communicate.

3.2.3 ack
^^^^^^^^^

   The resource, acquiring which means the message has been
   acknowledged by all nodes in the cluster. The BAST of the resource
   is used to inform the receiving node that a node wants to
   communicate.

The algorithm is:

 1. receive status - all nodes have concurrent-reader lock on "ack"::

	sender                         receiver                 receiver
	"ack":CR                       "ack":CR                 "ack":CR

 2. sender get EX on "token",
    sender get EX on "message"::

	sender                        receiver                 receiver
	"token":EX                    "ack":CR                 "ack":CR
	"message":EX
	"ack":CR

    Sender checks that it still needs to send a message. Messages
    received or other events that happened while waiting for the
    "token" may have made this message inappropriate or redundant.

 3. sender writes LVB

    sender down-convert "message" from EX to CW

    sender try to get EX of "ack"

    ::

      [ wait until all receivers have *processed* the "message" ]

                                       [ triggered by bast of "ack" ]
                                       receiver get CR on "message"
                                       receiver read LVB
                                       receiver processes the message
                                       [ wait finish ]
                                       receiver releases "ack"
                                       receiver tries to get PR on "message"

     sender                         receiver                  receiver
     "token":EX                     "message":CR              "message":CR
     "message":CW
     "ack":EX

 4. triggered by grant of EX on "ack" (indicating all receivers
    have processed message)

    sender down-converts "ack" from EX to CR

    sender releases "message"

    sender releases "token"

    ::

                                 receiver upconvert to PR on "message"
                                 receiver get CR of "ack"
                                 receiver release "message"

     sender                      receiver                   receiver
     "ack":CR                    "ack":CR                   "ack":CR


4. Handling Failures
====================

4.1 Node Failure
----------------

 When a node fails, the DLM informs the cluster with the slot
 number. The node starts a cluster recovery thread. The cluster
 recovery thread:

	- acquires the bitmap<number> lock of the failed node
	- opens the bitmap
	- reads the bitmap of the failed node
	- copies the set bitmap to local node
	- cleans the bitmap of the failed node
	- releases bitmap<number> lock of the failed node
	- initiates resync of the bitmap on the current node
	  md_check_recovery is invoked within recover_bitmaps,
	  then md_check_recovery -> metadata_update_start/finish,
	  it will lock the communication by lock_comm.
	  Which means when one node is resyncing it blocks all
	  other nodes from writing anywhere on the array.

 The resync process is the regular md resync. However, in a clustered
 environment when a resync is performed, it needs to tell other nodes
 of the areas which are suspended. Before a resync starts, the node
 send out RESYNCING with the (lo,hi) range of the area which needs to
 be suspended. Each node maintains a suspend_list, which contains the
 list of ranges which are currently suspended. On receiving RESYNCING,
 the node adds the range to the suspend_list. Similarly, when the node
 performing resync finishes, it sends RESYNCING with an empty range to
 other nodes and other nodes remove the corresponding entry from the
 suspend_list.

 A helper function, ->area_resyncing() can be used to check if a
 particular I/O range should be suspended or not.

4.2 Device Failure
==================

 Device failures are handled and communicated with the metadata update
 routine.  When a node detects a device failure it does not allow
 any further writes to that device until the failure has been
 acknowledged by all other nodes.

5. Adding a new Device
----------------------

 For adding a new device, it is necessary that all nodes "see" the new
 device to be added. For this, the following algorithm is used:

   1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
   2.  Node 1 sends a NEWDISK message with uuid and slot number
   3.  Other nodes issue kobject_uevent_env with uuid and slot number
       (Steps 4,5 could be a udev rule)
   4.  In userspace, the node searches for the disk, perhaps
       using blkid -t SUB_UUID=""
   5.  Other nodes issue either of the following depending on whether
       the disk was found:
       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
       disc.number set to slot number)
       ioctl(CLUSTERED_DISK_NACK)
   6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
   7.  Node 1 attempts EX lock on "no-new-dev"
   8.  If node 1 gets the lock, it sends METADATA_UPDATED after
       unmarking the disk as SpareLocal
   9.  If not (get "no-new-dev" lock), it fails the operation and sends
       METADATA_UPDATED.
   10. Other nodes get the information whether a disk is added or not
       by the following METADATA_UPDATED.

6. Module interface
===================

 There are 17 call-backs which the md core can make to the cluster
 module.  Understanding these can give a good overview of the whole
 process.

6.1 join(nodes) and leave()
---------------------------

 These are called when an array is started with a clustered bitmap,
 and when the array is stopped.  join() ensures the cluster is
 available and initializes the various resources.
 Only the first 'nodes' nodes in the cluster can use the array.

6.2 slot_number()
-----------------

 Reports the slot number advised by the cluster infrastructure.
 Range is from 0 to nodes-1.

6.3 resync_info_update()
------------------------

 This updates the resync range that is stored in the bitmap lock.
 The starting point is updated as the resync progresses.  The
 end point is always the end of the array.
 It does *not* send a RESYNCING message.

6.4 resync_start(), resync_finish()
-----------------------------------

 These are called when resync/recovery/reshape starts or stops.
 They update the resyncing range in the bitmap lock and also
 send a RESYNCING message.  resync_start reports the whole
 array as resyncing, resync_finish reports none of it.

 resync_finish() also sends a BITMAP_NEEDS_SYNC message which
 allows some other node to take over.

6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
-------------------------------------------------------------------------------

 metadata_update_start is used to get exclusive access to
 the metadata.  If a change is still needed once that access is
 gained, metadata_update_finish() will send a METADATA_UPDATE
 message to all other nodes, otherwise metadata_update_cancel()
 can be used to release the lock.

6.6 area_resyncing()
--------------------

 This combines two elements of functionality.

 Firstly, it will check if any node is currently resyncing
 anything in a given range of sectors.  If any resync is found,
 then the caller will avoid writing or read-balancing in that
 range.

 Secondly, while node recovery is happening it reports that
 all areas are resyncing for READ requests.  This avoids races
 between the cluster-filesystem and the cluster-RAID handling
 a node failure.

6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
---------------------------------------------------------------

 These are used to manage the new-disk protocol described above.
 When a new device is added, add_new_disk_start() is called before
 it is bound to the array and, if that succeeds, add_new_disk_finish()
 is called the device is fully added.

 When a device is added in acknowledgement to a previous
 request, or when the device is declared "unavailable",
 new_disk_ack() is called.

6.8 remove_disk()
-----------------

 This is called when a spare or failed device is removed from
 the array.  It causes a REMOVE message to be send to other nodes.

6.9 gather_bitmaps()
--------------------

 This sends a RE_ADD message to all other nodes and then
 gathers bitmap information from all bitmaps.  This combined
 bitmap is then used to recovery the re-added device.

6.10 lock_all_bitmaps() and unlock_all_bitmaps()
------------------------------------------------

 These are called when change bitmap to none. If a node plans
 to clear the cluster raid's bitmap, it need to make sure no other
 nodes are using the raid which is achieved by lock all bitmap
 locks within the cluster, and also those locks are unlocked
 accordingly.

7. Unsupported features
=======================

There are somethings which are not supported by cluster MD yet.

- change array_sectors.
Commit	Line	Data
7ed44d59 MCC	1	==========
	2	MD Cluster
	3	==========
	4
f0e230ad GJ	5	The cluster MD is a shared-device RAID for a cluster, it supports
f0e230ad GJ	6	two levels: raid1 and raid10 (limited support).
b8d83448 GR	7
	8
	9	1. On-disk format
7ed44d59	10	=================
b8d83448	11
d323ef0f	12	Separate write-intent-bitmaps are used for each cluster node.
b8d83448	13	The bitmaps record all writes that may have been started on that node,
7ed44d59	14	and may not yet have finished. The on-disk layout is::
b8d83448	15
7ed44d59 MCC	16	0 4k 8k 12k
	17	-------------------------------------------------------------------
	18	\| idle \| md super \| bm super [0] + bits \|
	19	\| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \|
	20	\| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \|
	21	\| bm bits [3, contd] \| \| \|
b8d83448	22
d323ef0f GJ	23	During "normal" functioning we assume the filesystem ensures that only
	24	one node writes to any given block at a time, so a write request will
	25
b8d83448 GR	26	- set the appropriate bit (if not already set)
	27	- commit the write to all mirrors
	28	- schedule the bit to be cleared after a timeout.
	29
d323ef0f GJ	30	Reads are just handled normally. It is up to the filesystem to ensure
d323ef0f GJ	31	one node doesn't read from a location where another node (or the same
b8d83448 GR	32	node) is writing.
	33
	34
	35	2. DLM Locks for management
7ed44d59	36	===========================
b8d83448	37
d323ef0f	38	There are three groups of locks for managing the device:
b8d83448 GR	39
b8d83448 GR	40	2.1 Bitmap lock resource (bm_lockres)
7ed44d59	41	-------------------------------------
b8d83448	42
d323ef0f GJ	43	The bm_lockres protects individual node bitmaps. They are named in
	44	the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
	45	node joins the cluster, it acquires the lock in PW mode and it stays
	46	so during the lifetime the node is part of the cluster. The lock
	47	resource number is based on the slot number returned by the DLM
	48	subsystem. Since DLM starts node count from one and bitmap slots
	49	start from zero, one is subtracted from the DLM slot number to arrive
	50	at the bitmap slot number.
	51
	52	The LVB of the bitmap lock for a particular node records the range
	53	of sectors that are being re-synced by that node. No other
	54	node may write to those sectors. This is used when a new nodes
	55	joins the cluster.
	56
	57	2.2 Message passing locks
7ed44d59	58	-------------------------
d323ef0f GJ	59
	60	Each node has to communicate with other nodes when starting or ending
	61	resync, and for metadata superblock updates. This communication is
	62	managed through three locks: "token", "message", and "ack", together
	63	with the Lock Value Block (LVB) of one of the "message" lock.
	64
	65	2.3 new-device management
7ed44d59	66	-------------------------
d323ef0f	67
7852fe3a	68	A single lock: "no-new-dev" is used to coordinate the addition of
d323ef0f GJ	69	new devices - this must be synchronized across the array.
d323ef0f GJ	70	Normally all nodes hold a concurrent-read lock on this device.
b8d83448 GR	71
b8d83448 GR	72	3. Communication
7ed44d59	73	================
b8d83448	74
d323ef0f GJ	75	Messages can be broadcast to all nodes, and the sender waits for all
	76	other nodes to acknowledge the message before proceeding. Only one
	77	message can be processed at a time.
b8d83448 GR	78
b8d83448 GR	79	3.1 Message Types
7ed44d59	80	-----------------
b8d83448	81
d323ef0f	82	There are six types of messages which are passed:
b8d83448	83
7ed44d59 MCC	84	3.1.1 METADATA_UPDATED
	85	^^^^^^^^^^^^^^^^^^^^^^
	86
	87	informs other nodes that the metadata has
d323ef0f GJ	88	been updated, and the node must re-read the md superblock. This is
	89	performed synchronously. It is primarily used to signal device
	90	failure.
b8d83448	91
7ed44d59 MCC	92	3.1.2 RESYNCING
	93	^^^^^^^^^^^^^^^
	94	informs other nodes that a resync is initiated or
d323ef0f GJ	95	ended so that each node may suspend or resume the region. Each
d323ef0f GJ	96	RESYNCING message identifies a range of the devices that the
d7714952	97	sending node is about to resync. This overrides any previous
d323ef0f GJ	98	notification from that node: only one ranged can be resynced at a
	99	time per-node.
	100
7ed44d59 MCC	101	3.1.3 NEWDISK
	102	^^^^^^^^^^^^^
	103
	104	informs other nodes that a device is being added to
d323ef0f GJ	105	the array. Message contains an identifier for that device. See
	106	below for further details.
	107
7ed44d59 MCC	108	3.1.4 REMOVE
	109	^^^^^^^^^^^^
	110
	111	A failed or spare device is being removed from the
d323ef0f GJ	112	array. The slot-number of the device is included in the message.
d323ef0f GJ	113
7ed44d59 MCC	114	3.1.5 RE_ADD:
	115
	116	A failed device is being re-activated - the assumption
d323ef0f GJ	117	is that it has been determined to be working again.
d323ef0f GJ	118
7ed44d59 MCC	119	3.1.6 BITMAP_NEEDS_SYNC:
	120
	121	If a node is stopped locally but the bitmap
d323ef0f GJ	122	isn't clean, then another node is informed to take the ownership of
d323ef0f GJ	123	resync.
b8d83448 GR	124
b8d83448 GR	125	3.2 Communication mechanism
7ed44d59	126	---------------------------
b8d83448 GR	127
	128	The DLM LVB is used to communicate within nodes of the cluster. There
	129	are three resources used for the purpose:
	130
7ed44d59 MCC	131	3.2.1 token
	132	^^^^^^^^^^^
	133	The resource which protects the entire communication
b8d83448 GR	134	system. The node having the token resource is allowed to
	135	communicate.
	136
7ed44d59 MCC	137	3.2.2 message
	138	^^^^^^^^^^^^^
	139	The lock resource which carries the data to communicate.
b8d83448	140
7ed44d59 MCC	141	3.2.3 ack
	142	^^^^^^^^^
	143
	144	The resource, acquiring which means the message has been
b8d83448	145	acknowledged by all nodes in the cluster. The BAST of the resource
d323ef0f GJ	146	is used to inform the receiving node that a node wants to
d323ef0f GJ	147	communicate.
b8d83448 GR	148
	149	The algorithm is:
	150
7ed44d59 MCC	151	1. receive status - all nodes have concurrent-reader lock on "ack"::
	152
	153	sender receiver receiver
	154	"ack":CR "ack":CR "ack":CR
b8d83448	155
7ed44d59 MCC	156	2. sender get EX on "token",
7ed44d59 MCC	157	sender get EX on "message"::
b8d83448	158
7ed44d59 MCC	159	sender receiver receiver
	160	"token":EX "ack":CR "ack":CR
	161	"message":EX
	162	"ack":CR
b8d83448	163
d323ef0f GJ	164	Sender checks that it still needs to send a message. Messages
	165	received or other events that happened while waiting for the
	166	"token" may have made this message inappropriate or redundant.
b8d83448	167
7ed44d59 MCC	168	3. sender writes LVB
7ed44d59 MCC	169
d323ef0f	170	sender down-convert "message" from EX to CW
7ed44d59	171
d323ef0f	172	sender try to get EX of "ack"
b8d83448	173
7ed44d59 MCC	174	::
	175
	176	[ wait until all receivers have processed the "message" ]
d323ef0f	177
7ed44d59 MCC	178	[ triggered by bast of "ack" ]
	179	receiver get CR on "message"
	180	receiver read LVB
	181	receiver processes the message
	182	[ wait finish ]
	183	receiver releases "ack"
	184	receiver tries to get PR on "message"
	185
	186	sender receiver receiver
	187	"token":EX "message":CR "message":CR
	188	"message":CW
	189	"ack":EX
d323ef0f GJ	190
	191	4. triggered by grant of EX on "ack" (indicating all receivers
	192	have processed message)
7ed44d59	193
d323ef0f	194	sender down-converts "ack" from EX to CR
7ed44d59	195
d323ef0f	196	sender releases "message"
7ed44d59	197
d323ef0f	198	sender releases "token"
b8d83448	199
7ed44d59 MCC	200	::
	201
	202	receiver upconvert to PR on "message"
	203	receiver get CR of "ack"
	204	receiver release "message"
	205
	206	sender receiver receiver
	207	"ack":CR "ack":CR "ack":CR
b8d83448 GR	208
	209
	210	4. Handling Failures
7ed44d59	211	====================
b8d83448 GR	212
b8d83448 GR	213	4.1 Node Failure
7ed44d59	214	----------------
d323ef0f GJ	215
	216	When a node fails, the DLM informs the cluster with the slot
	217	number. The node starts a cluster recovery thread. The cluster
	218	recovery thread:
	219
b8d83448 GR	220	- acquires the bitmap<number> lock of the failed node
	221	- opens the bitmap
	222	- reads the bitmap of the failed node
	223	- copies the set bitmap to local node
	224	- cleans the bitmap of the failed node
	225	- releases bitmap<number> lock of the failed node
	226	- initiates resync of the bitmap on the current node
7ed44d59 MCC	227	md_check_recovery is invoked within recover_bitmaps,
	228	then md_check_recovery -> metadata_update_start/finish,
	229	it will lock the communication by lock_comm.
	230	Which means when one node is resyncing it blocks all
	231	other nodes from writing anywhere on the array.
b8d83448	232
d323ef0f	233	The resync process is the regular md resync. However, in a clustered
b8d83448 GR	234	environment when a resync is performed, it needs to tell other nodes
b8d83448 GR	235	of the areas which are suspended. Before a resync starts, the node
d323ef0f GJ	236	send out RESYNCING with the (lo,hi) range of the area which needs to
	237	be suspended. Each node maintains a suspend_list, which contains the
	238	list of ranges which are currently suspended. On receiving RESYNCING,
	239	the node adds the range to the suspend_list. Similarly, when the node
	240	performing resync finishes, it sends RESYNCING with an empty range to
	241	other nodes and other nodes remove the corresponding entry from the
	242	suspend_list.
b8d83448	243
d323ef0f GJ	244	A helper function, ->area_resyncing() can be used to check if a
d323ef0f GJ	245	particular I/O range should be suspended or not.
b8d83448 GR	246
b8d83448 GR	247	4.2 Device Failure
7ed44d59	248	==================
d323ef0f	249
b8d83448	250	Device failures are handled and communicated with the metadata update
d323ef0f GJ	251	routine. When a node detects a device failure it does not allow
	252	any further writes to that device until the failure has been
	253	acknowledged by all other nodes.
b8d83448 GR	254
b8d83448 GR	255	5. Adding a new Device
7ed44d59	256	----------------------
d323ef0f GJ	257
	258	For adding a new device, it is necessary that all nodes "see" the new
	259	device to be added. For this, the following algorithm is used:
b8d83448	260
7ed44d59	261	1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
d323ef0f	262	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
7ed44d59 MCC	263	2. Node 1 sends a NEWDISK message with uuid and slot number
7ed44d59 MCC	264	3. Other nodes issue kobject_uevent_env with uuid and slot number
b8d83448	265	(Steps 4,5 could be a udev rule)
7ed44d59	266	4. In userspace, the node searches for the disk, perhaps
b8d83448	267	using blkid -t SUB_UUID=""
7ed44d59	268	5. Other nodes issue either of the following depending on whether
d323ef0f	269	the disk was found:
b8d83448	270	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
7ed44d59	271	disc.number set to slot number)
b8d83448	272	ioctl(CLUSTERED_DISK_NACK)
7ed44d59 MCC	273	6. Other nodes drop lock on "no-new-devs" (CR) if device is found
	274	7. Node 1 attempts EX lock on "no-new-dev"
	275	8. If node 1 gets the lock, it sends METADATA_UPDATED after
d323ef0f	276	unmarking the disk as SpareLocal
7ed44d59	277	9. If not (get "no-new-dev" lock), it fails the operation and sends
d323ef0f GJ	278	METADATA_UPDATED.
	279	10. Other nodes get the information whether a disk is added or not
	280	by the following METADATA_UPDATED.
	281
7ed44d59 MCC	282	6. Module interface
7ed44d59 MCC	283	===================
d323ef0f GJ	284
	285	There are 17 call-backs which the md core can make to the cluster
	286	module. Understanding these can give a good overview of the whole
	287	process.
	288
	289	6.1 join(nodes) and leave()
7ed44d59	290	---------------------------
d323ef0f GJ	291
	292	These are called when an array is started with a clustered bitmap,
	293	and when the array is stopped. join() ensures the cluster is
	294	available and initializes the various resources.
	295	Only the first 'nodes' nodes in the cluster can use the array.
	296
	297	6.2 slot_number()
7ed44d59	298	-----------------
d323ef0f GJ	299
	300	Reports the slot number advised by the cluster infrastructure.
	301	Range is from 0 to nodes-1.
	302
	303	6.3 resync_info_update()
7ed44d59	304	------------------------
d323ef0f GJ	305
	306	This updates the resync range that is stored in the bitmap lock.
	307	The starting point is updated as the resync progresses. The
	308	end point is always the end of the array.
	309	It does not send a RESYNCING message.
	310
	311	6.4 resync_start(), resync_finish()
7ed44d59	312	-----------------------------------
d323ef0f GJ	313
	314	These are called when resync/recovery/reshape starts or stops.
	315	They update the resyncing range in the bitmap lock and also
	316	send a RESYNCING message. resync_start reports the whole
	317	array as resyncing, resync_finish reports none of it.
	318
	319	resync_finish() also sends a BITMAP_NEEDS_SYNC message which
	320	allows some other node to take over.
	321
7ed44d59 MCC	322	6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
7ed44d59 MCC	323	-------------------------------------------------------------------------------
d323ef0f GJ	324
	325	metadata_update_start is used to get exclusive access to
	326	the metadata. If a change is still needed once that access is
	327	gained, metadata_update_finish() will send a METADATA_UPDATE
	328	message to all other nodes, otherwise metadata_update_cancel()
	329	can be used to release the lock.
	330
	331	6.6 area_resyncing()
7ed44d59	332	--------------------
d323ef0f GJ	333
	334	This combines two elements of functionality.
	335
	336	Firstly, it will check if any node is currently resyncing
	337	anything in a given range of sectors. If any resync is found,
	338	then the caller will avoid writing or read-balancing in that
	339	range.
	340
	341	Secondly, while node recovery is happening it reports that
	342	all areas are resyncing for READ requests. This avoids races
	343	between the cluster-filesystem and the cluster-RAID handling
	344	a node failure.
	345
	346	6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
7ed44d59	347	---------------------------------------------------------------
d323ef0f GJ	348
	349	These are used to manage the new-disk protocol described above.
	350	When a new device is added, add_new_disk_start() is called before
	351	it is bound to the array and, if that succeeds, add_new_disk_finish()
	352	is called the device is fully added.
	353
	354	When a device is added in acknowledgement to a previous
	355	request, or when the device is declared "unavailable",
	356	new_disk_ack() is called.
	357
	358	6.8 remove_disk()
7ed44d59	359	-----------------
d323ef0f GJ	360
	361	This is called when a spare or failed device is removed from
	362	the array. It causes a REMOVE message to be send to other nodes.
	363
	364	6.9 gather_bitmaps()
7ed44d59	365	--------------------
d323ef0f GJ	366
	367	This sends a RE_ADD message to all other nodes and then
	368	gathers bitmap information from all bitmaps. This combined
	369	bitmap is then used to recovery the re-added device.
	370
	371	6.10 lock_all_bitmaps() and unlock_all_bitmaps()
7ed44d59	372	------------------------------------------------
d323ef0f GJ	373
	374	These are called when change bitmap to none. If a node plans
	375	to clear the cluster raid's bitmap, it need to make sure no other
	376	nodes are using the raid which is achieved by lock all bitmap
	377	locks within the cluster, and also those locks are unlocked
	378	accordingly.
ab5a98b1 GJ	379
ab5a98b1 GJ	380	7. Unsupported features
7ed44d59	381	=======================
ab5a98b1 GJ	382
	383	There are somethings which are not supported by cluster MD yet.
	384
818da59f	385	- change array_sectors.