Commit | Line | Data |
---|---|---|
7ed44d59 MCC |
1 | ========== |
2 | MD Cluster | |
3 | ========== | |
4 | ||
f0e230ad GJ |
5 | The cluster MD is a shared-device RAID for a cluster, it supports |
6 | two levels: raid1 and raid10 (limited support). | |
b8d83448 GR |
7 | |
8 | ||
9 | 1. On-disk format | |
7ed44d59 | 10 | ================= |
b8d83448 | 11 | |
d323ef0f | 12 | Separate write-intent-bitmaps are used for each cluster node. |
b8d83448 | 13 | The bitmaps record all writes that may have been started on that node, |
7ed44d59 | 14 | and may not yet have finished. The on-disk layout is:: |
b8d83448 | 15 | |
7ed44d59 MCC |
16 | 0 4k 8k 12k |
17 | ------------------------------------------------------------------- | |
18 | | idle | md super | bm super [0] + bits | | |
19 | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | |
20 | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | |
21 | | bm bits [3, contd] | | | | |
b8d83448 | 22 | |
d323ef0f GJ |
23 | During "normal" functioning we assume the filesystem ensures that only |
24 | one node writes to any given block at a time, so a write request will | |
25 | ||
b8d83448 GR |
26 | - set the appropriate bit (if not already set) |
27 | - commit the write to all mirrors | |
28 | - schedule the bit to be cleared after a timeout. | |
29 | ||
d323ef0f GJ |
30 | Reads are just handled normally. It is up to the filesystem to ensure |
31 | one node doesn't read from a location where another node (or the same | |
b8d83448 GR |
32 | node) is writing. |
33 | ||
34 | ||
35 | 2. DLM Locks for management | |
7ed44d59 | 36 | =========================== |
b8d83448 | 37 | |
d323ef0f | 38 | There are three groups of locks for managing the device: |
b8d83448 GR |
39 | |
40 | 2.1 Bitmap lock resource (bm_lockres) | |
7ed44d59 | 41 | ------------------------------------- |
b8d83448 | 42 | |
d323ef0f GJ |
43 | The bm_lockres protects individual node bitmaps. They are named in |
44 | the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a | |
45 | node joins the cluster, it acquires the lock in PW mode and it stays | |
46 | so during the lifetime the node is part of the cluster. The lock | |
47 | resource number is based on the slot number returned by the DLM | |
48 | subsystem. Since DLM starts node count from one and bitmap slots | |
49 | start from zero, one is subtracted from the DLM slot number to arrive | |
50 | at the bitmap slot number. | |
51 | ||
52 | The LVB of the bitmap lock for a particular node records the range | |
53 | of sectors that are being re-synced by that node. No other | |
54 | node may write to those sectors. This is used when a new nodes | |
55 | joins the cluster. | |
56 | ||
57 | 2.2 Message passing locks | |
7ed44d59 | 58 | ------------------------- |
d323ef0f GJ |
59 | |
60 | Each node has to communicate with other nodes when starting or ending | |
61 | resync, and for metadata superblock updates. This communication is | |
62 | managed through three locks: "token", "message", and "ack", together | |
63 | with the Lock Value Block (LVB) of one of the "message" lock. | |
64 | ||
65 | 2.3 new-device management | |
7ed44d59 | 66 | ------------------------- |
d323ef0f | 67 | |
7852fe3a | 68 | A single lock: "no-new-dev" is used to coordinate the addition of |
d323ef0f GJ |
69 | new devices - this must be synchronized across the array. |
70 | Normally all nodes hold a concurrent-read lock on this device. | |
b8d83448 GR |
71 | |
72 | 3. Communication | |
7ed44d59 | 73 | ================ |
b8d83448 | 74 | |
d323ef0f GJ |
75 | Messages can be broadcast to all nodes, and the sender waits for all |
76 | other nodes to acknowledge the message before proceeding. Only one | |
77 | message can be processed at a time. | |
b8d83448 GR |
78 | |
79 | 3.1 Message Types | |
7ed44d59 | 80 | ----------------- |
b8d83448 | 81 | |
d323ef0f | 82 | There are six types of messages which are passed: |
b8d83448 | 83 | |
7ed44d59 MCC |
84 | 3.1.1 METADATA_UPDATED |
85 | ^^^^^^^^^^^^^^^^^^^^^^ | |
86 | ||
87 | informs other nodes that the metadata has | |
d323ef0f GJ |
88 | been updated, and the node must re-read the md superblock. This is |
89 | performed synchronously. It is primarily used to signal device | |
90 | failure. | |
b8d83448 | 91 | |
7ed44d59 MCC |
92 | 3.1.2 RESYNCING |
93 | ^^^^^^^^^^^^^^^ | |
94 | informs other nodes that a resync is initiated or | |
d323ef0f GJ |
95 | ended so that each node may suspend or resume the region. Each |
96 | RESYNCING message identifies a range of the devices that the | |
d7714952 | 97 | sending node is about to resync. This overrides any previous |
d323ef0f GJ |
98 | notification from that node: only one ranged can be resynced at a |
99 | time per-node. | |
100 | ||
7ed44d59 MCC |
101 | 3.1.3 NEWDISK |
102 | ^^^^^^^^^^^^^ | |
103 | ||
104 | informs other nodes that a device is being added to | |
d323ef0f GJ |
105 | the array. Message contains an identifier for that device. See |
106 | below for further details. | |
107 | ||
7ed44d59 MCC |
108 | 3.1.4 REMOVE |
109 | ^^^^^^^^^^^^ | |
110 | ||
111 | A failed or spare device is being removed from the | |
d323ef0f GJ |
112 | array. The slot-number of the device is included in the message. |
113 | ||
7ed44d59 MCC |
114 | 3.1.5 RE_ADD: |
115 | ||
116 | A failed device is being re-activated - the assumption | |
d323ef0f GJ |
117 | is that it has been determined to be working again. |
118 | ||
7ed44d59 MCC |
119 | 3.1.6 BITMAP_NEEDS_SYNC: |
120 | ||
121 | If a node is stopped locally but the bitmap | |
d323ef0f GJ |
122 | isn't clean, then another node is informed to take the ownership of |
123 | resync. | |
b8d83448 GR |
124 | |
125 | 3.2 Communication mechanism | |
7ed44d59 | 126 | --------------------------- |
b8d83448 GR |
127 | |
128 | The DLM LVB is used to communicate within nodes of the cluster. There | |
129 | are three resources used for the purpose: | |
130 | ||
7ed44d59 MCC |
131 | 3.2.1 token |
132 | ^^^^^^^^^^^ | |
133 | The resource which protects the entire communication | |
b8d83448 GR |
134 | system. The node having the token resource is allowed to |
135 | communicate. | |
136 | ||
7ed44d59 MCC |
137 | 3.2.2 message |
138 | ^^^^^^^^^^^^^ | |
139 | The lock resource which carries the data to communicate. | |
b8d83448 | 140 | |
7ed44d59 MCC |
141 | 3.2.3 ack |
142 | ^^^^^^^^^ | |
143 | ||
144 | The resource, acquiring which means the message has been | |
b8d83448 | 145 | acknowledged by all nodes in the cluster. The BAST of the resource |
d323ef0f GJ |
146 | is used to inform the receiving node that a node wants to |
147 | communicate. | |
b8d83448 GR |
148 | |
149 | The algorithm is: | |
150 | ||
7ed44d59 MCC |
151 | 1. receive status - all nodes have concurrent-reader lock on "ack":: |
152 | ||
153 | sender receiver receiver | |
154 | "ack":CR "ack":CR "ack":CR | |
b8d83448 | 155 | |
7ed44d59 MCC |
156 | 2. sender get EX on "token", |
157 | sender get EX on "message":: | |
b8d83448 | 158 | |
7ed44d59 MCC |
159 | sender receiver receiver |
160 | "token":EX "ack":CR "ack":CR | |
161 | "message":EX | |
162 | "ack":CR | |
b8d83448 | 163 | |
d323ef0f GJ |
164 | Sender checks that it still needs to send a message. Messages |
165 | received or other events that happened while waiting for the | |
166 | "token" may have made this message inappropriate or redundant. | |
b8d83448 | 167 | |
7ed44d59 MCC |
168 | 3. sender writes LVB |
169 | ||
d323ef0f | 170 | sender down-convert "message" from EX to CW |
7ed44d59 | 171 | |
d323ef0f | 172 | sender try to get EX of "ack" |
b8d83448 | 173 | |
7ed44d59 MCC |
174 | :: |
175 | ||
176 | [ wait until all receivers have *processed* the "message" ] | |
d323ef0f | 177 | |
7ed44d59 MCC |
178 | [ triggered by bast of "ack" ] |
179 | receiver get CR on "message" | |
180 | receiver read LVB | |
181 | receiver processes the message | |
182 | [ wait finish ] | |
183 | receiver releases "ack" | |
184 | receiver tries to get PR on "message" | |
185 | ||
186 | sender receiver receiver | |
187 | "token":EX "message":CR "message":CR | |
188 | "message":CW | |
189 | "ack":EX | |
d323ef0f GJ |
190 | |
191 | 4. triggered by grant of EX on "ack" (indicating all receivers | |
192 | have processed message) | |
7ed44d59 | 193 | |
d323ef0f | 194 | sender down-converts "ack" from EX to CR |
7ed44d59 | 195 | |
d323ef0f | 196 | sender releases "message" |
7ed44d59 | 197 | |
d323ef0f | 198 | sender releases "token" |
b8d83448 | 199 | |
7ed44d59 MCC |
200 | :: |
201 | ||
202 | receiver upconvert to PR on "message" | |
203 | receiver get CR of "ack" | |
204 | receiver release "message" | |
205 | ||
206 | sender receiver receiver | |
207 | "ack":CR "ack":CR "ack":CR | |
b8d83448 GR |
208 | |
209 | ||
210 | 4. Handling Failures | |
7ed44d59 | 211 | ==================== |
b8d83448 GR |
212 | |
213 | 4.1 Node Failure | |
7ed44d59 | 214 | ---------------- |
d323ef0f GJ |
215 | |
216 | When a node fails, the DLM informs the cluster with the slot | |
217 | number. The node starts a cluster recovery thread. The cluster | |
218 | recovery thread: | |
219 | ||
b8d83448 GR |
220 | - acquires the bitmap<number> lock of the failed node |
221 | - opens the bitmap | |
222 | - reads the bitmap of the failed node | |
223 | - copies the set bitmap to local node | |
224 | - cleans the bitmap of the failed node | |
225 | - releases bitmap<number> lock of the failed node | |
226 | - initiates resync of the bitmap on the current node | |
7ed44d59 MCC |
227 | md_check_recovery is invoked within recover_bitmaps, |
228 | then md_check_recovery -> metadata_update_start/finish, | |
229 | it will lock the communication by lock_comm. | |
230 | Which means when one node is resyncing it blocks all | |
231 | other nodes from writing anywhere on the array. | |
b8d83448 | 232 | |
d323ef0f | 233 | The resync process is the regular md resync. However, in a clustered |
b8d83448 GR |
234 | environment when a resync is performed, it needs to tell other nodes |
235 | of the areas which are suspended. Before a resync starts, the node | |
d323ef0f GJ |
236 | send out RESYNCING with the (lo,hi) range of the area which needs to |
237 | be suspended. Each node maintains a suspend_list, which contains the | |
238 | list of ranges which are currently suspended. On receiving RESYNCING, | |
239 | the node adds the range to the suspend_list. Similarly, when the node | |
240 | performing resync finishes, it sends RESYNCING with an empty range to | |
241 | other nodes and other nodes remove the corresponding entry from the | |
242 | suspend_list. | |
b8d83448 | 243 | |
d323ef0f GJ |
244 | A helper function, ->area_resyncing() can be used to check if a |
245 | particular I/O range should be suspended or not. | |
b8d83448 GR |
246 | |
247 | 4.2 Device Failure | |
7ed44d59 | 248 | ================== |
d323ef0f | 249 | |
b8d83448 | 250 | Device failures are handled and communicated with the metadata update |
d323ef0f GJ |
251 | routine. When a node detects a device failure it does not allow |
252 | any further writes to that device until the failure has been | |
253 | acknowledged by all other nodes. | |
b8d83448 GR |
254 | |
255 | 5. Adding a new Device | |
7ed44d59 | 256 | ---------------------- |
d323ef0f GJ |
257 | |
258 | For adding a new device, it is necessary that all nodes "see" the new | |
259 | device to be added. For this, the following algorithm is used: | |
b8d83448 | 260 | |
7ed44d59 | 261 | 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues |
d323ef0f | 262 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) |
7ed44d59 MCC |
263 | 2. Node 1 sends a NEWDISK message with uuid and slot number |
264 | 3. Other nodes issue kobject_uevent_env with uuid and slot number | |
b8d83448 | 265 | (Steps 4,5 could be a udev rule) |
7ed44d59 | 266 | 4. In userspace, the node searches for the disk, perhaps |
b8d83448 | 267 | using blkid -t SUB_UUID="" |
7ed44d59 | 268 | 5. Other nodes issue either of the following depending on whether |
d323ef0f | 269 | the disk was found: |
b8d83448 | 270 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and |
7ed44d59 | 271 | disc.number set to slot number) |
b8d83448 | 272 | ioctl(CLUSTERED_DISK_NACK) |
7ed44d59 MCC |
273 | 6. Other nodes drop lock on "no-new-devs" (CR) if device is found |
274 | 7. Node 1 attempts EX lock on "no-new-dev" | |
275 | 8. If node 1 gets the lock, it sends METADATA_UPDATED after | |
d323ef0f | 276 | unmarking the disk as SpareLocal |
7ed44d59 | 277 | 9. If not (get "no-new-dev" lock), it fails the operation and sends |
d323ef0f GJ |
278 | METADATA_UPDATED. |
279 | 10. Other nodes get the information whether a disk is added or not | |
280 | by the following METADATA_UPDATED. | |
281 | ||
7ed44d59 MCC |
282 | 6. Module interface |
283 | =================== | |
d323ef0f GJ |
284 | |
285 | There are 17 call-backs which the md core can make to the cluster | |
286 | module. Understanding these can give a good overview of the whole | |
287 | process. | |
288 | ||
289 | 6.1 join(nodes) and leave() | |
7ed44d59 | 290 | --------------------------- |
d323ef0f GJ |
291 | |
292 | These are called when an array is started with a clustered bitmap, | |
293 | and when the array is stopped. join() ensures the cluster is | |
294 | available and initializes the various resources. | |
295 | Only the first 'nodes' nodes in the cluster can use the array. | |
296 | ||
297 | 6.2 slot_number() | |
7ed44d59 | 298 | ----------------- |
d323ef0f GJ |
299 | |
300 | Reports the slot number advised by the cluster infrastructure. | |
301 | Range is from 0 to nodes-1. | |
302 | ||
303 | 6.3 resync_info_update() | |
7ed44d59 | 304 | ------------------------ |
d323ef0f GJ |
305 | |
306 | This updates the resync range that is stored in the bitmap lock. | |
307 | The starting point is updated as the resync progresses. The | |
308 | end point is always the end of the array. | |
309 | It does *not* send a RESYNCING message. | |
310 | ||
311 | 6.4 resync_start(), resync_finish() | |
7ed44d59 | 312 | ----------------------------------- |
d323ef0f GJ |
313 | |
314 | These are called when resync/recovery/reshape starts or stops. | |
315 | They update the resyncing range in the bitmap lock and also | |
316 | send a RESYNCING message. resync_start reports the whole | |
317 | array as resyncing, resync_finish reports none of it. | |
318 | ||
319 | resync_finish() also sends a BITMAP_NEEDS_SYNC message which | |
320 | allows some other node to take over. | |
321 | ||
7ed44d59 MCC |
322 | 6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel() |
323 | ------------------------------------------------------------------------------- | |
d323ef0f GJ |
324 | |
325 | metadata_update_start is used to get exclusive access to | |
326 | the metadata. If a change is still needed once that access is | |
327 | gained, metadata_update_finish() will send a METADATA_UPDATE | |
328 | message to all other nodes, otherwise metadata_update_cancel() | |
329 | can be used to release the lock. | |
330 | ||
331 | 6.6 area_resyncing() | |
7ed44d59 | 332 | -------------------- |
d323ef0f GJ |
333 | |
334 | This combines two elements of functionality. | |
335 | ||
336 | Firstly, it will check if any node is currently resyncing | |
337 | anything in a given range of sectors. If any resync is found, | |
338 | then the caller will avoid writing or read-balancing in that | |
339 | range. | |
340 | ||
341 | Secondly, while node recovery is happening it reports that | |
342 | all areas are resyncing for READ requests. This avoids races | |
343 | between the cluster-filesystem and the cluster-RAID handling | |
344 | a node failure. | |
345 | ||
346 | 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() | |
7ed44d59 | 347 | --------------------------------------------------------------- |
d323ef0f GJ |
348 | |
349 | These are used to manage the new-disk protocol described above. | |
350 | When a new device is added, add_new_disk_start() is called before | |
351 | it is bound to the array and, if that succeeds, add_new_disk_finish() | |
352 | is called the device is fully added. | |
353 | ||
354 | When a device is added in acknowledgement to a previous | |
355 | request, or when the device is declared "unavailable", | |
356 | new_disk_ack() is called. | |
357 | ||
358 | 6.8 remove_disk() | |
7ed44d59 | 359 | ----------------- |
d323ef0f GJ |
360 | |
361 | This is called when a spare or failed device is removed from | |
362 | the array. It causes a REMOVE message to be send to other nodes. | |
363 | ||
364 | 6.9 gather_bitmaps() | |
7ed44d59 | 365 | -------------------- |
d323ef0f GJ |
366 | |
367 | This sends a RE_ADD message to all other nodes and then | |
368 | gathers bitmap information from all bitmaps. This combined | |
369 | bitmap is then used to recovery the re-added device. | |
370 | ||
371 | 6.10 lock_all_bitmaps() and unlock_all_bitmaps() | |
7ed44d59 | 372 | ------------------------------------------------ |
d323ef0f GJ |
373 | |
374 | These are called when change bitmap to none. If a node plans | |
375 | to clear the cluster raid's bitmap, it need to make sure no other | |
376 | nodes are using the raid which is achieved by lock all bitmap | |
377 | locks within the cluster, and also those locks are unlocked | |
378 | accordingly. | |
ab5a98b1 GJ |
379 | |
380 | 7. Unsupported features | |
7ed44d59 | 381 | ======================= |
ab5a98b1 GJ |
382 | |
383 | There are somethings which are not supported by cluster MD yet. | |
384 | ||
818da59f | 385 | - change array_sectors. |