Commit | Line | Data |
---|---|---|
f0ba4377 | 1 | ========= |
9d0eb0ab JR |
2 | dm-switch |
3 | ========= | |
4 | ||
5 | The device-mapper switch target creates a device that supports an | |
6 | arbitrary mapping of fixed-size regions of I/O across a fixed set of | |
7 | paths. The path used for any specific region can be switched | |
8 | dynamically by sending the target a message. | |
9 | ||
10 | It maps I/O to underlying block devices efficiently when there is a large | |
11 | number of fixed-sized address regions but there is no simple pattern | |
12 | that would allow for a compact representation of the mapping such as | |
13 | dm-stripe. | |
14 | ||
15 | Background | |
16 | ---------- | |
17 | ||
18 | Dell EqualLogic and some other iSCSI storage arrays use a distributed | |
19 | frameless architecture. In this architecture, the storage group | |
20 | consists of a number of distinct storage arrays ("members") each having | |
21 | independent controllers, disk storage and network adapters. When a LUN | |
22 | is created it is spread across multiple members. The details of the | |
23 | spreading are hidden from initiators connected to this storage system. | |
24 | The storage group exposes a single target discovery portal, no matter | |
25 | how many members are being used. When iSCSI sessions are created, each | |
26 | session is connected to an eth port on a single member. Data to a LUN | |
27 | can be sent on any iSCSI session, and if the blocks being accessed are | |
28 | stored on another member the I/O will be forwarded as required. This | |
29 | forwarding is invisible to the initiator. The storage layout is also | |
30 | dynamic, and the blocks stored on disk may be moved from member to | |
31 | member as needed to balance the load. | |
32 | ||
33 | This architecture simplifies the management and configuration of both | |
34 | the storage group and initiators. In a multipathing configuration, it | |
35 | is possible to set up multiple iSCSI sessions to use multiple network | |
36 | interfaces on both the host and target to take advantage of the | |
37 | increased network bandwidth. An initiator could use a simple round | |
38 | robin algorithm to send I/O across all paths and let the storage array | |
39 | members forward it as necessary, but there is a performance advantage to | |
40 | sending data directly to the correct member. | |
41 | ||
42 | A device-mapper table already lets you map different regions of a | |
43 | device onto different targets. However in this architecture the LUN is | |
44 | spread with an address region size on the order of 10s of MBs, which | |
45 | means the resulting table could have more than a million entries and | |
46 | consume far too much memory. | |
47 | ||
48 | Using this device-mapper switch target we can now build a two-layer | |
49 | device hierarchy: | |
50 | ||
e73f6e8a MS |
51 | Upper Tier - Determine which array member the I/O should be sent to. |
52 | Lower Tier - Load balance amongst paths to a particular member. | |
9d0eb0ab JR |
53 | |
54 | The lower tier consists of a single dm multipath device for each member. | |
55 | Each of these multipath devices contains the set of paths directly to | |
56 | the array member in one priority group, and leverages existing path | |
57 | selectors to load balance amongst these paths. We also build a | |
58 | non-preferred priority group containing paths to other array members for | |
59 | failover reasons. | |
60 | ||
61 | The upper tier consists of a single dm-switch device. This device uses | |
62 | a bitmap to look up the location of the I/O and choose the appropriate | |
63 | lower tier device to route the I/O. By using a bitmap we are able to | |
64 | use 4 bits for each address range in a 16 member group (which is very | |
65 | large for us). This is a much denser representation than the dm table | |
66 | b-tree can achieve. | |
67 | ||
68 | Construction Parameters | |
69 | ======================= | |
70 | ||
f0ba4377 MCC |
71 | <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+ |
72 | <num_paths> | |
73 | The number of paths across which to distribute the I/O. | |
9d0eb0ab | 74 | |
f0ba4377 MCC |
75 | <region_size> |
76 | The number of 512-byte sectors in a region. Each region can be redirected | |
77 | to any of the available paths. | |
9d0eb0ab | 78 | |
f0ba4377 MCC |
79 | <num_optional_args> |
80 | The number of optional arguments. Currently, no optional arguments | |
81 | are supported and so this must be zero. | |
9d0eb0ab | 82 | |
f0ba4377 MCC |
83 | <dev_path> |
84 | The block device that represents a specific path to the device. | |
9d0eb0ab | 85 | |
f0ba4377 MCC |
86 | <offset> |
87 | The offset of the start of data on the specific <dev_path> (in units | |
88 | of 512-byte sectors). This number is added to the sector number when | |
89 | forwarding the request to the specific path. Typically it is zero. | |
9d0eb0ab JR |
90 | |
91 | Messages | |
92 | ======== | |
93 | ||
94 | set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... | |
95 | ||
96 | Modify the region table by specifying which regions are redirected to | |
97 | which paths. | |
98 | ||
99 | <index> | |
100 | The region number (region size was specified in constructor parameters). | |
101 | If index is omitted, the next region (previous index + 1) is used. | |
102 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | |
103 | ||
104 | <path_nr> | |
105 | The path number in the range 0 ... (<num_paths> - 1). | |
106 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | |
107 | ||
56b1ebf2 MP |
108 | R<n>,<m> |
109 | This parameter allows repetitive patterns to be loaded quickly. <n> and <m> | |
110 | are hexadecimal numbers. The last <n> mappings are repeated in the next <m> | |
111 | slots. | |
112 | ||
9d0eb0ab JR |
113 | Status |
114 | ====== | |
115 | ||
116 | No status line is reported. | |
117 | ||
118 | Example | |
119 | ======= | |
120 | ||
121 | Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with | |
122 | the same size. | |
123 | ||
f0ba4377 MCC |
124 | Create a switch device with 64kB region size:: |
125 | ||
95f21c5c | 126 | dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` |
9d0eb0ab JR |
127 | switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" |
128 | ||
129 | Set mappings for the first 7 entries to point to devices switch0, switch1, | |
f0ba4377 MCC |
130 | switch2, switch0, switch1, switch2, switch1:: |
131 | ||
9d0eb0ab | 132 | dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 |
56b1ebf2 | 133 | |
f0ba4377 MCC |
134 | Set repetitive mapping. This command:: |
135 | ||
56b1ebf2 | 136 | dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 |
f0ba4377 MCC |
137 | |
138 | is equivalent to:: | |
139 | ||
56b1ebf2 MP |
140 | dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ |
141 | :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 |