Commit | Line | Data |
---|---|---|
87f2c671 PM |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ========================= | |
4 | Resilient Next-hop Groups | |
5 | ========================= | |
6 | ||
7 | Resilient groups are a type of next-hop group that is aimed at minimizing | |
8 | disruption in flow routing across changes to the group composition and | |
9 | weights of constituent next hops. | |
10 | ||
11 | The idea behind resilient hashing groups is best explained in contrast to | |
12 | the legacy multipath next-hop group, which uses the hash-threshold | |
13 | algorithm, described in RFC 2992. | |
14 | ||
15 | To select a next hop, hash-threshold algorithm first assigns a range of | |
16 | hashes to each next hop in the group, and then selects the next hop by | |
17 | comparing the SKB hash with the individual ranges. When a next hop is | |
18 | removed from the group, the ranges are recomputed, which leads to | |
19 | reassignment of parts of hash space from one next hop to another. RFC 2992 | |
20 | illustrates it thus:: | |
21 | ||
22 | +-------+-------+-------+-------+-------+ | |
23 | | 1 | 2 | 3 | 4 | 5 | | |
24 | +-------+-+-----+---+---+-----+-+-------+ | |
25 | | 1 | 2 | 4 | 5 | | |
26 | +---------+---------+---------+---------+ | |
27 | ||
28 | Before and after deletion of next hop 3 | |
29 | under the hash-threshold algorithm. | |
30 | ||
31 | Note how next hop 2 gave up part of the hash space in favor of next hop 1, | |
32 | and 4 in favor of 5. While there will usually be some overlap between the | |
33 | previous and the new distribution, some traffic flows change the next hop | |
34 | that they resolve to. | |
35 | ||
36 | If a multipath group is used for load-balancing between multiple servers, | |
37 | this hash space reassignment causes an issue that packets from a single | |
38 | flow suddenly end up arriving at a server that does not expect them. This | |
39 | can result in TCP connections being reset. | |
40 | ||
41 | If a multipath group is used for load-balancing among available paths to | |
42 | the same server, the issue is that different latencies and reordering along | |
43 | the way causes the packets to arrive in the wrong order, resulting in | |
44 | degraded application performance. | |
45 | ||
46 | To mitigate the above-mentioned flow redirection, resilient next-hop groups | |
47 | insert another layer of indirection between the hash space and its | |
48 | constituent next hops: a hash table. The selection algorithm uses SKB hash | |
49 | to choose a hash table bucket, then reads the next hop that this bucket | |
50 | contains, and forwards traffic there. | |
51 | ||
52 | This indirection brings an important feature. In the hash-threshold | |
53 | algorithm, the range of hashes associated with a next hop must be | |
54 | continuous. With a hash table, mapping between the hash table buckets and | |
55 | the individual next hops is arbitrary. Therefore when a next hop is deleted | |
56 | the buckets that held it are simply reassigned to other next hops:: | |
57 | ||
58 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
59 | |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| | |
60 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
61 | v v v v | |
62 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
63 | |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| | |
64 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
65 | ||
66 | Before and after deletion of next hop 3 | |
67 | under the resilient hashing algorithm. | |
68 | ||
69 | When weights of next hops in a group are altered, it may be possible to | |
70 | choose a subset of buckets that are currently not used for forwarding | |
71 | traffic, and use those to satisfy the new next-hop distribution demands, | |
72 | keeping the "busy" buckets intact. This way, established flows are ideally | |
73 | kept being forwarded to the same endpoints through the same paths as before | |
74 | the next-hop group change. | |
75 | ||
76 | Algorithm | |
77 | --------- | |
78 | ||
79 | In a nutshell, the algorithm works as follows. Each next hop deserves a | |
80 | certain number of buckets, according to its weight and the number of | |
81 | buckets in the hash table. In accordance with the source code, we will call | |
82 | this number a "wants count" of a next hop. In case of an event that might | |
83 | cause bucket allocation change, the wants counts for individual next hops | |
84 | are updated. | |
85 | ||
86 | Next hops that have fewer buckets than their wants count, are called | |
87 | "underweight". Those that have more are "overweight". If there are no | |
88 | overweight (and therefore no underweight) next hops in the group, it is | |
89 | said to be "balanced". | |
90 | ||
91 | Each bucket maintains a last-used timer. Every time a packet is forwarded | |
92 | through a bucket, this timer is updated to current jiffies value. One | |
93 | attribute of a resilient group is then the "idle timer", which is the | |
94 | amount of time that a bucket must not be hit by traffic in order for it to | |
95 | be considered "idle". Buckets that are not idle are busy. | |
96 | ||
97 | After assigning wants counts to next hops, an "upkeep" algorithm runs. For | |
98 | buckets: | |
99 | ||
100 | 1) that have no assigned next hop, or | |
101 | 2) whose next hop has been removed, or | |
102 | 3) that are idle and their next hop is overweight, | |
103 | ||
104 | upkeep changes the next hop that the bucket references to one of the | |
105 | underweight next hops. If, after considering all buckets in this manner, | |
106 | there are still underweight next hops, another upkeep run is scheduled to a | |
107 | future time. | |
108 | ||
109 | There may not be enough "idle" buckets to satisfy the updated wants counts | |
110 | of all next hops. Another attribute of a resilient group is the "unbalanced | |
111 | timer". This timer can be set to 0, in which case the table will stay out | |
112 | of balance until idle buckets do appear, possibly never. If set to a | |
113 | non-zero value, the value represents the period of time that the table is | |
114 | permitted to stay out of balance. | |
115 | ||
116 | With this in mind, we update the above list of conditions with one more | |
117 | item. Thus buckets: | |
118 | ||
119 | 4) whose next hop is overweight, and the amount of time that the table has | |
120 | been out of balance exceeds the unbalanced timer, if that is non-zero, | |
121 | ||
122 | \... are migrated as well. | |
123 | ||
124 | Offloading & Driver Feedback | |
125 | ---------------------------- | |
126 | ||
127 | When offloading resilient groups, the algorithm that distributes buckets | |
128 | among next hops is still the one in SW. Drivers are notified of updates to | |
129 | next hop groups in the following three ways: | |
130 | ||
131 | - Full group notification with the type | |
132 | ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is | |
133 | created and buckets populated for the first time. | |
134 | ||
135 | - Single-bucket notifications of the type | |
136 | ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of | |
137 | individual migrations within an already-established group. | |
138 | ||
139 | - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This | |
140 | is sent before the group is replaced, and is a way for the driver to veto | |
141 | the group before committing anything to the HW. | |
142 | ||
143 | Some single-bucket notifications are forced, as indicated by the "force" | |
144 | flag in the notification. Those are used for the cases where e.g. the next | |
145 | hop associated with the bucket was removed, and the bucket really must be | |
146 | migrated. | |
147 | ||
148 | Non-forced notifications can be overridden by the driver by returning an | |
149 | error code. The use case for this is that the driver notifies the HW that a | |
150 | bucket should be migrated, but the HW discovers that the bucket has in fact | |
151 | been hit by traffic. | |
152 | ||
153 | A second way for the HW to report that a bucket is busy is through the | |
154 | ``nexthop_res_grp_activity_update()`` API. The buckets identified this way | |
155 | as busy are treated as if traffic hit them. | |
156 | ||
157 | Offloaded buckets should be flagged as either "offload" or "trap". This is | |
158 | done through the ``nexthop_bucket_set_hw_flags()`` API. | |
159 | ||
160 | Netlink UAPI | |
161 | ------------ | |
162 | ||
163 | Resilient Group Replacement | |
164 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
165 | ||
166 | Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the | |
167 | same manner as other multipath groups. The following changes apply to the | |
168 | attributes passed in the netlink message: | |
169 | ||
170 | =================== ========================================================= | |
171 | ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group. | |
172 | ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient | |
173 | groups. | |
174 | =================== ========================================================= | |
175 | ||
176 | ``NHA_RES_GROUP`` payload: | |
177 | ||
178 | =================================== ========================================= | |
179 | ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table. | |
180 | ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t. | |
181 | ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t. | |
182 | =================================== ========================================= | |
183 | ||
184 | Next Hop Get | |
185 | ^^^^^^^^^^^^ | |
186 | ||
187 | Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP`` | |
188 | message in exactly the same way as other next hop get requests. The | |
189 | response attributes match the replacement attributes cited above, except | |
190 | ``NHA_RES_GROUP`` payload will include the following attribute: | |
191 | ||
192 | =================================== ========================================= | |
193 | ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out | |
194 | of balance, in units of clock_t. | |
195 | =================================== ========================================= | |
196 | ||
197 | Bucket Get | |
198 | ^^^^^^^^^^ | |
199 | ||
200 | The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is | |
201 | used to request a single bucket. The attributes recognized at get requests | |
202 | are: | |
203 | ||
204 | =================== ========================================================= | |
205 | ``NHA_ID`` ID of the next-hop group that the bucket belongs to. | |
206 | ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. | |
207 | =================== ========================================================= | |
208 | ||
209 | ``NHA_RES_BUCKET`` payload: | |
210 | ||
211 | ======================== ==================================================== | |
212 | ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table. | |
213 | ======================== ==================================================== | |
214 | ||
215 | Bucket Dumps | |
216 | ^^^^^^^^^^^^ | |
217 | ||
218 | The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used | |
219 | to request a dump of matching buckets. The attributes recognized at dump | |
220 | requests are: | |
221 | ||
222 | =================== ========================================================= | |
223 | ``NHA_ID`` If specified, limits the dump to just the next-hop group | |
224 | with this ID. | |
225 | ``NHA_OIF`` If specified, limits the dump to buckets that contain | |
226 | next hops that use the device with this ifindex. | |
227 | ``NHA_MASTER`` If specified, limits the dump to buckets that contain | |
228 | next hops that use a device in the VRF with this ifindex. | |
229 | ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. | |
230 | =================== ========================================================= | |
231 | ||
232 | ``NHA_RES_BUCKET`` payload: | |
233 | ||
234 | ======================== ==================================================== | |
235 | ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets | |
236 | that contain the next hop with this ID. | |
237 | ======================== ==================================================== | |
238 | ||
239 | Usage | |
240 | ----- | |
241 | ||
242 | To illustrate the usage, consider the following commands:: | |
243 | ||
244 | # ip nexthop add id 1 via 192.0.2.2 dev eth0 | |
245 | # ip nexthop add id 2 via 192.0.2.3 dev eth0 | |
246 | # ip nexthop add id 10 group 1/2 type resilient \ | |
247 | buckets 8 idle_timer 60 unbalanced_timer 300 | |
248 | ||
249 | The last command creates a resilient next-hop group. It will have 8 buckets | |
250 | (which is unusually low number, and used here for demonstration purposes | |
251 | only), each bucket will be considered idle when no traffic hits it for at | |
252 | least 60 seconds, and if the table remains out of balance for 300 seconds, | |
253 | it will be forcefully brought into balance. | |
254 | ||
255 | Changing next-hop weights leads to change in bucket allocation:: | |
256 | ||
257 | # ip nexthop replace id 10 group 1,3/2 type resilient | |
258 | ||
259 | This can be confirmed by looking at individual buckets:: | |
260 | ||
261 | # ip nexthop bucket show id 10 | |
262 | id 10 index 0 idle_time 5.59 nhid 1 | |
263 | id 10 index 1 idle_time 5.59 nhid 1 | |
264 | id 10 index 2 idle_time 8.74 nhid 2 | |
265 | id 10 index 3 idle_time 8.74 nhid 2 | |
266 | id 10 index 4 idle_time 8.74 nhid 1 | |
267 | id 10 index 5 idle_time 8.74 nhid 1 | |
268 | id 10 index 6 idle_time 8.74 nhid 1 | |
269 | id 10 index 7 idle_time 8.74 nhid 1 | |
270 | ||
271 | Note the two buckets that have a shorter idle time. Those are the ones that | |
272 | were migrated after the next-hop replace command to satisfy the new demand | |
273 | that next hop 1 be given 6 buckets instead of 4. | |
274 | ||
275 | Netdevsim | |
276 | --------- | |
277 | ||
278 | The netdevsim driver implements a mock offload of resilient groups, and | |
279 | exposes debugfs interface that allows marking individual buckets as busy. | |
280 | For example, the following will mark bucket 23 in next-hop group 10 as | |
281 | active:: | |
282 | ||
283 | # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity | |
284 | ||
285 | In addition, another debugfs interface can be used to configure that the | |
286 | next attempt to migrate a bucket should fail:: | |
287 | ||
288 | # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace | |
289 | ||
290 | Besides serving as an example, the interfaces that netdevsim exposes are | |
291 | useful in automated testing, and | |
292 | ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of | |
293 | them to test the algorithm. |