| 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ========================= |
| 4 | Resilient Next-hop Groups |
| 5 | ========================= |
| 6 | |
| 7 | Resilient groups are a type of next-hop group that is aimed at minimizing |
| 8 | disruption in flow routing across changes to the group composition and |
| 9 | weights of constituent next hops. |
| 10 | |
| 11 | The idea behind resilient hashing groups is best explained in contrast to |
| 12 | the legacy multipath next-hop group, which uses the hash-threshold |
| 13 | algorithm, described in RFC 2992. |
| 14 | |
| 15 | To select a next hop, hash-threshold algorithm first assigns a range of |
| 16 | hashes to each next hop in the group, and then selects the next hop by |
| 17 | comparing the SKB hash with the individual ranges. When a next hop is |
| 18 | removed from the group, the ranges are recomputed, which leads to |
| 19 | reassignment of parts of hash space from one next hop to another. RFC 2992 |
| 20 | illustrates it thus:: |
| 21 | |
| 22 | +-------+-------+-------+-------+-------+ |
| 23 | | 1 | 2 | 3 | 4 | 5 | |
| 24 | +-------+-+-----+---+---+-----+-+-------+ |
| 25 | | 1 | 2 | 4 | 5 | |
| 26 | +---------+---------+---------+---------+ |
| 27 | |
| 28 | Before and after deletion of next hop 3 |
| 29 | under the hash-threshold algorithm. |
| 30 | |
| 31 | Note how next hop 2 gave up part of the hash space in favor of next hop 1, |
| 32 | and 4 in favor of 5. While there will usually be some overlap between the |
| 33 | previous and the new distribution, some traffic flows change the next hop |
| 34 | that they resolve to. |
| 35 | |
| 36 | If a multipath group is used for load-balancing between multiple servers, |
| 37 | this hash space reassignment causes an issue that packets from a single |
| 38 | flow suddenly end up arriving at a server that does not expect them. This |
| 39 | can result in TCP connections being reset. |
| 40 | |
| 41 | If a multipath group is used for load-balancing among available paths to |
| 42 | the same server, the issue is that different latencies and reordering along |
| 43 | the way causes the packets to arrive in the wrong order, resulting in |
| 44 | degraded application performance. |
| 45 | |
| 46 | To mitigate the above-mentioned flow redirection, resilient next-hop groups |
| 47 | insert another layer of indirection between the hash space and its |
| 48 | constituent next hops: a hash table. The selection algorithm uses SKB hash |
| 49 | to choose a hash table bucket, then reads the next hop that this bucket |
| 50 | contains, and forwards traffic there. |
| 51 | |
| 52 | This indirection brings an important feature. In the hash-threshold |
| 53 | algorithm, the range of hashes associated with a next hop must be |
| 54 | continuous. With a hash table, mapping between the hash table buckets and |
| 55 | the individual next hops is arbitrary. Therefore when a next hop is deleted |
| 56 | the buckets that held it are simply reassigned to other next hops:: |
| 57 | |
| 58 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 59 | |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| |
| 60 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 61 | v v v v |
| 62 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 63 | |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| |
| 64 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 65 | |
| 66 | Before and after deletion of next hop 3 |
| 67 | under the resilient hashing algorithm. |
| 68 | |
| 69 | When weights of next hops in a group are altered, it may be possible to |
| 70 | choose a subset of buckets that are currently not used for forwarding |
| 71 | traffic, and use those to satisfy the new next-hop distribution demands, |
| 72 | keeping the "busy" buckets intact. This way, established flows are ideally |
| 73 | kept being forwarded to the same endpoints through the same paths as before |
| 74 | the next-hop group change. |
| 75 | |
| 76 | Algorithm |
| 77 | --------- |
| 78 | |
| 79 | In a nutshell, the algorithm works as follows. Each next hop deserves a |
| 80 | certain number of buckets, according to its weight and the number of |
| 81 | buckets in the hash table. In accordance with the source code, we will call |
| 82 | this number a "wants count" of a next hop. In case of an event that might |
| 83 | cause bucket allocation change, the wants counts for individual next hops |
| 84 | are updated. |
| 85 | |
| 86 | Next hops that have fewer buckets than their wants count, are called |
| 87 | "underweight". Those that have more are "overweight". If there are no |
| 88 | overweight (and therefore no underweight) next hops in the group, it is |
| 89 | said to be "balanced". |
| 90 | |
| 91 | Each bucket maintains a last-used timer. Every time a packet is forwarded |
| 92 | through a bucket, this timer is updated to current jiffies value. One |
| 93 | attribute of a resilient group is then the "idle timer", which is the |
| 94 | amount of time that a bucket must not be hit by traffic in order for it to |
| 95 | be considered "idle". Buckets that are not idle are busy. |
| 96 | |
| 97 | After assigning wants counts to next hops, an "upkeep" algorithm runs. For |
| 98 | buckets: |
| 99 | |
| 100 | 1) that have no assigned next hop, or |
| 101 | 2) whose next hop has been removed, or |
| 102 | 3) that are idle and their next hop is overweight, |
| 103 | |
| 104 | upkeep changes the next hop that the bucket references to one of the |
| 105 | underweight next hops. If, after considering all buckets in this manner, |
| 106 | there are still underweight next hops, another upkeep run is scheduled to a |
| 107 | future time. |
| 108 | |
| 109 | There may not be enough "idle" buckets to satisfy the updated wants counts |
| 110 | of all next hops. Another attribute of a resilient group is the "unbalanced |
| 111 | timer". This timer can be set to 0, in which case the table will stay out |
| 112 | of balance until idle buckets do appear, possibly never. If set to a |
| 113 | non-zero value, the value represents the period of time that the table is |
| 114 | permitted to stay out of balance. |
| 115 | |
| 116 | With this in mind, we update the above list of conditions with one more |
| 117 | item. Thus buckets: |
| 118 | |
| 119 | 4) whose next hop is overweight, and the amount of time that the table has |
| 120 | been out of balance exceeds the unbalanced timer, if that is non-zero, |
| 121 | |
| 122 | \... are migrated as well. |
| 123 | |
| 124 | Offloading & Driver Feedback |
| 125 | ---------------------------- |
| 126 | |
| 127 | When offloading resilient groups, the algorithm that distributes buckets |
| 128 | among next hops is still the one in SW. Drivers are notified of updates to |
| 129 | next hop groups in the following three ways: |
| 130 | |
| 131 | - Full group notification with the type |
| 132 | ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is |
| 133 | created and buckets populated for the first time. |
| 134 | |
| 135 | - Single-bucket notifications of the type |
| 136 | ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of |
| 137 | individual migrations within an already-established group. |
| 138 | |
| 139 | - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This |
| 140 | is sent before the group is replaced, and is a way for the driver to veto |
| 141 | the group before committing anything to the HW. |
| 142 | |
| 143 | Some single-bucket notifications are forced, as indicated by the "force" |
| 144 | flag in the notification. Those are used for the cases where e.g. the next |
| 145 | hop associated with the bucket was removed, and the bucket really must be |
| 146 | migrated. |
| 147 | |
| 148 | Non-forced notifications can be overridden by the driver by returning an |
| 149 | error code. The use case for this is that the driver notifies the HW that a |
| 150 | bucket should be migrated, but the HW discovers that the bucket has in fact |
| 151 | been hit by traffic. |
| 152 | |
| 153 | A second way for the HW to report that a bucket is busy is through the |
| 154 | ``nexthop_res_grp_activity_update()`` API. The buckets identified this way |
| 155 | as busy are treated as if traffic hit them. |
| 156 | |
| 157 | Offloaded buckets should be flagged as either "offload" or "trap". This is |
| 158 | done through the ``nexthop_bucket_set_hw_flags()`` API. |
| 159 | |
| 160 | Netlink UAPI |
| 161 | ------------ |
| 162 | |
| 163 | Resilient Group Replacement |
| 164 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 165 | |
| 166 | Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the |
| 167 | same manner as other multipath groups. The following changes apply to the |
| 168 | attributes passed in the netlink message: |
| 169 | |
| 170 | =================== ========================================================= |
| 171 | ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group. |
| 172 | ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient |
| 173 | groups. |
| 174 | =================== ========================================================= |
| 175 | |
| 176 | ``NHA_RES_GROUP`` payload: |
| 177 | |
| 178 | =================================== ========================================= |
| 179 | ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table. |
| 180 | ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t. |
| 181 | ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t. |
| 182 | =================================== ========================================= |
| 183 | |
| 184 | Next Hop Get |
| 185 | ^^^^^^^^^^^^ |
| 186 | |
| 187 | Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP`` |
| 188 | message in exactly the same way as other next hop get requests. The |
| 189 | response attributes match the replacement attributes cited above, except |
| 190 | ``NHA_RES_GROUP`` payload will include the following attribute: |
| 191 | |
| 192 | =================================== ========================================= |
| 193 | ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out |
| 194 | of balance, in units of clock_t. |
| 195 | =================================== ========================================= |
| 196 | |
| 197 | Bucket Get |
| 198 | ^^^^^^^^^^ |
| 199 | |
| 200 | The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is |
| 201 | used to request a single bucket. The attributes recognized at get requests |
| 202 | are: |
| 203 | |
| 204 | =================== ========================================================= |
| 205 | ``NHA_ID`` ID of the next-hop group that the bucket belongs to. |
| 206 | ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. |
| 207 | =================== ========================================================= |
| 208 | |
| 209 | ``NHA_RES_BUCKET`` payload: |
| 210 | |
| 211 | ======================== ==================================================== |
| 212 | ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table. |
| 213 | ======================== ==================================================== |
| 214 | |
| 215 | Bucket Dumps |
| 216 | ^^^^^^^^^^^^ |
| 217 | |
| 218 | The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used |
| 219 | to request a dump of matching buckets. The attributes recognized at dump |
| 220 | requests are: |
| 221 | |
| 222 | =================== ========================================================= |
| 223 | ``NHA_ID`` If specified, limits the dump to just the next-hop group |
| 224 | with this ID. |
| 225 | ``NHA_OIF`` If specified, limits the dump to buckets that contain |
| 226 | next hops that use the device with this ifindex. |
| 227 | ``NHA_MASTER`` If specified, limits the dump to buckets that contain |
| 228 | next hops that use a device in the VRF with this ifindex. |
| 229 | ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. |
| 230 | =================== ========================================================= |
| 231 | |
| 232 | ``NHA_RES_BUCKET`` payload: |
| 233 | |
| 234 | ======================== ==================================================== |
| 235 | ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets |
| 236 | that contain the next hop with this ID. |
| 237 | ======================== ==================================================== |
| 238 | |
| 239 | Usage |
| 240 | ----- |
| 241 | |
| 242 | To illustrate the usage, consider the following commands:: |
| 243 | |
| 244 | # ip nexthop add id 1 via 192.0.2.2 dev eth0 |
| 245 | # ip nexthop add id 2 via 192.0.2.3 dev eth0 |
| 246 | # ip nexthop add id 10 group 1/2 type resilient \ |
| 247 | buckets 8 idle_timer 60 unbalanced_timer 300 |
| 248 | |
| 249 | The last command creates a resilient next-hop group. It will have 8 buckets |
| 250 | (which is unusually low number, and used here for demonstration purposes |
| 251 | only), each bucket will be considered idle when no traffic hits it for at |
| 252 | least 60 seconds, and if the table remains out of balance for 300 seconds, |
| 253 | it will be forcefully brought into balance. |
| 254 | |
| 255 | Changing next-hop weights leads to change in bucket allocation:: |
| 256 | |
| 257 | # ip nexthop replace id 10 group 1,3/2 type resilient |
| 258 | |
| 259 | This can be confirmed by looking at individual buckets:: |
| 260 | |
| 261 | # ip nexthop bucket show id 10 |
| 262 | id 10 index 0 idle_time 5.59 nhid 1 |
| 263 | id 10 index 1 idle_time 5.59 nhid 1 |
| 264 | id 10 index 2 idle_time 8.74 nhid 2 |
| 265 | id 10 index 3 idle_time 8.74 nhid 2 |
| 266 | id 10 index 4 idle_time 8.74 nhid 1 |
| 267 | id 10 index 5 idle_time 8.74 nhid 1 |
| 268 | id 10 index 6 idle_time 8.74 nhid 1 |
| 269 | id 10 index 7 idle_time 8.74 nhid 1 |
| 270 | |
| 271 | Note the two buckets that have a shorter idle time. Those are the ones that |
| 272 | were migrated after the next-hop replace command to satisfy the new demand |
| 273 | that next hop 1 be given 6 buckets instead of 4. |
| 274 | |
| 275 | Netdevsim |
| 276 | --------- |
| 277 | |
| 278 | The netdevsim driver implements a mock offload of resilient groups, and |
| 279 | exposes debugfs interface that allows marking individual buckets as busy. |
| 280 | For example, the following will mark bucket 23 in next-hop group 10 as |
| 281 | active:: |
| 282 | |
| 283 | # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity |
| 284 | |
| 285 | In addition, another debugfs interface can be used to configure that the |
| 286 | next attempt to migrate a bucket should fail:: |
| 287 | |
| 288 | # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace |
| 289 | |
| 290 | Besides serving as an example, the interfaces that netdevsim exposes are |
| 291 | useful in automated testing, and |
| 292 | ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of |
| 293 | them to test the algorithm. |