[linux-block.git] / Documentation / networking / nexthop-group-resilient.rst

.. SPDX-License-Identifier: GPL-2.0

=========================
Resilient Next-hop Groups
=========================

Resilient groups are a type of next-hop group that is aimed at minimizing
disruption in flow routing across changes to the group composition and
weights of constituent next hops.

The idea behind resilient hashing groups is best explained in contrast to
the legacy multipath next-hop group, which uses the hash-threshold
algorithm, described in RFC 2992.

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus::

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them. This
can result in TCP connections being reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in the wrong order, resulting in
degraded application performance.

To mitigate the above-mentioned flow redirection, resilient next-hop groups
insert another layer of indirection between the hash space and its
constituent next hops: a hash table. The selection algorithm uses SKB hash
to choose a hash table bucket, then reads the next hop that this bucket
contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops::

	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	    |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                     v v v v
	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	    |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

	    Before and after deletion of next hop 3
	    under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

Algorithm
---------

In a nutshell, the algorithm works as follows. Each next hop deserves a
certain number of buckets, according to its weight and the number of
buckets in the hash table. In accordance with the source code, we will call
this number a "wants count" of a next hop. In case of an event that might
cause bucket allocation change, the wants counts for individual next hops
are updated.

Next hops that have fewer buckets than their wants count, are called
"underweight". Those that have more are "overweight". If there are no
overweight (and therefore no underweight) next hops in the group, it is
said to be "balanced".

Each bucket maintains a last-used timer. Every time a packet is forwarded
through a bucket, this timer is updated to current jiffies value. One
attribute of a resilient group is then the "idle timer", which is the
amount of time that a bucket must not be hit by traffic in order for it to
be considered "idle". Buckets that are not idle are busy.

After assigning wants counts to next hops, an "upkeep" algorithm runs. For
buckets:

1) that have no assigned next hop, or
2) whose next hop has been removed, or
3) that are idle and their next hop is overweight,

upkeep changes the next hop that the bucket references to one of the
underweight next hops. If, after considering all buckets in this manner,
there are still underweight next hops, another upkeep run is scheduled to a
future time.

There may not be enough "idle" buckets to satisfy the updated wants counts
of all next hops. Another attribute of a resilient group is the "unbalanced
timer". This timer can be set to 0, in which case the table will stay out
of balance until idle buckets do appear, possibly never. If set to a
non-zero value, the value represents the period of time that the table is
permitted to stay out of balance.

With this in mind, we update the above list of conditions with one more
item. Thus buckets:

4) whose next hop is overweight, and the amount of time that the table has
   been out of balance exceeds the unbalanced timer, if that is non-zero,

\... are migrated as well.

Offloading & Driver Feedback
----------------------------

When offloading resilient groups, the algorithm that distributes buckets
among next hops is still the one in SW. Drivers are notified of updates to
next hop groups in the following three ways:

- Full group notification with the type
  ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
  created and buckets populated for the first time.

- Single-bucket notifications of the type
  ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
  individual migrations within an already-established group.

- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
  is sent before the group is replaced, and is a way for the driver to veto
  the group before committing anything to the HW.

Some single-bucket notifications are forced, as indicated by the "force"
flag in the notification. Those are used for the cases where e.g. the next
hop associated with the bucket was removed, and the bucket really must be
migrated.

Non-forced notifications can be overridden by the driver by returning an
error code. The use case for this is that the driver notifies the HW that a
bucket should be migrated, but the HW discovers that the bucket has in fact
been hit by traffic.

A second way for the HW to report that a bucket is busy is through the
``nexthop_res_grp_activity_update()`` API. The buckets identified this way
as busy are treated as if traffic hit them.

Offloaded buckets should be flagged as either "offload" or "trap". This is
done through the ``nexthop_bucket_set_hw_flags()`` API.

Netlink UAPI
------------

Resilient Group Replacement
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
same manner as other multipath groups. The following changes apply to the
attributes passed in the netlink message:

  =================== =========================================================
  ``NHA_GROUP_TYPE``  Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
  ``NHA_RES_GROUP``   A nest that contains attributes specific to resilient
                      groups.
  =================== =========================================================

``NHA_RES_GROUP`` payload:

  =================================== =========================================
  ``NHA_RES_GROUP_BUCKETS``           Number of buckets in the hash table.
  ``NHA_RES_GROUP_IDLE_TIMER``        Idle timer in units of clock_t.
  ``NHA_RES_GROUP_UNBALANCED_TIMER``  Unbalanced timer in units of clock_t.
  =================================== =========================================

Next Hop Get
^^^^^^^^^^^^

Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
message in exactly the same way as other next hop get requests. The
response attributes match the replacement attributes cited above, except
``NHA_RES_GROUP`` payload will include the following attribute:

  =================================== =========================================
  ``NHA_RES_GROUP_UNBALANCED_TIME``   How long has the resilient group been out
                                      of balance, in units of clock_t.
  =================================== =========================================

Bucket Get
^^^^^^^^^^

The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
used to request a single bucket. The attributes recognized at get requests
are:

  =================== =========================================================
  ``NHA_ID``          ID of the next-hop group that the bucket belongs to.
  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
  =================== =========================================================

``NHA_RES_BUCKET`` payload:

  ======================== ====================================================
  ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
  ======================== ====================================================

Bucket Dumps
^^^^^^^^^^^^

The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
to request a dump of matching buckets. The attributes recognized at dump
requests are:

  =================== =========================================================
  ``NHA_ID``          If specified, limits the dump to just the next-hop group
                      with this ID.
  ``NHA_OIF``         If specified, limits the dump to buckets that contain
                      next hops that use the device with this ifindex.
  ``NHA_MASTER``      If specified, limits the dump to buckets that contain
                      next hops that use a device in the VRF with this ifindex.
  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
  =================== =========================================================

``NHA_RES_BUCKET`` payload:

  ======================== ====================================================
  ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
                           that contain the next hop with this ID.
  ======================== ====================================================

Usage
-----

To illustrate the usage, consider the following commands::

	# ip nexthop add id 1 via 192.0.2.2 dev eth0
	# ip nexthop add id 2 via 192.0.2.3 dev eth0
	# ip nexthop add id 10 group 1/2 type resilient \
		buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8 buckets
(which is unusually low number, and used here for demonstration purposes
only), each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

Changing next-hop weights leads to change in bucket allocation::

	# ip nexthop replace id 10 group 1,3/2 type resilient

This can be confirmed by looking at individual buckets::

	# ip nexthop bucket show id 10
	id 10 index 0 idle_time 5.59 nhid 1
	id 10 index 1 idle_time 5.59 nhid 1
	id 10 index 2 idle_time 8.74 nhid 2
	id 10 index 3 idle_time 8.74 nhid 2
	id 10 index 4 idle_time 8.74 nhid 1
	id 10 index 5 idle_time 8.74 nhid 1
	id 10 index 6 idle_time 8.74 nhid 1
	id 10 index 7 idle_time 8.74 nhid 1

Note the two buckets that have a shorter idle time. Those are the ones that
were migrated after the next-hop replace command to satisfy the new demand
that next hop 1 be given 6 buckets instead of 4.

Netdevsim
---------

The netdevsim driver implements a mock offload of resilient groups, and
exposes debugfs interface that allows marking individual buckets as busy.
For example, the following will mark bucket 23 in next-hop group 10 as
active::

	# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity

In addition, another debugfs interface can be used to configure that the
next attempt to migrate a bucket should fail::

	# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace

Besides serving as an example, the interfaces that netdevsim exposes are
useful in automated testing, and
``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
them to test the algorithm.
Commit	Line	Data
	1	.. SPDX-License-Identifier: GPL-2.0
	2
	3	=========================
	4	Resilient Next-hop Groups
	5	=========================
	6
	7	Resilient groups are a type of next-hop group that is aimed at minimizing
	8	disruption in flow routing across changes to the group composition and
	9	weights of constituent next hops.
	10
	11	The idea behind resilient hashing groups is best explained in contrast to
	12	the legacy multipath next-hop group, which uses the hash-threshold
	13	algorithm, described in RFC 2992.
	14
	15	To select a next hop, hash-threshold algorithm first assigns a range of
	16	hashes to each next hop in the group, and then selects the next hop by
	17	comparing the SKB hash with the individual ranges. When a next hop is
	18	removed from the group, the ranges are recomputed, which leads to
	19	reassignment of parts of hash space from one next hop to another. RFC 2992
	20	illustrates it thus::
	21
	22	+-------+-------+-------+-------+-------+
	23	\| 1 \| 2 \| 3 \| 4 \| 5 \|
	24	+-------+-+-----+---+---+-----+-+-------+
	25	\| 1 \| 2 \| 4 \| 5 \|
	26	+---------+---------+---------+---------+
	27
	28	Before and after deletion of next hop 3
	29	under the hash-threshold algorithm.
	30
	31	Note how next hop 2 gave up part of the hash space in favor of next hop 1,
	32	and 4 in favor of 5. While there will usually be some overlap between the
	33	previous and the new distribution, some traffic flows change the next hop
	34	that they resolve to.
	35
	36	If a multipath group is used for load-balancing between multiple servers,
	37	this hash space reassignment causes an issue that packets from a single
	38	flow suddenly end up arriving at a server that does not expect them. This
	39	can result in TCP connections being reset.
	40
	41	If a multipath group is used for load-balancing among available paths to
	42	the same server, the issue is that different latencies and reordering along
	43	the way causes the packets to arrive in the wrong order, resulting in
	44	degraded application performance.
	45
	46	To mitigate the above-mentioned flow redirection, resilient next-hop groups
	47	insert another layer of indirection between the hash space and its
	48	constituent next hops: a hash table. The selection algorithm uses SKB hash
	49	to choose a hash table bucket, then reads the next hop that this bucket
	50	contains, and forwards traffic there.
	51
	52	This indirection brings an important feature. In the hash-threshold
	53	algorithm, the range of hashes associated with a next hop must be
	54	continuous. With a hash table, mapping between the hash table buckets and
	55	the individual next hops is arbitrary. Therefore when a next hop is deleted
	56	the buckets that held it are simply reassigned to other next hops::
	57
	58	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	59	\|1\|1\|1\|1\|2\|2\|2\|2\|3\|3\|3\|3\|4\|4\|4\|4\|5\|5\|5\|5\|
	60	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	61	v v v v
	62	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	63	\|1\|1\|1\|1\|2\|2\|2\|2\|1\|2\|4\|5\|4\|4\|4\|4\|5\|5\|5\|5\|
	64	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	65
	66	Before and after deletion of next hop 3
	67	under the resilient hashing algorithm.
	68
	69	When weights of next hops in a group are altered, it may be possible to
	70	choose a subset of buckets that are currently not used for forwarding
	71	traffic, and use those to satisfy the new next-hop distribution demands,
	72	keeping the "busy" buckets intact. This way, established flows are ideally
	73	kept being forwarded to the same endpoints through the same paths as before
	74	the next-hop group change.
	75
	76	Algorithm
	77	---------
	78
	79	In a nutshell, the algorithm works as follows. Each next hop deserves a
	80	certain number of buckets, according to its weight and the number of
	81	buckets in the hash table. In accordance with the source code, we will call
	82	this number a "wants count" of a next hop. In case of an event that might
	83	cause bucket allocation change, the wants counts for individual next hops
	84	are updated.
	85
	86	Next hops that have fewer buckets than their wants count, are called
	87	"underweight". Those that have more are "overweight". If there are no
	88	overweight (and therefore no underweight) next hops in the group, it is
	89	said to be "balanced".
	90
	91	Each bucket maintains a last-used timer. Every time a packet is forwarded
	92	through a bucket, this timer is updated to current jiffies value. One
	93	attribute of a resilient group is then the "idle timer", which is the
	94	amount of time that a bucket must not be hit by traffic in order for it to
	95	be considered "idle". Buckets that are not idle are busy.
	96
	97	After assigning wants counts to next hops, an "upkeep" algorithm runs. For
	98	buckets:
	99
	100	1) that have no assigned next hop, or
	101	2) whose next hop has been removed, or
	102	3) that are idle and their next hop is overweight,
	103
	104	upkeep changes the next hop that the bucket references to one of the
	105	underweight next hops. If, after considering all buckets in this manner,
	106	there are still underweight next hops, another upkeep run is scheduled to a
	107	future time.
	108
	109	There may not be enough "idle" buckets to satisfy the updated wants counts
	110	of all next hops. Another attribute of a resilient group is the "unbalanced
	111	timer". This timer can be set to 0, in which case the table will stay out
	112	of balance until idle buckets do appear, possibly never. If set to a
	113	non-zero value, the value represents the period of time that the table is
	114	permitted to stay out of balance.
	115
	116	With this in mind, we update the above list of conditions with one more
	117	item. Thus buckets:
	118
	119	4) whose next hop is overweight, and the amount of time that the table has
	120	been out of balance exceeds the unbalanced timer, if that is non-zero,
	121
	122	\... are migrated as well.
	123
	124	Offloading & Driver Feedback
	125	----------------------------
	126
	127	When offloading resilient groups, the algorithm that distributes buckets
	128	among next hops is still the one in SW. Drivers are notified of updates to
	129	next hop groups in the following three ways:
	130
	131	- Full group notification with the type
	132	``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
	133	created and buckets populated for the first time.
	134
	135	- Single-bucket notifications of the type
	136	``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
	137	individual migrations within an already-established group.
	138
	139	- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
	140	is sent before the group is replaced, and is a way for the driver to veto
	141	the group before committing anything to the HW.
	142
	143	Some single-bucket notifications are forced, as indicated by the "force"
	144	flag in the notification. Those are used for the cases where e.g. the next
	145	hop associated with the bucket was removed, and the bucket really must be
	146	migrated.
	147
	148	Non-forced notifications can be overridden by the driver by returning an
	149	error code. The use case for this is that the driver notifies the HW that a
	150	bucket should be migrated, but the HW discovers that the bucket has in fact
	151	been hit by traffic.
	152
	153	A second way for the HW to report that a bucket is busy is through the
	154	``nexthop_res_grp_activity_update()`` API. The buckets identified this way
	155	as busy are treated as if traffic hit them.
	156
	157	Offloaded buckets should be flagged as either "offload" or "trap". This is
	158	done through the ``nexthop_bucket_set_hw_flags()`` API.
	159
	160	Netlink UAPI
	161	------------
	162
	163	Resilient Group Replacement
	164	^^^^^^^^^^^^^^^^^^^^^^^^^^^
	165
	166	Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
	167	same manner as other multipath groups. The following changes apply to the
	168	attributes passed in the netlink message:
	169
	170	=================== =========================================================
	171	``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
	172	``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
	173	groups.
	174	=================== =========================================================
	175
	176	``NHA_RES_GROUP`` payload:
	177
	178	=================================== =========================================
	179	``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
	180	``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
	181	``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
	182	=================================== =========================================
	183
	184	Next Hop Get
	185	^^^^^^^^^^^^
	186
	187	Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
	188	message in exactly the same way as other next hop get requests. The
	189	response attributes match the replacement attributes cited above, except
	190	``NHA_RES_GROUP`` payload will include the following attribute:
	191
	192	=================================== =========================================
	193	``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
	194	of balance, in units of clock_t.
	195	=================================== =========================================
	196
	197	Bucket Get
	198	^^^^^^^^^^
	199
	200	The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
	201	used to request a single bucket. The attributes recognized at get requests
	202	are:
	203
	204	=================== =========================================================
	205	``NHA_ID`` ID of the next-hop group that the bucket belongs to.
	206	``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
	207	=================== =========================================================
	208
	209	``NHA_RES_BUCKET`` payload:
	210
	211	======================== ====================================================
	212	``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
	213	======================== ====================================================
	214
	215	Bucket Dumps
	216	^^^^^^^^^^^^
	217
	218	The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
	219	to request a dump of matching buckets. The attributes recognized at dump
	220	requests are:
	221
	222	=================== =========================================================
	223	``NHA_ID`` If specified, limits the dump to just the next-hop group
	224	with this ID.
	225	``NHA_OIF`` If specified, limits the dump to buckets that contain
	226	next hops that use the device with this ifindex.
	227	``NHA_MASTER`` If specified, limits the dump to buckets that contain
	228	next hops that use a device in the VRF with this ifindex.
	229	``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
	230	=================== =========================================================
	231
	232	``NHA_RES_BUCKET`` payload:
	233
	234	======================== ====================================================
	235	``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
	236	that contain the next hop with this ID.
	237	======================== ====================================================
	238
	239	Usage
	240	-----
	241
	242	To illustrate the usage, consider the following commands::
	243
	244	# ip nexthop add id 1 via 192.0.2.2 dev eth0
	245	# ip nexthop add id 2 via 192.0.2.3 dev eth0
	246	# ip nexthop add id 10 group 1/2 type resilient \
	247	buckets 8 idle_timer 60 unbalanced_timer 300
	248
	249	The last command creates a resilient next-hop group. It will have 8 buckets
	250	(which is unusually low number, and used here for demonstration purposes
	251	only), each bucket will be considered idle when no traffic hits it for at
	252	least 60 seconds, and if the table remains out of balance for 300 seconds,
	253	it will be forcefully brought into balance.
	254
	255	Changing next-hop weights leads to change in bucket allocation::
	256
	257	# ip nexthop replace id 10 group 1,3/2 type resilient
	258
	259	This can be confirmed by looking at individual buckets::
	260
	261	# ip nexthop bucket show id 10
	262	id 10 index 0 idle_time 5.59 nhid 1
	263	id 10 index 1 idle_time 5.59 nhid 1
	264	id 10 index 2 idle_time 8.74 nhid 2
	265	id 10 index 3 idle_time 8.74 nhid 2
	266	id 10 index 4 idle_time 8.74 nhid 1
	267	id 10 index 5 idle_time 8.74 nhid 1
	268	id 10 index 6 idle_time 8.74 nhid 1
	269	id 10 index 7 idle_time 8.74 nhid 1
	270
	271	Note the two buckets that have a shorter idle time. Those are the ones that
	272	were migrated after the next-hop replace command to satisfy the new demand
	273	that next hop 1 be given 6 buckets instead of 4.
	274
	275	Netdevsim
	276	---------
	277
	278	The netdevsim driver implements a mock offload of resilient groups, and
	279	exposes debugfs interface that allows marking individual buckets as busy.
	280	For example, the following will mark bucket 23 in next-hop group 10 as
	281	active::
	282
	283	# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
	284
	285	In addition, another debugfs interface can be used to configure that the
	286	next attempt to migrate a bucket should fail::
	287
	288	# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
	289
	290	Besides serving as an example, the interfaces that netdevsim exposes are
	291	useful in automated testing, and
	292	``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
	293	them to test the algorithm.