Commit | Line | Data |
---|---|---|
58ccb2b2 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ==================================== | |
4 | Virtual Routing and Forwarding (VRF) | |
5 | ==================================== | |
6 | ||
7 | The VRF Device | |
8 | ============== | |
9 | ||
10 | The VRF device combined with ip rules provides the ability to create virtual | |
11 | routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the | |
12 | Linux network stack. One use case is the multi-tenancy problem where each | |
13 | tenant has their own unique routing tables and in the very least need | |
14 | different default gateways. | |
15 | ||
16 | Processes can be "VRF aware" by binding a socket to the VRF device. Packets | |
17 | through the socket then use the routing table associated with the VRF | |
18 | device. An important feature of the VRF device implementation is that it | |
19 | impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected | |
20 | (ie., they do not need to be run in each VRF). The design also allows | |
21 | the use of higher priority ip rules (Policy Based Routing, PBR) to take | |
22 | precedence over the VRF device rules directing specific traffic as desired. | |
23 | ||
24 | In addition, VRF devices allow VRFs to be nested within namespaces. For | |
25 | example network namespaces provide separation of network interfaces at the | |
26 | device layer, VLANs on the interfaces within a namespace provide L2 separation | |
27 | and then VRF devices provide L3 separation. | |
28 | ||
29 | Design | |
30 | ------ | |
31 | A VRF device is created with an associated route table. Network interfaces | |
32 | are then enslaved to a VRF device:: | |
33 | ||
34 | +-----------------------------+ | |
35 | | vrf-blue | ===> route table 10 | |
36 | +-----------------------------+ | |
37 | | | | | |
38 | +------+ +------+ +-------------+ | |
39 | | eth1 | | eth2 | ... | bond1 | | |
40 | +------+ +------+ +-------------+ | |
41 | | | | |
42 | +------+ +------+ | |
43 | | eth8 | | eth9 | | |
44 | +------+ +------+ | |
45 | ||
46 | Packets received on an enslaved device and are switched to the VRF device | |
47 | in the IPv4 and IPv6 processing stacks giving the impression that packets | |
48 | flow through the VRF device. Similarly on egress routing rules are used to | |
49 | send packets to the VRF device driver before getting sent out the actual | |
50 | interface. This allows tcpdump on a VRF device to capture all packets into | |
51 | and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules | |
52 | can be applied using the VRF device to specify rules that apply to the VRF | |
53 | domain as a whole. | |
54 | ||
55 | .. [1] Packets in the forwarded state do not flow through the device, so those | |
56 | packets are not seen by tcpdump. Will revisit this limitation in a | |
57 | future release. | |
58 | ||
59 | .. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real | |
60 | ingress device and both INPUT and PREROUTING rules with skb->dev set to | |
61 | the VRF device. For egress POSTROUTING and OUTPUT rules can be written | |
62 | using either the VRF device or real egress device. | |
63 | ||
64 | Setup | |
65 | ----- | |
66 | 1. VRF device is created with an association to a FIB table. | |
67 | e.g,:: | |
68 | ||
69 | ip link add vrf-blue type vrf table 10 | |
70 | ip link set dev vrf-blue up | |
71 | ||
72 | 2. An l3mdev FIB rule directs lookups to the table associated with the device. | |
73 | A single l3mdev rule is sufficient for all VRFs. The VRF device adds the | |
74 | l3mdev rule for IPv4 and IPv6 when the first device is created with a | |
75 | default preference of 1000. Users may delete the rule if desired and add | |
76 | with a different priority or install per-VRF rules. | |
77 | ||
78 | Prior to the v4.8 kernel iif and oif rules are needed for each VRF device:: | |
79 | ||
80 | ip ru add oif vrf-blue table 10 | |
81 | ip ru add iif vrf-blue table 10 | |
82 | ||
83 | 3. Set the default route for the table (and hence default route for the VRF):: | |
84 | ||
85 | ip route add table 10 unreachable default metric 4278198272 | |
86 | ||
87 | This high metric value ensures that the default unreachable route can | |
88 | be overridden by a routing protocol suite. FRRouting interprets | |
89 | kernel metrics as a combined admin distance (upper byte) and priority | |
90 | (lower 3 bytes). Thus the above metric translates to [255/8192]. | |
91 | ||
92 | 4. Enslave L3 interfaces to a VRF device:: | |
93 | ||
94 | ip link set dev eth1 master vrf-blue | |
95 | ||
96 | Local and connected routes for enslaved devices are automatically moved to | |
97 | the table associated with VRF device. Any additional routes depending on | |
98 | the enslaved device are dropped and will need to be reinserted to the VRF | |
99 | FIB table following the enslavement. | |
100 | ||
101 | The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global | |
102 | addresses as VRF enslavement changes:: | |
103 | ||
104 | sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 | |
105 | ||
106 | 5. Additional VRF routes are added to associated table:: | |
107 | ||
108 | ip route add table 10 ... | |
109 | ||
110 | ||
111 | Applications | |
112 | ------------ | |
113 | Applications that are to work within a VRF need to bind their socket to the | |
114 | VRF device:: | |
115 | ||
116 | setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); | |
117 | ||
118 | or to specify the output device using cmsg and IP_PKTINFO. | |
119 | ||
120 | By default the scope of the port bindings for unbound sockets is | |
121 | limited to the default VRF. That is, it will not be matched by packets | |
122 | arriving on interfaces enslaved to an l3mdev and processes may bind to | |
123 | the same port if they bind to an l3mdev. | |
124 | ||
125 | TCP & UDP services running in the default VRF context (ie., not bound | |
126 | to any VRF device) can work across all VRF domains by enabling the | |
127 | tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:: | |
128 | ||
129 | sysctl -w net.ipv4.tcp_l3mdev_accept=1 | |
130 | sysctl -w net.ipv4.udp_l3mdev_accept=1 | |
131 | ||
132 | These options are disabled by default so that a socket in a VRF is only | |
133 | selected for packets in that VRF. There is a similar option for RAW | |
134 | sockets, which is enabled by default for reasons of backwards compatibility. | |
135 | This is so as to specify the output device with cmsg and IP_PKTINFO, but | |
136 | using a socket not bound to the corresponding VRF. This allows e.g. older ping | |
137 | implementations to be run with specifying the device but without executing it | |
138 | in the VRF. This option can be disabled so that packets received in a VRF | |
139 | context are only handled by a raw socket bound to the VRF, and packets in the | |
140 | default VRF are only handled by a socket not bound to any VRF:: | |
141 | ||
142 | sysctl -w net.ipv4.raw_l3mdev_accept=0 | |
143 | ||
144 | netfilter rules on the VRF device can be used to limit access to services | |
145 | running in the default VRF context as well. | |
146 | ||
b1165777 BP |
147 | Using VRF-aware applications (applications which simultaneously create sockets |
148 | outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1`` | |
149 | is possible but may lead to problems in some situations. With that sysctl | |
150 | value, it is unspecified which listening socket will be selected to handle | |
151 | connections for VRF traffic; ie. either a socket bound to the VRF or an unbound | |
152 | socket may be used to accept new connections from a VRF. This somewhat | |
153 | unexpected behavior can lead to problems if sockets are configured with extra | |
154 | options (ex. TCP MD5 keys) with the expectation that VRF traffic will | |
155 | exclusively be handled by sockets bound to VRFs, as would be the case with | |
156 | ``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of | |
157 | which listening socket is selected, established sockets will be created in the | |
158 | VRF based on the ingress interface, as documented earlier. | |
159 | ||
58ccb2b2 MCC |
160 | -------------------------------------------------------------------------------- |
161 | ||
162 | Using iproute2 for VRFs | |
163 | ======================= | |
164 | iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this | |
165 | section lists both commands where appropriate -- with the vrf keyword and the | |
166 | older form without it. | |
167 | ||
168 | 1. Create a VRF | |
169 | ||
170 | To instantiate a VRF device and associate it with a table:: | |
171 | ||
172 | $ ip link add dev NAME type vrf table ID | |
173 | ||
174 | As of v4.8 the kernel supports the l3mdev FIB rule where a single rule | |
175 | covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first | |
176 | device create. | |
177 | ||
178 | 2. List VRFs | |
179 | ||
180 | To list VRFs that have been created:: | |
181 | ||
182 | $ ip [-d] link show type vrf | |
183 | NOTE: The -d option is needed to show the table id | |
184 | ||
185 | For example:: | |
186 | ||
187 | $ ip -d link show type vrf | |
188 | 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 | |
189 | link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 | |
190 | vrf table 1 addrgenmode eui64 | |
191 | 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 | |
192 | link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 | |
193 | vrf table 10 addrgenmode eui64 | |
194 | 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 | |
195 | link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 | |
196 | vrf table 66 addrgenmode eui64 | |
197 | 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 | |
198 | link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 | |
199 | vrf table 81 addrgenmode eui64 | |
200 | ||
201 | ||
202 | Or in brief output:: | |
203 | ||
204 | $ ip -br link show type vrf | |
205 | mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP> | |
206 | red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP> | |
207 | blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP> | |
208 | green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP> | |
209 | ||
210 | ||
211 | 3. Assign a Network Interface to a VRF | |
212 | ||
213 | Network interfaces are assigned to a VRF by enslaving the netdevice to a | |
214 | VRF device:: | |
215 | ||
216 | $ ip link set dev NAME master NAME | |
217 | ||
218 | On enslavement connected and local routes are automatically moved to the | |
219 | table associated with the VRF device. | |
220 | ||
221 | For example:: | |
222 | ||
223 | $ ip link set dev eth0 master mgmt | |
224 | ||
225 | ||
226 | 4. Show Devices Assigned to a VRF | |
227 | ||
228 | To show devices that have been assigned to a specific VRF add the master | |
229 | option to the ip command:: | |
230 | ||
231 | $ ip link show vrf NAME | |
232 | $ ip link show master NAME | |
233 | ||
234 | For example:: | |
235 | ||
236 | $ ip link show vrf red | |
237 | 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 | |
238 | link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff | |
239 | 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 | |
240 | link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff | |
241 | 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 | |
242 | link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff | |
243 | ||
244 | ||
245 | Or using the brief output:: | |
246 | ||
247 | $ ip -br link show vrf red | |
248 | eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP> | |
249 | eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP> | |
250 | eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST> | |
251 | ||
252 | ||
253 | 5. Show Neighbor Entries for a VRF | |
254 | ||
255 | To list neighbor entries associated with devices enslaved to a VRF device | |
256 | add the master option to the ip command:: | |
257 | ||
258 | $ ip [-6] neigh show vrf NAME | |
259 | $ ip [-6] neigh show master NAME | |
260 | ||
261 | For example:: | |
262 | ||
263 | $ ip neigh show vrf red | |
264 | 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE | |
265 | 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE | |
266 | ||
267 | $ ip -6 neigh show vrf red | |
268 | 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE | |
269 | ||
270 | ||
271 | 6. Show Addresses for a VRF | |
272 | ||
273 | To show addresses for interfaces associated with a VRF add the master | |
274 | option to the ip command:: | |
275 | ||
276 | $ ip addr show vrf NAME | |
277 | $ ip addr show master NAME | |
278 | ||
279 | For example:: | |
280 | ||
281 | $ ip addr show vrf red | |
282 | 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 | |
283 | link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff | |
284 | inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 | |
285 | valid_lft forever preferred_lft forever | |
286 | inet6 2002:1::2/120 scope global | |
287 | valid_lft forever preferred_lft forever | |
288 | inet6 fe80::ff:fe00:202/64 scope link | |
289 | valid_lft forever preferred_lft forever | |
290 | 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 | |
291 | link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff | |
292 | inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 | |
293 | valid_lft forever preferred_lft forever | |
294 | inet6 2002:2::2/120 scope global | |
295 | valid_lft forever preferred_lft forever | |
296 | inet6 fe80::ff:fe00:203/64 scope link | |
297 | valid_lft forever preferred_lft forever | |
298 | 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000 | |
299 | link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff | |
300 | ||
301 | Or in brief format:: | |
302 | ||
303 | $ ip -br addr show vrf red | |
304 | eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 | |
305 | eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 | |
306 | eth5 DOWN | |
307 | ||
308 | ||
309 | 7. Show Routes for a VRF | |
310 | ||
311 | To show routes for a VRF use the ip command to display the table associated | |
312 | with the VRF device:: | |
313 | ||
314 | $ ip [-6] route show vrf NAME | |
315 | $ ip [-6] route show table ID | |
316 | ||
317 | For example:: | |
318 | ||
319 | $ ip route show vrf red | |
320 | unreachable default metric 4278198272 | |
321 | broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 | |
322 | 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 | |
323 | local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 | |
324 | broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 | |
325 | broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 | |
326 | 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 | |
327 | local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 | |
328 | broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 | |
329 | ||
330 | $ ip -6 route show vrf red | |
331 | local 2002:1:: dev lo proto none metric 0 pref medium | |
332 | local 2002:1::2 dev lo proto none metric 0 pref medium | |
333 | 2002:1::/120 dev eth1 proto kernel metric 256 pref medium | |
334 | local 2002:2:: dev lo proto none metric 0 pref medium | |
335 | local 2002:2::2 dev lo proto none metric 0 pref medium | |
336 | 2002:2::/120 dev eth2 proto kernel metric 256 pref medium | |
337 | local fe80:: dev lo proto none metric 0 pref medium | |
338 | local fe80:: dev lo proto none metric 0 pref medium | |
339 | local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium | |
340 | local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium | |
341 | fe80::/64 dev eth1 proto kernel metric 256 pref medium | |
342 | fe80::/64 dev eth2 proto kernel metric 256 pref medium | |
343 | ff00::/8 dev red metric 256 pref medium | |
344 | ff00::/8 dev eth1 metric 256 pref medium | |
345 | ff00::/8 dev eth2 metric 256 pref medium | |
346 | unreachable default dev lo metric 4278198272 error -101 pref medium | |
347 | ||
348 | 8. Route Lookup for a VRF | |
349 | ||
350 | A test route lookup can be done for a VRF:: | |
351 | ||
352 | $ ip [-6] route get vrf NAME ADDRESS | |
353 | $ ip [-6] route get oif NAME ADDRESS | |
354 | ||
355 | For example:: | |
356 | ||
357 | $ ip route get 10.2.1.40 vrf red | |
358 | 10.2.1.40 dev eth1 table red src 10.2.1.2 | |
359 | cache | |
360 | ||
361 | $ ip -6 route get 2002:1::32 vrf red | |
362 | 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium | |
363 | ||
364 | ||
365 | 9. Removing Network Interface from a VRF | |
366 | ||
367 | Network interfaces are removed from a VRF by breaking the enslavement to | |
368 | the VRF device:: | |
369 | ||
370 | $ ip link set dev NAME nomaster | |
371 | ||
372 | Connected routes are moved back to the default table and local entries are | |
373 | moved to the local table. | |
374 | ||
375 | For example:: | |
376 | ||
377 | $ ip link set dev eth0 nomaster | |
378 | ||
379 | -------------------------------------------------------------------------------- | |
380 | ||
381 | Commands used in this example:: | |
382 | ||
383 | cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF | |
384 | 1 mgmt | |
385 | 10 red | |
386 | 66 blue | |
387 | 81 green | |
388 | EOF | |
389 | ||
390 | function vrf_create | |
391 | { | |
392 | VRF=$1 | |
393 | TBID=$2 | |
394 | ||
395 | # create VRF device | |
396 | ip link add ${VRF} type vrf table ${TBID} | |
397 | ||
398 | if [ "${VRF}" != "mgmt" ]; then | |
399 | ip route add table ${TBID} unreachable default metric 4278198272 | |
400 | fi | |
401 | ip link set dev ${VRF} up | |
402 | } | |
403 | ||
404 | vrf_create mgmt 1 | |
405 | ip link set dev eth0 master mgmt | |
406 | ||
407 | vrf_create red 10 | |
408 | ip link set dev eth1 master red | |
409 | ip link set dev eth2 master red | |
410 | ip link set dev eth5 master red | |
411 | ||
412 | vrf_create blue 66 | |
413 | ip link set dev eth3 master blue | |
414 | ||
415 | vrf_create green 81 | |
416 | ip link set dev eth4 master green | |
417 | ||
418 | ||
419 | Interface addresses from /etc/network/interfaces: | |
420 | auto eth0 | |
421 | iface eth0 inet static | |
422 | address 10.0.0.2 | |
423 | netmask 255.255.255.0 | |
424 | gateway 10.0.0.254 | |
425 | ||
426 | iface eth0 inet6 static | |
427 | address 2000:1::2 | |
428 | netmask 120 | |
429 | ||
430 | auto eth1 | |
431 | iface eth1 inet static | |
432 | address 10.2.1.2 | |
433 | netmask 255.255.255.0 | |
434 | ||
435 | iface eth1 inet6 static | |
436 | address 2002:1::2 | |
437 | netmask 120 | |
438 | ||
439 | auto eth2 | |
440 | iface eth2 inet static | |
441 | address 10.2.2.2 | |
442 | netmask 255.255.255.0 | |
443 | ||
444 | iface eth2 inet6 static | |
445 | address 2002:2::2 | |
446 | netmask 120 | |
447 | ||
448 | auto eth3 | |
449 | iface eth3 inet static | |
450 | address 10.2.3.2 | |
451 | netmask 255.255.255.0 | |
452 | ||
453 | iface eth3 inet6 static | |
454 | address 2002:3::2 | |
455 | netmask 120 | |
456 | ||
457 | auto eth4 | |
458 | iface eth4 inet static | |
459 | address 10.2.4.2 | |
460 | netmask 255.255.255.0 | |
461 | ||
462 | iface eth4 inet6 static | |
463 | address 2002:4::2 | |
464 | netmask 120 |