Commit | Line | Data |
---|---|---|
97162a1e MCC |
1 | ================== |
2 | IP over InfiniBand | |
3 | ================== | |
1da177e4 LT |
4 | |
5 | The ib_ipoib driver is an implementation of the IP over InfiniBand | |
ac83cbaa RD |
6 | protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib |
7 | working group. It is a "native" implementation in the sense of | |
8 | setting the interface type to ARPHRD_INFINIBAND and the hardware | |
9 | address length to 20 (earlier proprietary implementations | |
1da177e4 LT |
10 | masqueraded to the kernel as ethernet interfaces). |
11 | ||
12 | Partitions and P_Keys | |
97162a1e | 13 | ===================== |
1da177e4 LT |
14 | |
15 | When the IPoIB driver is loaded, it creates one interface for each | |
16 | port using the P_Key at index 0. To create an interface with a | |
17 | different P_Key, write the desired P_Key into the main interface's | |
97162a1e | 18 | /sys/class/net/<intf name>/create_child file. For example:: |
1da177e4 LT |
19 | |
20 | echo 0x8001 > /sys/class/net/ib0/create_child | |
21 | ||
22 | This will create an interface named ib0.8001 with P_Key 0x8001. To | |
97162a1e | 23 | remove a subinterface, use the "delete_child" file:: |
1da177e4 LT |
24 | |
25 | echo 0x8001 > /sys/class/net/ib0/delete_child | |
26 | ||
27 | The P_Key for any interface is given by the "pkey" file, and the | |
28 | main interface for a subinterface is in "parent." | |
29 | ||
9baa0b03 | 30 | Child interface create/delete can also be done using IPoIB's |
08559657 | 31 | rtnl_link_ops, where children created using either way behave the same. |
9baa0b03 | 32 | |
6a3335b4 | 33 | Datagram vs Connected modes |
97162a1e | 34 | =========================== |
6a3335b4 OG |
35 | |
36 | The IPoIB driver supports two modes of operation: datagram and | |
37 | connected. The mode is set and read through an interface's | |
38 | /sys/class/net/<intf name>/mode file. | |
39 | ||
40 | In datagram mode, the IB UD (Unreliable Datagram) transport is used | |
41 | and so the interface MTU has is equal to the IB L2 MTU minus the | |
42 | IPoIB encapsulation header (4 bytes). For example, in a typical IB | |
43 | fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. | |
44 | ||
45 | In connected mode, the IB RC (Reliable Connected) transport is used. | |
f7111821 BVA |
46 | Connected mode takes advantage of the connected nature of the IB |
47 | transport and allows an MTU up to the maximal IP packet size of 64K, | |
48 | which reduces the number of IP packets needed for handling large UDP | |
49 | datagrams, TCP segments, etc and increases the performance for large | |
50 | messages. | |
6a3335b4 OG |
51 | |
52 | In connected mode, the interface's UD QP is still used for multicast | |
53 | and communication with peers that don't support connected mode. In | |
54 | this case, RX emulation of ICMP PMTU packets is used to cause the | |
55 | networking stack to use the smaller UD MTU for these neighbours. | |
56 | ||
57 | Stateless offloads | |
97162a1e | 58 | ================== |
6a3335b4 OG |
59 | |
60 | If the IB HW supports IPoIB stateless offloads, IPoIB advertises | |
61 | TCP/IP checksum and/or Large Send (LSO) offloading capability to the | |
62 | network stack. | |
63 | ||
64 | Large Receive (LRO) offloading is also implemented and may be turned | |
65 | on/off using ethtool calls. Currently LRO is supported only for | |
66 | checksum offload capable devices. | |
67 | ||
97162a1e | 68 | Stateless offloads are supported only in datagram mode. |
6a3335b4 OG |
69 | |
70 | Interrupt moderation | |
97162a1e | 71 | ==================== |
6a3335b4 OG |
72 | |
73 | If the underlying IB device supports CQ event moderation, one can | |
74 | use ethtool to set interrupt mitigation parameters and thus reduce | |
75 | the overhead incurred by handling interrupts. The main code path of | |
76 | IPoIB doesn't use events for TX completion signaling so only RX | |
77 | moderation is supported. | |
78 | ||
1da177e4 | 79 | Debugging Information |
97162a1e | 80 | ===================== |
1da177e4 LT |
81 | |
82 | By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set | |
83 | to 'y', tracing messages are compiled into the driver. They are | |
84 | turned on by setting the module parameters debug_level and | |
85 | mcast_debug_level to 1. These parameters can be controlled at | |
86 | runtime through files in /sys/module/ib_ipoib/. | |
87 | ||
b1ed8dab | 88 | CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs |
97162a1e | 89 | virtual filesystem. By mounting this filesystem, for example with:: |
1da177e4 | 90 | |
b1ed8dab | 91 | mount -t debugfs none /sys/kernel/debug |
1da177e4 LT |
92 | |
93 | it is possible to get statistics about multicast groups from the | |
b1ed8dab | 94 | files /sys/kernel/debug/ipoib/ib0_mcg and so on. |
1da177e4 LT |
95 | |
96 | The performance impact of this option is negligible, so it | |
97 | is safe to enable this option with debug_level set to 0 for normal | |
98 | operation. | |
99 | ||
100 | CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in | |
101 | the data path when data_debug_level is set to 1. However, even with | |
102 | the output disabled, enabling this configuration option will affect | |
103 | performance, because it adds tests to the fast path. | |
104 | ||
105 | References | |
97162a1e | 106 | ========== |
1da177e4 | 107 | |
ac83cbaa | 108 | Transmission of IP over InfiniBand (IPoIB) (RFC 4391) |
97162a1e MCC |
109 | http://ietf.org/rfc/rfc4391.txt |
110 | ||
ac83cbaa | 111 | IP over InfiniBand (IPoIB) Architecture (RFC 4392) |
97162a1e MCC |
112 | http://ietf.org/rfc/rfc4392.txt |
113 | ||
6a3335b4 OG |
114 | IP over InfiniBand: Connected Mode (RFC 4755) |
115 | http://ietf.org/rfc/rfc4755.txt |