Linux-2.6.12-rc2
[linux-2.6-block.git] / Documentation / MSI-HOWTO.txt
CommitLineData
1da177e4
LT
1 The MSI Driver Guide HOWTO
2 Tom L Nguyen tom.l.nguyen@intel.com
3 10/03/2003
4 Revised Feb 12, 2004 by Martine Silbermann
5 email: Martine.Silbermann@hp.com
6 Revised Jun 25, 2004 by Tom L Nguyen
7
81. About this guide
9
10This guide describes the basics of Message Signaled Interrupts (MSI),
11the advantages of using MSI over traditional interrupt mechanisms,
12and how to enable your driver to use MSI or MSI-X. Also included is
13a Frequently Asked Questions.
14
152. Copyright 2003 Intel Corporation
16
173. What is MSI/MSI-X?
18
19Message Signaled Interrupt (MSI), as described in the PCI Local Bus
20Specification Revision 2.3 or latest, is an optional feature, and a
21required feature for PCI Express devices. MSI enables a device function
22to request service by sending an Inbound Memory Write on its PCI bus to
23the FSB as a Message Signal Interrupt transaction. Because MSI is
24generated in the form of a Memory Write, all transaction conditions,
25such as a Retry, Master-Abort, Target-Abort or normal completion, are
26supported.
27
28A PCI device that supports MSI must also support pin IRQ assertion
29interrupt mechanism to provide backward compatibility for systems that
30do not support MSI. In Systems, which support MSI, the bus driver is
31responsible for initializing the message address and message data of
32the device function's MSI/MSI-X capability structure during device
33initial configuration.
34
35An MSI capable device function indicates MSI support by implementing
36the MSI/MSI-X capability structure in its PCI capability list. The
37device function may implement both the MSI capability structure and
38the MSI-X capability structure; however, the bus driver should not
39enable both.
40
41The MSI capability structure contains Message Control register,
42Message Address register and Message Data register. These registers
43provide the bus driver control over MSI. The Message Control register
44indicates the MSI capability supported by the device. The Message
45Address register specifies the target address and the Message Data
46register specifies the characteristics of the message. To request
47service, the device function writes the content of the Message Data
48register to the target address. The device and its software driver
49are prohibited from writing to these registers.
50
51The MSI-X capability structure is an optional extension to MSI. It
52uses an independent and separate capability structure. There are
53some key advantages to implementing the MSI-X capability structure
54over the MSI capability structure as described below.
55
56 - Support a larger maximum number of vectors per function.
57
58 - Provide the ability for system software to configure
59 each vector with an independent message address and message
60 data, specified by a table that resides in Memory Space.
61
62 - MSI and MSI-X both support per-vector masking. Per-vector
63 masking is an optional extension of MSI but a required
64 feature for MSI-X. Per-vector masking provides the kernel
65 the ability to mask/unmask MSI when servicing its software
66 interrupt service routing handler. If per-vector masking is
67 not supported, then the device driver should provide the
68 hardware/software synchronization to ensure that the device
69 generates MSI when the driver wants it to do so.
70
714. Why use MSI?
72
73As a benefit the simplification of board design, MSI allows board
74designers to remove out of band interrupt routing. MSI is another
75step towards a legacy-free environment.
76
77Due to increasing pressure on chipset and processor packages to
78reduce pin count, the need for interrupt pins is expected to
79diminish over time. Devices, due to pin constraints, may implement
80messages to increase performance.
81
82PCI Express endpoints uses INTx emulation (in-band messages) instead
83of IRQ pin assertion. Using INTx emulation requires interrupt
84sharing among devices connected to the same node (PCI bridge) while
85MSI is unique (non-shared) and does not require BIOS configuration
86support. As a result, the PCI Express technology requires MSI
87support for better interrupt performance.
88
89Using MSI enables the device functions to support two or more
90vectors, which can be configured to target different CPU's to
91increase scalability.
92
935. Configuring a driver to use MSI/MSI-X
94
95By default, the kernel will not enable MSI/MSI-X on all devices that
96support this capability. The CONFIG_PCI_MSI kernel option
97must be selected to enable MSI/MSI-X support.
98
995.1 Including MSI/MSI-X support into the kernel
100
101To allow MSI/MSI-X capable device drivers to selectively enable
102MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
103below), the VECTOR based scheme needs to be enabled by setting
104CONFIG_PCI_MSI during kernel config.
105
106Since the target of the inbound message is the local APIC, providing
107CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
108
1095.2 Configuring for MSI support
110
111Due to the non-contiguous fashion in vector assignment of the
112existing Linux kernel, this version does not support multiple
113messages regardless of a device function is capable of supporting
114more than one vector. To enable MSI on a device function's MSI
115capability structure requires a device driver to call the function
116pci_enable_msi() explicitly.
117
1185.2.1 API pci_enable_msi
119
120int pci_enable_msi(struct pci_dev *dev)
121
122With this new API, any existing device driver, which like to have
123MSI enabled on its device function, must call this API to enable MSI
124A successful call will initialize the MSI capability structure
125with ONE vector, regardless of whether a device function is
126capable of supporting multiple messages. This vector replaces the
127pre-assigned dev->irq with a new MSI vector. To avoid the conflict
128of new assigned vector with existing pre-assigned vector requires
129a device driver to call this API before calling request_irq().
130
1315.2.2 API pci_disable_msi
132
133void pci_disable_msi(struct pci_dev *dev)
134
135This API should always be used to undo the effect of pci_enable_msi()
136when a device driver is unloading. This API restores dev->irq with
137the pre-assigned IOAPIC vector and switches a device's interrupt
138mode to PCI pin-irq assertion/INTx emulation mode.
139
140Note that a device driver should always call free_irq() on MSI vector
141it has done request_irq() on before calling this API. Failure to do
142so results a BUG_ON() and a device will be left with MSI enabled and
143leaks its vector.
144
1455.2.3 MSI mode vs. legacy mode diagram
146
147The below diagram shows the events, which switches the interrupt
148mode on the MSI-capable device function between MSI mode and
149PIN-IRQ assertion mode.
150
151 ------------ pci_enable_msi ------------------------
152 | | <=============== | |
153 | MSI MODE | | PIN-IRQ ASSERTION MODE |
154 | | ===============> | |
155 ------------ pci_disable_msi ------------------------
156
157
158Figure 1.0 MSI Mode vs. Legacy Mode
159
160In Figure 1.0, a device operates by default in legacy mode. Legacy
161in this context means PCI pin-irq assertion or PCI-Express INTx
162emulation. A successful MSI request (using pci_enable_msi()) switches
163a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
164stored in dev->irq will be saved by the PCI subsystem and a new
165assigned MSI vector will replace dev->irq.
166
167To return back to its default mode, a device driver should always call
168pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
169device driver should always call free_irq() on MSI vector it has done
170request_irq() on before calling pci_disable_msi(). Failure to do so
171results a BUG_ON() and a device will be left with MSI enabled and
172leaks its vector. Otherwise, the PCI subsystem restores a device's
173dev->irq with a pre-assigned IOAPIC vector and marks released
174MSI vector as unused.
175
176Once being marked as unused, there is no guarantee that the PCI
177subsystem will reserve this MSI vector for a device. Depending on
178the availability of current PCI vector resources and the number of
179MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
180
181For the case where the PCI subsystem re-assigned this MSI vector
182another driver, a request to switching back to MSI mode may result
183in being assigned a different MSI vector or a failure if no more
184vectors are available.
185
1865.3 Configuring for MSI-X support
187
188Due to the ability of the system software to configure each vector of
189the MSI-X capability structure with an independent message address
190and message data, the non-contiguous fashion in vector assignment of
191the existing Linux kernel has no impact on supporting multiple
192messages on an MSI-X capable device functions. To enable MSI-X on
193a device function's MSI-X capability structure requires its device
194driver to call the function pci_enable_msix() explicitly.
195
196The function pci_enable_msix(), once invoked, enables either
197all or nothing, depending on the current availability of PCI vector
198resources. If the PCI vector resources are available for the number
199of vectors requested by a device driver, this function will configure
200the MSI-X table of the MSI-X capability structure of a device with
201requested messages. To emphasize this reason, for example, a device
202may be capable for supporting the maximum of 32 vectors while its
203software driver usually may request 4 vectors. It is recommended
204that the device driver should call this function once during the
205initialization phase of the device driver.
206
207Unlike the function pci_enable_msi(), the function pci_enable_msix()
208does not replace the pre-assigned IOAPIC dev->irq with a new MSI
209vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
210into the field vector of each element contained in a second argument.
211Note that the pre-assigned IO-APIC dev->irq is valid only if the device
212operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
213using dev->irq by the device driver to request for interrupt service
214may result unpredictabe behavior.
215
216For each MSI-X vector granted, a device driver is responsible to call
217other functions like request_irq(), enable_irq(), etc. to enable
218this vector with its corresponding interrupt service handler. It is
219a device driver's choice to assign all vectors with the same
220interrupt service handler or each vector with a unique interrupt
221service handler.
222
2235.3.1 Handling MMIO address space of MSI-X Table
224
225The PCI 3.0 specification has implementation notes that MMIO address
226space for a device's MSI-X structure should be isolated so that the
227software system can set different page for controlling accesses to
228the MSI-X structure. The implementation of MSI patch requires the PCI
229subsystem, not a device driver, to maintain full control of the MSI-X
230table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
231A device driver is prohibited from requesting the MMIO address space
232of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
233enabling MSI-X on its hardware device when it calls the function
234pci_enable_msix().
235
2365.3.2 Handling MSI-X allocation
237
238Determining the number of MSI-X vectors allocated to a function is
239dependent on the number of MSI capable devices and MSI-X capable
240devices populated in the system. The policy of allocating MSI-X
241vectors to a function is defined as the following:
242
243#of MSI-X vectors allocated to a function = (x - y)/z where
244
245x = The number of available PCI vector resources by the time
246 the device driver calls pci_enable_msix(). The PCI vector
247 resources is the sum of the number of unassigned vectors
248 (new) and the number of released vectors when any MSI/MSI-X
249 device driver switches its hardware device back to a legacy
250 mode or is hot-removed. The number of unassigned vectors
251 may exclude some vectors reserved, as defined in parameter
252 NR_HP_RESERVED_VECTORS, for the case where the system is
253 capable of supporting hot-add/hot-remove operations. Users
254 may change the value defined in NR_HR_RESERVED_VECTORS to
255 meet their specific needs.
256
257y = The number of MSI capable devices populated in the system.
258 This policy ensures that each MSI capable device has its
259 vector reserved to avoid the case where some MSI-X capable
260 drivers may attempt to claim all available vector resources.
261
262z = The number of MSI-X capable devices pupulated in the system.
263 This policy ensures that maximum (x - y) is distributed
264 evenly among MSI-X capable devices.
265
266Note that the PCI subsystem scans y and z during a bus enumeration.
267When the PCI subsystem completes configuring MSI/MSI-X capability
268structure of a device as requested by its device driver, y/z is
269decremented accordingly.
270
2715.3.3 Handling MSI-X shortages
272
273For the case where fewer MSI-X vectors are allocated to a function
274than requested, the function pci_enable_msix() will return the
275maximum number of MSI-X vectors available to the caller. A device
276driver may re-send its request with fewer or equal vectors indicated
277in a return. For example, if a device driver requests 5 vectors, but
278the number of available vectors is 3 vectors, a value of 3 will be a
279return as a result of pci_enable_msix() call. A function could be
280designed for its driver to use only 3 MSI-X table entries as
281different combinations as ABC--, A-B-C, A--CB, etc. Note that this
282patch does not support multiple entries with the same vector. Such
283attempt by a device driver to use 5 MSI-X table entries with 3 vectors
284as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
285pci_enable_msix(). Below are the reasons why supporting multiple
286entries with the same vector is an undesirable solution.
287
288 - The PCI subsystem can not determine which entry, which
289 generated the message, to mask/unmask MSI while handling
290 software driver ISR. Attempting to walk through all MSI-X
291 table entries (2048 max) to mask/unmask any match vector
292 is an undesirable solution.
293
294 - Walk through all MSI-X table entries (2048 max) to handle
295 SMP affinity of any match vector is an undesirable solution.
296
2975.3.4 API pci_enable_msix
298
299int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
300
301This API enables a device driver to request the PCI subsystem
302for enabling MSI-X messages on its hardware device. Depending on
303the availability of PCI vectors resources, the PCI subsystem enables
304either all or nothing.
305
306Argument dev points to the device (pci_dev) structure.
307
308Argument entries is a pointer of unsigned integer type. The number of
309elements is indicated in argument nvec. The content of each element
310will be mapped to the following struct defined in /driver/pci/msi.h.
311
312struct msix_entry {
313 u16 vector; /* kernel uses to write alloc vector */
314 u16 entry; /* driver uses to specify entry */
315};
316
317A device driver is responsible for initializing the field entry of
318each element with unique entry supported by MSI-X table. Otherwise,
319-EINVAL will be returned as a result. A successful return of zero
320indicates the PCI subsystem completes initializing each of requested
321entries of the MSI-X table with message address and message data.
322Last but not least, the PCI subsystem will write the 1:1
323vector-to-entry mapping into the field vector of each element. A
324device driver is responsible of keeping track of allocated MSI-X
325vectors in its internal data structure.
326
327Argument nvec is an integer indicating the number of messages
328requested.
329
330A return of zero indicates that the number of MSI-X vectors is
331successfully allocated. A return of greater than zero indicates
332MSI-X vector shortage. Or a return of less than zero indicates
333a failure. This failure may be a result of duplicate entries
334specified in second argument, or a result of no available vector,
335or a result of failing to initialize MSI-X table entries.
336
3375.3.5 API pci_disable_msix
338
339void pci_disable_msix(struct pci_dev *dev)
340
341This API should always be used to undo the effect of pci_enable_msix()
342when a device driver is unloading. Note that a device driver should
343always call free_irq() on all MSI-X vectors it has done request_irq()
344on before calling this API. Failure to do so results a BUG_ON() and
345a device will be left with MSI-X enabled and leaks its vectors.
346
3475.3.6 MSI-X mode vs. legacy mode diagram
348
349The below diagram shows the events, which switches the interrupt
350mode on the MSI-X capable device function between MSI-X mode and
351PIN-IRQ assertion mode (legacy).
352
353 ------------ pci_enable_msix(,,n) ------------------------
354 | | <=============== | |
355 | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
356 | | ===============> | |
357 ------------ pci_disable_msix ------------------------
358
359Figure 2.0 MSI-X Mode vs. Legacy Mode
360
361In Figure 2.0, a device operates by default in legacy mode. A
362successful MSI-X request (using pci_enable_msix()) switches a
363device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
364stored in dev->irq will be saved by the PCI subsystem; however,
365unlike MSI mode, the PCI subsystem will not replace dev->irq with
366assigned MSI-X vector because the PCI subsystem already writes the 1:1
367vector-to-entry mapping into the field vector of each element
368specified in second argument.
369
370To return back to its default mode, a device driver should always call
371pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
372a device driver should always call free_irq() on all MSI-X vectors it
373has done request_irq() on before calling pci_disable_msix(). Failure
374to do so results a BUG_ON() and a device will be left with MSI-X
375enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
376device function's interrupt mode from MSI-X mode to legacy mode and
377marks all allocated MSI-X vectors as unused.
378
379Once being marked as unused, there is no guarantee that the PCI
380subsystem will reserve these MSI-X vectors for a device. Depending on
381the availability of current PCI vector resources and the number of
382MSI/MSI-X requests from other drivers, these MSI-X vectors may be
383re-assigned.
384
385For the case where the PCI subsystem re-assigned these MSI-X vectors
386to other driver, a request to switching back to MSI-X mode may result
387being assigned with another set of MSI-X vectors or a failure if no
388more vectors are available.
389
3905.4 Handling function implementng both MSI and MSI-X capabilities
391
392For the case where a function implements both MSI and MSI-X
393capabilities, the PCI subsystem enables a device to run either in MSI
394mode or MSI-X mode but not both. A device driver determines whether it
395wants MSI or MSI-X enabled on its hardware device. Once a device
396driver requests for MSI, for example, it is prohibited to request for
397MSI-X; in other words, a device driver is not permitted to ping-pong
398between MSI mod MSI-X mode during a run-time.
399
4005.5 Hardware requirements for MSI/MSI-X support
401MSI/MSI-X support requires support from both system hardware and
402individual hardware device functions.
403
4045.5.1 System hardware support
405Since the target of MSI address is the local APIC CPU, enabling
406MSI/MSI-X support in Linux kernel is dependent on whether existing
407system hardware supports local APIC. Users should verify their
408system whether it runs when CONFIG_X86_LOCAL_APIC=y.
409
410In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
411however, in UP environment, users must manually set
412CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
413CONFIG_PCI_MSI enables the VECTOR based scheme and
414the option for MSI-capable device drivers to selectively enable
415MSI/MSI-X.
416
417Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
418vector is allocated new during runtime and MSI/MSI-X support does not
419depend on BIOS support. This key independency enables MSI/MSI-X
420support on future IOxAPIC free platform.
421
4225.5.2 Device hardware support
423The hardware device function supports MSI by indicating the
424MSI/MSI-X capability structure on its PCI capability list. By
425default, this capability structure will not be initialized by
426the kernel to enable MSI during the system boot. In other words,
427the device function is running on its default pin assertion mode.
428Note that in many cases the hardware supporting MSI have bugs,
429which may result in system hang. The software driver of specific
430MSI-capable hardware is responsible for whether calling
431pci_enable_msi or not. A return of zero indicates the kernel
432successfully initializes the MSI/MSI-X capability structure of the
433device funtion. The device function is now running on MSI/MSI-X mode.
434
4355.6 How to tell whether MSI/MSI-X is enabled on device function
436
437At the driver level, a return of zero from the function call of
438pci_enable_msi()/pci_enable_msix() indicates to a device driver that
439its device function is initialized successfully and ready to run in
440MSI/MSI-X mode.
441
442At the user level, users can use command 'cat /proc/interrupts'
443to display the vector allocated for a device and its interrupt
444MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
445enabled on a SCSI Adaptec 39320D Ultra320.
446
447 CPU0 CPU1
448 0: 324639 0 IO-APIC-edge timer
449 1: 1186 0 IO-APIC-edge i8042
450 2: 0 0 XT-PIC cascade
451 12: 2797 0 IO-APIC-edge i8042
452 14: 6543 0 IO-APIC-edge ide0
453 15: 1 0 IO-APIC-edge ide1
454169: 0 0 IO-APIC-level uhci-hcd
455185: 0 0 IO-APIC-level uhci-hcd
456193: 138 10 PCI MSI aic79xx
457201: 30 0 PCI MSI aic79xx
458225: 30 0 IO-APIC-level aic7xxx
459233: 30 0 IO-APIC-level aic7xxx
460NMI: 0 0
461LOC: 324553 325068
462ERR: 0
463MIS: 0
464
4656. FAQ
466
467Q1. Are there any limitations on using the MSI?
468
469A1. If the PCI device supports MSI and conforms to the
470specification and the platform supports the APIC local bus,
471then using MSI should work.
472
473Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
474AMD processors)? In P3 IPI's are transmitted on the APIC local
475bus and in P4 and Xeon they are transmitted on the system
476bus. Are there any implications with this?
477
478A2. MSI support enables a PCI device sending an inbound
479memory write (0xfeexxxxx as target address) on its PCI bus
480directly to the FSB. Since the message address has a
481redirection hint bit cleared, it should work.
482
483Q3. The target address 0xfeexxxxx will be translated by the
484Host Bridge into an interrupt message. Are there any
485limitations on the chipsets such as Intel 8xx, Intel e7xxx,
486or VIA?
487
488A3. If these chipsets support an inbound memory write with
489target address set as 0xfeexxxxx, as conformed to PCI
490specification 2.3 or latest, then it should work.
491
492Q4. From the driver point of view, if the MSI is lost because
493of the errors occur during inbound memory write, then it may
494wait for ever. Is there a mechanism for it to recover?
495
496A4. Since the target of the transaction is an inbound memory
497write, all transaction termination conditions (Retry,
498Master-Abort, Target-Abort, or normal completion) are
499supported. A device sending an MSI must abide by all the PCI
500rules and conditions regarding that inbound memory write. So,
501if a retry is signaled it must retry, etc... We believe that
502the recommendation for Abort is also a retry (refer to PCI
503specification 2.3 or latest).