Documentation: Hardware tag matching
[linux-2.6-block.git] / Documentation / infiniband / tag_matching.txt
CommitLineData
82fb3424
AK
1Tag matching logic
2
3The MPI standard defines a set of rules, known as tag-matching, for matching
4source send operations to destination receives. The following parameters must
5match the following source and destination parameters:
6* Communicator
7* User tag - wild card may be specified by the receiver
8* Source rank – wild car may be specified by the receiver
9* Destination rank – wild
10The ordering rules require that when more than one pair of send and receive
11message envelopes may match, the pair that includes the earliest posted-send
12and the earliest posted-receive is the pair that must be used to satisfy the
13matching operation. However, this doesn’t imply that tags are consumed in
14the order they are created, e.g., a later generated tag may be consumed, if
15earlier tags can’t be used to satisfy the matching rules.
16
17When a message is sent from the sender to the receiver, the communication
18library may attempt to process the operation either after or before the
19corresponding matching receive is posted. If a matching receive is posted,
20this is an expected message, otherwise it is called an unexpected message.
21Implementations frequently use different matching schemes for these two
22different matching instances.
23
24To keep MPI library memory footprint down, MPI implementations typically use
25two different protocols for this purpose:
26
271. The Eager protocol- the complete message is sent when the send is
28processed by the sender. A completion send is received in the send_cq
29notifying that the buffer can be reused.
30
312. The Rendezvous Protocol - the sender sends the tag-matching header,
32and perhaps a portion of data when first notifying the receiver. When the
33corresponding buffer is posted, the responder will use the information from
34the header to initiate an RDMA READ operation directly to the matching buffer.
35A fin message needs to be received in order for the buffer to be reused.
36
37Tag matching implementation
38
39There are two types of matching objects used, the posted receive list and the
40unexpected message list. The application posts receive buffers through calls
41to the MPI receive routines in the posted receive list and posts send messages
42using the MPI send routines. The head of the posted receive list may be
43maintained by the hardware, with the software expected to shadow this list.
44
45When send is initiated and arrives at the receive side, if there is no
46pre-posted receive for this arriving message, it is passed to the software and
47placed in the unexpected message list. Otherwise the match is processed,
48including rendezvous processing, if appropriate, delivering the data to the
49specified receive buffer. This allows overlapping receive-side MPI tag
50matching with computation.
51
52When a receive-message is posted, the communication library will first check
53the software unexpected message list for a matching receive. If a match is
54found, data is delivered to the user buffer, using a software controlled
55protocol. The UCX implementation uses either an eager or rendezvous protocol,
56depending on data size. If no match is found, the entire pre-posted receive
57list is maintained by the hardware, and there is space to add one more
58pre-posted receive to this list, this receive is passed to the hardware.
59Software is expected to shadow this list, to help with processing MPI cancel
60operations. In addition, because hardware and software are not expected to be
61tightly synchronized with respect to the tag-matching operation, this shadow
62list is used to detect the case that a pre-posted receive is passed to the
63hardware, as the matching unexpected message is being passed from the hardware
64to the software.