Commit | Line | Data |
---|---|---|
97162a1e | 1 | ================== |
82fb3424 | 2 | Tag matching logic |
97162a1e | 3 | ================== |
82fb3424 AK |
4 | |
5 | The MPI standard defines a set of rules, known as tag-matching, for matching | |
6 | source send operations to destination receives. The following parameters must | |
7 | match the following source and destination parameters: | |
97162a1e | 8 | |
82fb3424 AK |
9 | * Communicator |
10 | * User tag - wild card may be specified by the receiver | |
11 | * Source rank – wild car may be specified by the receiver | |
12 | * Destination rank – wild | |
97162a1e | 13 | |
82fb3424 AK |
14 | The ordering rules require that when more than one pair of send and receive |
15 | message envelopes may match, the pair that includes the earliest posted-send | |
16 | and the earliest posted-receive is the pair that must be used to satisfy the | |
17 | matching operation. However, this doesn’t imply that tags are consumed in | |
18 | the order they are created, e.g., a later generated tag may be consumed, if | |
19 | earlier tags can’t be used to satisfy the matching rules. | |
20 | ||
21 | When a message is sent from the sender to the receiver, the communication | |
22 | library may attempt to process the operation either after or before the | |
23 | corresponding matching receive is posted. If a matching receive is posted, | |
24 | this is an expected message, otherwise it is called an unexpected message. | |
25 | Implementations frequently use different matching schemes for these two | |
26 | different matching instances. | |
27 | ||
28 | To keep MPI library memory footprint down, MPI implementations typically use | |
29 | two different protocols for this purpose: | |
30 | ||
31 | 1. The Eager protocol- the complete message is sent when the send is | |
32 | processed by the sender. A completion send is received in the send_cq | |
33 | notifying that the buffer can be reused. | |
34 | ||
35 | 2. The Rendezvous Protocol - the sender sends the tag-matching header, | |
36 | and perhaps a portion of data when first notifying the receiver. When the | |
37 | corresponding buffer is posted, the responder will use the information from | |
38 | the header to initiate an RDMA READ operation directly to the matching buffer. | |
39 | A fin message needs to be received in order for the buffer to be reused. | |
40 | ||
41 | Tag matching implementation | |
97162a1e | 42 | =========================== |
82fb3424 AK |
43 | |
44 | There are two types of matching objects used, the posted receive list and the | |
45 | unexpected message list. The application posts receive buffers through calls | |
46 | to the MPI receive routines in the posted receive list and posts send messages | |
47 | using the MPI send routines. The head of the posted receive list may be | |
48 | maintained by the hardware, with the software expected to shadow this list. | |
49 | ||
50 | When send is initiated and arrives at the receive side, if there is no | |
51 | pre-posted receive for this arriving message, it is passed to the software and | |
52 | placed in the unexpected message list. Otherwise the match is processed, | |
53 | including rendezvous processing, if appropriate, delivering the data to the | |
54 | specified receive buffer. This allows overlapping receive-side MPI tag | |
55 | matching with computation. | |
56 | ||
57 | When a receive-message is posted, the communication library will first check | |
58 | the software unexpected message list for a matching receive. If a match is | |
59 | found, data is delivered to the user buffer, using a software controlled | |
60 | protocol. The UCX implementation uses either an eager or rendezvous protocol, | |
61 | depending on data size. If no match is found, the entire pre-posted receive | |
62 | list is maintained by the hardware, and there is space to add one more | |
63 | pre-posted receive to this list, this receive is passed to the hardware. | |
64 | Software is expected to shadow this list, to help with processing MPI cancel | |
65 | operations. In addition, because hardware and software are not expected to be | |
66 | tightly synchronized with respect to the tag-matching operation, this shadow | |
67 | list is used to detect the case that a pre-posted receive is passed to the | |
68 | hardware, as the matching unexpected message is being passed from the hardware | |
69 | to the software. |