Commit | Line | Data |
---|---|---|
eddb1c22 JH |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ==================================================== | |
4 | pin_user_pages() and related calls | |
5 | ==================================================== | |
6 | ||
7 | .. contents:: :local: | |
8 | ||
9 | Overview | |
10 | ======== | |
11 | ||
12 | This document describes the following functions:: | |
13 | ||
14 | pin_user_pages() | |
15 | pin_user_pages_fast() | |
16 | pin_user_pages_remote() | |
17 | ||
18 | Basic description of FOLL_PIN | |
19 | ============================= | |
20 | ||
21 | FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() | |
22 | ("gup") family of functions. FOLL_PIN has significant interactions and | |
23 | interdependencies with FOLL_LONGTERM, so both are covered here. | |
24 | ||
25 | FOLL_PIN is internal to gup, meaning that it should not appear at the gup call | |
26 | sites. This allows the associated wrapper functions (pin_user_pages*() and | |
27 | others) to set the correct combination of these flags, and to check for problems | |
28 | as well. | |
29 | ||
30 | FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. | |
31 | This is in order to avoid creating a large number of wrapper functions to cover | |
32 | all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the | |
33 | pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so | |
34 | that's a natural dividing line, and a good point to make separate wrapper calls. | |
35 | In other words, use pin_user_pages*() for DMA-pinned pages, and | |
f9e55970 | 36 | get_user_pages*() for other cases. There are five cases described later on in |
eddb1c22 JH |
37 | this document, to further clarify that concept. |
38 | ||
39 | FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, | |
40 | multiple threads and call sites are free to pin the same struct pages, via both | |
41 | FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the | |
42 | other, not the struct page(s). | |
43 | ||
44 | The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN | |
45 | uses a different reference counting technique. | |
46 | ||
47 | FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, | |
48 | FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. | |
49 | ||
50 | Which flags are set by each wrapper | |
51 | =================================== | |
52 | ||
53 | For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup | |
54 | flags the caller provides. The caller is required to pass in a non-null struct | |
47e29d32 JH |
55 | pages* array, and the function then pins pages by incrementing each by a special |
56 | value: GUP_PIN_COUNTING_BIAS. | |
57 | ||
94688e8e MWO |
58 | For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, |
59 | the extra space available in the struct folio is used to store the | |
60 | pincount directly. | |
61 | ||
62 | This approach for large folios avoids the counting upper limit problems | |
63 | that are discussed below. Those limitations would have been aggravated | |
64 | severely by huge pages, because each tail page adds a refcount to the | |
65 | head page. And in fact, testing revealed that, without a separate pincount | |
66 | field, refcount overflows were seen in some huge page stress tests. | |
67 | ||
68 | This also means that huge pages and large folios do not suffer | |
47e29d32 | 69 | from the false positives problem that is mentioned below.:: |
eddb1c22 JH |
70 | |
71 | Function | |
72 | -------- | |
73 | pin_user_pages FOLL_PIN is always set internally by this function. | |
74 | pin_user_pages_fast FOLL_PIN is always set internally by this function. | |
75 | pin_user_pages_remote FOLL_PIN is always set internally by this function. | |
76 | ||
77 | For these get_user_pages*() functions, FOLL_GET might not even be specified. | |
78 | Behavior is a little more complex than above. If FOLL_GET was *not* specified, | |
79 | but the caller passed in a non-null struct pages* array, then the function | |
80 | sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount | |
81 | of each page by +1.:: | |
82 | ||
83 | Function | |
84 | -------- | |
85 | get_user_pages FOLL_GET is sometimes set internally by this function. | |
86 | get_user_pages_fast FOLL_GET is sometimes set internally by this function. | |
87 | get_user_pages_remote FOLL_GET is sometimes set internally by this function. | |
88 | ||
89 | Tracking dma-pinned pages | |
90 | ========================= | |
91 | ||
92 | Some of the key design constraints, and solutions, for tracking dma-pinned | |
93 | pages: | |
94 | ||
95 | * An actual reference count, per struct page, is required. This is because | |
96 | multiple processes may pin and unpin a page. | |
97 | ||
98 | * False positives (reporting that a page is dma-pinned, when in fact it is not) | |
99 | are acceptable, but false negatives are not. | |
100 | ||
101 | * struct page may not be increased in size for this, and all fields are already | |
102 | used. | |
103 | ||
104 | * Given the above, we can overload the page->_refcount field by using, sort of, | |
105 | the upper bits in that field for a dma-pinned count. "Sort of", means that, | |
106 | rather than dividing page->_refcount into bit fields, we simple add a medium- | |
107 | large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to | |
108 | page->_refcount. This provides fuzzy behavior: if a page has get_page() called | |
109 | on it 1024 times, then it will appear to have a single dma-pinned count. | |
110 | And again, that's acceptable. | |
111 | ||
112 | This also leads to limitations: there are only 31-10==21 bits available for a | |
113 | counter that increments 10 bits at a time. | |
114 | ||
eddb1c22 JH |
115 | * Callers must specifically request "dma-pinned tracking of pages". In other |
116 | words, just calling get_user_pages() will not suffice; a new set of functions, | |
117 | pin_user_page() and related, must be used. | |
118 | ||
119 | FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags | |
120 | ========================================================== | |
121 | ||
122 | Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing | |
123 | these categories: | |
124 | ||
125 | CASE 1: Direct IO (DIO) | |
126 | ----------------------- | |
127 | There are GUP references to pages that are serving | |
128 | as DIO buffers. These buffers are needed for a relatively short time (so they | |
129 | are not "long term"). No special synchronization with page_mkclean() or | |
130 | munmap() is provided. Therefore, flags to set at the call site are: :: | |
131 | ||
132 | FOLL_PIN | |
133 | ||
134 | ...but rather than setting FOLL_PIN directly, call sites should use one of | |
135 | the pin_user_pages*() routines that set FOLL_PIN. | |
136 | ||
137 | CASE 2: RDMA | |
138 | ------------ | |
139 | There are GUP references to pages that are serving as DMA | |
140 | buffers. These buffers are needed for a long time ("long term"). No special | |
141 | synchronization with page_mkclean() or munmap() is provided. Therefore, flags | |
142 | to set at the call site are: :: | |
143 | ||
144 | FOLL_PIN | FOLL_LONGTERM | |
145 | ||
146 | NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's | |
147 | because DAX pages do not have a separate page cache, and so "pinning" implies | |
148 | locking down file system blocks, which is not (yet) supported in that way. | |
149 | ||
a8f80f53 JH |
150 | CASE 3: MMU notifier registration, with or without page faulting hardware |
151 | ------------------------------------------------------------------------- | |
152 | Device drivers can pin pages via get_user_pages*(), and register for mmu | |
153 | notifier callbacks for the memory range. Then, upon receiving a notifier | |
154 | "invalidate range" callback , stop the device from using the range, and unpin | |
155 | the pages. There may be other possible schemes, such as for example explicitly | |
156 | synchronizing against pending IO, that accomplish approximately the same thing. | |
157 | ||
158 | Or, if the hardware supports replayable page faults, then the device driver can | |
159 | avoid pinning entirely (this is ideal), as follows: register for mmu notifier | |
160 | callbacks as above, but instead of stopping the device and unpinning in the | |
161 | callback, simply remove the range from the device's page tables. | |
162 | ||
163 | Either way, as long as the driver unpins the pages upon mmu notifier callback, | |
164 | then there is proper synchronization with both filesystem and mm | |
165 | (page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. | |
eddb1c22 JH |
166 | |
167 | CASE 4: Pinning for struct page manipulation only | |
168 | ------------------------------------------------- | |
a8f80f53 JH |
169 | If only struct page data (as opposed to the actual memory contents that a page |
170 | is tracking) is affected, then normal GUP calls are sufficient, and neither flag | |
171 | needs to be set. | |
eddb1c22 | 172 | |
eaf4d22a JH |
173 | CASE 5: Pinning in order to write to the data within the page |
174 | ------------------------------------------------------------- | |
175 | Even though neither DMA nor Direct IO is involved, just a simple case of "pin, | |
176 | write to a page's data, unpin" can cause a problem. Case 5 may be considered a | |
177 | superset of Case 1, plus Case 2, plus anything that invokes that pattern. In | |
178 | other words, if the code is neither Case 1 nor Case 2, it may still require | |
179 | FOLL_PIN, for patterns like this: | |
180 | ||
181 | Correct (uses FOLL_PIN calls): | |
182 | pin_user_pages() | |
183 | write to the data within the pages | |
184 | unpin_user_pages() | |
185 | ||
186 | INCORRECT (uses FOLL_GET calls): | |
187 | get_user_pages() | |
188 | write to the data within the pages | |
189 | put_page() | |
190 | ||
3faa52c0 JH |
191 | page_maybe_dma_pinned(): the whole point of pinning |
192 | =================================================== | |
eddb1c22 JH |
193 | |
194 | The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able | |
195 | to query, "is this page DMA-pinned?" That allows code such as page_mkclean() | |
196 | (and file system writeback code in general) to make informed decisions about | |
197 | what to do when a page cannot be unmapped due to such pins. | |
198 | ||
199 | What to do in those cases is the subject of a years-long series of discussions | |
200 | and debates (see the References at the end of this document). It's a TODO item | |
201 | here: fill in the details once that's worked out. Meanwhile, it's safe to say | |
202 | that having this available: :: | |
203 | ||
3faa52c0 | 204 | static inline bool page_maybe_dma_pinned(struct page *page) |
eddb1c22 JH |
205 | |
206 | ...is a prerequisite to solving the long-running gup+DMA problem. | |
207 | ||
208 | Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM | |
209 | =================================================================== | |
210 | ||
211 | Another way of thinking about these flags is as a progression of restrictions: | |
212 | FOLL_GET is for struct page manipulation, without affecting the data that the | |
213 | struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for | |
214 | short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is | |
215 | a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more | |
216 | restrictive case that has FOLL_PIN as a prerequisite: this is for pages that | |
217 | will be pinned longterm, and whose data will be accessed. | |
218 | ||
219 | Unit testing | |
220 | ============ | |
221 | This file:: | |
222 | ||
baa489fa | 223 | tools/testing/selftests/mm/gup_test.c |
eddb1c22 JH |
224 | |
225 | has the following new calls to exercise the new pin*() wrapper functions: | |
226 | ||
9c84f229 | 227 | * PIN_FAST_BENCHMARK (./gup_test -a) |
a9bed1e1 | 228 | * PIN_BASIC_TEST (./gup_test -b) |
eddb1c22 JH |
229 | |
230 | You can monitor how many total dma-pinned pages have been acquired and released | |
231 | since the system was booted, via two new /proc/vmstat entries: :: | |
232 | ||
1970dc6f JH |
233 | /proc/vmstat/nr_foll_pin_acquired |
234 | /proc/vmstat/nr_foll_pin_released | |
eddb1c22 | 235 | |
1970dc6f JH |
236 | Under normal conditions, these two values will be equal unless there are any |
237 | long-term [R]DMA pins in place, or during pin/unpin transitions. | |
238 | ||
239 | * nr_foll_pin_acquired: This is the number of logical pins that have been | |
240 | acquired since the system was powered on. For huge pages, the head page is | |
241 | pinned once for each page (head page and each tail page) within the huge page. | |
242 | This follows the same sort of behavior that get_user_pages() uses for huge | |
243 | pages: the head page is refcounted once for each tail or head page in the huge | |
244 | page, when get_user_pages() is applied to a huge page. | |
245 | ||
246 | * nr_foll_pin_released: The number of logical pins that have been released since | |
247 | the system was powered on. Note that pages are released (unpinned) on a | |
248 | PAGE_SIZE granularity, even if the original pin was applied to a huge page. | |
249 | Becaused of the pin count behavior described above in "nr_foll_pin_acquired", | |
250 | the accounting balances out, so that after doing this:: | |
251 | ||
252 | pin_user_pages(huge_page); | |
253 | for (each page in huge_page) | |
254 | unpin_user_page(page); | |
255 | ||
256 | ...the following is expected:: | |
257 | ||
258 | nr_foll_pin_released == nr_foll_pin_acquired | |
259 | ||
260 | (...unless it was already out of balance due to a long-term RDMA pin being in | |
261 | place.) | |
eddb1c22 | 262 | |
dc8fb2f2 JH |
263 | Other diagnostics |
264 | ================= | |
265 | ||
94688e8e MWO |
266 | dump_page() has been enhanced slightly to handle these new counting |
267 | fields, and to better report on large folios in general. Specifically, | |
268 | for large folios, the exact pincount is reported. | |
dc8fb2f2 | 269 | |
eddb1c22 JH |
270 | References |
271 | ========== | |
272 | ||
273 | * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ | |
274 | * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ | |
275 | * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ | |
47e29d32 | 276 | * `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ |
eddb1c22 JH |
277 | |
278 | John Hubbard, October, 2019 |