Commit | Line | Data |
---|---|---|
eddb1c22 JH |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ==================================================== | |
4 | pin_user_pages() and related calls | |
5 | ==================================================== | |
6 | ||
7 | .. contents:: :local: | |
8 | ||
9 | Overview | |
10 | ======== | |
11 | ||
12 | This document describes the following functions:: | |
13 | ||
14 | pin_user_pages() | |
15 | pin_user_pages_fast() | |
16 | pin_user_pages_remote() | |
17 | ||
18 | Basic description of FOLL_PIN | |
19 | ============================= | |
20 | ||
21 | FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() | |
22 | ("gup") family of functions. FOLL_PIN has significant interactions and | |
23 | interdependencies with FOLL_LONGTERM, so both are covered here. | |
24 | ||
25 | FOLL_PIN is internal to gup, meaning that it should not appear at the gup call | |
26 | sites. This allows the associated wrapper functions (pin_user_pages*() and | |
27 | others) to set the correct combination of these flags, and to check for problems | |
28 | as well. | |
29 | ||
30 | FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. | |
31 | This is in order to avoid creating a large number of wrapper functions to cover | |
32 | all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the | |
33 | pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so | |
34 | that's a natural dividing line, and a good point to make separate wrapper calls. | |
35 | In other words, use pin_user_pages*() for DMA-pinned pages, and | |
f9e55970 | 36 | get_user_pages*() for other cases. There are five cases described later on in |
eddb1c22 JH |
37 | this document, to further clarify that concept. |
38 | ||
39 | FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, | |
40 | multiple threads and call sites are free to pin the same struct pages, via both | |
41 | FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the | |
42 | other, not the struct page(s). | |
43 | ||
44 | The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN | |
45 | uses a different reference counting technique. | |
46 | ||
47 | FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, | |
48 | FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. | |
49 | ||
50 | Which flags are set by each wrapper | |
51 | =================================== | |
52 | ||
53 | For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup | |
54 | flags the caller provides. The caller is required to pass in a non-null struct | |
47e29d32 JH |
55 | pages* array, and the function then pins pages by incrementing each by a special |
56 | value: GUP_PIN_COUNTING_BIAS. | |
57 | ||
5232c63f MWO |
58 | For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, |
59 | an exact form of pin counting is achieved, by using the 2nd struct page | |
60 | in the compound page. A new struct page field, compound_pincount, has | |
61 | been added in order to support this. | |
47e29d32 JH |
62 | |
63 | This approach for compound pages avoids the counting upper limit problems that | |
64 | are discussed below. Those limitations would have been aggravated severely by | |
65 | huge pages, because each tail page adds a refcount to the head page. And in | |
5232c63f | 66 | fact, testing revealed that, without a separate compound_pincount field, |
47e29d32 JH |
67 | page overflows were seen in some huge page stress tests. |
68 | ||
5232c63f | 69 | This also means that huge pages and compound pages do not suffer |
47e29d32 | 70 | from the false positives problem that is mentioned below.:: |
eddb1c22 JH |
71 | |
72 | Function | |
73 | -------- | |
74 | pin_user_pages FOLL_PIN is always set internally by this function. | |
75 | pin_user_pages_fast FOLL_PIN is always set internally by this function. | |
76 | pin_user_pages_remote FOLL_PIN is always set internally by this function. | |
77 | ||
78 | For these get_user_pages*() functions, FOLL_GET might not even be specified. | |
79 | Behavior is a little more complex than above. If FOLL_GET was *not* specified, | |
80 | but the caller passed in a non-null struct pages* array, then the function | |
81 | sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount | |
82 | of each page by +1.:: | |
83 | ||
84 | Function | |
85 | -------- | |
86 | get_user_pages FOLL_GET is sometimes set internally by this function. | |
87 | get_user_pages_fast FOLL_GET is sometimes set internally by this function. | |
88 | get_user_pages_remote FOLL_GET is sometimes set internally by this function. | |
89 | ||
90 | Tracking dma-pinned pages | |
91 | ========================= | |
92 | ||
93 | Some of the key design constraints, and solutions, for tracking dma-pinned | |
94 | pages: | |
95 | ||
96 | * An actual reference count, per struct page, is required. This is because | |
97 | multiple processes may pin and unpin a page. | |
98 | ||
99 | * False positives (reporting that a page is dma-pinned, when in fact it is not) | |
100 | are acceptable, but false negatives are not. | |
101 | ||
102 | * struct page may not be increased in size for this, and all fields are already | |
103 | used. | |
104 | ||
105 | * Given the above, we can overload the page->_refcount field by using, sort of, | |
106 | the upper bits in that field for a dma-pinned count. "Sort of", means that, | |
107 | rather than dividing page->_refcount into bit fields, we simple add a medium- | |
108 | large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to | |
109 | page->_refcount. This provides fuzzy behavior: if a page has get_page() called | |
110 | on it 1024 times, then it will appear to have a single dma-pinned count. | |
111 | And again, that's acceptable. | |
112 | ||
113 | This also leads to limitations: there are only 31-10==21 bits available for a | |
114 | counter that increments 10 bits at a time. | |
115 | ||
eddb1c22 JH |
116 | * Callers must specifically request "dma-pinned tracking of pages". In other |
117 | words, just calling get_user_pages() will not suffice; a new set of functions, | |
118 | pin_user_page() and related, must be used. | |
119 | ||
120 | FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags | |
121 | ========================================================== | |
122 | ||
123 | Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing | |
124 | these categories: | |
125 | ||
126 | CASE 1: Direct IO (DIO) | |
127 | ----------------------- | |
128 | There are GUP references to pages that are serving | |
129 | as DIO buffers. These buffers are needed for a relatively short time (so they | |
130 | are not "long term"). No special synchronization with page_mkclean() or | |
131 | munmap() is provided. Therefore, flags to set at the call site are: :: | |
132 | ||
133 | FOLL_PIN | |
134 | ||
135 | ...but rather than setting FOLL_PIN directly, call sites should use one of | |
136 | the pin_user_pages*() routines that set FOLL_PIN. | |
137 | ||
138 | CASE 2: RDMA | |
139 | ------------ | |
140 | There are GUP references to pages that are serving as DMA | |
141 | buffers. These buffers are needed for a long time ("long term"). No special | |
142 | synchronization with page_mkclean() or munmap() is provided. Therefore, flags | |
143 | to set at the call site are: :: | |
144 | ||
145 | FOLL_PIN | FOLL_LONGTERM | |
146 | ||
147 | NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's | |
148 | because DAX pages do not have a separate page cache, and so "pinning" implies | |
149 | locking down file system blocks, which is not (yet) supported in that way. | |
150 | ||
a8f80f53 JH |
151 | CASE 3: MMU notifier registration, with or without page faulting hardware |
152 | ------------------------------------------------------------------------- | |
153 | Device drivers can pin pages via get_user_pages*(), and register for mmu | |
154 | notifier callbacks for the memory range. Then, upon receiving a notifier | |
155 | "invalidate range" callback , stop the device from using the range, and unpin | |
156 | the pages. There may be other possible schemes, such as for example explicitly | |
157 | synchronizing against pending IO, that accomplish approximately the same thing. | |
158 | ||
159 | Or, if the hardware supports replayable page faults, then the device driver can | |
160 | avoid pinning entirely (this is ideal), as follows: register for mmu notifier | |
161 | callbacks as above, but instead of stopping the device and unpinning in the | |
162 | callback, simply remove the range from the device's page tables. | |
163 | ||
164 | Either way, as long as the driver unpins the pages upon mmu notifier callback, | |
165 | then there is proper synchronization with both filesystem and mm | |
166 | (page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. | |
eddb1c22 JH |
167 | |
168 | CASE 4: Pinning for struct page manipulation only | |
169 | ------------------------------------------------- | |
a8f80f53 JH |
170 | If only struct page data (as opposed to the actual memory contents that a page |
171 | is tracking) is affected, then normal GUP calls are sufficient, and neither flag | |
172 | needs to be set. | |
eddb1c22 | 173 | |
eaf4d22a JH |
174 | CASE 5: Pinning in order to write to the data within the page |
175 | ------------------------------------------------------------- | |
176 | Even though neither DMA nor Direct IO is involved, just a simple case of "pin, | |
177 | write to a page's data, unpin" can cause a problem. Case 5 may be considered a | |
178 | superset of Case 1, plus Case 2, plus anything that invokes that pattern. In | |
179 | other words, if the code is neither Case 1 nor Case 2, it may still require | |
180 | FOLL_PIN, for patterns like this: | |
181 | ||
182 | Correct (uses FOLL_PIN calls): | |
183 | pin_user_pages() | |
184 | write to the data within the pages | |
185 | unpin_user_pages() | |
186 | ||
187 | INCORRECT (uses FOLL_GET calls): | |
188 | get_user_pages() | |
189 | write to the data within the pages | |
190 | put_page() | |
191 | ||
3faa52c0 JH |
192 | page_maybe_dma_pinned(): the whole point of pinning |
193 | =================================================== | |
eddb1c22 JH |
194 | |
195 | The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able | |
196 | to query, "is this page DMA-pinned?" That allows code such as page_mkclean() | |
197 | (and file system writeback code in general) to make informed decisions about | |
198 | what to do when a page cannot be unmapped due to such pins. | |
199 | ||
200 | What to do in those cases is the subject of a years-long series of discussions | |
201 | and debates (see the References at the end of this document). It's a TODO item | |
202 | here: fill in the details once that's worked out. Meanwhile, it's safe to say | |
203 | that having this available: :: | |
204 | ||
3faa52c0 | 205 | static inline bool page_maybe_dma_pinned(struct page *page) |
eddb1c22 JH |
206 | |
207 | ...is a prerequisite to solving the long-running gup+DMA problem. | |
208 | ||
209 | Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM | |
210 | =================================================================== | |
211 | ||
212 | Another way of thinking about these flags is as a progression of restrictions: | |
213 | FOLL_GET is for struct page manipulation, without affecting the data that the | |
214 | struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for | |
215 | short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is | |
216 | a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more | |
217 | restrictive case that has FOLL_PIN as a prerequisite: this is for pages that | |
218 | will be pinned longterm, and whose data will be accessed. | |
219 | ||
220 | Unit testing | |
221 | ============ | |
222 | This file:: | |
223 | ||
baa489fa | 224 | tools/testing/selftests/mm/gup_test.c |
eddb1c22 JH |
225 | |
226 | has the following new calls to exercise the new pin*() wrapper functions: | |
227 | ||
9c84f229 | 228 | * PIN_FAST_BENCHMARK (./gup_test -a) |
a9bed1e1 | 229 | * PIN_BASIC_TEST (./gup_test -b) |
eddb1c22 JH |
230 | |
231 | You can monitor how many total dma-pinned pages have been acquired and released | |
232 | since the system was booted, via two new /proc/vmstat entries: :: | |
233 | ||
1970dc6f JH |
234 | /proc/vmstat/nr_foll_pin_acquired |
235 | /proc/vmstat/nr_foll_pin_released | |
eddb1c22 | 236 | |
1970dc6f JH |
237 | Under normal conditions, these two values will be equal unless there are any |
238 | long-term [R]DMA pins in place, or during pin/unpin transitions. | |
239 | ||
240 | * nr_foll_pin_acquired: This is the number of logical pins that have been | |
241 | acquired since the system was powered on. For huge pages, the head page is | |
242 | pinned once for each page (head page and each tail page) within the huge page. | |
243 | This follows the same sort of behavior that get_user_pages() uses for huge | |
244 | pages: the head page is refcounted once for each tail or head page in the huge | |
245 | page, when get_user_pages() is applied to a huge page. | |
246 | ||
247 | * nr_foll_pin_released: The number of logical pins that have been released since | |
248 | the system was powered on. Note that pages are released (unpinned) on a | |
249 | PAGE_SIZE granularity, even if the original pin was applied to a huge page. | |
250 | Becaused of the pin count behavior described above in "nr_foll_pin_acquired", | |
251 | the accounting balances out, so that after doing this:: | |
252 | ||
253 | pin_user_pages(huge_page); | |
254 | for (each page in huge_page) | |
255 | unpin_user_page(page); | |
256 | ||
257 | ...the following is expected:: | |
258 | ||
259 | nr_foll_pin_released == nr_foll_pin_acquired | |
260 | ||
261 | (...unless it was already out of balance due to a long-term RDMA pin being in | |
262 | place.) | |
eddb1c22 | 263 | |
dc8fb2f2 JH |
264 | Other diagnostics |
265 | ================= | |
266 | ||
5232c63f MWO |
267 | dump_page() has been enhanced slightly, to handle these new counting |
268 | fields, and to better report on compound pages in general. Specifically, | |
269 | for compound pages, the exact (compound_pincount) pincount is reported. | |
dc8fb2f2 | 270 | |
eddb1c22 JH |
271 | References |
272 | ========== | |
273 | ||
274 | * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ | |
275 | * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ | |
276 | * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ | |
47e29d32 | 277 | * `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ |
eddb1c22 JH |
278 | |
279 | John Hubbard, October, 2019 |