Commit | Line | Data |
---|---|---|
aa9f34e5 MR |
1 | .. hmm: |
2 | ||
3 | ===================================== | |
bffc33ec | 4 | Heterogeneous Memory Management (HMM) |
aa9f34e5 | 5 | ===================================== |
bffc33ec | 6 | |
e8eddfd2 JG |
7 | Provide infrastructure and helpers to integrate non-conventional memory (device |
8 | memory like GPU on board memory) into regular kernel path, with the cornerstone | |
9 | of this being specialized struct page for such memory (see sections 5 to 7 of | |
10 | this document). | |
11 | ||
12 | HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., | |
24844fd3 JC |
13 | allowing a device to transparently access program address coherently with |
14 | the CPU meaning that any valid pointer on the CPU is also a valid pointer | |
15 | for the device. This is becoming mandatory to simplify the use of advanced | |
16 | heterogeneous computing where GPU, DSP, or FPGA are used to perform various | |
e8eddfd2 | 17 | computations on behalf of a process. |
76ea470c RC |
18 | |
19 | This document is divided as follows: in the first section I expose the problems | |
20 | related to using device specific memory allocators. In the second section, I | |
21 | expose the hardware limitations that are inherent to many platforms. The third | |
22 | section gives an overview of the HMM design. The fourth section explains how | |
e8eddfd2 | 23 | CPU page-table mirroring works and the purpose of HMM in this context. The |
76ea470c RC |
24 | fifth section deals with how device memory is represented inside the kernel. |
25 | Finally, the last section presents a new migration helper that allows lever- | |
26 | aging the device DMA engine. | |
27 | ||
aa9f34e5 | 28 | .. contents:: :local: |
76ea470c | 29 | |
24844fd3 JC |
30 | Problems of using a device specific memory allocator |
31 | ==================================================== | |
bffc33ec | 32 | |
e8eddfd2 | 33 | Devices with a large amount of on board memory (several gigabytes) like GPUs |
76ea470c RC |
34 | have historically managed their memory through dedicated driver specific APIs. |
35 | This creates a disconnect between memory allocated and managed by a device | |
36 | driver and regular application memory (private anonymous, shared memory, or | |
37 | regular file backed memory). From here on I will refer to this aspect as split | |
38 | address space. I use shared address space to refer to the opposite situation: | |
39 | i.e., one in which any application memory region can be used by a device | |
40 | transparently. | |
bffc33ec | 41 | |
e8eddfd2 JG |
42 | Split address space happens because device can only access memory allocated |
43 | through device specific API. This implies that all memory objects in a program | |
44 | are not equal from the device point of view which complicates large programs | |
45 | that rely on a wide set of libraries. | |
bffc33ec | 46 | |
e8eddfd2 JG |
47 | Concretely this means that code that wants to leverage devices like GPUs needs |
48 | to copy object between generically allocated memory (malloc, mmap private, mmap | |
49 | share) and memory allocated through the device driver API (this still ends up | |
50 | with an mmap but of the device file). | |
bffc33ec | 51 | |
e8eddfd2 JG |
52 | For flat data sets (array, grid, image, ...) this isn't too hard to achieve but |
53 | complex data sets (list, tree, ...) are hard to get right. Duplicating a | |
54 | complex data set needs to re-map all the pointer relations between each of its | |
76ea470c | 55 | elements. This is error prone and program gets harder to debug because of the |
e8eddfd2 | 56 | duplicate data set and addresses. |
bffc33ec | 57 | |
e8eddfd2 | 58 | Split address space also means that libraries cannot transparently use data |
76ea470c | 59 | they are getting from the core program or another library and thus each library |
e8eddfd2 | 60 | might have to duplicate its input data set using the device specific memory |
76ea470c RC |
61 | allocator. Large projects suffer from this and waste resources because of the |
62 | various memory copies. | |
bffc33ec | 63 | |
e8eddfd2 | 64 | Duplicating each library API to accept as input or output memory allocated by |
bffc33ec | 65 | each device specific allocator is not a viable option. It would lead to a |
76ea470c | 66 | combinatorial explosion in the library entry points. |
bffc33ec | 67 | |
76ea470c RC |
68 | Finally, with the advance of high level language constructs (in C++ but in |
69 | other languages too) it is now possible for the compiler to leverage GPUs and | |
70 | other devices without programmer knowledge. Some compiler identified patterns | |
71 | are only do-able with a shared address space. It is also more reasonable to use | |
72 | a shared address space for all other patterns. | |
bffc33ec JG |
73 | |
74 | ||
24844fd3 JC |
75 | I/O bus, device memory characteristics |
76 | ====================================== | |
bffc33ec | 77 | |
e8eddfd2 JG |
78 | I/O buses cripple shared address spaces due to a few limitations. Most I/O |
79 | buses only allow basic memory access from device to main memory; even cache | |
80 | coherency is often optional. Access to device memory from CPU is even more | |
81 | limited. More often than not, it is not cache coherent. | |
bffc33ec | 82 | |
76ea470c RC |
83 | If we only consider the PCIE bus, then a device can access main memory (often |
84 | through an IOMMU) and be cache coherent with the CPUs. However, it only allows | |
85 | a limited set of atomic operations from device on main memory. This is worse | |
e8eddfd2 JG |
86 | in the other direction: the CPU can only access a limited range of the device |
87 | memory and cannot perform atomic operations on it. Thus device memory cannot | |
76ea470c | 88 | be considered the same as regular memory from the kernel point of view. |
bffc33ec JG |
89 | |
90 | Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 | |
76ea470c RC |
91 | and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). |
92 | The final limitation is latency. Access to main memory from the device has an | |
93 | order of magnitude higher latency than when the device accesses its own memory. | |
bffc33ec | 94 | |
76ea470c | 95 | Some platforms are developing new I/O buses or additions/modifications to PCIE |
e8eddfd2 | 96 | to address some of these limitations (OpenCAPI, CCIX). They mainly allow two- |
bffc33ec | 97 | way cache coherency between CPU and device and allow all atomic operations the |
e8eddfd2 | 98 | architecture supports. Sadly, not all platforms are following this trend and |
76ea470c | 99 | some major architectures are left without hardware solutions to these problems. |
bffc33ec | 100 | |
e8eddfd2 JG |
101 | So for shared address space to make sense, not only must we allow devices to |
102 | access any memory but we must also permit any memory to be migrated to device | |
103 | memory while device is using it (blocking CPU access while it happens). | |
bffc33ec JG |
104 | |
105 | ||
24844fd3 JC |
106 | Shared address space and migration |
107 | ================================== | |
bffc33ec JG |
108 | |
109 | HMM intends to provide two main features. First one is to share the address | |
76ea470c RC |
110 | space by duplicating the CPU page table in the device page table so the same |
111 | address points to the same physical memory for any valid main memory address in | |
bffc33ec JG |
112 | the process address space. |
113 | ||
76ea470c | 114 | To achieve this, HMM offers a set of helpers to populate the device page table |
bffc33ec | 115 | while keeping track of CPU page table updates. Device page table updates are |
76ea470c RC |
116 | not as easy as CPU page table updates. To update the device page table, you must |
117 | allocate a buffer (or use a pool of pre-allocated buffers) and write GPU | |
118 | specific commands in it to perform the update (unmap, cache invalidations, and | |
e8eddfd2 | 119 | flush, ...). This cannot be done through common code for all devices. Hence |
76ea470c RC |
120 | why HMM provides helpers to factor out everything that can be while leaving the |
121 | hardware specific details to the device driver. | |
122 | ||
e8eddfd2 | 123 | The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that |
76ea470c | 124 | allows allocating a struct page for each page of the device memory. Those pages |
e8eddfd2 | 125 | are special because the CPU cannot map them. However, they allow migrating |
76ea470c RC |
126 | main memory to device memory using existing migration mechanisms and everything |
127 | looks like a page is swapped out to disk from the CPU point of view. Using a | |
128 | struct page gives the easiest and cleanest integration with existing mm mech- | |
129 | anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE | |
130 | memory for the device memory and second to perform migration. Policy decisions | |
131 | of what and when to migrate things is left to the device driver. | |
132 | ||
133 | Note that any CPU access to a device page triggers a page fault and a migration | |
134 | back to main memory. For example, when a page backing a given CPU address A is | |
135 | migrated from a main memory page to a device page, then any CPU access to | |
136 | address A triggers a page fault and initiates a migration back to main memory. | |
137 | ||
138 | With these two features, HMM not only allows a device to mirror process address | |
139 | space and keeping both CPU and device page table synchronized, but also lever- | |
e8eddfd2 | 140 | ages device memory by migrating the part of the data set that is actively being |
76ea470c | 141 | used by the device. |
bffc33ec JG |
142 | |
143 | ||
aa9f34e5 MR |
144 | Address space mirroring implementation and API |
145 | ============================================== | |
bffc33ec | 146 | |
76ea470c RC |
147 | Address space mirroring's main objective is to allow duplication of a range of |
148 | CPU page table into a device page table; HMM helps keep both synchronized. A | |
e8eddfd2 | 149 | device driver that wants to mirror a process address space must start with the |
aa9f34e5 | 150 | registration of an hmm_mirror struct:: |
bffc33ec JG |
151 | |
152 | int hmm_mirror_register(struct hmm_mirror *mirror, | |
153 | struct mm_struct *mm); | |
154 | int hmm_mirror_register_locked(struct hmm_mirror *mirror, | |
155 | struct mm_struct *mm); | |
156 | ||
24844fd3 | 157 | |
e8eddfd2 | 158 | The locked variant is to be used when the driver is already holding mmap_sem |
76ea470c | 159 | of the mm in write mode. The mirror struct has a set of callbacks that are used |
24844fd3 | 160 | to propagate CPU page tables:: |
bffc33ec JG |
161 | |
162 | struct hmm_mirror_ops { | |
163 | /* sync_cpu_device_pagetables() - synchronize page tables | |
164 | * | |
165 | * @mirror: pointer to struct hmm_mirror | |
166 | * @update_type: type of update that occurred to the CPU page table | |
167 | * @start: virtual start address of the range to update | |
168 | * @end: virtual end address of the range to update | |
169 | * | |
170 | * This callback ultimately originates from mmu_notifiers when the CPU | |
171 | * page table is updated. The device driver must update its page table | |
172 | * in response to this callback. The update argument tells what action | |
173 | * to perform. | |
174 | * | |
175 | * The device driver must not return from this callback until the device | |
176 | * page tables are completely updated (TLBs flushed, etc); this is a | |
177 | * synchronous call. | |
178 | */ | |
179 | void (*update)(struct hmm_mirror *mirror, | |
180 | enum hmm_update action, | |
181 | unsigned long start, | |
182 | unsigned long end); | |
183 | }; | |
184 | ||
76ea470c RC |
185 | The device driver must perform the update action to the range (mark range |
186 | read only, or fully unmap, ...). The device must be done with the update before | |
187 | the driver callback returns. | |
bffc33ec | 188 | |
76ea470c | 189 | When the device driver wants to populate a range of virtual addresses, it can |
24844fd3 | 190 | use either:: |
aa9f34e5 | 191 | |
24844fd3 | 192 | int hmm_vma_get_pfns(struct vm_area_struct *vma, |
bffc33ec JG |
193 | struct hmm_range *range, |
194 | unsigned long start, | |
195 | unsigned long end, | |
196 | hmm_pfn_t *pfns); | |
197 | int hmm_vma_fault(struct vm_area_struct *vma, | |
198 | struct hmm_range *range, | |
199 | unsigned long start, | |
200 | unsigned long end, | |
201 | hmm_pfn_t *pfns, | |
202 | bool write, | |
203 | bool block); | |
204 | ||
76ea470c | 205 | The first one (hmm_vma_get_pfns()) will only fetch present CPU page table |
e8eddfd2 JG |
206 | entries and will not trigger a page fault on missing or non-present entries. |
207 | The second one does trigger a page fault on missing or read-only entry if the | |
76ea470c RC |
208 | write parameter is true. Page faults use the generic mm page fault code path |
209 | just like a CPU page fault. | |
bffc33ec | 210 | |
76ea470c RC |
211 | Both functions copy CPU page table entries into their pfns array argument. Each |
212 | entry in that array corresponds to an address in the virtual range. HMM | |
213 | provides a set of flags to help the driver identify special CPU page table | |
214 | entries. | |
bffc33ec JG |
215 | |
216 | Locking with the update() callback is the most important aspect the driver must | |
24844fd3 | 217 | respect in order to keep things properly synchronized. The usage pattern is:: |
bffc33ec JG |
218 | |
219 | int driver_populate_range(...) | |
220 | { | |
221 | struct hmm_range range; | |
222 | ... | |
223 | again: | |
224 | ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); | |
225 | if (ret) | |
226 | return ret; | |
227 | take_lock(driver->update); | |
228 | if (!hmm_vma_range_done(vma, &range)) { | |
229 | release_lock(driver->update); | |
230 | goto again; | |
231 | } | |
232 | ||
233 | // Use pfns array content to update device page table | |
234 | ||
235 | release_lock(driver->update); | |
236 | return 0; | |
237 | } | |
238 | ||
76ea470c RC |
239 | The driver->update lock is the same lock that the driver takes inside its |
240 | update() callback. That lock must be held before hmm_vma_range_done() to avoid | |
241 | any race with a concurrent CPU page table update. | |
bffc33ec | 242 | |
76ea470c RC |
243 | HMM implements all this on top of the mmu_notifier API because we wanted a |
244 | simpler API and also to be able to perform optimizations latter on like doing | |
245 | concurrent device updates in multi-devices scenario. | |
bffc33ec | 246 | |
e8eddfd2 | 247 | HMM also serves as an impedance mismatch between how CPU page table updates |
76ea470c RC |
248 | are done (by CPU write to the page table and TLB flushes) and how devices |
249 | update their own page table. Device updates are a multi-step process. First, | |
e8eddfd2 | 250 | appropriate commands are written to a buffer, then this buffer is scheduled for |
76ea470c RC |
251 | execution on the device. It is only once the device has executed commands in |
252 | the buffer that the update is done. Creating and scheduling the update command | |
253 | buffer can happen concurrently for multiple devices. Waiting for each device to | |
254 | report commands as executed is serialized (there is no point in doing this | |
255 | concurrently). | |
bffc33ec JG |
256 | |
257 | ||
aa9f34e5 MR |
258 | Represent and manage device memory from core kernel point of view |
259 | ================================================================= | |
bffc33ec | 260 | |
76ea470c RC |
261 | Several different designs were tried to support device memory. First one used |
262 | a device specific data structure to keep information about migrated memory and | |
263 | HMM hooked itself in various places of mm code to handle any access to | |
264 | addresses that were backed by device memory. It turns out that this ended up | |
265 | replicating most of the fields of struct page and also needed many kernel code | |
266 | paths to be updated to understand this new kind of memory. | |
bffc33ec | 267 | |
76ea470c RC |
268 | Most kernel code paths never try to access the memory behind a page |
269 | but only care about struct page contents. Because of this, HMM switched to | |
270 | directly using struct page for device memory which left most kernel code paths | |
271 | unaware of the difference. We only need to make sure that no one ever tries to | |
272 | map those pages from the CPU side. | |
bffc33ec | 273 | |
76ea470c | 274 | HMM provides a set of helpers to register and hotplug device memory as a new |
24844fd3 | 275 | region needing a struct page. This is offered through a very simple API:: |
bffc33ec JG |
276 | |
277 | struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, | |
278 | struct device *device, | |
279 | unsigned long size); | |
280 | void hmm_devmem_remove(struct hmm_devmem *devmem); | |
281 | ||
aa9f34e5 | 282 | The hmm_devmem_ops is where most of the important things are:: |
bffc33ec JG |
283 | |
284 | struct hmm_devmem_ops { | |
285 | void (*free)(struct hmm_devmem *devmem, struct page *page); | |
286 | int (*fault)(struct hmm_devmem *devmem, | |
287 | struct vm_area_struct *vma, | |
288 | unsigned long addr, | |
289 | struct page *page, | |
290 | unsigned flags, | |
291 | pmd_t *pmdp); | |
292 | }; | |
293 | ||
294 | The first callback (free()) happens when the last reference on a device page is | |
76ea470c RC |
295 | dropped. This means the device page is now free and no longer used by anyone. |
296 | The second callback happens whenever the CPU tries to access a device page | |
e8eddfd2 | 297 | which it cannot do. This second callback must trigger a migration back to |
76ea470c | 298 | system memory. |
bffc33ec JG |
299 | |
300 | ||
24844fd3 JC |
301 | Migration to and from device memory |
302 | =================================== | |
bffc33ec | 303 | |
e8eddfd2 | 304 | Because the CPU cannot access device memory, migration must use the device DMA |
76ea470c | 305 | engine to perform copy from and to device memory. For this we need a new |
24844fd3 | 306 | migration helper:: |
bffc33ec JG |
307 | |
308 | int migrate_vma(const struct migrate_vma_ops *ops, | |
309 | struct vm_area_struct *vma, | |
310 | unsigned long mentries, | |
311 | unsigned long start, | |
312 | unsigned long end, | |
313 | unsigned long *src, | |
314 | unsigned long *dst, | |
315 | void *private); | |
316 | ||
76ea470c RC |
317 | Unlike other migration functions it works on a range of virtual address, there |
318 | are two reasons for that. First, device DMA copy has a high setup overhead cost | |
bffc33ec | 319 | and thus batching multiple pages is needed as otherwise the migration overhead |
e8eddfd2 | 320 | makes the whole exercise pointless. The second reason is because the |
76ea470c | 321 | migration might be for a range of addresses the device is actively accessing. |
bffc33ec | 322 | |
76ea470c RC |
323 | The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) |
324 | controls destination memory allocation and copy operation. Second one is there | |
24844fd3 | 325 | to allow the device driver to perform cleanup operations after migration:: |
bffc33ec JG |
326 | |
327 | struct migrate_vma_ops { | |
328 | void (*alloc_and_copy)(struct vm_area_struct *vma, | |
329 | const unsigned long *src, | |
330 | unsigned long *dst, | |
331 | unsigned long start, | |
332 | unsigned long end, | |
333 | void *private); | |
334 | void (*finalize_and_map)(struct vm_area_struct *vma, | |
335 | const unsigned long *src, | |
336 | const unsigned long *dst, | |
337 | unsigned long start, | |
338 | unsigned long end, | |
339 | void *private); | |
340 | }; | |
341 | ||
76ea470c | 342 | It is important to stress that these migration helpers allow for holes in the |
bffc33ec | 343 | virtual address range. Some pages in the range might not be migrated for all |
76ea470c RC |
344 | the usual reasons (page is pinned, page is locked, ...). This helper does not |
345 | fail but just skips over those pages. | |
bffc33ec | 346 | |
76ea470c RC |
347 | The alloc_and_copy() might decide to not migrate all pages in the |
348 | range (for reasons under the callback control). For those, the callback just | |
349 | has to leave the corresponding dst entry empty. | |
bffc33ec | 350 | |
76ea470c | 351 | Finally, the migration of the struct page might fail (for file backed page) for |
bffc33ec | 352 | various reasons (failure to freeze reference, or update page cache, ...). If |
76ea470c RC |
353 | that happens, then the finalize_and_map() can catch any pages that were not |
354 | migrated. Note those pages were still copied to a new page and thus we wasted | |
bffc33ec JG |
355 | bandwidth but this is considered as a rare event and a price that we are |
356 | willing to pay to keep all the code simpler. | |
357 | ||
358 | ||
aa9f34e5 MR |
359 | Memory cgroup (memcg) and rss accounting |
360 | ======================================== | |
bffc33ec JG |
361 | |
362 | For now device memory is accounted as any regular page in rss counters (either | |
76ea470c RC |
363 | anonymous if device page is used for anonymous, file if device page is used for |
364 | file backed page or shmem if device page is used for shared memory). This is a | |
365 | deliberate choice to keep existing applications, that might start using device | |
366 | memory without knowing about it, running unimpacted. | |
367 | ||
e8eddfd2 | 368 | A drawback is that the OOM killer might kill an application using a lot of |
76ea470c RC |
369 | device memory and not a lot of regular system memory and thus not freeing much |
370 | system memory. We want to gather more real world experience on how applications | |
371 | and system react under memory pressure in the presence of device memory before | |
bffc33ec JG |
372 | deciding to account device memory differently. |
373 | ||
374 | ||
76ea470c | 375 | Same decision was made for memory cgroup. Device memory pages are accounted |
bffc33ec JG |
376 | against same memory cgroup a regular page would be accounted to. This does |
377 | simplify migration to and from device memory. This also means that migration | |
e8eddfd2 | 378 | back from device memory to regular memory cannot fail because it would |
bffc33ec | 379 | go above memory cgroup limit. We might revisit this choice latter on once we |
76ea470c | 380 | get more experience in how device memory is used and its impact on memory |
bffc33ec JG |
381 | resource control. |
382 | ||
383 | ||
76ea470c | 384 | Note that device memory can never be pinned by device driver nor through GUP |
bffc33ec | 385 | and thus such memory is always free upon process exit. Or when last reference |
76ea470c | 386 | is dropped in case of shared memory or file backed memory. |