Commit | Line | Data |
---|---|---|
aa9f34e5 | 1 | ===================================== |
bffc33ec | 2 | Heterogeneous Memory Management (HMM) |
aa9f34e5 | 3 | ===================================== |
bffc33ec | 4 | |
e8eddfd2 JG |
5 | Provide infrastructure and helpers to integrate non-conventional memory (device |
6 | memory like GPU on board memory) into regular kernel path, with the cornerstone | |
7 | of this being specialized struct page for such memory (see sections 5 to 7 of | |
8 | this document). | |
9 | ||
10 | HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., | |
2076e5c0 | 11 | allowing a device to transparently access program addresses coherently with |
24844fd3 JC |
12 | the CPU meaning that any valid pointer on the CPU is also a valid pointer |
13 | for the device. This is becoming mandatory to simplify the use of advanced | |
14 | heterogeneous computing where GPU, DSP, or FPGA are used to perform various | |
e8eddfd2 | 15 | computations on behalf of a process. |
76ea470c RC |
16 | |
17 | This document is divided as follows: in the first section I expose the problems | |
18 | related to using device specific memory allocators. In the second section, I | |
19 | expose the hardware limitations that are inherent to many platforms. The third | |
20 | section gives an overview of the HMM design. The fourth section explains how | |
e8eddfd2 | 21 | CPU page-table mirroring works and the purpose of HMM in this context. The |
76ea470c | 22 | fifth section deals with how device memory is represented inside the kernel. |
2076e5c0 RC |
23 | Finally, the last section presents a new migration helper that allows |
24 | leveraging the device DMA engine. | |
76ea470c | 25 | |
aa9f34e5 | 26 | .. contents:: :local: |
76ea470c | 27 | |
24844fd3 JC |
28 | Problems of using a device specific memory allocator |
29 | ==================================================== | |
bffc33ec | 30 | |
e8eddfd2 | 31 | Devices with a large amount of on board memory (several gigabytes) like GPUs |
76ea470c RC |
32 | have historically managed their memory through dedicated driver specific APIs. |
33 | This creates a disconnect between memory allocated and managed by a device | |
34 | driver and regular application memory (private anonymous, shared memory, or | |
35 | regular file backed memory). From here on I will refer to this aspect as split | |
36 | address space. I use shared address space to refer to the opposite situation: | |
37 | i.e., one in which any application memory region can be used by a device | |
38 | transparently. | |
bffc33ec | 39 | |
2076e5c0 RC |
40 | Split address space happens because devices can only access memory allocated |
41 | through a device specific API. This implies that all memory objects in a program | |
e8eddfd2 JG |
42 | are not equal from the device point of view which complicates large programs |
43 | that rely on a wide set of libraries. | |
bffc33ec | 44 | |
2076e5c0 RC |
45 | Concretely, this means that code that wants to leverage devices like GPUs needs |
46 | to copy objects between generically allocated memory (malloc, mmap private, mmap | |
e8eddfd2 JG |
47 | share) and memory allocated through the device driver API (this still ends up |
48 | with an mmap but of the device file). | |
bffc33ec | 49 | |
e8eddfd2 | 50 | For flat data sets (array, grid, image, ...) this isn't too hard to achieve but |
2076e5c0 | 51 | for complex data sets (list, tree, ...) it's hard to get right. Duplicating a |
e8eddfd2 | 52 | complex data set needs to re-map all the pointer relations between each of its |
2076e5c0 | 53 | elements. This is error prone and programs get harder to debug because of the |
e8eddfd2 | 54 | duplicate data set and addresses. |
bffc33ec | 55 | |
e8eddfd2 | 56 | Split address space also means that libraries cannot transparently use data |
76ea470c | 57 | they are getting from the core program or another library and thus each library |
e8eddfd2 | 58 | might have to duplicate its input data set using the device specific memory |
76ea470c RC |
59 | allocator. Large projects suffer from this and waste resources because of the |
60 | various memory copies. | |
bffc33ec | 61 | |
e8eddfd2 | 62 | Duplicating each library API to accept as input or output memory allocated by |
bffc33ec | 63 | each device specific allocator is not a viable option. It would lead to a |
76ea470c | 64 | combinatorial explosion in the library entry points. |
bffc33ec | 65 | |
76ea470c RC |
66 | Finally, with the advance of high level language constructs (in C++ but in |
67 | other languages too) it is now possible for the compiler to leverage GPUs and | |
68 | other devices without programmer knowledge. Some compiler identified patterns | |
69 | are only do-able with a shared address space. It is also more reasonable to use | |
70 | a shared address space for all other patterns. | |
bffc33ec JG |
71 | |
72 | ||
24844fd3 JC |
73 | I/O bus, device memory characteristics |
74 | ====================================== | |
bffc33ec | 75 | |
e8eddfd2 JG |
76 | I/O buses cripple shared address spaces due to a few limitations. Most I/O |
77 | buses only allow basic memory access from device to main memory; even cache | |
2076e5c0 | 78 | coherency is often optional. Access to device memory from a CPU is even more |
e8eddfd2 | 79 | limited. More often than not, it is not cache coherent. |
bffc33ec | 80 | |
76ea470c RC |
81 | If we only consider the PCIE bus, then a device can access main memory (often |
82 | through an IOMMU) and be cache coherent with the CPUs. However, it only allows | |
2076e5c0 | 83 | a limited set of atomic operations from the device on main memory. This is worse |
e8eddfd2 JG |
84 | in the other direction: the CPU can only access a limited range of the device |
85 | memory and cannot perform atomic operations on it. Thus device memory cannot | |
76ea470c | 86 | be considered the same as regular memory from the kernel point of view. |
bffc33ec JG |
87 | |
88 | Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 | |
76ea470c RC |
89 | and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). |
90 | The final limitation is latency. Access to main memory from the device has an | |
91 | order of magnitude higher latency than when the device accesses its own memory. | |
bffc33ec | 92 | |
76ea470c | 93 | Some platforms are developing new I/O buses or additions/modifications to PCIE |
2076e5c0 RC |
94 | to address some of these limitations (OpenCAPI, CCIX). They mainly allow |
95 | two-way cache coherency between CPU and device and allow all atomic operations the | |
e8eddfd2 | 96 | architecture supports. Sadly, not all platforms are following this trend and |
76ea470c | 97 | some major architectures are left without hardware solutions to these problems. |
bffc33ec | 98 | |
e8eddfd2 JG |
99 | So for shared address space to make sense, not only must we allow devices to |
100 | access any memory but we must also permit any memory to be migrated to device | |
2076e5c0 | 101 | memory while the device is using it (blocking CPU access while it happens). |
bffc33ec JG |
102 | |
103 | ||
24844fd3 JC |
104 | Shared address space and migration |
105 | ================================== | |
bffc33ec | 106 | |
2076e5c0 | 107 | HMM intends to provide two main features. The first one is to share the address |
76ea470c RC |
108 | space by duplicating the CPU page table in the device page table so the same |
109 | address points to the same physical memory for any valid main memory address in | |
bffc33ec JG |
110 | the process address space. |
111 | ||
76ea470c | 112 | To achieve this, HMM offers a set of helpers to populate the device page table |
bffc33ec | 113 | while keeping track of CPU page table updates. Device page table updates are |
76ea470c RC |
114 | not as easy as CPU page table updates. To update the device page table, you must |
115 | allocate a buffer (or use a pool of pre-allocated buffers) and write GPU | |
116 | specific commands in it to perform the update (unmap, cache invalidations, and | |
e8eddfd2 | 117 | flush, ...). This cannot be done through common code for all devices. Hence |
76ea470c RC |
118 | why HMM provides helpers to factor out everything that can be while leaving the |
119 | hardware specific details to the device driver. | |
120 | ||
e8eddfd2 | 121 | The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that |
2076e5c0 | 122 | allows allocating a struct page for each page of device memory. Those pages |
e8eddfd2 | 123 | are special because the CPU cannot map them. However, they allow migrating |
76ea470c | 124 | main memory to device memory using existing migration mechanisms and everything |
2076e5c0 RC |
125 | looks like a page that is swapped out to disk from the CPU point of view. Using a |
126 | struct page gives the easiest and cleanest integration with existing mm | |
127 | mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE | |
76ea470c | 128 | memory for the device memory and second to perform migration. Policy decisions |
2076e5c0 | 129 | of what and when to migrate is left to the device driver. |
76ea470c RC |
130 | |
131 | Note that any CPU access to a device page triggers a page fault and a migration | |
132 | back to main memory. For example, when a page backing a given CPU address A is | |
133 | migrated from a main memory page to a device page, then any CPU access to | |
134 | address A triggers a page fault and initiates a migration back to main memory. | |
135 | ||
136 | With these two features, HMM not only allows a device to mirror process address | |
2076e5c0 RC |
137 | space and keeps both CPU and device page tables synchronized, but also |
138 | leverages device memory by migrating the part of the data set that is actively being | |
76ea470c | 139 | used by the device. |
bffc33ec JG |
140 | |
141 | ||
aa9f34e5 MR |
142 | Address space mirroring implementation and API |
143 | ============================================== | |
bffc33ec | 144 | |
76ea470c RC |
145 | Address space mirroring's main objective is to allow duplication of a range of |
146 | CPU page table into a device page table; HMM helps keep both synchronized. A | |
e8eddfd2 | 147 | device driver that wants to mirror a process address space must start with the |
a22dd506 JG |
148 | registration of a mmu_interval_notifier:: |
149 | ||
5292e24a JG |
150 | int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub, |
151 | struct mm_struct *mm, unsigned long start, | |
152 | unsigned long length, | |
153 | const struct mmu_interval_notifier_ops *ops); | |
a22dd506 | 154 | |
5292e24a JG |
155 | During the ops->invalidate() callback the device driver must perform the |
156 | update action to the range (mark range read only, or fully unmap, etc.). The | |
157 | device must complete the update before the driver callback returns. | |
bffc33ec | 158 | |
76ea470c | 159 | When the device driver wants to populate a range of virtual addresses, it can |
d45d464b | 160 | use:: |
aa9f34e5 | 161 | |
be957c88 | 162 | int hmm_range_fault(struct hmm_range *range); |
bffc33ec | 163 | |
6bfef2f9 JG |
164 | It will trigger a page fault on missing or read-only entries if write access is |
165 | requested (see below). Page faults use the generic mm page fault code path just | |
166 | like a CPU page fault. | |
bffc33ec | 167 | |
76ea470c RC |
168 | Both functions copy CPU page table entries into their pfns array argument. Each |
169 | entry in that array corresponds to an address in the virtual range. HMM | |
170 | provides a set of flags to help the driver identify special CPU page table | |
171 | entries. | |
bffc33ec | 172 | |
2076e5c0 RC |
173 | Locking within the sync_cpu_device_pagetables() callback is the most important |
174 | aspect the driver must respect in order to keep things properly synchronized. | |
175 | The usage pattern is:: | |
bffc33ec JG |
176 | |
177 | int driver_populate_range(...) | |
178 | { | |
179 | struct hmm_range range; | |
180 | ... | |
25f23a0c | 181 | |
5292e24a | 182 | range.notifier = &interval_sub; |
25f23a0c JG |
183 | range.start = ...; |
184 | range.end = ...; | |
2733ea14 | 185 | range.hmm_pfns = ...; |
a3e0d41c | 186 | |
5292e24a | 187 | if (!mmget_not_zero(interval_sub->notifier.mm)) |
a22dd506 | 188 | return -EFAULT; |
25f23a0c | 189 | |
bffc33ec | 190 | again: |
5292e24a | 191 | range.notifier_seq = mmu_interval_read_begin(&interval_sub); |
3e4e28c5 | 192 | mmap_read_lock(mm); |
6bfef2f9 | 193 | ret = hmm_range_fault(&range); |
25f23a0c | 194 | if (ret) { |
3e4e28c5 | 195 | mmap_read_unlock(mm); |
a22dd506 JG |
196 | if (ret == -EBUSY) |
197 | goto again; | |
bffc33ec | 198 | return ret; |
25f23a0c | 199 | } |
3e4e28c5 | 200 | mmap_read_unlock(mm); |
a22dd506 | 201 | |
bffc33ec | 202 | take_lock(driver->update); |
a22dd506 | 203 | if (mmu_interval_read_retry(&ni, range.notifier_seq) { |
bffc33ec JG |
204 | release_lock(driver->update); |
205 | goto again; | |
206 | } | |
207 | ||
a22dd506 JG |
208 | /* Use pfns array content to update device page table, |
209 | * under the update lock */ | |
bffc33ec JG |
210 | |
211 | release_lock(driver->update); | |
212 | return 0; | |
213 | } | |
214 | ||
76ea470c | 215 | The driver->update lock is the same lock that the driver takes inside its |
a22dd506 JG |
216 | invalidate() callback. That lock must be held before calling |
217 | mmu_interval_read_retry() to avoid any race with a concurrent CPU page table | |
218 | update. | |
bffc33ec | 219 | |
023a019a JG |
220 | Leverage default_flags and pfn_flags_mask |
221 | ========================================= | |
222 | ||
2076e5c0 RC |
223 | The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify |
224 | fault or snapshot policy for the whole range instead of having to set them | |
225 | for each entry in the pfns array. | |
226 | ||
2733ea14 JG |
227 | For instance if the device driver wants pages for a range with at least read |
228 | permission, it sets:: | |
023a019a | 229 | |
2733ea14 | 230 | range->default_flags = HMM_PFN_REQ_FAULT; |
023a019a JG |
231 | range->pfn_flags_mask = 0; |
232 | ||
2076e5c0 | 233 | and calls hmm_range_fault() as described above. This will fill fault all pages |
023a019a JG |
234 | in the range with at least read permission. |
235 | ||
2076e5c0 RC |
236 | Now let's say the driver wants to do the same except for one page in the range for |
237 | which it wants to have write permission. Now driver set:: | |
91173c6e | 238 | |
2733ea14 JG |
239 | range->default_flags = HMM_PFN_REQ_FAULT; |
240 | range->pfn_flags_mask = HMM_PFN_REQ_WRITE; | |
241 | range->pfns[index_of_write] = HMM_PFN_REQ_WRITE; | |
023a019a | 242 | |
2076e5c0 | 243 | With this, HMM will fault in all pages with at least read (i.e., valid) and for the |
023a019a | 244 | address == range->start + (index_of_write << PAGE_SHIFT) it will fault with |
2076e5c0 | 245 | write permission i.e., if the CPU pte does not have write permission set then HMM |
023a019a JG |
246 | will call handle_mm_fault(). |
247 | ||
2733ea14 JG |
248 | After hmm_range_fault completes the flag bits are set to the current state of |
249 | the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is | |
250 | writable. | |
023a019a JG |
251 | |
252 | ||
aa9f34e5 MR |
253 | Represent and manage device memory from core kernel point of view |
254 | ================================================================= | |
bffc33ec | 255 | |
2076e5c0 RC |
256 | Several different designs were tried to support device memory. The first one |
257 | used a device specific data structure to keep information about migrated memory | |
258 | and HMM hooked itself in various places of mm code to handle any access to | |
76ea470c RC |
259 | addresses that were backed by device memory. It turns out that this ended up |
260 | replicating most of the fields of struct page and also needed many kernel code | |
261 | paths to be updated to understand this new kind of memory. | |
bffc33ec | 262 | |
76ea470c RC |
263 | Most kernel code paths never try to access the memory behind a page |
264 | but only care about struct page contents. Because of this, HMM switched to | |
265 | directly using struct page for device memory which left most kernel code paths | |
266 | unaware of the difference. We only need to make sure that no one ever tries to | |
267 | map those pages from the CPU side. | |
bffc33ec | 268 | |
24844fd3 JC |
269 | Migration to and from device memory |
270 | =================================== | |
bffc33ec | 271 | |
f7ebd9ed RC |
272 | Because the CPU cannot access device memory directly, the device driver must |
273 | use hardware DMA or device specific load/store instructions to migrate data. | |
274 | The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() | |
275 | functions are designed to make drivers easier to write and to centralize common | |
276 | code across drivers. | |
277 | ||
278 | Before migrating pages to device private memory, special device private | |
279 | ``struct page`` need to be created. These will be used as special "swap" | |
280 | page table entries so that a CPU process will fault if it tries to access | |
281 | a page that has been migrated to device private memory. | |
282 | ||
283 | These can be allocated and freed with:: | |
284 | ||
285 | struct resource *res; | |
286 | struct dev_pagemap pagemap; | |
287 | ||
288 | res = request_free_mem_region(&iomem_resource, /* number of bytes */, | |
289 | "name of driver resource"); | |
290 | pagemap.type = MEMORY_DEVICE_PRIVATE; | |
291 | pagemap.range.start = res->start; | |
292 | pagemap.range.end = res->end; | |
293 | pagemap.nr_range = 1; | |
294 | pagemap.ops = &device_devmem_ops; | |
295 | memremap_pages(&pagemap, numa_node_id()); | |
296 | ||
297 | memunmap_pages(&pagemap); | |
298 | release_mem_region(pagemap.range.start, range_len(&pagemap.range)); | |
299 | ||
300 | There are also devm_request_free_mem_region(), devm_memremap_pages(), | |
301 | devm_memunmap_pages(), and devm_release_mem_region() when the resources can | |
302 | be tied to a ``struct device``. | |
303 | ||
304 | The overall migration steps are similar to migrating NUMA pages within system | |
ee865889 | 305 | memory (see Documentation/mm/page_migration.rst) but the steps are split |
f7ebd9ed RC |
306 | between device driver specific code and shared common code: |
307 | ||
308 | 1. ``mmap_read_lock()`` | |
309 | ||
310 | The device driver has to pass a ``struct vm_area_struct`` to | |
311 | migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to | |
312 | be held for the duration of the migration. | |
313 | ||
314 | 2. ``migrate_vma_setup(struct migrate_vma *args)`` | |
315 | ||
316 | The device driver initializes the ``struct migrate_vma`` fields and passes | |
317 | the pointer to migrate_vma_setup(). The ``args->flags`` field is used to | |
318 | filter which source pages should be migrated. For example, setting | |
319 | ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and | |
320 | ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in | |
321 | device private memory. If the latter flag is set, the ``args->pgmap_owner`` | |
322 | field is used to identify device private pages owned by the driver. This | |
323 | avoids trying to migrate device private pages residing in other devices. | |
324 | Currently only anonymous private VMA ranges can be migrated to or from | |
325 | system memory and device private memory. | |
326 | ||
327 | One of the first steps migrate_vma_setup() does is to invalidate other | |
328 | device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and | |
329 | ``mmu_notifier_invalidate_range_end()`` calls around the page table | |
330 | walks to fill in the ``args->src`` array with PFNs to be migrated. | |
331 | The ``invalidate_range_start()`` callback is passed a | |
332 | ``struct mmu_notifier_range`` with the ``event`` field set to | |
6b49bf6d | 333 | ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to |
f7ebd9ed RC |
334 | the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is |
335 | allows the device driver to skip the invalidation callback and only | |
336 | invalidate device private MMU mappings that are actually migrating. | |
337 | This is explained more in the next section. | |
338 | ||
339 | While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()`` | |
340 | entry results in a valid "zero" PFN stored in the ``args->src`` array. | |
341 | This lets the driver allocate device private memory and clear it instead | |
342 | of copying a page of zeros. Valid PTE entries to system memory or | |
343 | device private struct pages will be locked with ``lock_page()``, isolated | |
344 | from the LRU (if system memory since device private pages are not on | |
345 | the LRU), unmapped from the process, and a special migration PTE is | |
346 | inserted in place of the original PTE. | |
347 | migrate_vma_setup() also clears the ``args->dst`` array. | |
348 | ||
349 | 3. The device driver allocates destination pages and copies source pages to | |
350 | destination pages. | |
351 | ||
352 | The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE`` | |
353 | bit is set and skips entries that are not migrating. The device driver | |
354 | can also choose to skip migrating a page by not filling in the ``dst`` | |
355 | array for that page. | |
356 | ||
357 | The driver then allocates either a device private struct page or a | |
358 | system memory page, locks the page with ``lock_page()``, and fills in the | |
359 | ``dst`` array entry with:: | |
360 | ||
ab09243a | 361 | dst[i] = migrate_pfn(page_to_pfn(dpage)); |
f7ebd9ed RC |
362 | |
363 | Now that the driver knows that this page is being migrated, it can | |
364 | invalidate device private MMU mappings and copy device private memory | |
365 | to system memory or another device private page. The core Linux kernel | |
366 | handles CPU page table invalidations so the device driver only has to | |
367 | invalidate its own MMU mappings. | |
368 | ||
369 | The driver can use ``migrate_pfn_to_page(src[i])`` to get the | |
370 | ``struct page`` of the source and either copy the source page to the | |
371 | destination or clear the destination device private memory if the pointer | |
372 | is ``NULL`` meaning the source page was not populated in system memory. | |
373 | ||
374 | 4. ``migrate_vma_pages()`` | |
375 | ||
376 | This step is where the migration is actually "committed". | |
377 | ||
378 | If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this | |
379 | is where the newly allocated page is inserted into the CPU's page table. | |
380 | This can fail if a CPU thread faults on the same page. However, the page | |
381 | table is locked and only one of the new pages will be inserted. | |
382 | The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared | |
383 | if it loses the race. | |
384 | ||
385 | If the source page was locked, isolated, etc. the source ``struct page`` | |
386 | information is now copied to destination ``struct page`` finalizing the | |
387 | migration on the CPU side. | |
388 | ||
389 | 5. Device driver updates device MMU page tables for pages still migrating, | |
390 | rolling back pages not migrating. | |
391 | ||
392 | If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device | |
393 | driver can update the device MMU and set the write enable bit if the | |
394 | ``MIGRATE_PFN_WRITE`` bit is set. | |
395 | ||
396 | 6. ``migrate_vma_finalize()`` | |
397 | ||
398 | This step replaces the special migration page table entry with the new | |
399 | page's page table entry and releases the reference to the source and | |
400 | destination ``struct page``. | |
401 | ||
402 | 7. ``mmap_read_unlock()`` | |
403 | ||
404 | The lock can now be released. | |
bffc33ec | 405 | |
b756a3b5 AP |
406 | Exclusive access memory |
407 | ======================= | |
408 | ||
409 | Some devices have features such as atomic PTE bits that can be used to implement | |
410 | atomic access to system memory. To support atomic operations to a shared virtual | |
411 | memory page such a device needs access to that page which is exclusive of any | |
412 | userspace access from the CPU. The ``make_device_exclusive_range()`` function | |
413 | can be used to make a memory range inaccessible from userspace. | |
414 | ||
415 | This replaces all mappings for pages in the given range with special swap | |
416 | entries. Any attempt to access the swap entry results in a fault which is | |
417 | resovled by replacing the entry with the original mapping. A driver gets | |
418 | notified that the mapping has been changed by MMU notifiers, after which point | |
419 | it will no longer have exclusive access to the page. Exclusive access is | |
420 | guranteed to last until the driver drops the page lock and page reference, at | |
421 | which point any CPU faults on the page may proceed as described. | |
422 | ||
aa9f34e5 MR |
423 | Memory cgroup (memcg) and rss accounting |
424 | ======================================== | |
bffc33ec | 425 | |
2076e5c0 | 426 | For now, device memory is accounted as any regular page in rss counters (either |
76ea470c | 427 | anonymous if device page is used for anonymous, file if device page is used for |
2076e5c0 | 428 | file backed page, or shmem if device page is used for shared memory). This is a |
76ea470c RC |
429 | deliberate choice to keep existing applications, that might start using device |
430 | memory without knowing about it, running unimpacted. | |
431 | ||
e8eddfd2 | 432 | A drawback is that the OOM killer might kill an application using a lot of |
76ea470c RC |
433 | device memory and not a lot of regular system memory and thus not freeing much |
434 | system memory. We want to gather more real world experience on how applications | |
435 | and system react under memory pressure in the presence of device memory before | |
bffc33ec JG |
436 | deciding to account device memory differently. |
437 | ||
438 | ||
76ea470c | 439 | Same decision was made for memory cgroup. Device memory pages are accounted |
bffc33ec JG |
440 | against same memory cgroup a regular page would be accounted to. This does |
441 | simplify migration to and from device memory. This also means that migration | |
e8eddfd2 | 442 | back from device memory to regular memory cannot fail because it would |
bffc33ec | 443 | go above memory cgroup limit. We might revisit this choice latter on once we |
76ea470c | 444 | get more experience in how device memory is used and its impact on memory |
bffc33ec JG |
445 | resource control. |
446 | ||
447 | ||
2076e5c0 | 448 | Note that device memory can never be pinned by a device driver nor through GUP |
bffc33ec | 449 | and thus such memory is always free upon process exit. Or when last reference |
76ea470c | 450 | is dropped in case of shared memory or file backed memory. |