Commit | Line | Data |
---|---|---|
acda97ac IMAT |
1 | ======================= |
2 | Direct Access for files | |
3 | ======================= | |
4 | ||
5 | Motivation | |
6 | ---------- | |
7 | ||
8 | The page cache is usually used to buffer reads and writes to files. | |
9 | It is also used to provide the pages which are mapped into userspace | |
10 | by a call to mmap. | |
11 | ||
12 | For block devices that are memory-like, the page cache pages would be | |
13 | unnecessary copies of the original storage. The `DAX` code removes the | |
14 | extra copy by performing reads and writes directly to the storage device. | |
15 | For file mappings, the storage device is mapped directly into userspace. | |
16 | ||
17 | ||
18 | Usage | |
19 | ----- | |
20 | ||
21 | If you have a block device which supports `DAX`, you can make a filesystem | |
22 | on it as usual. The `DAX` code currently only supports files with a block | |
23 | size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block | |
24 | size when creating the filesystem. | |
25 | ||
faac5095 | 26 | Currently 5 filesystems support `DAX`: ext2, ext4, xfs, virtiofs and erofs. |
073c3ab6 | 27 | Enabling `DAX` on them is different. |
acda97ac | 28 | |
faac5095 | 29 | Enabling DAX on ext2 and erofs |
30 | ------------------------------ | |
acda97ac IMAT |
31 | |
32 | When mounting the filesystem, use the ``-o dax`` option on the command line or | |
33 | add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files | |
34 | within the filesystem. It is equivalent to the ``-o dax=always`` behavior below. | |
35 | ||
36 | ||
37 | Enabling DAX on xfs and ext4 | |
38 | ---------------------------- | |
39 | ||
40 | Summary | |
41 | ------- | |
42 | ||
43 | 1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to | |
44 | the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details | |
45 | about this access mode. | |
46 | ||
47 | 2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular | |
48 | files and directories. This advisory flag can be set or cleared at any | |
49 | time, but doing so does not immediately affect the `S_DAX` state. | |
50 | ||
51 | 3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will | |
52 | be inherited by all regular files and subdirectories that are subsequently | |
53 | created in this directory. Files and subdirectories that exist at the time | |
54 | this flag is set or cleared on the parent directory are not modified by | |
55 | this modification of the parent directory. | |
56 | ||
57 | 4. There exist dax mount options which can override `FS_XFLAG_DAX` in the | |
58 | setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the | |
59 | following hold: | |
60 | ||
61 | ``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default. | |
62 | ||
63 | ``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`." | |
64 | ||
65 | ``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`." | |
66 | ||
67 | ``-o dax`` is a legacy option which is an alias for ``dax=always``. | |
68 | ||
69 | .. warning:: | |
70 | ||
71 | The option ``-o dax`` may be removed in the future so ``-o dax=always`` is | |
72 | the preferred method for specifying this behavior. | |
73 | ||
74 | .. note:: | |
75 | ||
76 | Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain | |
77 | the same even when the filesystem is mounted with a dax option. However, | |
78 | in-core inode state (`S_DAX`) will be overridden until the filesystem is | |
79 | remounted with dax=inode and the inode is evicted from kernel memory. | |
80 | ||
81 | 5. The `S_DAX` policy can be changed via: | |
82 | ||
83 | a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are | |
84 | created | |
85 | ||
86 | b) Setting the appropriate dax="foo" mount option | |
87 | ||
88 | c) Changing the `FS_XFLAG_DAX` flag on existing regular files and | |
89 | directories. This has runtime constraints and limitations that are | |
90 | described in 6) below. | |
91 | ||
92 | 6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX` | |
93 | flag, the change to existing regular files won't take effect until the | |
94 | files are closed by all processes. | |
95 | ||
96 | ||
97 | Details | |
98 | ------- | |
99 | ||
100 | There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`) | |
101 | and the other is a volatile flag indicating the active state of the feature | |
102 | (`S_DAX`). | |
103 | ||
104 | `FS_XFLAG_DAX` is preserved within the filesystem. This persistent config | |
105 | setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl | |
106 | (see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. | |
107 | ||
108 | New files and directories automatically inherit `FS_XFLAG_DAX` from | |
109 | their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at | |
110 | directory creation time can be used to set a default behavior for an entire | |
111 | sub-tree. | |
112 | ||
113 | To clarify inheritance, here are 3 examples: | |
114 | ||
115 | Example A: | |
116 | ||
117 | .. code-block:: shell | |
118 | ||
119 | mkdir -p a/b/c | |
120 | xfs_io -c 'chattr +x' a | |
121 | mkdir a/b/c/d | |
122 | mkdir a/e | |
123 | ||
124 | ------[outcome]------ | |
125 | ||
126 | dax: a,e | |
127 | no dax: b,c,d | |
128 | ||
129 | Example B: | |
130 | ||
131 | .. code-block:: shell | |
132 | ||
133 | mkdir a | |
134 | xfs_io -c 'chattr +x' a | |
135 | mkdir -p a/b/c/d | |
136 | ||
137 | ------[outcome]------ | |
138 | ||
139 | dax: a,b,c,d | |
140 | no dax: | |
141 | ||
142 | Example C: | |
143 | ||
144 | .. code-block:: shell | |
145 | ||
146 | mkdir -p a/b/c | |
147 | xfs_io -c 'chattr +x' c | |
148 | mkdir a/b/c/d | |
149 | ||
150 | ------[outcome]------ | |
151 | ||
152 | dax: c,d | |
153 | no dax: a,b | |
154 | ||
155 | The current enabled state (`S_DAX`) is set when a file inode is instantiated in | |
156 | memory by the kernel. It is set based on the underlying media support, the | |
157 | value of `FS_XFLAG_DAX` and the filesystem's dax mount option. | |
158 | ||
159 | statx can be used to query `S_DAX`. | |
160 | ||
161 | .. note:: | |
162 | ||
163 | That only regular files will ever have `S_DAX` set and therefore statx | |
164 | will never indicate that `S_DAX` is set on directories. | |
165 | ||
166 | Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even | |
167 | if the underlying media does not support dax and/or the filesystem is | |
168 | overridden with a mount option. | |
169 | ||
170 | ||
073c3ab6 JX |
171 | Enabling DAX on virtiofs |
172 | ---------------------------- | |
173 | The semantic of DAX on virtiofs is basically equal to that on ext4 and xfs, | |
174 | except that when '-o dax=inode' is specified, virtiofs client derives the hint | |
175 | whether DAX shall be enabled or not from virtiofs server through FUSE protocol, | |
176 | rather than the persistent `FS_XFLAG_DAX` flag. That is, whether DAX shall be | |
177 | enabled or not is completely determined by virtiofs server, while virtiofs | |
178 | server itself may deploy various algorithm making this decision, e.g. depending | |
179 | on the persistent `FS_XFLAG_DAX` flag on the host. | |
180 | ||
181 | It is still supported to set or clear persistent `FS_XFLAG_DAX` flag inside | |
182 | guest, but it is not guaranteed that DAX will be enabled or disabled for | |
183 | corresponding file then. Users inside guest still need to call statx(2) and | |
184 | check the statx flag `STATX_ATTR_DAX` to see if DAX is enabled for this file. | |
185 | ||
186 | ||
acda97ac IMAT |
187 | Implementation Tips for Block Driver Writers |
188 | -------------------------------------------- | |
189 | ||
190 | To support `DAX` in your block driver, implement the 'direct_access' | |
191 | block device operation. It is used to translate the sector number | |
192 | (expressed in units of 512-byte sectors) to a page frame number (pfn) | |
193 | that identifies the physical page for the memory. It also returns a | |
194 | kernel virtual address that can be used to access the memory. | |
195 | ||
196 | The direct_access method takes a 'size' parameter that indicates the | |
197 | number of bytes being requested. The function should return the number | |
198 | of bytes that can be contiguously accessed at that offset. It may also | |
199 | return a negative errno if an error occurs. | |
200 | ||
201 | In order to support this method, the storage must be byte-accessible by | |
202 | the CPU at all times. If your device uses paging techniques to expose | |
203 | a large amount of memory through a smaller window, then you cannot | |
204 | implement direct_access. Equally, if your device can occasionally | |
205 | stall the CPU for an extended period, you should also not attempt to | |
206 | implement direct_access. | |
207 | ||
208 | These block devices may be used for inspiration: | |
209 | - brd: RAM backed block device driver | |
210 | - dcssblk: s390 dcss block device driver | |
211 | - pmem: NVDIMM persistent memory driver | |
212 | ||
213 | ||
214 | Implementation Tips for Filesystem Writers | |
215 | ------------------------------------------ | |
216 | ||
217 | Filesystem support consists of: | |
218 | ||
219 | * Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in | |
220 | i_flags | |
221 | * Implementing ->read_iter and ->write_iter operations which use | |
222 | :c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set | |
223 | * Implementing an mmap file operation for `DAX` files which sets the | |
224 | `VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to | |
225 | include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These | |
226 | handlers should probably call :c:func:`dax_iomap_fault()` passing the | |
227 | appropriate fault size and iomap operations. | |
228 | * Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations | |
229 | instead of :c:func:`block_truncate_page()` for `DAX` files | |
230 | * Ensuring that there is sufficient locking between reads, writes, | |
231 | truncates and page faults | |
232 | ||
233 | The iomap handlers for allocating blocks must make sure that allocated blocks | |
234 | are zeroed out and converted to written extents before being returned to avoid | |
235 | exposure of uninitialized data through mmap. | |
236 | ||
237 | These filesystems may be used for inspiration: | |
238 | ||
239 | .. seealso:: | |
240 | ||
241 | ext2: see Documentation/filesystems/ext2.rst | |
242 | ||
243 | .. seealso:: | |
244 | ||
245 | xfs: see Documentation/admin-guide/xfs.rst | |
246 | ||
247 | .. seealso:: | |
248 | ||
249 | ext4: see Documentation/filesystems/ext4/ | |
250 | ||
251 | ||
252 | Handling Media Errors | |
253 | --------------------- | |
254 | ||
255 | The libnvdimm subsystem stores a record of known media error locations for | |
256 | each pmem block device (in gendisk->badblocks). If we fault at such location, | |
257 | or one with a latent error not yet discovered, the application can expect | |
258 | to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply | |
259 | writing the affected sectors (through the pmem driver, and if the underlying | |
260 | NVDIMM supports the clear_poison DSM defined by ACPI). | |
261 | ||
262 | Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or | |
263 | sysadmins have an option to restore the lost data from a prior ``backup/inbuilt`` | |
264 | redundancy in the following ways: | |
265 | ||
266 | 1. Delete the affected file, and restore from a backup (sysadmin route): | |
267 | This will free the filesystem blocks that were being used by the file, | |
268 | and the next time they're allocated, they will be zeroed first, which | |
269 | happens through the driver, and will clear bad sectors. | |
270 | ||
271 | 2. Truncate or hole-punch the part of the file that has a bad-block (at least | |
272 | an entire aligned sector has to be hole-punched, but not necessarily an | |
273 | entire filesystem block). | |
274 | ||
275 | These are the two basic paths that allow `DAX` filesystems to continue operating | |
276 | in the presence of media errors. More robust error recovery mechanisms can be | |
277 | built on top of this in the future, for example, involving redundancy/mirroring | |
278 | provided at the block layer through DM, or additionally, at the filesystem | |
279 | level. These would have to rely on the above two tenets, that error clearing | |
280 | can happen either by sending an IO through the driver, or zeroing (also through | |
281 | the driver). | |
282 | ||
283 | ||
284 | Shortcomings | |
285 | ------------ | |
286 | ||
287 | Even if the kernel or its modules are stored on a filesystem that supports | |
288 | `DAX` on a block device that supports `DAX`, they will still be copied into RAM. | |
289 | ||
290 | The DAX code does not work correctly on architectures which have virtually | |
291 | mapped caches such as ARM, MIPS and SPARC. | |
292 | ||
293 | Calling :c:func:`get_user_pages()` on a range of user memory that has been | |
294 | mmaped from a `DAX` file will fail when there are no 'struct page' to describe | |
295 | those pages. This problem has been addressed in some device drivers | |
296 | by adding optional struct page support for pages under the control of | |
297 | the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of | |
298 | how to do this). In the non struct page cases `O_DIRECT` reads/writes to | |
299 | those memory ranges from a non-`DAX` file will fail | |
300 | ||
301 | ||
302 | .. note:: | |
303 | ||
304 | `O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that | |
305 | is being accessed that is key here). Other things that will not work in | |
306 | the non struct page case include RDMA, :c:func:`sendfile()` and | |
307 | :c:func:`splice()`. |