Commit | Line | Data |
---|---|---|
95ec8dab MW |
1 | Direct Access for files |
2 | ----------------------- | |
3 | ||
4 | Motivation | |
5 | ---------- | |
6 | ||
7 | The page cache is usually used to buffer reads and writes to files. | |
8 | It is also used to provide the pages which are mapped into userspace | |
9 | by a call to mmap. | |
10 | ||
11 | For block devices that are memory-like, the page cache pages would be | |
12 | unnecessary copies of the original storage. The DAX code removes the | |
13 | extra copy by performing reads and writes directly to the storage device. | |
14 | For file mappings, the storage device is mapped directly into userspace. | |
15 | ||
16 | ||
17 | Usage | |
18 | ----- | |
19 | ||
20 | If you have a block device which supports DAX, you can make a filesystem | |
44f4c054 MW |
21 | on it as usual. The DAX code currently only supports files with a block |
22 | size equal to your kernel's PAGE_SIZE, so you may need to specify a block | |
83d90886 IW |
23 | size when creating the filesystem. |
24 | ||
25 | Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them | |
26 | is different. | |
27 | ||
15ee6567 | 28 | Enabling DAX on ext2 |
83d90886 IW |
29 | ----------------------------- |
30 | ||
31 | When mounting the filesystem, use the "-o dax" option on the command line or | |
32 | add 'dax' to the options in /etc/fstab. This works to enable DAX on all files | |
33 | within the filesystem. It is equivalent to the '-o dax=always' behavior below. | |
34 | ||
35 | ||
15ee6567 IW |
36 | Enabling DAX on xfs and ext4 |
37 | ---------------------------- | |
83d90886 IW |
38 | |
39 | Summary | |
40 | ------- | |
41 | ||
42 | 1. There exists an in-kernel file access mode flag S_DAX that corresponds to | |
43 | the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details | |
44 | about this access mode. | |
45 | ||
46 | 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular | |
47 | files and directories. This advisory flag can be set or cleared at any | |
48 | time, but doing so does not immediately affect the S_DAX state. | |
49 | ||
50 | 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will | |
51 | be inherited by all regular files and subdirectories that are subsequently | |
52 | created in this directory. Files and subdirectories that exist at the time | |
53 | this flag is set or cleared on the parent directory are not modified by | |
54 | this modification of the parent directory. | |
55 | ||
56 | 4. There exist dax mount options which can override FS_XFLAG_DAX in the | |
57 | setting of the S_DAX flag. Given underlying storage which supports DAX the | |
58 | following hold: | |
59 | ||
60 | "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default. | |
61 | ||
62 | "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX." | |
63 | ||
64 | "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX." | |
65 | ||
66 | "-o dax" is a legacy option which is an alias for "dax=always". | |
67 | This may be removed in the future so "-o dax=always" is | |
68 | the preferred method for specifying this behavior. | |
69 | ||
70 | NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain | |
71 | the same even when the filesystem is mounted with a dax option. However, | |
72 | in-core inode state (S_DAX) will be overridden until the filesystem is | |
73 | remounted with dax=inode and the inode is evicted from kernel memory. | |
74 | ||
75 | 5. The S_DAX policy can be changed via: | |
76 | ||
77 | a) Setting the parent directory FS_XFLAG_DAX as needed before files are | |
78 | created | |
79 | ||
80 | b) Setting the appropriate dax="foo" mount option | |
81 | ||
82 | c) Changing the FS_XFLAG_DAX flag on existing regular files and | |
83 | directories. This has runtime constraints and limitations that are | |
84 | described in 6) below. | |
85 | ||
86 | 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag, | |
87 | the change in behaviour for existing regular files may not occur | |
88 | immediately. If the change must take effect immediately, the administrator | |
89 | needs to: | |
90 | ||
91 | a) stop the application so there are no active references to the data set | |
92 | the policy change will affect | |
93 | ||
94 | b) evict the data set from kernel caches so it will be re-instantiated when | |
95 | the application is restarted. This can be achieved by: | |
96 | ||
97 | i. drop-caches | |
98 | ii. a filesystem unmount and mount cycle | |
99 | iii. a system reboot | |
100 | ||
101 | ||
102 | Details | |
103 | ------- | |
104 | ||
105 | There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX) | |
106 | and the other is a volatile flag indicating the active state of the feature | |
107 | (S_DAX). | |
108 | ||
109 | FS_XFLAG_DAX is preserved within the filesystem. This persistent config | |
110 | setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl | |
111 | (see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. | |
112 | ||
113 | New files and directories automatically inherit FS_XFLAG_DAX from | |
114 | their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at | |
115 | directory creation time can be used to set a default behavior for an entire | |
116 | sub-tree. | |
117 | ||
118 | To clarify inheritance, here are 3 examples: | |
119 | ||
120 | Example A: | |
121 | ||
122 | mkdir -p a/b/c | |
123 | xfs_io -c 'chattr +x' a | |
124 | mkdir a/b/c/d | |
125 | mkdir a/e | |
126 | ||
127 | dax: a,e | |
128 | no dax: b,c,d | |
129 | ||
130 | Example B: | |
131 | ||
132 | mkdir a | |
133 | xfs_io -c 'chattr +x' a | |
134 | mkdir -p a/b/c/d | |
135 | ||
136 | dax: a,b,c,d | |
137 | no dax: | |
138 | ||
139 | Example C: | |
140 | ||
141 | mkdir -p a/b/c | |
142 | xfs_io -c 'chattr +x' c | |
143 | mkdir a/b/c/d | |
144 | ||
145 | dax: c,d | |
146 | no dax: a,b | |
147 | ||
148 | ||
149 | The current enabled state (S_DAX) is set when a file inode is instantiated in | |
150 | memory by the kernel. It is set based on the underlying media support, the | |
151 | value of FS_XFLAG_DAX and the filesystem's dax mount option. | |
152 | ||
153 | statx can be used to query S_DAX. NOTE that only regular files will ever have | |
154 | S_DAX set and therefore statx will never indicate that S_DAX is set on | |
155 | directories. | |
156 | ||
157 | Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even | |
158 | if the underlying media does not support dax and/or the filesystem is | |
159 | overridden with a mount option. | |
160 | ||
95ec8dab MW |
161 | |
162 | ||
163 | Implementation Tips for Block Driver Writers | |
164 | -------------------------------------------- | |
165 | ||
166 | To support DAX in your block driver, implement the 'direct_access' | |
167 | block device operation. It is used to translate the sector number | |
168 | (expressed in units of 512-byte sectors) to a page frame number (pfn) | |
169 | that identifies the physical page for the memory. It also returns a | |
170 | kernel virtual address that can be used to access the memory. | |
171 | ||
172 | The direct_access method takes a 'size' parameter that indicates the | |
173 | number of bytes being requested. The function should return the number | |
174 | of bytes that can be contiguously accessed at that offset. It may also | |
175 | return a negative errno if an error occurs. | |
176 | ||
177 | In order to support this method, the storage must be byte-accessible by | |
178 | the CPU at all times. If your device uses paging techniques to expose | |
179 | a large amount of memory through a smaller window, then you cannot | |
180 | implement direct_access. Equally, if your device can occasionally | |
181 | stall the CPU for an extended period, you should also not attempt to | |
182 | implement direct_access. | |
183 | ||
184 | These block devices may be used for inspiration: | |
95ec8dab MW |
185 | - brd: RAM backed block device driver |
186 | - dcssblk: s390 dcss block device driver | |
221c7dc8 | 187 | - pmem: NVDIMM persistent memory driver |
95ec8dab MW |
188 | |
189 | ||
190 | Implementation Tips for Filesystem Writers | |
191 | ------------------------------------------ | |
192 | ||
193 | Filesystem support consists of | |
194 | - adding support to mark inodes as being DAX by setting the S_DAX flag in | |
195 | i_flags | |
dd936e43 JK |
196 | - implementing ->read_iter and ->write_iter operations which use dax_iomap_rw() |
197 | when inode has S_DAX flag set | |
95ec8dab | 198 | - implementing an mmap file operation for DAX files which sets the |
844f35db | 199 | VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to |
dd936e43 | 200 | include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These |
91d25ba8 RZ |
201 | handlers should probably call dax_iomap_fault() passing the appropriate |
202 | fault size and iomap operations. | |
dd936e43 JK |
203 | - calling iomap_zero_range() passing appropriate iomap operations instead of |
204 | block_truncate_page() for DAX files | |
95ec8dab MW |
205 | - ensuring that there is sufficient locking between reads, writes, |
206 | truncates and page faults | |
207 | ||
dd936e43 JK |
208 | The iomap handlers for allocating blocks must make sure that allocated blocks |
209 | are zeroed out and converted to written extents before being returned to avoid | |
210 | exposure of uninitialized data through mmap. | |
95ec8dab MW |
211 | |
212 | These filesystems may be used for inspiration: | |
0c1bc6b8 | 213 | - ext2: see Documentation/filesystems/ext2.rst |
93fb7f19 | 214 | - ext4: see Documentation/filesystems/ext4/ |
89b408a6 | 215 | - xfs: see Documentation/admin-guide/xfs.rst |
95ec8dab MW |
216 | |
217 | ||
4b0228fa VV |
218 | Handling Media Errors |
219 | --------------------- | |
220 | ||
221 | The libnvdimm subsystem stores a record of known media error locations for | |
222 | each pmem block device (in gendisk->badblocks). If we fault at such location, | |
223 | or one with a latent error not yet discovered, the application can expect | |
224 | to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply | |
225 | writing the affected sectors (through the pmem driver, and if the underlying | |
226 | NVDIMM supports the clear_poison DSM defined by ACPI). | |
227 | ||
228 | Since DAX IO normally doesn't go through the driver/bio path, applications or | |
229 | sysadmins have an option to restore the lost data from a prior backup/inbuilt | |
230 | redundancy in the following ways: | |
231 | ||
232 | 1. Delete the affected file, and restore from a backup (sysadmin route): | |
83d90886 | 233 | This will free the filesystem blocks that were being used by the file, |
4b0228fa VV |
234 | and the next time they're allocated, they will be zeroed first, which |
235 | happens through the driver, and will clear bad sectors. | |
236 | ||
237 | 2. Truncate or hole-punch the part of the file that has a bad-block (at least | |
238 | an entire aligned sector has to be hole-punched, but not necessarily an | |
239 | entire filesystem block). | |
240 | ||
241 | These are the two basic paths that allow DAX filesystems to continue operating | |
242 | in the presence of media errors. More robust error recovery mechanisms can be | |
243 | built on top of this in the future, for example, involving redundancy/mirroring | |
244 | provided at the block layer through DM, or additionally, at the filesystem | |
245 | level. These would have to rely on the above two tenets, that error clearing | |
246 | can happen either by sending an IO through the driver, or zeroing (also through | |
247 | the driver). | |
248 | ||
249 | ||
95ec8dab MW |
250 | Shortcomings |
251 | ------------ | |
252 | ||
253 | Even if the kernel or its modules are stored on a filesystem that supports | |
254 | DAX on a block device that supports DAX, they will still be copied into RAM. | |
255 | ||
d92576f1 MW |
256 | The DAX code does not work correctly on architectures which have virtually |
257 | mapped caches such as ARM, MIPS and SPARC. | |
258 | ||
95ec8dab | 259 | Calling get_user_pages() on a range of user memory that has been mmaped |
9ff2dc56 SB |
260 | from a DAX file will fail when there are no 'struct page' to describe |
261 | those pages. This problem has been addressed in some device drivers | |
262 | by adding optional struct page support for pages under the control of | |
263 | the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of | |
264 | how to do this). In the non struct page cases O_DIRECT reads/writes to | |
265 | those memory ranges from a non-DAX file will fail (note that O_DIRECT | |
266 | reads/writes _of a DAX file_ do work, it is the memory that is being | |
267 | accessed that is key here). Other things that will not work in the | |
268 | non struct page case include RDMA, sendfile() and splice(). |