Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | |
2 | The SGI XFS Filesystem | |
3 | ====================== | |
4 | ||
5 | XFS is a high performance journaling filesystem which originated | |
6 | on the SGI IRIX platform. It is completely multi-threaded, can | |
7 | support large files and large filesystems, extended attributes, | |
8 | variable block sizes, is extent based, and makes extensive use of | |
9 | Btrees (directories, extents, free space) to aid both performance | |
10 | and scalability. | |
11 | ||
a10c5d91 | 12 | Refer to the documentation at https://xfs.wiki.kernel.org/ |
1da177e4 LT |
13 | for further details. This implementation is on-disk compatible |
14 | with the IRIX version of XFS. | |
15 | ||
16 | ||
17 | Mount Options | |
18 | ============= | |
19 | ||
20 | When mounting an XFS filesystem, the following options are accepted. | |
3e5b7d8b DC |
21 | For boolean mount options, the names with the (*) suffix is the |
22 | default behaviour. | |
1da177e4 | 23 | |
fc97bbf3 NS |
24 | allocsize=size |
25 | Sets the buffered I/O end-of-file preallocation size when | |
26 | doing delayed allocation writeout (default size is 64KiB). | |
27 | Valid values for this option are page size (typically 4KiB) | |
28 | through to 1GiB, inclusive, in power-of-2 increments. | |
29 | ||
3e5b7d8b DC |
30 | The default behaviour is for dynamic end-of-file |
31 | preallocation size, which uses a set of heuristics to | |
32 | optimise the preallocation size based on the current | |
33 | allocation patterns within the file and the access patterns | |
34 | to the file. Specifying a fixed allocsize value turns off | |
35 | the dynamic behaviour. | |
36 | ||
37 | attr2 | |
38 | noattr2 | |
39 | The options enable/disable an "opportunistic" improvement to | |
40 | be made in the way inline extended attributes are stored | |
41 | on-disk. When the new form is used for the first time when | |
42 | attr2 is selected (either when setting or removing extended | |
43 | attributes) the on-disk superblock feature bit field will be | |
44 | updated to reflect this format being in use. | |
45 | ||
46 | The default behaviour is determined by the on-disk feature | |
47 | bit indicating that attr2 behaviour is active. If either | |
48 | mount option it set, then that becomes the new default used | |
49 | by the filesystem. | |
fc97bbf3 | 50 | |
d3eaace8 DC |
51 | CRC enabled filesystems always use the attr2 format, and so |
52 | will reject the noattr2 mount option if it is set. | |
53 | ||
e84661aa | 54 | discard |
3e5b7d8b DC |
55 | nodiscard (*) |
56 | Enable/disable the issuing of commands to let the block | |
57 | device reclaim space freed by the filesystem. This is | |
58 | useful for SSD devices, thinly provisioned LUNs and virtual | |
59 | machine images, but may have a performance impact. | |
60 | ||
61 | Note: It is currently recommended that you use the fstrim | |
62 | application to discard unused blocks rather than the discard | |
63 | mount option because the performance impact of this option | |
64 | is quite severe. | |
65 | ||
66 | grpid/bsdgroups | |
67 | nogrpid/sysvgroups (*) | |
68 | These options define what group ID a newly created file | |
69 | gets. When grpid is set, it takes the group ID of the | |
70 | directory in which it is created; otherwise it takes the | |
71 | fsgid of the current process, unless the directory has the | |
72 | setgid bit set, in which case it takes the gid from the | |
73 | parent directory, and also gets the setgid bit set if it is | |
74 | a directory itself. | |
75 | ||
76 | filestreams | |
77 | Make the data allocator use the filestreams allocation mode | |
78 | across the entire filesystem rather than just on directories | |
79 | configured to use it. | |
80 | ||
81 | ikeep | |
82 | noikeep (*) | |
83 | When ikeep is specified, XFS does not delete empty inode | |
84 | clusters and keeps them around on disk. When noikeep is | |
85 | specified, empty inode clusters are returned to the free | |
86 | space pool. | |
c99abb8f CM |
87 | |
88 | inode32 | |
3e5b7d8b DC |
89 | inode64 (*) |
90 | When inode32 is specified, it indicates that XFS limits | |
91 | inode creation to locations which will not result in inode | |
92 | numbers with more than 32 bits of significance. | |
93 | ||
94 | When inode64 is specified, it indicates that XFS is allowed | |
95 | to create inodes at any location in the filesystem, | |
96 | including those which will result in inode numbers occupying | |
97 | more than 32 bits of significance. | |
98 | ||
99 | inode32 is provided for backwards compatibility with older | |
100 | systems and applications, since 64 bits inode numbers might | |
101 | cause problems for some applications that cannot handle | |
102 | large inode numbers. If applications are in use which do | |
103 | not handle inode numbers bigger than 32 bits, the inode32 | |
104 | option should be specified. | |
105 | ||
106 | ||
107 | largeio | |
108 | nolargeio (*) | |
fc97bbf3 | 109 | If "nolargeio" is specified, the optimal I/O reported in |
3e5b7d8b DC |
110 | st_blksize by stat(2) will be as small as possible to allow |
111 | user applications to avoid inefficient read/modify/write | |
112 | I/O. This is typically the page size of the machine, as | |
113 | this is the granularity of the page cache. | |
114 | ||
115 | If "largeio" specified, a filesystem that was created with a | |
116 | "swidth" specified will return the "swidth" value (in bytes) | |
117 | in st_blksize. If the filesystem does not have a "swidth" | |
118 | specified but does specify an "allocsize" then "allocsize" | |
119 | (in bytes) will be returned instead. Otherwise the behaviour | |
120 | is the same as if "nolargeio" was specified. | |
fc97bbf3 | 121 | |
1da177e4 | 122 | logbufs=value |
3e5b7d8b DC |
123 | Set the number of in-memory log buffers. Valid numbers |
124 | range from 2-8 inclusive. | |
125 | ||
126 | The default value is 8 buffers. | |
127 | ||
128 | If the memory cost of 8 log buffers is too high on small | |
129 | systems, then it may be reduced at some cost to performance | |
130 | on metadata intensive workloads. The logbsize option below | |
9ed354b7 | 131 | controls the size of each buffer and so is also relevant to |
3e5b7d8b | 132 | this case. |
1da177e4 LT |
133 | |
134 | logbsize=value | |
3e5b7d8b DC |
135 | Set the size of each in-memory log buffer. The size may be |
136 | specified in bytes, or in kilobytes with a "k" suffix. | |
137 | Valid sizes for version 1 and version 2 logs are 16384 (16k) | |
138 | and 32768 (32k). Valid sizes for version 2 logs also | |
139 | include 65536 (64k), 131072 (128k) and 262144 (256k). The | |
140 | logbsize must be an integer multiple of the log | |
141 | stripe unit configured at mkfs time. | |
142 | ||
143 | The default value for for version 1 logs is 32768, while the | |
144 | default value for version 2 logs is MAX(32768, log_sunit). | |
1da177e4 LT |
145 | |
146 | logdev=device and rtdev=device | |
147 | Use an external log (metadata journal) and/or real-time device. | |
148 | An XFS filesystem has up to three parts: a data section, a log | |
149 | section, and a real-time section. The real-time section is | |
150 | optional, and the log section can be separate from the data | |
151 | section or contained within it. | |
152 | ||
153 | noalign | |
3e5b7d8b DC |
154 | Data allocations will not be aligned at stripe unit |
155 | boundaries. This is only relevant to filesystems created | |
156 | with non-zero data alignment parameters (sunit, swidth) by | |
157 | mkfs. | |
1da177e4 LT |
158 | |
159 | norecovery | |
160 | The filesystem will be mounted without running log recovery. | |
161 | If the filesystem was not cleanly unmounted, it is likely to | |
162 | be inconsistent when mounted in "norecovery" mode. | |
163 | Some files or directories may not be accessible because of this. | |
164 | Filesystems mounted "norecovery" must be mounted read-only or | |
165 | the mount will fail. | |
166 | ||
167 | nouuid | |
3e5b7d8b DC |
168 | Don't check for double mounted file systems using the file |
169 | system uuid. This is useful to mount LVM snapshot volumes, | |
170 | and often used in combination with "norecovery" for mounting | |
171 | read-only snapshots. | |
172 | ||
173 | noquota | |
174 | Forcibly turns off all quota accounting and enforcement | |
175 | within the filesystem. | |
1da177e4 | 176 | |
fc97bbf3 | 177 | uquota/usrquota/uqnoenforce/quota |
1da177e4 | 178 | User disk quota accounting enabled, and limits (optionally) |
fc97bbf3 | 179 | enforced. Refer to xfs_quota(8) for further details. |
1da177e4 | 180 | |
fc97bbf3 | 181 | gquota/grpquota/gqnoenforce |
1da177e4 | 182 | Group disk quota accounting enabled and limits (optionally) |
fc97bbf3 NS |
183 | enforced. Refer to xfs_quota(8) for further details. |
184 | ||
185 | pquota/prjquota/pqnoenforce | |
186 | Project disk quota accounting enabled and limits (optionally) | |
187 | enforced. Refer to xfs_quota(8) for further details. | |
1da177e4 LT |
188 | |
189 | sunit=value and swidth=value | |
3e5b7d8b DC |
190 | Used to specify the stripe unit and width for a RAID device |
191 | or a stripe volume. "value" must be specified in 512-byte | |
192 | block units. These options are only relevant to filesystems | |
193 | that were created with non-zero data alignment parameters. | |
194 | ||
195 | The sunit and swidth parameters specified must be compatible | |
196 | with the existing filesystem alignment characteristics. In | |
197 | general, that means the only valid changes to sunit are | |
198 | increasing it by a power-of-2 multiple. Valid swidth values | |
199 | are any integer multiple of a valid sunit value. | |
200 | ||
201 | Typically the only time these mount options are necessary if | |
202 | after an underlying RAID device has had it's geometry | |
203 | modified, such as adding a new disk to a RAID5 lun and | |
204 | reshaping it. | |
1da177e4 | 205 | |
fc97bbf3 NS |
206 | swalloc |
207 | Data allocations will be rounded up to stripe width boundaries | |
208 | when the current end of file is being extended and the file | |
209 | size is larger than the stripe width size. | |
210 | ||
3e5b7d8b DC |
211 | wsync |
212 | When specified, all filesystem namespace operations are | |
213 | executed synchronously. This ensures that when the namespace | |
214 | operation (create, unlink, etc) completes, the change to the | |
215 | namespace is on stable storage. This is useful in HA setups | |
216 | where failover must not result in clients seeing | |
217 | inconsistent namespace presentation during or after a | |
218 | failover event. | |
219 | ||
220 | ||
221 | Deprecated Mount Options | |
222 | ======================== | |
223 | ||
4cf4573d DC |
224 | Name Removal Schedule |
225 | ---- ---------------- | |
3e5b7d8b | 226 | |
3e5b7d8b | 227 | |
444a7022 ES |
228 | Removed Mount Options |
229 | ===================== | |
3e5b7d8b | 230 | |
444a7022 ES |
231 | Name Removed |
232 | ---- ------- | |
4d66ea09 FL |
233 | delaylog/nodelaylog v4.0 |
234 | ihashsize v4.0 | |
235 | irixsgid v4.0 | |
236 | osyncisdsync/osyncisosync v4.0 | |
1c02d502 ES |
237 | barrier v4.19 |
238 | nobarrier v4.19 | |
3e5b7d8b | 239 | |
fc97bbf3 | 240 | |
1da177e4 LT |
241 | sysctls |
242 | ======= | |
243 | ||
244 | The following sysctls are available for the XFS filesystem: | |
245 | ||
246 | fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) | |
fc97bbf3 | 247 | Setting this to "1" clears accumulated XFS statistics |
1da177e4 | 248 | in /proc/fs/xfs/stat. It then immediately resets to "0". |
fc97bbf3 | 249 | |
1da177e4 | 250 | fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) |
3e5b7d8b DC |
251 | The interval at which the filesystem flushes metadata |
252 | out to disk and runs internal cache cleanup routines. | |
1da177e4 | 253 | |
3e5b7d8b DC |
254 | fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000) |
255 | The interval at which the filesystem ages filestreams cache | |
256 | references and returns timed-out AGs back to the free stream | |
257 | pool. | |
1da177e4 | 258 | |
3e5b7d8b DC |
259 | fs.xfs.speculative_prealloc_lifetime |
260 | (Units: seconds Min: 1 Default: 300 Max: 86400) | |
261 | The interval at which the background scanning for inodes | |
262 | with unused speculative preallocation runs. The scan | |
263 | removes unused preallocation from clean inodes and releases | |
264 | the unused space back to the free pool. | |
1da177e4 LT |
265 | |
266 | fs.xfs.error_level (Min: 0 Default: 3 Max: 11) | |
267 | A volume knob for error reporting when internal errors occur. | |
268 | This will generate detailed messages & backtraces for filesystem | |
269 | shutdowns, for example. Current threshold values are: | |
270 | ||
271 | XFS_ERRLEVEL_OFF: 0 | |
272 | XFS_ERRLEVEL_LOW: 1 | |
273 | XFS_ERRLEVEL_HIGH: 5 | |
274 | ||
d519da41 | 275 | fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256) |
fc97bbf3 | 276 | Causes certain error conditions to call BUG(). Value is a bitmask; |
de8bd0eb | 277 | OR together the tags which represent errors which should cause panics: |
fc97bbf3 | 278 | |
1da177e4 LT |
279 | XFS_NO_PTAG 0 |
280 | XFS_PTAG_IFLUSH 0x00000001 | |
281 | XFS_PTAG_LOGRES 0x00000002 | |
282 | XFS_PTAG_AILDELETE 0x00000004 | |
283 | XFS_PTAG_ERROR_REPORT 0x00000008 | |
284 | XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 | |
285 | XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 | |
286 | XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 | |
de8bd0eb | 287 | XFS_PTAG_FSBLOCK_ZERO 0x00000080 |
d519da41 | 288 | XFS_PTAG_VERIFIER_ERROR 0x00000100 |
1da177e4 | 289 | |
fc97bbf3 | 290 | This option is intended for debugging only. |
1da177e4 LT |
291 | |
292 | fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) | |
293 | Controls whether symlinks are created with mode 0777 (default) | |
294 | or whether their mode is affected by the umask (irix mode). | |
295 | ||
296 | fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) | |
297 | Controls files created in SGID directories. | |
298 | If the group ID of the new file does not match the effective group | |
fc97bbf3 NS |
299 | ID or one of the supplementary group IDs of the parent dir, the |
300 | ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl | |
1da177e4 LT |
301 | is set. |
302 | ||
fc97bbf3 NS |
303 | fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) |
304 | Setting this to "1" will cause the "sync" flag set | |
305 | by the xfs_io(8) chattr command on a directory to be | |
1da177e4 LT |
306 | inherited by files in that directory. |
307 | ||
fc97bbf3 NS |
308 | fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) |
309 | Setting this to "1" will cause the "nodump" flag set | |
310 | by the xfs_io(8) chattr command on a directory to be | |
1da177e4 LT |
311 | inherited by files in that directory. |
312 | ||
fc97bbf3 NS |
313 | fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) |
314 | Setting this to "1" will cause the "noatime" flag set | |
315 | by the xfs_io(8) chattr command on a directory to be | |
1da177e4 | 316 | inherited by files in that directory. |
fc97bbf3 NS |
317 | |
318 | fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) | |
319 | Setting this to "1" will cause the "nosymlinks" flag set | |
320 | by the xfs_io(8) chattr command on a directory to be | |
321 | inherited by files in that directory. | |
322 | ||
3e5b7d8b DC |
323 | fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1) |
324 | Setting this to "1" will cause the "nodefrag" flag set | |
325 | by the xfs_io(8) chattr command on a directory to be | |
326 | inherited by files in that directory. | |
327 | ||
fc97bbf3 NS |
328 | fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) |
329 | In "inode32" allocation mode, this option determines how many | |
330 | files the allocator attempts to allocate in the same allocation | |
331 | group before moving to the next allocation group. The intent | |
332 | is to control the rate at which the allocator moves between | |
333 | allocation groups when allocating extents for new files. | |
3e5b7d8b DC |
334 | |
335 | Deprecated Sysctls | |
336 | ================== | |
337 | ||
64af7a6e | 338 | None at present. |
3e5b7d8b | 339 | |
3e5b7d8b | 340 | |
64af7a6e DC |
341 | Removed Sysctls |
342 | =============== | |
3e5b7d8b | 343 | |
64af7a6e DC |
344 | Name Removed |
345 | ---- ------- | |
4d66ea09 FL |
346 | fs.xfs.xfsbufd_centisec v4.0 |
347 | fs.xfs.age_buffer_centisecs v4.0 | |
5694fe9a CM |
348 | |
349 | ||
350 | Error handling | |
351 | ============== | |
352 | ||
353 | XFS can act differently according to the type of error found during its | |
354 | operation. The implementation introduces the following concepts to the error | |
355 | handler: | |
356 | ||
357 | -failure speed: | |
358 | Defines how fast XFS should propagate an error upwards when a specific | |
359 | error is found during the filesystem operation. It can propagate | |
360 | immediately, after a defined number of retries, after a set time period, | |
361 | or simply retry forever. | |
362 | ||
363 | -error classes: | |
364 | Specifies the subsystem the error configuration will apply to, such as | |
365 | metadata IO or memory allocation. Different subsystems will have | |
366 | different error handlers for which behaviour can be configured. | |
367 | ||
368 | -error handlers: | |
369 | Defines the behavior for a specific error. | |
370 | ||
371 | The filesystem behavior during an error can be set via sysfs files. Each | |
372 | error handler works independently - the first condition met by an error handler | |
373 | for a specific class will cause the error to be propagated rather than reset and | |
374 | retried. | |
375 | ||
376 | The action taken by the filesystem when the error is propagated is context | |
377 | dependent - it may cause a shut down in the case of an unrecoverable error, | |
378 | it may be reported back to userspace, or it may even be ignored because | |
379 | there's nothing useful we can with the error or anyone we can report it to (e.g. | |
380 | during unmount). | |
381 | ||
382 | The configuration files are organized into the following hierarchy for each | |
383 | mounted filesystem: | |
384 | ||
385 | /sys/fs/xfs/<dev>/error/<class>/<error>/ | |
386 | ||
387 | Where: | |
388 | <dev> | |
389 | The short device name of the mounted filesystem. This is the same device | |
390 | name that shows up in XFS kernel error messages as "XFS(<dev>): ..." | |
391 | ||
392 | <class> | |
393 | The subsystem the error configuration belongs to. As of 4.9, the defined | |
394 | classes are: | |
395 | ||
396 | - "metadata": applies metadata buffer write IO | |
397 | ||
398 | <error> | |
399 | The individual error handler configurations. | |
400 | ||
401 | ||
402 | Each filesystem has "global" error configuration options defined in their top | |
403 | level directory: | |
404 | ||
405 | /sys/fs/xfs/<dev>/error/ | |
406 | ||
407 | fail_at_unmount (Min: 0 Default: 1 Max: 1) | |
408 | Defines the filesystem error behavior at unmount time. | |
409 | ||
410 | If set to a value of 1, XFS will override all other error configurations | |
411 | during unmount and replace them with "immediate fail" characteristics. | |
412 | i.e. no retries, no retry timeout. This will always allow unmount to | |
413 | succeed when there are persistent errors present. | |
414 | ||
415 | If set to 0, the configured retry behaviour will continue until all | |
416 | retries and/or timeouts have been exhausted. This will delay unmount | |
417 | completion when there are persistent errors, and it may prevent the | |
418 | filesystem from ever unmounting fully in the case of "retry forever" | |
419 | handler configurations. | |
420 | ||
806654a9 | 421 | Note: there is no guarantee that fail_at_unmount can be set while an |
5694fe9a CM |
422 | unmount is in progress. It is possible that the sysfs entries are |
423 | removed by the unmounting filesystem before a "retry forever" error | |
424 | handler configuration causes unmount to hang, and hence the filesystem | |
425 | must be configured appropriately before unmount begins to prevent | |
426 | unmount hangs. | |
427 | ||
428 | Each filesystem has specific error class handlers that define the error | |
429 | propagation behaviour for specific errors. There is also a "default" error | |
430 | handler defined, which defines the behaviour for all errors that don't have | |
431 | specific handlers defined. Where multiple retry constraints are configuredi for | |
432 | a single error, the first retry configuration that expires will cause the error | |
433 | to be propagated. The handler configurations are found in the directory: | |
434 | ||
435 | /sys/fs/xfs/<dev>/error/<class>/<error>/ | |
436 | ||
437 | max_retries (Min: -1 Default: Varies Max: INTMAX) | |
438 | Defines the allowed number of retries of a specific error before | |
439 | the filesystem will propagate the error. The retry count for a given | |
440 | error context (e.g. a specific metadata buffer) is reset every time | |
441 | there is a successful completion of the operation. | |
442 | ||
443 | Setting the value to "-1" will cause XFS to retry forever for this | |
444 | specific error. | |
445 | ||
446 | Setting the value to "0" will cause XFS to fail immediately when the | |
447 | specific error is reported. | |
448 | ||
449 | Setting the value to "N" (where 0 < N < Max) will make XFS retry the | |
450 | operation "N" times before propagating the error. | |
451 | ||
452 | retry_timeout_seconds (Min: -1 Default: Varies Max: 1 day) | |
453 | Define the amount of time (in seconds) that the filesystem is | |
454 | allowed to retry its operations when the specific error is | |
455 | found. | |
456 | ||
457 | Setting the value to "-1" will allow XFS to retry forever for this | |
458 | specific error. | |
459 | ||
460 | Setting the value to "0" will cause XFS to fail immediately when the | |
461 | specific error is reported. | |
462 | ||
463 | Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the | |
464 | operation for up to "N" seconds before propagating the error. | |
465 | ||
466 | Note: The default behaviour for a specific error handler is dependent on both | |
467 | the class and error context. For example, the default values for | |
468 | "metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults | |
469 | to "fail immediately" behaviour. This is done because ENODEV is a fatal, | |
470 | unrecoverable error no matter how many times the metadata IO is retried. |