Merge tag 'sound-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
[linux-block.git] / Documentation / filesystems / netfs_library.rst
CommitLineData
fb28afcc
DH
1.. SPDX-License-Identifier: GPL-2.0
2
3=================================
ddca5b0e 4Network Filesystem Helper Library
fb28afcc
DH
5=================================
6
7.. Contents:
8
9 - Overview.
bc899ee1
DH
10 - Per-inode context.
11 - Inode context helper functions.
fb28afcc
DH
12 - Buffered read helpers.
13 - Read helper functions.
14 - Read helper structures.
15 - Read helper operations.
16 - Read helper procedure.
17 - Read helper cache API.
18
19
20Overview
21========
22
23The network filesystem helper library is a set of functions designed to aid a
24network filesystem in implementing VM/VFS operations. For the moment, that
25just includes turning various VM buffered read operations into requests to read
26from the server. The helper library, however, can also interpose other
27services, such as local caching or local data encryption.
28
29Note that the library module doesn't link against local caching directly, so
30access must be provided by the netfs.
31
32
bc899ee1
DH
33Per-Inode Context
34=================
35
36The network filesystem helper library needs a place to store a bit of state for
37its use on each netfs inode it is helping to manage. To this end, a context
38structure is defined::
39
874c8ca1
DH
40 struct netfs_inode {
41 struct inode inode;
bc899ee1 42 const struct netfs_request_ops *ops;
874c8ca1 43 struct fscache_cookie *cache;
bc899ee1
DH
44 };
45
874c8ca1
DH
46A network filesystem that wants to use netfs lib must place one of these in its
47inode wrapper struct instead of the VFS ``struct inode``. This can be done in
48a way similar to the following::
bc899ee1
DH
49
50 struct my_inode {
874c8ca1 51 struct netfs_inode netfs; /* Netfslib context and vfs inode */
bc899ee1
DH
52 ...
53 };
54
874c8ca1
DH
55This allows netfslib to find its state by using ``container_of()`` from the
56inode pointer, thereby allowing the netfslib helper functions to be pointed to
57directly by the VFS/VM operation tables.
bc899ee1
DH
58
59The structure contains the following fields:
60
874c8ca1
DH
61 * ``inode``
62
63 The VFS inode structure.
64
bc899ee1
DH
65 * ``ops``
66
67 The set of operations provided by the network filesystem to netfslib.
68
69 * ``cache``
70
71 Local caching cookie, or NULL if no caching is enabled. This field does not
72 exist if fscache is disabled.
73
74
75Inode Context Helper Functions
76------------------------------
77
78To help deal with the per-inode context, a number helper functions are
79provided. Firstly, a function to perform basic initialisation on a context and
80set the operations table pointer::
81
e81fb419 82 void netfs_inode_init(struct netfs_inode *ctx,
874c8ca1 83 const struct netfs_request_ops *ops);
bc899ee1 84
874c8ca1 85then a function to cast from the VFS inode structure to the netfs context::
bc899ee1 86
874c8ca1 87 struct netfs_inode *netfs_node(struct inode *inode);
bc899ee1
DH
88
89and finally, a function to get the cache cookie pointer from the context
90attached to an inode (or NULL if fscache is disabled)::
91
e81fb419 92 struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
bc899ee1
DH
93
94
fb28afcc
DH
95Buffered Read Helpers
96=====================
97
08830c8b 98The library provides a set of read helpers that handle the ->read_folio(),
fb28afcc
DH
99->readahead() and much of the ->write_begin() VM operations and translate them
100into a common call framework.
101
102The following services are provided:
103
ddca5b0e 104 * Handle folios that span multiple pages.
fb28afcc 105
ddca5b0e 106 * Insulate the netfs from VM interface changes.
fb28afcc 107
ddca5b0e
DH
108 * Allow the netfs to arbitrarily split reads up into pieces, even ones that
109 don't match folio sizes or folio alignments and that may cross folios.
fb28afcc 110
ddca5b0e
DH
111 * Allow the netfs to expand a readahead request in both directions to meet its
112 needs.
fb28afcc 113
ddca5b0e 114 * Allow the netfs to partially fulfil a read, which will then be resubmitted.
fb28afcc 115
ddca5b0e 116 * Handle local caching, allowing cached data and server-read data to be
fb28afcc
DH
117 interleaved for a single request.
118
ddca5b0e 119 * Handle clearing of bufferage that aren't on the server.
fb28afcc
DH
120
121 * Handle retrying of reads that failed, switching reads from the cache to the
122 server as necessary.
123
124 * In the future, this is a place that other services can be performed, such as
125 local encryption of data to be stored remotely or in the cache.
126
127From the network filesystem, the helpers require a table of operations. This
128includes a mandatory method to issue a read operation along with a number of
129optional methods.
130
131
132Read Helper Functions
133---------------------
134
135Three read helpers are provided::
136
bc899ee1 137 void netfs_readahead(struct readahead_control *ractl);
08830c8b 138 int netfs_read_folio(struct file *file,
e81fb419
LT
139 struct folio *folio);
140 int netfs_write_begin(struct netfs_inode *ctx,
141 struct file *file,
ddca5b0e
DH
142 struct address_space *mapping,
143 loff_t pos,
144 unsigned int len,
ddca5b0e 145 struct folio **_folio,
bc899ee1 146 void **_fsdata);
fb28afcc 147
bc899ee1
DH
148Each corresponds to a VM address space operation. These operations use the
149state in the per-inode context.
fb28afcc 150
08830c8b 151For ->readahead() and ->read_folio(), the network filesystem just point directly
bc899ee1 152at the corresponding read helper; whereas for ->write_begin(), it may be a
fb28afcc 153little more complicated as the network filesystem might want to flush
ddca5b0e
DH
154conflicting writes or track dirty data and needs to put the acquired folio if
155an error occurs after calling the helper.
fb28afcc
DH
156
157The helpers manage the read request, calling back into the network filesystem
158through the suppplied table of operations. Waits will be performed as
159necessary before returning for helpers that are meant to be synchronous.
160
40a81101
DH
161If an error occurs, the ->free_request() will be called to clean up the
162netfs_io_request struct allocated. If some parts of the request are in
163progress when an error occurs, the request will get partially completed if
164sufficient data is read.
fb28afcc
DH
165
166Additionally, there is::
167
6a19114b 168 * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
fb28afcc
DH
169 ssize_t transferred_or_error,
170 bool was_async);
171
172which should be called to complete a read subrequest. This is given the number
173of bytes transferred or a negative error code, plus a flag indicating whether
174the operation was asynchronous (ie. whether the follow-on processing can be
175done in the current context, given this may involve sleeping).
176
177
178Read Helper Structures
179----------------------
180
181The read helpers make use of a couple of structures to maintain the state of
182the read. The first is a structure that manages a read request as a whole::
183
6a19114b 184 struct netfs_io_request {
fb28afcc
DH
185 struct inode *inode;
186 struct address_space *mapping;
187 struct netfs_cache_resources cache_resources;
188 void *netfs_priv;
189 loff_t start;
190 size_t len;
191 loff_t i_size;
6a19114b 192 const struct netfs_request_ops *netfs_ops;
fb28afcc
DH
193 unsigned int debug_id;
194 ...
195 };
196
197The above fields are the ones the netfs can use. They are:
198
199 * ``inode``
200 * ``mapping``
201
202 The inode and the address space of the file being read from. The mapping
203 may or may not point to inode->i_data.
204
205 * ``cache_resources``
206
207 Resources for the local cache to use, if present.
208
209 * ``netfs_priv``
210
211 The network filesystem's private data. The value for this can be passed in
40a81101 212 to the helper functions or set during the request.
fb28afcc
DH
213
214 * ``start``
215 * ``len``
216
217 The file position of the start of the read request and the length. These
218 may be altered by the ->expand_readahead() op.
219
220 * ``i_size``
221
222 The size of the file at the start of the request.
223
224 * ``netfs_ops``
225
226 A pointer to the operation table. The value for this is passed into the
227 helper functions.
228
229 * ``debug_id``
230
231 A number allocated to this operation that can be displayed in trace lines
232 for reference.
233
234
235The second structure is used to manage individual slices of the overall read
236request::
237
6a19114b
DH
238 struct netfs_io_subrequest {
239 struct netfs_io_request *rreq;
fb28afcc
DH
240 loff_t start;
241 size_t len;
242 size_t transferred;
243 unsigned long flags;
244 unsigned short debug_index;
245 ...
246 };
247
248Each subrequest is expected to access a single source, though the helpers will
249handle falling back from one source type to another. The members are:
250
251 * ``rreq``
252
253 A pointer to the read request.
254
255 * ``start``
256 * ``len``
257
258 The file position of the start of this slice of the read request and the
259 length.
260
261 * ``transferred``
262
263 The amount of data transferred so far of the length of this slice. The
264 network filesystem or cache should start the operation this far into the
265 slice. If a short read occurs, the helpers will call again, having updated
266 this to reflect the amount read so far.
267
268 * ``flags``
269
270 Flags pertaining to the read. There are two of interest to the filesystem
271 or cache:
272
273 * ``NETFS_SREQ_CLEAR_TAIL``
274
275 This can be set to indicate that the remainder of the slice, from
276 transferred to len, should be cleared.
277
278 * ``NETFS_SREQ_SEEK_DATA_READ``
279
280 This is a hint to the cache that it might want to try skipping ahead to
281 the next data (ie. using SEEK_DATA).
282
283 * ``debug_index``
284
285 A number allocated to this slice that can be displayed in trace lines for
286 reference.
287
288
289Read Helper Operations
290----------------------
291
292The network filesystem must provide the read helpers with a table of operations
293through which it can issue requests and negotiate::
294
6a19114b
DH
295 struct netfs_request_ops {
296 void (*init_request)(struct netfs_io_request *rreq, struct file *file);
40a81101 297 void (*free_request)(struct netfs_io_request *rreq);
6a19114b
DH
298 int (*begin_cache_operation)(struct netfs_io_request *rreq);
299 void (*expand_readahead)(struct netfs_io_request *rreq);
300 bool (*clamp_length)(struct netfs_io_subrequest *subreq);
f18a3785 301 void (*issue_read)(struct netfs_io_subrequest *subreq);
6a19114b 302 bool (*is_still_valid)(struct netfs_io_request *rreq);
fb28afcc 303 int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
fac47b43 304 struct folio **foliop, void **_fsdata);
6a19114b 305 void (*done)(struct netfs_io_request *rreq);
fb28afcc
DH
306 };
307
308The operations are as follows:
309
6a19114b 310 * ``init_request()``
fb28afcc
DH
311
312 [Optional] This is called to initialise the request structure. It is given
40a81101
DH
313 the file for reference.
314
315 * ``free_request()``
316
317 [Optional] This is called as the request is being deallocated so that the
318 filesystem can clean up any state it has attached there.
fb28afcc 319
fb28afcc
DH
320 * ``begin_cache_operation()``
321
322 [Optional] This is called to ask the network filesystem to call into the
323 cache (if present) to initialise the caching state for this read. The netfs
324 library module cannot access the cache directly, so the cache should call
325 something like fscache_begin_read_operation() to do this.
326
327 The cache gets to store its state in ->cache_resources and must set a table
328 of operations of its own there (though of a different type).
329
330 This should return 0 on success and an error code otherwise. If an error is
331 reported, the operation may proceed anyway, just without local caching (only
332 out of memory and interruption errors cause failure here).
333
334 * ``expand_readahead()``
335
336 [Optional] This is called to allow the filesystem to expand the size of a
337 readahead read request. The filesystem gets to expand the request in both
338 directions, though it's not permitted to reduce it as the numbers may
339 represent an allocation already made. If local caching is enabled, it gets
340 to expand the request first.
341
342 Expansion is communicated by changing ->start and ->len in the request
343 structure. Note that if any change is made, ->len must be increased by at
344 least as much as ->start is reduced.
345
346 * ``clamp_length()``
347
348 [Optional] This is called to allow the filesystem to reduce the size of a
349 subrequest. The filesystem can use this, for example, to chop up a request
350 that has to be split across multiple servers or to put multiple reads in
351 flight.
352
353 This should return 0 on success and an error code on error.
354
f18a3785 355 * ``issue_read()``
fb28afcc
DH
356
357 [Required] The helpers use this to dispatch a subrequest to the server for
358 reading. In the subrequest, ->start, ->len and ->transferred indicate what
359 data should be read from the server.
360
361 There is no return value; the netfs_subreq_terminated() function should be
362 called to indicate whether or not the operation succeeded and how much data
ddca5b0e 363 it transferred. The filesystem also should not deal with setting folios
fb28afcc
DH
364 uptodate, unlocking them or dropping their refs - the helpers need to deal
365 with this as they have to coordinate with copying to the local cache.
366
ddca5b0e
DH
367 Note that the helpers have the folios locked, but not pinned. It is
368 possible to use the ITER_XARRAY iov iterator to refer to the range of the
369 inode that is being operated upon without the need to allocate large bvec
370 tables.
fb28afcc
DH
371
372 * ``is_still_valid()``
373
374 [Optional] This is called to find out if the data just read from the local
375 cache is still valid. It should return true if it is still valid and false
376 if not. If it's not still valid, it will be reread from the server.
377
378 * ``check_write_begin()``
379
380 [Optional] This is called from the netfs_write_begin() helper once it has
ddca5b0e 381 allocated/grabbed the folio to be modified to allow the filesystem to flush
fb28afcc
DH
382 conflicting state before allowing it to be modified.
383
fac47b43
XL
384 It may unlock and discard the folio it was given and set the caller's folio
385 pointer to NULL. It should return 0 if everything is now fine (``*foliop``
386 left set) or the op should be retried (``*foliop`` cleared) and any other
387 error code to abort the operation.
fb28afcc
DH
388
389 * ``done``
390
ddca5b0e 391 [Optional] This is called after the folios in the request have all been
fb28afcc
DH
392 unlocked (and marked uptodate if applicable).
393
fb28afcc
DH
394
395
396Read Helper Procedure
397---------------------
398
399The read helpers work by the following general procedure:
400
401 * Set up the request.
402
403 * For readahead, allow the local cache and then the network filesystem to
404 propose expansions to the read request. This is then proposed to the VM.
405 If the VM cannot fully perform the expansion, a partially expanded read will
406 be performed, though this may not get written to the cache in its entirety.
407
408 * Loop around slicing chunks off of the request to form subrequests:
409
410 * If a local cache is present, it gets to do the slicing, otherwise the
411 helpers just try to generate maximal slices.
412
413 * The network filesystem gets to clamp the size of each slice if it is to be
414 the source. This allows rsize and chunking to be implemented.
415
416 * The helpers issue a read from the cache or a read from the server or just
417 clears the slice as appropriate.
418
419 * The next slice begins at the end of the last one.
420
421 * As slices finish being read, they terminate.
422
423 * When all the subrequests have terminated, the subrequests are assessed and
424 any that are short or have failed are reissued:
425
426 * Failed cache requests are issued against the server instead.
427
428 * Failed server requests just fail.
429
430 * Short reads against either source will be reissued against that source
431 provided they have transferred some more data:
432
433 * The cache may need to skip holes that it can't do DIO from.
434
435 * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
436 end of the slice instead of reissuing.
437
ddca5b0e 438 * Once the data is read, the folios that have been fully read/cleared:
fb28afcc
DH
439
440 * Will be marked uptodate.
441
442 * If a cache is present, will be marked with PG_fscache.
443
444 * Unlocked
445
ddca5b0e 446 * Any folios that need writing to the cache will then have DIO writes issued.
fb28afcc
DH
447
448 * Synchronous operations will wait for reading to be complete.
449
ddca5b0e 450 * Writes to the cache will proceed asynchronously and the folios will have the
fb28afcc
DH
451 PG_fscache mark removed when that completes.
452
453 * The request structures will be cleaned up when everything has completed.
454
455
456Read Helper Cache API
457---------------------
458
459When implementing a local cache to be used by the read helpers, two things are
460required: some way for the network filesystem to initialise the caching for a
461read request and a table of operations for the helpers to call.
462
463The network filesystem's ->begin_cache_operation() method is called to set up a
464cache and this must call into the cache to do the work. If using fscache, for
465example, the cache would call::
466
6a19114b 467 int fscache_begin_read_operation(struct netfs_io_request *rreq,
fb28afcc
DH
468 struct fscache_cookie *cookie);
469
470passing in the request pointer and the cookie corresponding to the file.
471
6a19114b 472The netfs_io_request object contains a place for the cache to hang its
fb28afcc
DH
473state::
474
475 struct netfs_cache_resources {
476 const struct netfs_cache_ops *ops;
477 void *cache_priv;
478 void *cache_priv2;
479 };
480
481This contains an operations table pointer and two private pointers. The
482operation table looks like the following::
483
484 struct netfs_cache_ops {
485 void (*end_operation)(struct netfs_cache_resources *cres);
486
487 void (*expand_readahead)(struct netfs_cache_resources *cres,
488 loff_t *_start, size_t *_len, loff_t i_size);
489
6a19114b 490 enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
fb28afcc
DH
491 loff_t i_size);
492
493 int (*read)(struct netfs_cache_resources *cres,
494 loff_t start_pos,
495 struct iov_iter *iter,
496 bool seek_data,
497 netfs_io_terminated_t term_func,
498 void *term_func_priv);
499
ddca5b0e 500 int (*prepare_write)(struct netfs_cache_resources *cres,
e0484344
DH
501 loff_t *_start, size_t *_len, loff_t i_size,
502 bool no_space_allocated_yet);
ddca5b0e 503
fb28afcc
DH
504 int (*write)(struct netfs_cache_resources *cres,
505 loff_t start_pos,
506 struct iov_iter *iter,
507 netfs_io_terminated_t term_func,
508 void *term_func_priv);
bee9f655
DH
509
510 int (*query_occupancy)(struct netfs_cache_resources *cres,
511 loff_t start, size_t len, size_t granularity,
512 loff_t *_data_start, size_t *_data_len);
fb28afcc
DH
513 };
514
515With a termination handler function pointer::
516
517 typedef void (*netfs_io_terminated_t)(void *priv,
518 ssize_t transferred_or_error,
519 bool was_async);
520
521The methods defined in the table are:
522
523 * ``end_operation()``
524
525 [Required] Called to clean up the resources at the end of the read request.
526
527 * ``expand_readahead()``
528
529 [Optional] Called at the beginning of a netfs_readahead() operation to allow
530 the cache to expand a request in either direction. This allows the cache to
531 size the request appropriately for the cache granularity.
532
533 The function is passed poiners to the start and length in its parameters,
534 plus the size of the file for reference, and adjusts the start and length
535 appropriately. It should return one of:
536
537 * ``NETFS_FILL_WITH_ZEROES``
538 * ``NETFS_DOWNLOAD_FROM_SERVER``
539 * ``NETFS_READ_FROM_CACHE``
540 * ``NETFS_INVALID_READ``
541
542 to indicate whether the slice should just be cleared or whether it should be
543 downloaded from the server or read from the cache - or whether slicing
544 should be given up at the current point.
545
546 * ``prepare_read()``
547
548 [Required] Called to configure the next slice of a request. ->start and
549 ->len in the subrequest indicate where and how big the next slice can be;
550 the cache gets to reduce the length to match its granularity requirements.
551
552 * ``read()``
553
554 [Required] Called to read from the cache. The start file offset is given
555 along with an iterator to read to, which gives the length also. It can be
556 given a hint requesting that it seek forward from that start position for
557 data.
558
559 Also provided is a pointer to a termination handler function and private
560 data to pass to that function. The termination function should be called
561 with the number of bytes transferred or an error code, plus a flag
562 indicating whether the termination is definitely happening in the caller's
563 context.
564
ddca5b0e
DH
565 * ``prepare_write()``
566
e0484344
DH
567 [Required] Called to prepare a write to the cache to take place. This
568 involves checking to see whether the cache has sufficient space to honour
569 the write. ``*_start`` and ``*_len`` indicate the region to be written; the
570 region can be shrunk or it can be expanded to a page boundary either way as
571 necessary to align for direct I/O. i_size holds the size of the object and
572 is provided for reference. no_space_allocated_yet is set to true if the
573 caller is certain that no data has been written to that region - for example
574 if it tried to do a read from there already.
ddca5b0e 575
fb28afcc
DH
576 * ``write()``
577
578 [Required] Called to write to the cache. The start file offset is given
579 along with an iterator to write from, which gives the length also.
580
581 Also provided is a pointer to a termination handler function and private
582 data to pass to that function. The termination function should be called
583 with the number of bytes transferred or an error code, plus a flag
584 indicating whether the termination is definitely happening in the caller's
585 context.
586
bee9f655
DH
587 * ``query_occupancy()``
588
589 [Required] Called to find out where the next piece of data is within a
590 particular region of the cache. The start and length of the region to be
591 queried are passed in, along with the granularity to which the answer needs
592 to be aligned. The function passes back the start and length of the data,
593 if any, available within that region. Note that there may be a hole at the
594 front.
595
596 It returns 0 if some data was found, -ENODATA if there was no usable data
597 within the region or -ENOBUFS if there is no caching on this file.
598
fb28afcc
DH
599Note that these methods are passed a pointer to the cache resource structure,
600not the read request structure as they could be used in other situations where
601there isn't a read request structure as well, such as writing dirty data to the
602cache.
6abbaa5b 603
ddca5b0e
DH
604
605API Function Reference
606======================
607
6abbaa5b 608.. kernel-doc:: include/linux/netfs.h
3be01750
DH
609.. kernel-doc:: fs/netfs/buffered_read.c
610.. kernel-doc:: fs/netfs/io.c