Commit | Line | Data |
---|---|---|
fb28afcc DH |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ================================= | |
ddca5b0e | 4 | Network Filesystem Helper Library |
fb28afcc DH |
5 | ================================= |
6 | ||
7 | .. Contents: | |
8 | ||
9 | - Overview. | |
10 | - Buffered read helpers. | |
11 | - Read helper functions. | |
12 | - Read helper structures. | |
13 | - Read helper operations. | |
14 | - Read helper procedure. | |
15 | - Read helper cache API. | |
16 | ||
17 | ||
18 | Overview | |
19 | ======== | |
20 | ||
21 | The network filesystem helper library is a set of functions designed to aid a | |
22 | network filesystem in implementing VM/VFS operations. For the moment, that | |
23 | just includes turning various VM buffered read operations into requests to read | |
24 | from the server. The helper library, however, can also interpose other | |
25 | services, such as local caching or local data encryption. | |
26 | ||
27 | Note that the library module doesn't link against local caching directly, so | |
28 | access must be provided by the netfs. | |
29 | ||
30 | ||
31 | Buffered Read Helpers | |
32 | ===================== | |
33 | ||
34 | The library provides a set of read helpers that handle the ->readpage(), | |
35 | ->readahead() and much of the ->write_begin() VM operations and translate them | |
36 | into a common call framework. | |
37 | ||
38 | The following services are provided: | |
39 | ||
ddca5b0e | 40 | * Handle folios that span multiple pages. |
fb28afcc | 41 | |
ddca5b0e | 42 | * Insulate the netfs from VM interface changes. |
fb28afcc | 43 | |
ddca5b0e DH |
44 | * Allow the netfs to arbitrarily split reads up into pieces, even ones that |
45 | don't match folio sizes or folio alignments and that may cross folios. | |
fb28afcc | 46 | |
ddca5b0e DH |
47 | * Allow the netfs to expand a readahead request in both directions to meet its |
48 | needs. | |
fb28afcc | 49 | |
ddca5b0e | 50 | * Allow the netfs to partially fulfil a read, which will then be resubmitted. |
fb28afcc | 51 | |
ddca5b0e | 52 | * Handle local caching, allowing cached data and server-read data to be |
fb28afcc DH |
53 | interleaved for a single request. |
54 | ||
ddca5b0e | 55 | * Handle clearing of bufferage that aren't on the server. |
fb28afcc DH |
56 | |
57 | * Handle retrying of reads that failed, switching reads from the cache to the | |
58 | server as necessary. | |
59 | ||
60 | * In the future, this is a place that other services can be performed, such as | |
61 | local encryption of data to be stored remotely or in the cache. | |
62 | ||
63 | From the network filesystem, the helpers require a table of operations. This | |
64 | includes a mandatory method to issue a read operation along with a number of | |
65 | optional methods. | |
66 | ||
67 | ||
68 | Read Helper Functions | |
69 | --------------------- | |
70 | ||
71 | Three read helpers are provided:: | |
72 | ||
ddca5b0e DH |
73 | void netfs_readahead(struct readahead_control *ractl, |
74 | const struct netfs_read_request_ops *ops, | |
75 | void *netfs_priv); | |
76 | int netfs_readpage(struct file *file, | |
77 | struct folio *folio, | |
78 | const struct netfs_read_request_ops *ops, | |
79 | void *netfs_priv); | |
80 | int netfs_write_begin(struct file *file, | |
81 | struct address_space *mapping, | |
82 | loff_t pos, | |
83 | unsigned int len, | |
84 | unsigned int flags, | |
85 | struct folio **_folio, | |
86 | void **_fsdata, | |
87 | const struct netfs_read_request_ops *ops, | |
88 | void *netfs_priv); | |
fb28afcc DH |
89 | |
90 | Each corresponds to a VM operation, with the addition of a couple of parameters | |
91 | for the use of the read helpers: | |
92 | ||
93 | * ``ops`` | |
94 | ||
95 | A table of operations through which the helpers can talk to the filesystem. | |
96 | ||
97 | * ``netfs_priv`` | |
98 | ||
99 | Filesystem private data (can be NULL). | |
100 | ||
101 | Both of these values will be stored into the read request structure. | |
102 | ||
103 | For ->readahead() and ->readpage(), the network filesystem should just jump | |
104 | into the corresponding read helper; whereas for ->write_begin(), it may be a | |
105 | little more complicated as the network filesystem might want to flush | |
ddca5b0e DH |
106 | conflicting writes or track dirty data and needs to put the acquired folio if |
107 | an error occurs after calling the helper. | |
fb28afcc DH |
108 | |
109 | The helpers manage the read request, calling back into the network filesystem | |
110 | through the suppplied table of operations. Waits will be performed as | |
111 | necessary before returning for helpers that are meant to be synchronous. | |
112 | ||
113 | If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to | |
114 | deal with it. If some parts of the request are in progress when an error | |
115 | occurs, the request will get partially completed if sufficient data is read. | |
116 | ||
117 | Additionally, there is:: | |
118 | ||
119 | * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq, | |
120 | ssize_t transferred_or_error, | |
121 | bool was_async); | |
122 | ||
123 | which should be called to complete a read subrequest. This is given the number | |
124 | of bytes transferred or a negative error code, plus a flag indicating whether | |
125 | the operation was asynchronous (ie. whether the follow-on processing can be | |
126 | done in the current context, given this may involve sleeping). | |
127 | ||
128 | ||
129 | Read Helper Structures | |
130 | ---------------------- | |
131 | ||
132 | The read helpers make use of a couple of structures to maintain the state of | |
133 | the read. The first is a structure that manages a read request as a whole:: | |
134 | ||
135 | struct netfs_read_request { | |
136 | struct inode *inode; | |
137 | struct address_space *mapping; | |
138 | struct netfs_cache_resources cache_resources; | |
139 | void *netfs_priv; | |
140 | loff_t start; | |
141 | size_t len; | |
142 | loff_t i_size; | |
143 | const struct netfs_read_request_ops *netfs_ops; | |
144 | unsigned int debug_id; | |
145 | ... | |
146 | }; | |
147 | ||
148 | The above fields are the ones the netfs can use. They are: | |
149 | ||
150 | * ``inode`` | |
151 | * ``mapping`` | |
152 | ||
153 | The inode and the address space of the file being read from. The mapping | |
154 | may or may not point to inode->i_data. | |
155 | ||
156 | * ``cache_resources`` | |
157 | ||
158 | Resources for the local cache to use, if present. | |
159 | ||
160 | * ``netfs_priv`` | |
161 | ||
162 | The network filesystem's private data. The value for this can be passed in | |
163 | to the helper functions or set during the request. The ->cleanup() op will | |
164 | be called if this is non-NULL at the end. | |
165 | ||
166 | * ``start`` | |
167 | * ``len`` | |
168 | ||
169 | The file position of the start of the read request and the length. These | |
170 | may be altered by the ->expand_readahead() op. | |
171 | ||
172 | * ``i_size`` | |
173 | ||
174 | The size of the file at the start of the request. | |
175 | ||
176 | * ``netfs_ops`` | |
177 | ||
178 | A pointer to the operation table. The value for this is passed into the | |
179 | helper functions. | |
180 | ||
181 | * ``debug_id`` | |
182 | ||
183 | A number allocated to this operation that can be displayed in trace lines | |
184 | for reference. | |
185 | ||
186 | ||
187 | The second structure is used to manage individual slices of the overall read | |
188 | request:: | |
189 | ||
190 | struct netfs_read_subrequest { | |
191 | struct netfs_read_request *rreq; | |
192 | loff_t start; | |
193 | size_t len; | |
194 | size_t transferred; | |
195 | unsigned long flags; | |
196 | unsigned short debug_index; | |
197 | ... | |
198 | }; | |
199 | ||
200 | Each subrequest is expected to access a single source, though the helpers will | |
201 | handle falling back from one source type to another. The members are: | |
202 | ||
203 | * ``rreq`` | |
204 | ||
205 | A pointer to the read request. | |
206 | ||
207 | * ``start`` | |
208 | * ``len`` | |
209 | ||
210 | The file position of the start of this slice of the read request and the | |
211 | length. | |
212 | ||
213 | * ``transferred`` | |
214 | ||
215 | The amount of data transferred so far of the length of this slice. The | |
216 | network filesystem or cache should start the operation this far into the | |
217 | slice. If a short read occurs, the helpers will call again, having updated | |
218 | this to reflect the amount read so far. | |
219 | ||
220 | * ``flags`` | |
221 | ||
222 | Flags pertaining to the read. There are two of interest to the filesystem | |
223 | or cache: | |
224 | ||
225 | * ``NETFS_SREQ_CLEAR_TAIL`` | |
226 | ||
227 | This can be set to indicate that the remainder of the slice, from | |
228 | transferred to len, should be cleared. | |
229 | ||
230 | * ``NETFS_SREQ_SEEK_DATA_READ`` | |
231 | ||
232 | This is a hint to the cache that it might want to try skipping ahead to | |
233 | the next data (ie. using SEEK_DATA). | |
234 | ||
235 | * ``debug_index`` | |
236 | ||
237 | A number allocated to this slice that can be displayed in trace lines for | |
238 | reference. | |
239 | ||
240 | ||
241 | Read Helper Operations | |
242 | ---------------------- | |
243 | ||
244 | The network filesystem must provide the read helpers with a table of operations | |
245 | through which it can issue requests and negotiate:: | |
246 | ||
247 | struct netfs_read_request_ops { | |
248 | void (*init_rreq)(struct netfs_read_request *rreq, struct file *file); | |
249 | bool (*is_cache_enabled)(struct inode *inode); | |
250 | int (*begin_cache_operation)(struct netfs_read_request *rreq); | |
251 | void (*expand_readahead)(struct netfs_read_request *rreq); | |
252 | bool (*clamp_length)(struct netfs_read_subrequest *subreq); | |
253 | void (*issue_op)(struct netfs_read_subrequest *subreq); | |
254 | bool (*is_still_valid)(struct netfs_read_request *rreq); | |
255 | int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, | |
ddca5b0e | 256 | struct folio *folio, void **_fsdata); |
fb28afcc DH |
257 | void (*done)(struct netfs_read_request *rreq); |
258 | void (*cleanup)(struct address_space *mapping, void *netfs_priv); | |
259 | }; | |
260 | ||
261 | The operations are as follows: | |
262 | ||
263 | * ``init_rreq()`` | |
264 | ||
265 | [Optional] This is called to initialise the request structure. It is given | |
266 | the file for reference and can modify the ->netfs_priv value. | |
267 | ||
268 | * ``is_cache_enabled()`` | |
269 | ||
270 | [Required] This is called by netfs_write_begin() to ask if the file is being | |
271 | cached. It should return true if it is being cached and false otherwise. | |
272 | ||
273 | * ``begin_cache_operation()`` | |
274 | ||
275 | [Optional] This is called to ask the network filesystem to call into the | |
276 | cache (if present) to initialise the caching state for this read. The netfs | |
277 | library module cannot access the cache directly, so the cache should call | |
278 | something like fscache_begin_read_operation() to do this. | |
279 | ||
280 | The cache gets to store its state in ->cache_resources and must set a table | |
281 | of operations of its own there (though of a different type). | |
282 | ||
283 | This should return 0 on success and an error code otherwise. If an error is | |
284 | reported, the operation may proceed anyway, just without local caching (only | |
285 | out of memory and interruption errors cause failure here). | |
286 | ||
287 | * ``expand_readahead()`` | |
288 | ||
289 | [Optional] This is called to allow the filesystem to expand the size of a | |
290 | readahead read request. The filesystem gets to expand the request in both | |
291 | directions, though it's not permitted to reduce it as the numbers may | |
292 | represent an allocation already made. If local caching is enabled, it gets | |
293 | to expand the request first. | |
294 | ||
295 | Expansion is communicated by changing ->start and ->len in the request | |
296 | structure. Note that if any change is made, ->len must be increased by at | |
297 | least as much as ->start is reduced. | |
298 | ||
299 | * ``clamp_length()`` | |
300 | ||
301 | [Optional] This is called to allow the filesystem to reduce the size of a | |
302 | subrequest. The filesystem can use this, for example, to chop up a request | |
303 | that has to be split across multiple servers or to put multiple reads in | |
304 | flight. | |
305 | ||
306 | This should return 0 on success and an error code on error. | |
307 | ||
308 | * ``issue_op()`` | |
309 | ||
310 | [Required] The helpers use this to dispatch a subrequest to the server for | |
311 | reading. In the subrequest, ->start, ->len and ->transferred indicate what | |
312 | data should be read from the server. | |
313 | ||
314 | There is no return value; the netfs_subreq_terminated() function should be | |
315 | called to indicate whether or not the operation succeeded and how much data | |
ddca5b0e | 316 | it transferred. The filesystem also should not deal with setting folios |
fb28afcc DH |
317 | uptodate, unlocking them or dropping their refs - the helpers need to deal |
318 | with this as they have to coordinate with copying to the local cache. | |
319 | ||
ddca5b0e DH |
320 | Note that the helpers have the folios locked, but not pinned. It is |
321 | possible to use the ITER_XARRAY iov iterator to refer to the range of the | |
322 | inode that is being operated upon without the need to allocate large bvec | |
323 | tables. | |
fb28afcc DH |
324 | |
325 | * ``is_still_valid()`` | |
326 | ||
327 | [Optional] This is called to find out if the data just read from the local | |
328 | cache is still valid. It should return true if it is still valid and false | |
329 | if not. If it's not still valid, it will be reread from the server. | |
330 | ||
331 | * ``check_write_begin()`` | |
332 | ||
333 | [Optional] This is called from the netfs_write_begin() helper once it has | |
ddca5b0e | 334 | allocated/grabbed the folio to be modified to allow the filesystem to flush |
fb28afcc DH |
335 | conflicting state before allowing it to be modified. |
336 | ||
ddca5b0e | 337 | It should return 0 if everything is now fine, -EAGAIN if the folio should be |
fb28afcc DH |
338 | regrabbed and any other error code to abort the operation. |
339 | ||
340 | * ``done`` | |
341 | ||
ddca5b0e | 342 | [Optional] This is called after the folios in the request have all been |
fb28afcc DH |
343 | unlocked (and marked uptodate if applicable). |
344 | ||
345 | * ``cleanup`` | |
346 | ||
347 | [Optional] This is called as the request is being deallocated so that the | |
348 | filesystem can clean up ->netfs_priv. | |
349 | ||
350 | ||
351 | ||
352 | Read Helper Procedure | |
353 | --------------------- | |
354 | ||
355 | The read helpers work by the following general procedure: | |
356 | ||
357 | * Set up the request. | |
358 | ||
359 | * For readahead, allow the local cache and then the network filesystem to | |
360 | propose expansions to the read request. This is then proposed to the VM. | |
361 | If the VM cannot fully perform the expansion, a partially expanded read will | |
362 | be performed, though this may not get written to the cache in its entirety. | |
363 | ||
364 | * Loop around slicing chunks off of the request to form subrequests: | |
365 | ||
366 | * If a local cache is present, it gets to do the slicing, otherwise the | |
367 | helpers just try to generate maximal slices. | |
368 | ||
369 | * The network filesystem gets to clamp the size of each slice if it is to be | |
370 | the source. This allows rsize and chunking to be implemented. | |
371 | ||
372 | * The helpers issue a read from the cache or a read from the server or just | |
373 | clears the slice as appropriate. | |
374 | ||
375 | * The next slice begins at the end of the last one. | |
376 | ||
377 | * As slices finish being read, they terminate. | |
378 | ||
379 | * When all the subrequests have terminated, the subrequests are assessed and | |
380 | any that are short or have failed are reissued: | |
381 | ||
382 | * Failed cache requests are issued against the server instead. | |
383 | ||
384 | * Failed server requests just fail. | |
385 | ||
386 | * Short reads against either source will be reissued against that source | |
387 | provided they have transferred some more data: | |
388 | ||
389 | * The cache may need to skip holes that it can't do DIO from. | |
390 | ||
391 | * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the | |
392 | end of the slice instead of reissuing. | |
393 | ||
ddca5b0e | 394 | * Once the data is read, the folios that have been fully read/cleared: |
fb28afcc DH |
395 | |
396 | * Will be marked uptodate. | |
397 | ||
398 | * If a cache is present, will be marked with PG_fscache. | |
399 | ||
400 | * Unlocked | |
401 | ||
ddca5b0e | 402 | * Any folios that need writing to the cache will then have DIO writes issued. |
fb28afcc DH |
403 | |
404 | * Synchronous operations will wait for reading to be complete. | |
405 | ||
ddca5b0e | 406 | * Writes to the cache will proceed asynchronously and the folios will have the |
fb28afcc DH |
407 | PG_fscache mark removed when that completes. |
408 | ||
409 | * The request structures will be cleaned up when everything has completed. | |
410 | ||
411 | ||
412 | Read Helper Cache API | |
413 | --------------------- | |
414 | ||
415 | When implementing a local cache to be used by the read helpers, two things are | |
416 | required: some way for the network filesystem to initialise the caching for a | |
417 | read request and a table of operations for the helpers to call. | |
418 | ||
419 | The network filesystem's ->begin_cache_operation() method is called to set up a | |
420 | cache and this must call into the cache to do the work. If using fscache, for | |
421 | example, the cache would call:: | |
422 | ||
423 | int fscache_begin_read_operation(struct netfs_read_request *rreq, | |
424 | struct fscache_cookie *cookie); | |
425 | ||
426 | passing in the request pointer and the cookie corresponding to the file. | |
427 | ||
428 | The netfs_read_request object contains a place for the cache to hang its | |
429 | state:: | |
430 | ||
431 | struct netfs_cache_resources { | |
432 | const struct netfs_cache_ops *ops; | |
433 | void *cache_priv; | |
434 | void *cache_priv2; | |
435 | }; | |
436 | ||
437 | This contains an operations table pointer and two private pointers. The | |
438 | operation table looks like the following:: | |
439 | ||
440 | struct netfs_cache_ops { | |
441 | void (*end_operation)(struct netfs_cache_resources *cres); | |
442 | ||
443 | void (*expand_readahead)(struct netfs_cache_resources *cres, | |
444 | loff_t *_start, size_t *_len, loff_t i_size); | |
445 | ||
446 | enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq, | |
447 | loff_t i_size); | |
448 | ||
449 | int (*read)(struct netfs_cache_resources *cres, | |
450 | loff_t start_pos, | |
451 | struct iov_iter *iter, | |
452 | bool seek_data, | |
453 | netfs_io_terminated_t term_func, | |
454 | void *term_func_priv); | |
455 | ||
ddca5b0e DH |
456 | int (*prepare_write)(struct netfs_cache_resources *cres, |
457 | loff_t *_start, size_t *_len, loff_t i_size); | |
458 | ||
fb28afcc DH |
459 | int (*write)(struct netfs_cache_resources *cres, |
460 | loff_t start_pos, | |
461 | struct iov_iter *iter, | |
462 | netfs_io_terminated_t term_func, | |
463 | void *term_func_priv); | |
464 | }; | |
465 | ||
466 | With a termination handler function pointer:: | |
467 | ||
468 | typedef void (*netfs_io_terminated_t)(void *priv, | |
469 | ssize_t transferred_or_error, | |
470 | bool was_async); | |
471 | ||
472 | The methods defined in the table are: | |
473 | ||
474 | * ``end_operation()`` | |
475 | ||
476 | [Required] Called to clean up the resources at the end of the read request. | |
477 | ||
478 | * ``expand_readahead()`` | |
479 | ||
480 | [Optional] Called at the beginning of a netfs_readahead() operation to allow | |
481 | the cache to expand a request in either direction. This allows the cache to | |
482 | size the request appropriately for the cache granularity. | |
483 | ||
484 | The function is passed poiners to the start and length in its parameters, | |
485 | plus the size of the file for reference, and adjusts the start and length | |
486 | appropriately. It should return one of: | |
487 | ||
488 | * ``NETFS_FILL_WITH_ZEROES`` | |
489 | * ``NETFS_DOWNLOAD_FROM_SERVER`` | |
490 | * ``NETFS_READ_FROM_CACHE`` | |
491 | * ``NETFS_INVALID_READ`` | |
492 | ||
493 | to indicate whether the slice should just be cleared or whether it should be | |
494 | downloaded from the server or read from the cache - or whether slicing | |
495 | should be given up at the current point. | |
496 | ||
497 | * ``prepare_read()`` | |
498 | ||
499 | [Required] Called to configure the next slice of a request. ->start and | |
500 | ->len in the subrequest indicate where and how big the next slice can be; | |
501 | the cache gets to reduce the length to match its granularity requirements. | |
502 | ||
503 | * ``read()`` | |
504 | ||
505 | [Required] Called to read from the cache. The start file offset is given | |
506 | along with an iterator to read to, which gives the length also. It can be | |
507 | given a hint requesting that it seek forward from that start position for | |
508 | data. | |
509 | ||
510 | Also provided is a pointer to a termination handler function and private | |
511 | data to pass to that function. The termination function should be called | |
512 | with the number of bytes transferred or an error code, plus a flag | |
513 | indicating whether the termination is definitely happening in the caller's | |
514 | context. | |
515 | ||
ddca5b0e DH |
516 | * ``prepare_write()`` |
517 | ||
518 | [Required] Called to adjust a write to the cache and check that there is | |
519 | sufficient space in the cache. The start and length values indicate the | |
520 | size of the write that netfslib is proposing, and this can be adjusted by | |
521 | the cache to respect DIO boundaries. The file size is passed for | |
522 | information. | |
523 | ||
fb28afcc DH |
524 | * ``write()`` |
525 | ||
526 | [Required] Called to write to the cache. The start file offset is given | |
527 | along with an iterator to write from, which gives the length also. | |
528 | ||
529 | Also provided is a pointer to a termination handler function and private | |
530 | data to pass to that function. The termination function should be called | |
531 | with the number of bytes transferred or an error code, plus a flag | |
532 | indicating whether the termination is definitely happening in the caller's | |
533 | context. | |
534 | ||
535 | Note that these methods are passed a pointer to the cache resource structure, | |
536 | not the read request structure as they could be used in other situations where | |
537 | there isn't a read request structure as well, such as writing dirty data to the | |
538 | cache. | |
6abbaa5b | 539 | |
ddca5b0e DH |
540 | |
541 | API Function Reference | |
542 | ====================== | |
543 | ||
6abbaa5b | 544 | .. kernel-doc:: include/linux/netfs.h |
ddca5b0e | 545 | .. kernel-doc:: fs/netfs/read_helper.c |