Commit | Line | Data |
---|---|---|
14ebc28e | 1 | ===================== |
80aafd50 JL |
2 | The errseq_t datatype |
3 | ===================== | |
14ebc28e | 4 | |
80aafd50 JL |
5 | An errseq_t is a way of recording errors in one place, and allowing any |
6 | number of "subscribers" to tell whether it has changed since a previous | |
7 | point where it was sampled. | |
8 | ||
9 | The initial use case for this is tracking errors for file | |
10 | synchronization syscalls (fsync, fdatasync, msync and sync_file_range), | |
11 | but it may be usable in other situations. | |
12 | ||
13 | It's implemented as an unsigned 32-bit value. The low order bits are | |
14 | designated to hold an error code (between 1 and MAX_ERRNO). The upper bits | |
15 | are used as a counter. This is done with atomics instead of locking so that | |
16 | these functions can be called from any context. | |
17 | ||
18 | Note that there is a risk of collisions if new errors are being recorded | |
19 | frequently, since we have so few bits to use as a counter. | |
20 | ||
21 | To mitigate this, the bit between the error value and counter is used as | |
22 | a flag to tell whether the value has been sampled since a new value was | |
23 | recorded. That allows us to avoid bumping the counter if no one has | |
24 | sampled it since the last time an error was recorded. | |
25 | ||
14ebc28e | 26 | Thus we end up with a value that looks something like this: |
80aafd50 | 27 | |
14ebc28e MW |
28 | +--------------------------------------+----+------------------------+ |
29 | | 31..13 | 12 | 11..0 | | |
30 | +--------------------------------------+----+------------------------+ | |
31 | | counter | SF | errno | | |
32 | +--------------------------------------+----+------------------------+ | |
80aafd50 JL |
33 | |
34 | The general idea is for "watchers" to sample an errseq_t value and keep | |
35 | it as a running cursor. That value can later be used to tell whether | |
36 | any new errors have occurred since that sampling was done, and atomically | |
37 | record the state at the time that it was checked. This allows us to | |
38 | record errors in one place, and then have a number of "watchers" that | |
39 | can tell whether the value has changed since they last checked it. | |
40 | ||
41 | A new errseq_t should always be zeroed out. An errseq_t value of all zeroes | |
42 | is the special (but common) case where there has never been an error. An all | |
43 | zero value thus serves as the "epoch" if one wishes to know whether there | |
44 | has ever been an error set since it was first initialized. | |
45 | ||
46 | API usage | |
47 | ========= | |
14ebc28e | 48 | |
80aafd50 JL |
49 | Let me tell you a story about a worker drone. Now, he's a good worker |
50 | overall, but the company is a little...management heavy. He has to | |
51 | report to 77 supervisors today, and tomorrow the "big boss" is coming in | |
52 | from out of town and he's sure to test the poor fellow too. | |
53 | ||
54 | They're all handing him work to do -- so much he can't keep track of who | |
55 | handed him what, but that's not really a big problem. The supervisors | |
56 | just want to know when he's finished all of the work they've handed him so | |
57 | far and whether he made any mistakes since they last asked. | |
58 | ||
59 | He might have made the mistake on work they didn't actually hand him, | |
60 | but he can't keep track of things at that level of detail, all he can | |
61 | remember is the most recent mistake that he made. | |
62 | ||
63 | Here's our worker_drone representation:: | |
64 | ||
65 | struct worker_drone { | |
66 | errseq_t wd_err; /* for recording errors */ | |
67 | }; | |
68 | ||
69 | Every day, the worker_drone starts out with a blank slate:: | |
70 | ||
71 | struct worker_drone wd; | |
72 | ||
73 | wd.wd_err = (errseq_t)0; | |
74 | ||
75 | The supervisors come in and get an initial read for the day. They | |
76 | don't care about anything that happened before their watch begins:: | |
77 | ||
78 | struct supervisor { | |
79 | errseq_t s_wd_err; /* private "cursor" for wd_err */ | |
80 | spinlock_t s_wd_err_lock; /* protects s_wd_err */ | |
81 | } | |
82 | ||
83 | struct supervisor su; | |
84 | ||
85 | su.s_wd_err = errseq_sample(&wd.wd_err); | |
86 | spin_lock_init(&su.s_wd_err_lock); | |
87 | ||
88 | Now they start handing him tasks to do. Every few minutes they ask him to | |
89 | finish up all of the work they've handed him so far. Then they ask him | |
90 | whether he made any mistakes on any of it:: | |
91 | ||
92 | spin_lock(&su.su_wd_err_lock); | |
93 | err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); | |
94 | spin_unlock(&su.su_wd_err_lock); | |
95 | ||
96 | Up to this point, that just keeps returning 0. | |
97 | ||
98 | Now, the owners of this company are quite miserly and have given him | |
99 | substandard equipment with which to do his job. Occasionally it | |
100 | glitches and he makes a mistake. He sighs a heavy sigh, and marks it | |
101 | down:: | |
102 | ||
103 | errseq_set(&wd.wd_err, -EIO); | |
104 | ||
105 | ...and then gets back to work. The supervisors eventually poll again | |
106 | and they each get the error when they next check. Subsequent calls will | |
107 | return 0, until another error is recorded, at which point it's reported | |
108 | to each of them once. | |
109 | ||
110 | Note that the supervisors can't tell how many mistakes he made, only | |
111 | whether one was made since they last checked, and the latest value | |
112 | recorded. | |
113 | ||
114 | Occasionally the big boss comes in for a spot check and asks the worker | |
115 | to do a one-off job for him. He's not really watching the worker | |
116 | full-time like the supervisors, but he does need to know whether a | |
117 | mistake occurred while his job was processing. | |
118 | ||
119 | He can just sample the current errseq_t in the worker, and then use that | |
120 | to tell whether an error has occurred later:: | |
121 | ||
122 | errseq_t since = errseq_sample(&wd.wd_err); | |
123 | /* submit some work and wait for it to complete */ | |
124 | err = errseq_check(&wd.wd_err, since); | |
125 | ||
126 | Since he's just going to discard "since" after that point, he doesn't | |
127 | need to advance it here. He also doesn't need any locking since it's | |
128 | not usable by anyone else. | |
129 | ||
130 | Serializing errseq_t cursor updates | |
131 | =================================== | |
14ebc28e | 132 | |
80aafd50 JL |
133 | Note that the errseq_t API does not protect the errseq_t cursor during a |
134 | check_and_advance_operation. Only the canonical error code is handled | |
135 | atomically. In a situation where more than one task might be using the | |
136 | same errseq_t cursor at the same time, it's important to serialize | |
137 | updates to that cursor. | |
138 | ||
139 | If that's not done, then it's possible for the cursor to go backward | |
140 | in which case the same error could be reported more than once. | |
141 | ||
142 | Because of this, it's often advantageous to first do an errseq_check to | |
143 | see if anything has changed, and only later do an | |
144 | errseq_check_and_advance after taking the lock. e.g.:: | |
145 | ||
146 | if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) { | |
147 | /* su.s_wd_err is protected by s_wd_err_lock */ | |
148 | spin_lock(&su.s_wd_err_lock); | |
149 | err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); | |
150 | spin_unlock(&su.s_wd_err_lock); | |
151 | } | |
152 | ||
153 | That avoids the spinlock in the common case where nothing has changed | |
154 | since the last time it was checked. | |
14ebc28e MW |
155 | |
156 | Functions | |
157 | ========= | |
158 | ||
159 | .. kernel-doc:: lib/errseq.c |