Commit | Line | Data |
---|---|---|
1ecf393f TL |
1 | .. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) |
2 | .. See the bottom of this file for additional redistribution information. | |
3 | ||
4 | Handling regressions | |
5 | ++++++++++++++++++++ | |
6 | ||
7 | *We don't cause regressions* -- this document describes what this "first rule of | |
8 | Linux kernel development" means in practice for developers. It complements | |
9 | Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a | |
10 | user's point of view; if you never read that text, go and at least skim over it | |
11 | before continuing here. | |
12 | ||
13 | The important bits (aka "The TL;DR") | |
14 | ==================================== | |
15 | ||
16 | #. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_ | |
17 | (regressions@lists.linux.dev) quickly become aware of any new regression | |
18 | report: | |
19 | ||
20 | * When receiving a mailed report that did not CC the list, bring it into the | |
21 | loop by immediately sending at least a brief "Reply-all" with the list | |
22 | CCed. | |
23 | ||
24 | * Forward or bounce any reports submitted in bug trackers to the list. | |
25 | ||
26 | #. Make the Linux kernel regression tracking bot "regzbot" track the issue (this | |
27 | is optional, but recommended): | |
28 | ||
29 | * For mailed reports, check if the reporter included a line like ``#regzbot | |
30 | introduced v5.13..v5.14-rc1``. If not, send a reply (with the regressions | |
31 | list in CC) containing a paragraph like the following, which tells regzbot | |
32 | when the issue started to happen:: | |
33 | ||
34 | #regzbot ^introduced 1f2e3d4c5b6a | |
35 | ||
36 | * When forwarding reports from a bug tracker to the regressions list (see | |
37 | above), include a paragraph like the following:: | |
38 | ||
39 | #regzbot introduced: v5.13..v5.14-rc1 | |
40 | #regzbot from: Some N. Ice Human <some.human@example.com> | |
41 | #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789 | |
42 | ||
43 | #. When submitting fixes for regressions, add "Link:" tags to the patch | |
44 | description pointing to all places where the issue was reported, as | |
45 | mandated by Documentation/process/submitting-patches.rst and | |
46 | :ref:`Documentation/process/5.Posting.rst <development_posting>`. | |
47 | ||
d2b40ba2 TL |
48 | #. Try to fix regressions quickly once the culprit has been identified; fixes |
49 | for most regressions should be merged within two weeks, but some need to be | |
50 | resolved within two or three days. | |
51 | ||
1ecf393f TL |
52 | |
53 | All the details on Linux kernel regressions relevant for developers | |
54 | =================================================================== | |
55 | ||
56 | ||
57 | The important basics in more detail | |
58 | ----------------------------------- | |
59 | ||
60 | ||
61 | What to do when receiving regression reports | |
62 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
63 | ||
64 | Ensure the Linux kernel's regression tracker and others subscribers of the | |
65 | `regression mailing list <https://lore.kernel.org/regressions/>`_ | |
66 | (regressions@lists.linux.dev) become aware of any newly reported regression: | |
67 | ||
68 | * When you receive a report by mail that did not CC the list, immediately bring | |
69 | it into the loop by sending at least a brief "Reply-all" with the list CCed; | |
70 | try to ensure it gets CCed again in case you reply to a reply that omitted | |
71 | the list. | |
72 | ||
73 | * If a report submitted in a bug tracker hits your Inbox, forward or bounce it | |
74 | to the list. Consider checking the list archives beforehand, if the reporter | |
75 | already forwarded the report as instructed by | |
76 | Documentation/admin-guide/reporting-issues.rst. | |
77 | ||
78 | When doing either, consider making the Linux kernel regression tracking bot | |
79 | "regzbot" immediately start tracking the issue: | |
80 | ||
81 | * For mailed reports, check if the reporter included a "regzbot command" like | |
82 | ``#regzbot introduced 1f2e3d4c5b6a``. If not, send a reply (with the | |
83 | regressions list in CC) with a paragraph like the following::: | |
84 | ||
85 | #regzbot ^introduced: v5.13..v5.14-rc1 | |
86 | ||
87 | This tells regzbot the version range in which the issue started to happen; | |
88 | you can specify a range using commit-ids as well or state a single commit-id | |
89 | in case the reporter bisected the culprit. | |
90 | ||
91 | Note the caret (^) before the "introduced": it tells regzbot to treat the | |
92 | parent mail (the one you reply to) as the initial report for the regression | |
93 | you want to see tracked; that's important, as regzbot will later look out | |
94 | for patches with "Link:" tags pointing to the report in the archives on | |
95 | lore.kernel.org. | |
96 | ||
97 | * When forwarding a regressions reported to a bug tracker, include a paragraph | |
98 | with these regzbot commands:: | |
99 | ||
100 | #regzbot introduced: 1f2e3d4c5b6a | |
101 | #regzbot from: Some N. Ice Human <some.human@example.com> | |
102 | #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789 | |
103 | ||
104 | Regzbot will then automatically associate patches with the report that | |
105 | contain "Link:" tags pointing to your mail or the mentioned ticket. | |
106 | ||
107 | What's important when fixing regressions | |
108 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
109 | ||
110 | You don't need to do anything special when submitting fixes for regression, just | |
111 | remember to do what Documentation/process/submitting-patches.rst, | |
112 | :ref:`Documentation/process/5.Posting.rst <development_posting>`, and | |
113 | Documentation/process/stable-kernel-rules.rst already explain in more detail: | |
114 | ||
115 | * Point to all places where the issue was reported using "Link:" tags:: | |
116 | ||
117 | Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/ | |
118 | Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890 | |
119 | ||
120 | * Add a "Fixes:" tag to specify the commit causing the regression. | |
121 | ||
122 | * If the culprit was merged in an earlier development cycle, explicitly mark | |
123 | the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag. | |
124 | ||
125 | All this is expected from you and important when it comes to regression, as | |
126 | these tags are of great value for everyone (you included) that might be looking | |
127 | into the issue weeks, months, or years later. These tags are also crucial for | |
128 | tools and scripts used by other kernel developers or Linux distributions; one of | |
129 | these tools is regzbot, which heavily relies on the "Link:" tags to associate | |
130 | reports for regression with changes resolving them. | |
131 | ||
d2b40ba2 TL |
132 | Prioritize work on fixing regressions |
133 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
134 | ||
135 | You should fix any reported regression as quickly as possible, to provide | |
136 | affected users with a solution in a timely manner and prevent more users from | |
137 | running into the issue; nevertheless developers need to take enough time and | |
138 | care to ensure regression fixes do not cause additional damage. | |
139 | ||
140 | In the end though, developers should give their best to prevent users from | |
141 | running into situations where a regression leaves them only three options: "run | |
142 | a kernel with a regression that seriously impacts usage", "continue running an | |
143 | outdated and thus potentially insecure kernel version for more than two weeks | |
144 | after a regression's culprit was identified", and "downgrade to a still | |
145 | supported kernel series that lack required features". | |
146 | ||
147 | How to realize this depends a lot on the situation. Here are a few rules of | |
148 | thumb for you, in order or importance: | |
149 | ||
150 | * Prioritize work on handling regression reports and fixing regression over all | |
151 | other Linux kernel work, unless the latter concerns acute security issues or | |
152 | bugs causing data loss or damage. | |
153 | ||
154 | * Always consider reverting the culprit commits and reapplying them later | |
155 | together with necessary fixes, as this might be the least dangerous and | |
156 | quickest way to fix a regression. | |
157 | ||
158 | * Developers should handle regressions in all supported kernel series, but are | |
159 | free to delegate the work to the stable team, if the issue probably at no | |
160 | point in time occurred with mainline. | |
161 | ||
162 | * Try to resolve any regressions introduced in the current development before | |
163 | its end. If you fear a fix might be too risky to apply only days before a new | |
164 | mainline release, let Linus decide: submit the fix separately to him as soon | |
165 | as possible with the explanation of the situation. He then can make a call | |
166 | and postpone the release if necessary, for example if multiple such changes | |
167 | show up in his inbox. | |
168 | ||
169 | * Address regressions in stable, longterm, or proper mainline releases with | |
170 | more urgency than regressions in mainline pre-releases. That changes after | |
171 | the release of the fifth pre-release, aka "-rc5": mainline then becomes as | |
172 | important, to ensure all the improvements and fixes are ideally tested | |
173 | together for at least one week before Linus releases a new mainline version. | |
174 | ||
175 | * Fix regressions within two or three days, if they are critical for some | |
176 | reason -- for example, if the issue is likely to affect many users of the | |
177 | kernel series in question on all or certain architectures. Note, this | |
178 | includes mainline, as issues like compile errors otherwise might prevent many | |
179 | testers or continuous integration systems from testing the series. | |
180 | ||
181 | * Aim to fix regressions within one week after the culprit was identified, if | |
182 | the issue was introduced in either: | |
183 | ||
184 | * a recent stable/longterm release | |
185 | ||
186 | * the development cycle of the latest proper mainline release | |
187 | ||
188 | In the latter case (say Linux v5.14), try to address regressions even | |
189 | quicker, if the stable series for the predecessor (v5.13) will be abandoned | |
190 | soon or already was stamped "End-of-Life" (EOL) -- this usually happens about | |
191 | three to four weeks after a new mainline release. | |
192 | ||
193 | * Try to fix all other regressions within two weeks after the culprit was | |
194 | found. Two or three additional weeks are acceptable for performance | |
195 | regressions and other issues which are annoying, but don't prevent anyone | |
196 | from running Linux (unless it's an issue in the current development cycle, | |
197 | as those should ideally be addressed before the release). A few weeks in | |
198 | total are acceptable if a regression can only be fixed with a risky change | |
199 | and at the same time is affecting only a few users; as much time is | |
200 | also okay if the regression is already present in the second newest longterm | |
201 | kernel series. | |
202 | ||
203 | Note: The aforementioned time frames for resolving regressions are meant to | |
204 | include getting the fix tested, reviewed, and merged into mainline, ideally with | |
205 | the fix being in linux-next at least briefly. This leads to delays you need to | |
206 | account for. | |
207 | ||
208 | Subsystem maintainers are expected to assist in reaching those periods by doing | |
209 | timely reviews and quick handling of accepted patches. They thus might have to | |
210 | send git-pull requests earlier or more often than usual; depending on the fix, | |
211 | it might even be acceptable to skip testing in linux-next. Especially fixes for | |
212 | regressions in stable and longterm kernels need to be handled quickly, as fixes | |
213 | need to be merged in mainline before they can be backported to older series. | |
214 | ||
1ecf393f TL |
215 | |
216 | More aspects regarding regressions developers should be aware of | |
217 | ---------------------------------------------------------------- | |
218 | ||
219 | ||
220 | How to deal with changes where a risk of regression is known | |
221 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
222 | ||
223 | Evaluate how big the risk of regressions is, for example by performing a code | |
224 | search in Linux distributions and Git forges. Also consider asking other | |
225 | developers or projects likely to be affected to evaluate or even test the | |
226 | proposed change; if problems surface, maybe some solution acceptable for all | |
227 | can be found. | |
228 | ||
229 | If the risk of regressions in the end seems to be relatively small, go ahead | |
230 | with the change, but let all involved parties know about the risk. Hence, make | |
231 | sure your patch description makes this aspect obvious. Once the change is | |
232 | merged, tell the Linux kernel's regression tracker and the regressions mailing | |
233 | list about the risk, so everyone has the change on the radar in case reports | |
234 | trickle in. Depending on the risk, you also might want to ask the subsystem | |
235 | maintainer to mention the issue in his mainline pull request. | |
236 | ||
237 | What else is there to known about regressions? | |
238 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
239 | ||
240 | Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot | |
241 | of other aspects you want might want to be aware of: | |
242 | ||
243 | * the purpose of the "no regressions rule" | |
244 | ||
245 | * what issues actually qualify as regression | |
246 | ||
247 | * who's in charge for finding the root cause of a regression | |
248 | ||
249 | * how to handle tricky situations, e.g. when a regression is caused by a | |
250 | security fix or when fixing a regression might cause another one | |
251 | ||
252 | Whom to ask for advice when it comes to regressions | |
253 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
254 | ||
255 | Send a mail to the regressions mailing list (regressions@lists.linux.dev) while | |
256 | CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the | |
257 | issue might better be dealt with in private, feel free to omit the list. | |
258 | ||
259 | ||
260 | More about regression tracking and regzbot | |
261 | ------------------------------------------ | |
262 | ||
263 | ||
264 | Why the Linux kernel has a regression tracker, and why is regzbot used? | |
265 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
266 | ||
267 | Rules like "no regressions" need someone to ensure they are followed, otherwise | |
268 | they are broken either accidentally or on purpose. History has shown this to be | |
269 | true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to | |
270 | keep an eye on things as the Linux kernel's regression tracker, who's | |
271 | occasionally helped by other people. Neither of them are paid to do this, | |
272 | that's why regression tracking is done on a best effort basis. | |
273 | ||
274 | Earlier attempts to manually track regressions have shown it's an exhausting and | |
275 | frustrating work, which is why they were abandoned after a while. To prevent | |
276 | this from happening again, Thorsten developed regzbot to facilitate the work, | |
277 | with the long term goal to automate regression tracking as much as possible for | |
278 | everyone involved. | |
279 | ||
280 | How does regression tracking work with regzbot? | |
281 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
282 | ||
283 | The bot watches for replies to reports of tracked regressions. Additionally, | |
284 | it's looking out for posted or committed patches referencing such reports | |
285 | with "Link:" tags; replies to such patch postings are tracked as well. | |
286 | Combined this data provides good insights into the current state of the fixing | |
287 | process. | |
288 | ||
289 | Regzbot tries to do its job with as little overhead as possible for both | |
290 | reporters and developers. In fact, only reporters are burdened with an extra | |
291 | duty: they need to tell regzbot about the regression report using the ``#regzbot | |
292 | introduced`` command outlined above; if they don't do that, someone else can | |
293 | take care of that using ``#regzbot ^introduced``. | |
294 | ||
295 | For developers there normally is no extra work involved, they just need to make | |
296 | sure to do something that was expected long before regzbot came to light: add | |
297 | "Link:" tags to the patch description pointing to all reports about the issue | |
298 | fixed. | |
299 | ||
300 | Do I have to use regzbot? | |
301 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
302 | ||
303 | It's in the interest of everyone if you do, as kernel maintainers like Linus | |
304 | Torvalds partly rely on regzbot's tracking in their work -- for example when | |
305 | deciding to release a new version or extend the development phase. For this they | |
306 | need to be aware of all unfixed regression; to do that, Linus is known to look | |
307 | into the weekly reports sent by regzbot. | |
308 | ||
309 | Do I have to tell regzbot about every regression I stumble upon? | |
310 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
311 | ||
312 | Ideally yes: we are all humans and easily forget problems when something more | |
313 | important unexpectedly comes up -- for example a bigger problem in the Linux | |
314 | kernel or something in real life that's keeping us away from keyboards for a | |
315 | while. Hence, it's best to tell regzbot about every regression, except when you | |
316 | immediately write a fix and commit it to a tree regularly merged to the affected | |
317 | kernel series. | |
318 | ||
319 | How to see which regressions regzbot tracks currently? | |
320 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
321 | ||
322 | Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_ | |
323 | for the latest info; alternatively, `search for the latest regression report | |
324 | <https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_, | |
325 | which regzbot normally sends out once a week on Sunday evening (UTC), which is a | |
326 | few hours before Linus usually publishes new (pre-)releases. | |
327 | ||
328 | What places is regzbot monitoring? | |
329 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
330 | ||
331 | Regzbot is watching the most important Linux mailing lists as well as the git | |
332 | repositories of linux-next, mainline, and stable/longterm. | |
333 | ||
334 | What kind of issues are supposed to be tracked by regzbot? | |
335 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
336 | ||
337 | The bot is meant to track regressions, hence please don't involve regzbot for | |
338 | regular issues. But it's okay for the Linux kernel's regression tracker if you | |
339 | use regzbot to track severe issues, like reports about hangs, corrupted data, | |
340 | or internal errors (Panic, Oops, BUG(), warning, ...). | |
341 | ||
342 | Can I add regressions found by CI systems to regzbot's tracking? | |
343 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
344 | ||
345 | Feel free to do so, if the particular regression likely has impact on practical | |
346 | use cases and thus might be noticed by users; hence, please don't involve | |
347 | regzbot for theoretical regressions unlikely to show themselves in real world | |
348 | usage. | |
349 | ||
350 | How to interact with regzbot? | |
351 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
352 | ||
353 | By using a 'regzbot command' in a direct or indirect reply to the mail with the | |
354 | regression report. These commands need to be in their own paragraph (IOW: they | |
355 | need to be separated from the rest of the mail using blank lines). | |
356 | ||
357 | One such command is ``#regzbot introduced <version or commit>``, which makes | |
358 | regzbot consider your mail as a regressions report added to the tracking, as | |
359 | already described above; ``#regzbot ^introduced <version or commit>`` is another | |
360 | such command, which makes regzbot consider the parent mail as a report for a | |
361 | regression which it starts to track. | |
362 | ||
363 | Once one of those two commands has been utilized, other regzbot commands can be | |
364 | used in direct or indirect replies to the report. You can write them below one | |
365 | of the `introduced` commands or in replies to the mail that used one of them | |
366 | or itself is a reply to that mail: | |
367 | ||
368 | * Set or update the title:: | |
369 | ||
370 | #regzbot title: foo | |
371 | ||
372 | * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of | |
373 | the issue or a fix are discussed -- for example the posting of a patch fixing | |
374 | the regression:: | |
375 | ||
376 | #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/ | |
377 | ||
378 | Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot | |
379 | will consider all messages in that thread or ticket as related to the fixing | |
380 | process. | |
381 | ||
382 | * Point to a place with further details of interest, like a mailing list post | |
383 | or a ticket in a bug tracker that are slightly related, but about a different | |
384 | topic:: | |
385 | ||
386 | #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789 | |
387 | ||
388 | * Mark a regression as fixed by a commit that is heading upstream or already | |
389 | landed:: | |
390 | ||
391 | #regzbot fixed-by: 1f2e3d4c5d | |
392 | ||
393 | * Mark a regression as a duplicate of another one already tracked by regzbot:: | |
394 | ||
395 | #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/ | |
396 | ||
397 | * Mark a regression as invalid:: | |
398 | ||
399 | #regzbot invalid: wasn't a regression, problem has always existed | |
400 | ||
401 | Is there more to tell about regzbot and its commands? | |
402 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
403 | ||
404 | More detailed and up-to-date information about the Linux | |
405 | kernel's regression tracking bot can be found on its | |
406 | `project page <https://gitlab.com/knurd42/regzbot>`_, which among others | |
407 | contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ | |
408 | and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_ | |
409 | which both cover more details than the above section. | |
410 | ||
411 | Quotes from Linus about regression | |
412 | ---------------------------------- | |
413 | ||
414 | Find below a few real life examples of how Linus Torvalds expects regressions to | |
415 | be handled: | |
416 | ||
417 | * From `2017-10-26 (1/2) | |
418 | <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_:: | |
419 | ||
420 | If you break existing user space setups THAT IS A REGRESSION. | |
421 | ||
422 | It's not ok to say "but we'll fix the user space setup". | |
423 | ||
424 | Really. NOT OK. | |
425 | ||
426 | [...] | |
427 | ||
428 | The first rule is: | |
429 | ||
430 | - we don't cause regressions | |
431 | ||
432 | and the corollary is that when regressions *do* occur, we admit to | |
433 | them and fix them, instead of blaming user space. | |
434 | ||
435 | The fact that you have apparently been denying the regression now for | |
436 | three weeks means that I will revert, and I will stop pulling apparmor | |
437 | requests until the people involved understand how kernel development | |
438 | is done. | |
439 | ||
440 | * From `2017-10-26 (2/2) | |
441 | <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_:: | |
442 | ||
443 | People should basically always feel like they can update their kernel | |
444 | and simply not have to worry about it. | |
445 | ||
446 | I refuse to introduce "you can only update the kernel if you also | |
447 | update that other program" kind of limitations. If the kernel used to | |
448 | work for you, the rule is that it continues to work for you. | |
449 | ||
450 | There have been exceptions, but they are few and far between, and they | |
451 | generally have some major and fundamental reasons for having happened, | |
452 | that were basically entirely unavoidable, and people _tried_hard_ to | |
453 | avoid them. Maybe we can't practically support the hardware any more | |
454 | after it is decades old and nobody uses it with modern kernels any | |
455 | more. Maybe there's a serious security issue with how we did things, | |
456 | and people actually depended on that fundamentally broken model. Maybe | |
457 | there was some fundamental other breakage that just _had_ to have a | |
458 | flag day for very core and fundamental reasons. | |
459 | ||
460 | And notice that this is very much about *breaking* peoples environments. | |
461 | ||
462 | Behavioral changes happen, and maybe we don't even support some | |
463 | feature any more. There's a number of fields in /proc/<pid>/stat that | |
464 | are printed out as zeroes, simply because they don't even *exist* in | |
465 | the kernel any more, or because showing them was a mistake (typically | |
466 | an information leak). But the numbers got replaced by zeroes, so that | |
467 | the code that used to parse the fields still works. The user might not | |
468 | see everything they used to see, and so behavior is clearly different, | |
469 | but things still _work_, even if they might no longer show sensitive | |
470 | (or no longer relevant) information. | |
471 | ||
472 | But if something actually breaks, then the change must get fixed or | |
473 | reverted. And it gets fixed in the *kernel*. Not by saying "well, fix | |
474 | your user space then". It was a kernel change that exposed the | |
475 | problem, it needs to be the kernel that corrects for it, because we | |
476 | have a "upgrade in place" model. We don't have a "upgrade with new | |
477 | user space". | |
478 | ||
479 | And I seriously will refuse to take code from people who do not | |
480 | understand and honor this very simple rule. | |
481 | ||
482 | This rule is also not going to change. | |
483 | ||
484 | And yes, I realize that the kernel is "special" in this respect. I'm | |
485 | proud of it. | |
486 | ||
487 | I have seen, and can point to, lots of projects that go "We need to | |
488 | break that use case in order to make progress" or "you relied on | |
489 | undocumented behavior, it sucks to be you" or "there's a better way to | |
490 | do what you want to do, and you have to change to that new better | |
491 | way", and I simply don't think that's acceptable outside of very early | |
492 | alpha releases that have experimental users that know what they signed | |
493 | up for. The kernel hasn't been in that situation for the last two | |
494 | decades. | |
495 | ||
496 | We do API breakage _inside_ the kernel all the time. We will fix | |
497 | internal problems by saying "you now need to do XYZ", but then it's | |
498 | about internal kernel API's, and the people who do that then also | |
499 | obviously have to fix up all the in-kernel users of that API. Nobody | |
500 | can say "I now broke the API you used, and now _you_ need to fix it | |
501 | up". Whoever broke something gets to fix it too. | |
502 | ||
503 | And we simply do not break user space. | |
504 | ||
505 | * From `2020-05-21 | |
506 | <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_:: | |
507 | ||
508 | The rules about regressions have never been about any kind of | |
509 | documented behavior, or where the code lives. | |
510 | ||
511 | The rules about regressions are always about "breaks user workflow". | |
512 | ||
513 | Users are literally the _only_ thing that matters. | |
514 | ||
515 | No amount of "you shouldn't have used this" or "that behavior was | |
516 | undefined, it's your own fault your app broke" or "that used to work | |
517 | simply because of a kernel bug" is at all relevant. | |
518 | ||
519 | Now, reality is never entirely black-and-white. So we've had things | |
520 | like "serious security issue" etc that just forces us to make changes | |
521 | that may break user space. But even then the rule is that we don't | |
522 | really have other options that would allow things to continue. | |
523 | ||
524 | And obviously, if users take years to even notice that something | |
525 | broke, or if we have sane ways to work around the breakage that | |
526 | doesn't make for too much trouble for users (ie "ok, there are a | |
527 | handful of users, and they can use a kernel command line to work | |
528 | around it" kind of things) we've also been a bit less strict. | |
529 | ||
530 | But no, "that was documented to be broken" (whether it's because the | |
531 | code was in staging or because the man-page said something else) is | |
532 | irrelevant. If staging code is so useful that people end up using it, | |
533 | that means that it's basically regular kernel code with a flag saying | |
534 | "please clean this up". | |
535 | ||
536 | The other side of the coin is that people who talk about "API | |
537 | stability" are entirely wrong. API's don't matter either. You can make | |
538 | any changes to an API you like - as long as nobody notices. | |
539 | ||
540 | Again, the regression rule is not about documentation, not about | |
541 | API's, and not about the phase of the moon. | |
542 | ||
543 | It's entirely about "we caused problems for user space that used to work". | |
544 | ||
545 | * From `2017-11-05 | |
546 | <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_:: | |
547 | ||
548 | And our regression rule has never been "behavior doesn't change". | |
549 | That would mean that we could never make any changes at all. | |
550 | ||
551 | For example, we do things like add new error handling etc all the | |
552 | time, which we then sometimes even add tests for in our kselftest | |
553 | directory. | |
554 | ||
555 | So clearly behavior changes all the time and we don't consider that a | |
556 | regression per se. | |
557 | ||
558 | The rule for a regression for the kernel is that some real user | |
559 | workflow breaks. Not some test. Not a "look, I used to be able to do | |
560 | X, now I can't". | |
561 | ||
562 | * From `2018-08-03 | |
563 | <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_:: | |
564 | ||
565 | YOU ARE MISSING THE #1 KERNEL RULE. | |
566 | ||
567 | We do not regress, and we do not regress exactly because your are 100% wrong. | |
568 | ||
569 | And the reason you state for your opinion is in fact exactly *WHY* you | |
570 | are wrong. | |
571 | ||
572 | Your "good reasons" are pure and utter garbage. | |
573 | ||
574 | The whole point of "we do not regress" is so that people can upgrade | |
575 | the kernel and never have to worry about it. | |
576 | ||
577 | > Kernel had a bug which has been fixed | |
578 | ||
579 | That is *ENTIRELY* immaterial. | |
580 | ||
581 | Guys, whether something was buggy or not DOES NOT MATTER. | |
582 | ||
583 | Why? | |
584 | ||
585 | Bugs happen. That's a fact of life. Arguing that "we had to break | |
586 | something because we were fixing a bug" is completely insane. We fix | |
587 | tens of bugs every single day, thinking that "fixing a bug" means that | |
588 | we can break something is simply NOT TRUE. | |
589 | ||
590 | So bugs simply aren't even relevant to the discussion. They happen, | |
591 | they get found, they get fixed, and it has nothing to do with "we | |
592 | break users". | |
593 | ||
594 | Because the only thing that matters IS THE USER. | |
595 | ||
596 | How hard is that to understand? | |
597 | ||
598 | Anybody who uses "but it was buggy" as an argument is entirely missing | |
599 | the point. As far as the USER was concerned, it wasn't buggy - it | |
600 | worked for him/her. | |
601 | ||
602 | Maybe it worked *because* the user had taken the bug into account, | |
603 | maybe it worked because the user didn't notice - again, it doesn't | |
604 | matter. It worked for the user. | |
605 | ||
606 | Breaking a user workflow for a "bug" is absolutely the WORST reason | |
607 | for breakage you can imagine. | |
608 | ||
609 | It's basically saying "I took something that worked, and I broke it, | |
610 | but now it's better". Do you not see how f*cking insane that statement | |
611 | is? | |
612 | ||
613 | And without users, your program is not a program, it's a pointless | |
614 | piece of code that you might as well throw away. | |
615 | ||
616 | Seriously. This is *why* the #1 rule for kernel development is "we | |
617 | don't break users". Because "I fixed a bug" is absolutely NOT AN | |
618 | ARGUMENT if that bug fix broke a user setup. You actually introduced a | |
619 | MUCH BIGGER bug by "fixing" something that the user clearly didn't | |
620 | even care about. | |
621 | ||
622 | And dammit, we upgrade the kernel ALL THE TIME without upgrading any | |
623 | other programs at all. It is absolutely required, because flag-days | |
624 | and dependencies are horribly bad. | |
625 | ||
626 | And it is also required simply because I as a kernel developer do not | |
627 | upgrade random other tools that I don't even care about as I develop | |
628 | the kernel, and I want any of my users to feel safe doing the same | |
629 | time. | |
630 | ||
631 | So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel | |
632 | without upgrading some other random binary, then we have a problem. | |
633 | ||
634 | * From `2021-06-05 | |
635 | <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_:: | |
636 | ||
637 | THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS. | |
638 | ||
639 | Honestly, security people need to understand that "not working" is not | |
640 | a success case of security. It's a failure case. | |
641 | ||
642 | Yes, "not working" may be secure. But security in that case is *pointless*. | |
643 | ||
644 | * From `2011-05-06 (1/3) | |
645 | <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_:: | |
646 | ||
647 | Binary compatibility is more important. | |
648 | ||
649 | And if binaries don't use the interface to parse the format (or just | |
650 | parse it wrongly - see the fairly recent example of adding uuid's to | |
651 | /proc/self/mountinfo), then it's a regression. | |
652 | ||
653 | And regressions get reverted, unless there are security issues or | |
654 | similar that makes us go "Oh Gods, we really have to break things". | |
655 | ||
656 | I don't understand why this simple logic is so hard for some kernel | |
657 | developers to understand. Reality matters. Your personal wishes matter | |
658 | NOT AT ALL. | |
659 | ||
660 | If you made an interface that can be used without parsing the | |
661 | interface description, then we're stuck with the interface. Theory | |
662 | simply doesn't matter. | |
663 | ||
664 | You could help fix the tools, and try to avoid the compatibility | |
665 | issues that way. There aren't that many of them. | |
666 | ||
667 | From `2011-05-06 (2/3) | |
668 | <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_:: | |
669 | ||
670 | it's clearly NOT an internal tracepoint. By definition. It's being | |
671 | used by powertop. | |
672 | ||
673 | From `2011-05-06 (3/3) | |
674 | <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_:: | |
675 | ||
676 | We have programs that use that ABI and thus it's a regression if they break. | |
677 | ||
678 | * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_:: | |
679 | ||
680 | > Now this got me wondering if Debian _unstable_ actually qualifies as a | |
681 | > standard distro userspace. | |
682 | ||
683 | Oh, if the kernel breaks some standard user space, that counts. Tons | |
684 | of people run Debian unstable | |
685 | ||
686 | * From `2019-09-15 | |
687 | <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_:: | |
688 | ||
689 | One _particularly_ last-minute revert is the top-most commit (ignoring | |
690 | the version change itself) done just before the release, and while | |
691 | it's very annoying, it's perhaps also instructive. | |
692 | ||
693 | What's instructive about it is that I reverted a commit that wasn't | |
694 | actually buggy. In fact, it was doing exactly what it set out to do, | |
695 | and did it very well. In fact it did it _so_ well that the much | |
696 | improved IO patterns it caused then ended up revealing a user-visible | |
697 | regression due to a real bug in a completely unrelated area. | |
698 | ||
699 | The actual details of that regression are not the reason I point that | |
700 | revert out as instructive, though. It's more that it's an instructive | |
701 | example of what counts as a regression, and what the whole "no | |
702 | regressions" kernel rule means. The reverted commit didn't change any | |
703 | API's, and it didn't introduce any new bugs. But it ended up exposing | |
704 | another problem, and as such caused a kernel upgrade to fail for a | |
705 | user. So it got reverted. | |
706 | ||
707 | The point here being that we revert based on user-reported _behavior_, | |
708 | not based on some "it changes the ABI" or "it caused a bug" concept. | |
709 | The problem was really pre-existing, and it just didn't happen to | |
710 | trigger before. The better IO patterns introduced by the change just | |
711 | happened to expose an old bug, and people had grown to depend on the | |
712 | previously benign behavior of that old issue. | |
713 | ||
714 | And never fear, we'll re-introduce the fix that improved on the IO | |
715 | patterns once we've decided just how to handle the fact that we had a | |
716 | bad interaction with an interface that people had then just happened | |
717 | to rely on incidental behavior for before. It's just that we'll have | |
718 | to hash through how to do that (there are no less than three different | |
719 | patches by three different developers being discussed, and there might | |
720 | be more coming...). In the meantime, I reverted the thing that exposed | |
721 | the problem to users for this release, even if I hope it will be | |
722 | re-introduced (perhaps even backported as a stable patch) once we have | |
723 | consensus about the issue it exposed. | |
724 | ||
725 | Take-away from the whole thing: it's not about whether you change the | |
726 | kernel-userspace ABI, or fix a bug, or about whether the old code | |
727 | "should never have worked in the first place". It's about whether | |
728 | something breaks existing users' workflow. | |
729 | ||
730 | Anyway, that was my little aside on the whole regression thing. Since | |
731 | it's that "first rule of kernel programming", I felt it is perhaps | |
732 | worth just bringing it up every once in a while | |
733 | ||
734 | .. | |
735 | end-of-content | |
736 | .. | |
737 | This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top | |
738 | of the file. If you want to distribute this text under CC-BY-4.0 only, | |
739 | please use "The Linux kernel developers" for author attribution and link | |
740 | this as source: | |
741 | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst | |
742 | .. | |
743 | Note: Only the content of this RST file as found in the Linux kernel sources | |
744 | is available under CC-BY-4.0, as versions of this text that were processed | |
745 | (for example by the kernel's build system) might contain content taken from | |
746 | files which use a more restrictive license. |