Documentation/BUG-HUNTING: convert to ReST markup
[linux-2.6-block.git] / Documentation / BUG-HUNTING
CommitLineData
953ab835
MCC
1Bug hunting
2+++++++++++
43019a56
IM
3
4Last updated: 20 December 2005
5
43019a56
IM
6Introduction
7============
8
9Always try the latest kernel from kernel.org and build from source. If you are
10not confident in doing that please report the bug to your distribution vendor
11instead of to a kernel developer.
12
13Finding bugs is not always easy. Have a go though. If you can't find it don't
14give up. Report as much as you have found to the relevant maintainer. See
15MAINTAINERS for who that is for the subsystem you have worked on.
16
953ab835
MCC
17Before you submit a bug report read
18:ref:`Documentation/REPORTING-BUGS <reportingbugs>`.
43019a56
IM
19
20Devices not appearing
21=====================
22
23Often this is caused by udev. Check that first before blaming it on the
24kernel.
25
26Finding patch that caused a bug
27===============================
28
29
30
953ab835
MCC
31Finding using ``git-bisect``
32----------------------------
43019a56 33
953ab835
MCC
34Using the provided tools with ``git`` makes finding bugs easy provided the bug
35is reproducible.
43019a56
IM
36
37Steps to do it:
953ab835 38
43019a56 39- start using git for the kernel source
953ab835 40- read the man page for ``git-bisect``
43019a56
IM
41- have fun
42
43Finding it the old way
44----------------------
45
1da177e4
LT
46[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
47
d81919c9 48This is how to track down a bug if you know nothing about kernel hacking.
1da177e4
LT
49It's a brute force approach but it works pretty well.
50
51You need:
52
953ab835
MCC
53 - A reproducible bug - it has to happen predictably (sorry)
54 - All the kernel tar files from a revision that worked to the
1da177e4
LT
55 revision that doesn't
56
57You will then do:
58
953ab835
MCC
59 - Rebuild a revision that you believe works, install, and verify that.
60 - Do a binary search over the kernels to figure out which one
d81919c9 61 introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but
1da177e4
LT
62 you know that 1.3.69 does. Pick a kernel in the middle and build
63 that, like 1.3.50. Build & test; if it works, pick the mid point
64 between .50 and .69, else the mid point between .28 and .50.
953ab835 65 - You'll narrow it down to the kernel that introduced the bug. You
d81919c9 66 can probably do better than this but it gets tricky.
1da177e4 67
953ab835 68 - Narrow it down to a subdirectory
1da177e4
LT
69
70 - Copy kernel that works into "test". Let's say that 3.62 works,
71 but 3.63 doesn't. So you diff -r those two kernels and come
72 up with a list of directories that changed. For each of those
73 directories:
74
75 Copy the non-working directory next to the working directory
d81919c9 76 as "dir.63".
1da177e4 77 One directory at time, try moving the working directory to
953ab835 78 "dir.62" and mv dir.63 dir"time, try::
1da177e4
LT
79
80 mv dir dir.62
81 mv dir.63 dir
82 find dir -name '*.[oa]' -print | xargs rm -f
83
84 And then rebuild and retest. Assuming that all related
d81919c9
CK
85 changes were contained in the sub directory, this should
86 isolate the change to a directory.
1da177e4
LT
87
88 Problems: changes in header files may have occurred; I've
d81919c9 89 found in my case that they were self explanatory - you may
1da177e4
LT
90 or may not want to give up when that happens.
91
953ab835 92 - Narrow it down to a file
1da177e4
LT
93
94 - You can apply the same technique to each file in the directory,
d81919c9
CK
95 hoping that the changes in that file are self contained.
96
953ab835 97 - Narrow it down to a routine
1da177e4
LT
98
99 - You can take the old file and the new file and manually create
953ab835 100 a merged file that has::
1da177e4
LT
101
102 #ifdef VER62
103 routine()
104 {
105 ...
106 }
107 #else
108 routine()
109 {
110 ...
111 }
112 #endif
113
114 And then walk through that file, one routine at a time and
953ab835 115 prefix it with::
1da177e4
LT
116
117 #define VER62
118 /* both routines here */
119 #undef VER62
120
121 Then recompile, retest, move the ifdefs until you find the one
122 that makes the difference.
123
124Finally, you take all the info that you have, kernel revisions, bug
d81919c9 125description, the extent to which you have narrowed it down, and pass
1da177e4
LT
126that off to whomever you believe is the maintainer of that section.
127A post to linux.dev.kernel isn't such a bad idea if you've done some
128work to narrow it down.
129
130If you get it down to a routine, you'll probably get a fix in 24 hours.
131
132My apologies to Linus and the other kernel hackers for describing this
133brute force approach, it's hardly what a kernel hacker would do. However,
134it does work and it lets non-hackers help fix bugs. And it is cool
135because Linux snapshots will let you do this - something that you can't
136do with vendor supplied releases.
137
43019a56
IM
138Fixing the bug
139==============
140
141Nobody is going to tell you how to fix bugs. Seriously. You need to work it
142out. But below are some hints on how to use the tools.
143
144To debug a kernel, use objdump and look for the hex offset from the crash
145output to find the valid line of code/assembler. Without debug symbols, you
146will see the assembler code for the routine shown, but if your kernel has
147debug symbols the C code will also be available. (Debug symbols can be enabled
953ab835 148in the kernel hacking menu of the menu configuration.) For example::
43019a56
IM
149
150 objdump -r -S -l --disassemble net/dccp/ipv4.o
151
953ab835
MCC
152.. note::
153
154 You need to be at the top level of the kernel tree for this to pick up
155 your C files.
43019a56
IM
156
157If you don't have access to the code you can also debug on some crash dumps
953ab835
MCC
158e.g. crash dump output as shown by Dave Miller::
159
160 EIP is at ip_queue_xmit+0x14/0x4c0
161 ...
162 Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
163 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
164 <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
165
166 Put the bytes into a "foo.s" file like this:
167
168 .text
169 .globl foo
170 foo:
171 .byte .... /* bytes from Code: part of OOPS dump */
172
173 Compile it with "gcc -c -o foo.o foo.s" then look at the output of
174 "objdump --disassemble foo.o".
175
176 Output:
177
178 ip_queue_xmit:
179 push %ebp
180 push %edi
181 push %esi
182 push %ebx
183 sub $0xbc, %esp
184 mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
185 mov 0x8(%ebp), %ebx ! %ebx = skb->sk
186 mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
43019a56 187
926b2898 188In addition, you can use GDB to figure out the exact file and line
953ab835
MCC
189number of the OOPS from the ``vmlinux`` file. If you have
190``CONFIG_DEBUG_INFO`` enabled, you can simply copy the EIP value from the
191OOPS::
926b2898
PE
192
193 EIP: 0060:[<c021e50e>] Not tainted VLI
194
953ab835 195And use GDB to translate that to human-readable form::
926b2898
PE
196
197 gdb vmlinux
198 (gdb) l *0xc021e50e
199
953ab835
MCC
200If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function
201offset from the OOPS::
926b2898
PE
202
203 EIP is at vt_ioctl+0xda8/0x1482
204
953ab835 205And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled::
926b2898
PE
206
207 make vmlinux
208 gdb vmlinux
209 (gdb) p vt_ioctl
210 (gdb) l *(0x<address of vt_ioctl> + 0xda8)
953ab835
MCC
211
212or, as one command::
213
dcc85cb6
RK
214 (gdb) l *(vt_ioctl + 0xda8)
215
953ab835
MCC
216If you have a call trace, such as::
217
218 Call Trace:
219 [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
220 [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
221 [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
222 ...
223
dcc85cb6 224this shows the problem in the :jbd: module. You can load that module in gdb
953ab835
MCC
225and list the relevant code::
226
dcc85cb6
RK
227 gdb fs/jbd/jbd.ko
228 (gdb) p log_wait_commit
229 (gdb) l *(0x<address> + 0xa3)
953ab835
MCC
230
231or::
232
dcc85cb6
RK
233 (gdb) l *(log_wait_commit + 0xa3)
234
926b2898 235
43019a56
IM
236Another very useful option of the Kernel Hacking section in menuconfig is
237Debug memory allocations. This will help you see whether data has been
238initialised and not set before use etc. To see the values that get assigned
953ab835
MCC
239with this look at ``mm/slab.c`` and search for ``POISON_INUSE``. When using
240this an Oops will often show the poisoned data instead of zero which is the
241default.
43019a56
IM
242
243Once you have worked out a fix please submit it upstream. After all open
244source is about sharing what you do and don't you want to be recognised for
245your genius?
246
953ab835
MCC
247Please do read :ref:`Documentation/SubmittingPatches <submittingpatches>`
248though to help your code get accepted.