Zen and the Art of Bug Hunting
As with so many things in life, with software bugs prevention is better than the cure
By: Vicky Larmour
Jun. 2, 2009 06:15 AM
Ask any software engineer about difficult bugs they've faced and they'll always have a story to tell you: the bug that took weeks to track down, the bug that was affecting millions of customers but could not be reproduced by any of the development team, or the bug that brought the entire system grinding to a halt that turned out to be due to a single misplaced comma.
Of course, some bugs - the vast majority - are easily tracked down and fixed. But as software systems get more advanced and complex, bug hunting becomes much more of a skill. Some software engineers enjoy the bug hunt, and become experts at chasing down the clues and collecting evidence before eventually determining the root cause of the problem and how to fix it; for many more, though, debugging is simply a chore - something that must be endured at the end of one exciting new development before they are allowed to move on to the next.
In this article I aim to collate and share some of the experiences my colleagues and I - software engineers focusing on complex wireless embedded systems - have gained in our years of bug hunting. The principles are mostly very broadly applicable to all software debugging, so regardless of your particular area of expertise, I hope you will find that the tools and techniques described will help to turn debugging from a frustrating and time-consuming chore into a structured process, leading to reliable results, much like software development itself.
Getting Accurate Information and Reproducing the Bug
Bug descriptions from non-technical users (and even from technical users) are often vague or misleading, so don't be afraid to ask for clarification. Simon Tatham, the author of the popular free terminal tool PuTTY, has written an excellent white paper about how to report a bug, which I recommend as a good set of guidelines to point people at if you have trouble getting useful bug reports.
At this stage you need to aim to understand - is there a particular set of user actions that produces the problem? Does it occur only (or far more often) in particular system circumstances such as a busy network? Are there other factors that may not seem relevant to the user, such as other software installed on the PC or a different operating system than you have been testing with?
You also need to make sure you know what the erroneous behavior actually is. A report of "it doesn't work" may mean anything from "the device won't power up at all," right through to "it completes 99% of the requested action before failing with a very specific error message"!
Once you're confident you understand how to reproduce the problem and what the erroneous behavior is, you need to look at gathering as much information from the system and its interfaces as possible. What data is going over the air or on the wire? In an embedded device, what signals are on what pins, lines, or test pads on the board? Use scopes, sniffers and protocol analyzers to capture raw data, control signals, and timing information. Look at a known good case as well as the problem case and try to identify differences.
Log Files and Experimental Results
I normally create a new top-level folder relating to the bug and note where it is in my log book. When I try a particular experiment, I note the details of what I tried in my log book and scribble a test number next to it. I then make a new subfolder named the same as the test number and save all the logs and test output that I gather into that subfolder. I make a summarized note of the results in my log book and move on to the next test. Any special code builds or other scripts that are needed for this particular test also go into the subfolder.
Later, when I'm reviewing what I know about the issue, the summary in my log book gives me a high-level view of what I have tried and the results; even if each experiment in isolation doesn't help, a pattern may emerge. If something looks suspicious and I want more information, I still have the original raw logs to go back to, and if I want to run a test again, perhaps looking at a different aspect of the output, I still have the code I used to run that particular test. Later, when you find a fix, you can go back and verify that it matches and explains all the behavior seen during this stage of testing.
During this part of the process you'll also inevitably find yourself using scripts to process log files, or set up particular preconditions for the test, for example. Again the golden rule applies: keep everything. A quick Perl script to filter out a particular set of information in the log file may seem too specific to be useful again, but I guarantee that if you throw it away you will end up rewriting it (or something very similar) in the future, and you can save yourself unnecessary effort simply by keeping the original tool around.
Again, if you filter the logs, then keep the raw data as well. One problem I worked on showed up only on busy Ethernet networks and I could easily collect gigabytes of network traffic from a single test run, but disk space is cheap compared with the extra time and effort you will spend tracking a problem down if you don't have access to all the information you need.
Striking the Right Balance of Trust
On embedded devices or custom hardware in particular, don't necessarily trust the hardware. There may be individual board or connector issues, or a wider design issue. As with software documentation, the hardware datasheets may not be complete or accurate.
Don't necessarily trust the compiler / linker / toolchain; enabling optimization sometimes introduces bugs and, if it's possible to try with a lower optimization level, this can be helpful. As with anything else, the more custom parts of the toolchain will inevitably be less thoroughly tested, so focus on those first if there does seem to be an issue in the build process. Be prepared to step through assembly code in the debugger to rule out compiler or toolchain issues.
Finally in this section, I would like to emphasize that this litany of "don't necessarily trust..." doesn't mean "blame the problem on someone else"! There's a balance to be struck here. If you think the problem really is with something beyond your control, try to prove it with a minimal test case, and follow the bug-reporting guidelines I mentioned earlier to give a clear problem report to the relevant person.
Complex Issues Can Show Complex Behavior
In this case try to find a minimal test case that does demonstrate the problem to rule out irrelevant areas. Look for "passive" ways of observing the system; for example, to study Ethernet traffic, run a network sniffer using a different PC connected on a hub (not a switch!).
Bear in mind that issues might turn out to be two separate bugs interacting; when you have found a fix, if all the erroneous behavior is not explained, you may have only solved half the problem.
It can be very helpful to brainstorm complex system issues with other people who are less familiar with the inner workings of the system; the act of explaining the issue to a newcomer can help you to clarify it in your own mind (the so-called "cardboard consultant" effect - even explaining the problem to a cardboard cut-out can help), or they may be able to cast a fresh eye on the problem and see something you have missed.
We have had extremely good results from a structured debugging process that draws on these techniques and combines them into a powerful debugging tool, which is described in detail later in this article.
Common Problem Areas
These problems generally lead to memory corruption and the symptoms can be just about anything, not necessarily showing, making dynamic memory management a key area to consider when debugging a complex issue. Embedded systems in particular often cannot afford the overhead of memory debugging features in normal use, though you may be able to turn on (or add some) in your memory management library. These features can use guard blocks to protect the start and end of allocated memory, and can track and log allocations and frees. You may be able to monitor memory usage, either in real time or at set points in the code, using the debugger or shell commands or writing values out to a log file. Some debugging tools provide data breakpoints, allowing you to break execution if the value at a particular address changes, which can help to track down memory corruption that always occurs at the same place.
Call stack overflow is another culprit that can cause just about any symptom, so it's worth checking early-on. When this occurs the call stack (the per-task stack of function calls, associated return addresses, and context data) becomes too large for the space allocated for it and you may get execution jumping to a random address or trying to execute data rather than code. Many debugging tools will allow you to monitor stack usage, or you can log which tasks are using how much of their stack allocation and catch any overflows as they happen.
Race conditions occur when two events are handled by separate tasks and they have not been made properly thread-aware, so they "clash" and both may try to access the same data at the same time. A race condition is an especially likely cause if the problem goes away when you add debug logging in the relevant areas of code, as this slows down the execution of one of the code paths and allows the other code path to complete successfully first. When analyzing race conditions, ask yourself whether all shared data is properly mutex protected; and whether all state machines correctly handle events out of the expected order.
Structured Debugging Process
With hard problems like these, it can be awfully tempting (especially for those of us who are very detail-oriented, which is the vast majority of software engineers) to spend time collecting more and more detailed evidence, following every potential lead to its bitter end.
However, this is a trap to be avoided. If you get stuck in it, you run the risk of going vastly over schedule, demoralizing yourself in the process, and potentially still not finding the bug.
What you really need at this point is to take a step back and look at the bigger picture. Just noticing that you've got to this point is a skill in itself, but when you've tried this process a few times and become convinced of its value, you'll soon learn to spot the signs. A good rule of thumb is that when you've investigated two or three completely different avenues, but they have all failed to turn up the bug, it's time to give it a go.
You will need to set up a meeting with a few fellow engineers. Ideally you should include engineers with experience in areas that seem to relate to the bug, as well as at least one person who has not been involved in this bug, or preferably this piece of software development at all so far. However, you don't want more than about eight people altogether or the meeting becomes unmanageable.
The best venue for this meeting is a conference room or lab with lots of whiteboards or flip charts - you need everybody to be concentrating on the bigger picture, not at their own notes.
The first stage is to agree on a single-sentence statement of the problem. At this point you are aiming for clarity ("it doesn't work" is not helpful) without getting so bogged down in details that you lose the big-picture aspect straight away. It may take five or 10 minutes to agree on the problem statement, and when everyone is happy, you can write it up on the board and move onto the next stage.
The second stage is to state what is known about the problem, as an absolute certainty. At this point you are not interested in speculation, theories, ideas, suspicions, or anything other than pure hard facts. Encourage the group to challenge anything that may not be 100% factual. For example, an initial statement such as "The problem only shows up on revision G hardware" may be challenged by someone asking whether *all* other hardware revisions have been tested, and then revised to say "The problem shows up on revision G hardware but not D, E or F." All factual statements are listed on the whiteboard.
The third phase is the time for speculation. In-line with general brainstorming techniques, at this stage any ideas, theories, and speculation are accepted (no criticism or "it can't be that because..." at this point) and added to a new list on the whiteboard. Get as many ideas as possible as to where the cause might be.
Finally, you need to filter and prioritize the ideas. Some of the ideas might be testable with a very simple experiment; others might need more detailed investigation. As a group, pick the top few approaches according to what is considered most likely to be the cause and what is easy to test, and divide up the relevant actions between you.
Finally, schedule a follow-up meeting for the next day so everyone can report back. The follow-up meeting should follow roughly the same process, although if you can append to the notes from the previous day rather than starting again it will speed things up considerably.
In our experience of using this structured process, we have found that it brings results extremely quickly.
Fixing What You Have Found
Ideally, when fixing a complex problem, you should look at:
Your regression testing should cover not just the specific bug you've fixed but any other related code, even if "it should work fine." In complex systems, things that "should work fine" have a nasty habit of becoming someone else's* complex problem to debug later.
Prevention Better than the Cure
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week