Comments
yourfanat wrote: I am using another tool for Oracle developers - dbForge Studio for Oracle. This IDE has lots of usefull features, among them: oracle designer, code competion and formatter, query builder, debugger, profiler, erxport/import, reports and many others. The latest version supports Oracle 12C. More information here.
Cloud Expo on Google News
SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
Zen and the Art of Bug Hunting
As with so many things in life, with software bugs prevention is better than the cure

Ask any software engineer about difficult bugs they've faced and they'll always have a story to tell you: the bug that took weeks to track down, the bug that was affecting millions of customers but could not be reproduced by any of the development team, or the bug that brought the entire system grinding to a halt that turned out to be due to a single misplaced comma.

Of course, some bugs - the vast majority - are easily tracked down and fixed. But as software systems get more advanced and complex, bug hunting becomes much more of a skill. Some software engineers enjoy the bug hunt, and become experts at chasing down the clues and collecting evidence before eventually determining the root cause of the problem and how to fix it; for many more, though, debugging is simply a chore - something that must be endured at the end of one exciting new development before they are allowed to move on to the next.

In this article I aim to collate and share some of the experiences my colleagues and I - software engineers focusing on complex wireless embedded systems - have gained in our years of bug hunting. The principles are mostly very broadly applicable to all software debugging, so regardless of your particular area of expertise, I hope you will find that the tools and techniques described will help to turn debugging from a frustrating and time-consuming chore into a structured process, leading to reliable results, much like software development itself.

Getting Accurate Information and Reproducing the Bug
It may sound obvious, but the first step must always be to understand the problem and, if possible, reproduce the erroneous behaviour reliably. This will help you to be sure, later that you've fixed the problem that the user is reporting - it's quite embarrassing to proclaim that you've fixed something only for the user to inform you that the problem is very much still in evidence!

Bug descriptions from non-technical users (and even from technical users) are often vague or misleading, so don't be afraid to ask for clarification. Simon Tatham, the author of the popular free terminal tool PuTTY, has written an excellent white paper about how to report a bug, which I recommend as a good set of guidelines to point people at if you have trouble getting useful bug reports.

At this stage you need to aim to understand - is there a particular set of user actions that produces the problem? Does it occur only (or far more often) in particular system circumstances such as a busy network? Are there other factors that may not seem relevant to the user, such as other software installed on the PC or a different operating system than you have been testing with?

You also need to make sure you know what the erroneous behavior actually is. A report of "it doesn't work" may mean anything from "the device won't power up at all," right through to "it completes 99% of the requested action before failing with a very specific error message"!

Once you're confident you understand how to reproduce the problem and what the erroneous behavior is, you need to look at gathering as much information from the system and its interfaces as possible. What data is going over the air or on the wire? In an embedded device, what signals are on what pins, lines, or test pads on the board? Use scopes, sniffers and protocol analyzers to capture raw data, control signals, and timing information. Look at a known good case as well as the problem case and try to identify differences.

Log Files and Experimental Results
Keep Everything!
At this stage you are likely to be accumulating logs and test results at a rate of knots. My golden rule of testing for debugging is always keep everything!

I normally create a new top-level folder relating to the bug and note where it is in my log book. When I try a particular experiment, I note the details of what I tried in my log book and scribble a test number next to it. I then make a new subfolder named the same as the test number and save all the logs and test output that I gather into that subfolder. I make a summarized note of the results in my log book and move on to the next test. Any special code builds or other scripts that are needed for this particular test also go into the subfolder.

Later, when I'm reviewing what I know about the issue, the summary in my log book gives me a high-level view of what I have tried and the results; even if each experiment in isolation doesn't help, a pattern may emerge. If something looks suspicious and I want more information, I still have the original raw logs to go back to, and if I want to run a test again, perhaps looking at a different aspect of the output, I still have the code I used to run that particular test. Later, when you find a fix, you can go back and verify that it matches and explains all the behavior seen during this stage of testing.

During this part of the process you'll also inevitably find yourself using scripts to process log files, or set up particular preconditions for the test, for example. Again the golden rule applies: keep everything. A quick Perl script to filter out a particular set of information in the log file may seem too specific to be useful again, but I guarantee that if you throw it away you will end up rewriting it (or something very similar) in the future, and you can save yourself unnecessary effort simply by keeping the original tool around.

Filtering Logs
In all these reams of logs and test data you're collecting, you may well end up unable to see the wood for the trees. Filtering the logs can be very useful for identifying and locating problem cases, but can hide useful information too; so there is s a balance to be struck here.

Again, if you filter the logs, then keep the raw data as well. One problem I worked on showed up only on busy Ethernet networks and I could easily collect gigabytes of network traffic from a single test run, but disk space is cheap compared with the extra time and effort you will spend tracking a problem down if you don't have access to all the information you need.

Striking the Right Balance of Trust
When searching through source code to look for a potential issue, don't necessarily trust the comments or any associated design documentation. In an ideal world, the comments and documentation will be up-to-date and accurate, but in any case the code is definitive, so it is useful to learn to read it. Sometimes insights can come from noticing where the code and comments diverge - is there simply a cut-and-paste issue from when the code was originally developed, or a change that has been made to the code to fix an earlier bug that has introduced another bug?

On embedded devices or custom hardware in particular, don't necessarily trust the hardware. There may be individual board or connector issues, or a wider design issue. As with software documentation, the hardware datasheets may not be complete or accurate.

Don't necessarily trust the compiler / linker / toolchain; enabling optimization sometimes introduces bugs and, if it's possible to try with a lower optimization level, this can be helpful. As with anything else, the more custom parts of the toolchain will inevitably be less thoroughly tested, so focus on those first if there does seem to be an issue in the build process. Be prepared to step through assembly code in the debugger to rule out compiler or toolchain issues.

Finally in this section, I would like to emphasize that this litany of "don't necessarily trust..." doesn't mean "blame the problem on someone else"! There's a balance to be struck here. If you think the problem really is with something beyond your control, try to prove it with a minimal test case, and follow the bug-reporting guidelines I mentioned earlier to give a clear problem report to the relevant person.

Complex Issues Can Show Complex Behavior
Some issues may go away (or change in nature) when you try to debug them. These problems are often due to timing issues and race conditions, or corruption of memory or of the call stack.

In this case try to find a minimal test case that does demonstrate the problem to rule out irrelevant areas. Look for "passive" ways of observing the system; for example, to study Ethernet traffic, run a network sniffer using a different PC connected on a hub (not a switch!).

Bear in mind that issues might turn out to be two separate bugs interacting; when you have found a fix, if all the erroneous behavior is not explained, you may have only solved half the problem.

It can be very helpful to brainstorm complex system issues with other people who are less familiar with the inner workings of the system; the act of explaining the issue to a newcomer can help you to clarify it in your own mind (the so-called "cardboard consultant" effect - even explaining the problem to a cardboard cut-out can help), or they may be able to cast a fresh eye on the problem and see something you have missed.

We have had extremely good results from a structured debugging process that draws on these techniques and combines them into a powerful debugging tool, which is described in detail later in this article.

Common Problem Areas
Dynamic memory allocation is a potential source of many bugs in complex C programs. Common errors are:

  • Memory leaks (allocating a block and then failing to free it before all references to the block have been lost or gone out of scope)
  • Reusing freed memory (either reading or writing a block that has already been freed, or attempting to free it again)
  • Buffer overruns (reading/writing past the end of an allocated block of memory)

These problems generally lead to memory corruption and the symptoms can be just about anything, not necessarily showing, making dynamic memory management a key area to consider when debugging a complex issue. Embedded systems in particular often cannot afford the overhead of memory debugging features in normal use, though you may be able to turn on (or add some) in your memory management library. These features can use guard blocks to protect the start and end of allocated memory, and can track and log allocations and frees. You may be able to monitor memory usage, either in real time or at set points in the code, using the debugger or shell commands or writing values out to a log file. Some debugging tools provide data breakpoints, allowing you to break execution if the value at a particular address changes, which can help to track down memory corruption that always occurs at the same place.

Call stack overflow is another culprit that can cause just about any symptom, so it's worth checking early-on. When this occurs the call stack (the per-task stack of function calls, associated return addresses, and context data) becomes too large for the space allocated for it and you may get execution jumping to a random address or trying to execute data rather than code. Many debugging tools will allow you to monitor stack usage, or you can log which tasks are using how much of their stack allocation and catch any overflows as they happen.

Race conditions occur when two events are handled by separate tasks and they have not been made properly thread-aware, so they "clash" and both may try to access the same data at the same time. A race condition is an especially likely cause if the problem goes away when you add debug logging in the relevant areas of code, as this slows down the execution of one of the code paths and allows the other code path to complete successfully first. When analyzing race conditions, ask yourself whether all shared data is properly mutex protected; and whether all state machines correctly handle events out of the expected order.

Structured Debugging Process
Sometimes a bug will be particularly hard to track down. Several promising leads may present themselves initially, but as you follow the evidence down a particular route, it will ultimately fade away and you will be back to square one. In cases like this the issue itself is usually complex, involving multiple different software modules, and tests or experiments that you have run may present seemingly conflicting or contradictory evidence.

With hard problems like these, it can be awfully tempting (especially for those of us who are very detail-oriented, which is the vast majority of software engineers) to spend time collecting more and more detailed evidence, following every potential lead to its bitter end.

However, this is a trap to be avoided. If you get stuck in it, you run the risk of going vastly over schedule, demoralizing yourself in the process, and potentially still not finding the bug.

What you really need at this point is to take a step back and look at the bigger picture. Just noticing that you've got to this point is a skill in itself, but when you've tried this process a few times and become convinced of its value, you'll soon learn to spot the signs. A good rule of thumb is that when you've investigated two or three completely different avenues, but they have all failed to turn up the bug, it's time to give it a go.

You will need to set up a meeting with a few fellow engineers. Ideally you should include engineers with experience in areas that seem to relate to the bug, as well as at least one person who has not been involved in this bug, or preferably this piece of software development at all so far. However, you don't want more than about eight people altogether or the meeting becomes unmanageable.

The best venue for this meeting is a conference room or lab with lots of whiteboards or flip charts - you need everybody to be concentrating on the bigger picture, not at their own notes.

The first stage is to agree on a single-sentence statement of the problem. At this point you are aiming for clarity ("it doesn't work" is not helpful) without getting so bogged down in details that you lose the big-picture aspect straight away. It may take five or 10 minutes to agree on the problem statement, and when everyone is happy, you can write it up on the board and move onto the next stage.

The second stage is to state what is known about the problem, as an absolute certainty. At this point you are not interested in speculation, theories, ideas, suspicions, or anything other than pure hard facts. Encourage the group to challenge anything that may not be 100% factual. For example, an initial statement such as "The problem only shows up on revision G hardware" may be challenged by someone asking whether *all* other hardware revisions have been tested, and then revised to say "The problem shows up on revision G hardware but not D, E or F." All factual statements are listed on the whiteboard.

The third phase is the time for speculation. In-line with general brainstorming techniques, at this stage any ideas, theories, and speculation are accepted (no criticism or "it can't be that because..." at this point) and added to a new list on the whiteboard. Get as many ideas as possible as to where the cause might be.

Finally, you need to filter and prioritize the ideas. Some of the ideas might be testable with a very simple experiment; others might need more detailed investigation. As a group, pick the top few approaches according to what is considered most likely to be the cause and what is easy to test, and divide up the relevant actions between you.

Finally, schedule a follow-up meeting for the next day so everyone can report back. The follow-up meeting should follow roughly the same process, although if you can append to the notes from the previous day rather than starting again it will speed things up considerably.

In our experience of using this structured process, we have found that it brings results extremely quickly.

Fixing What You Have Found
If you've found the cause of a problem and the fix is "obvious," that doesn't necessarily make it right. For example, if you've found a double free in the code path - is there another code path that was solely relying on the free you've just removed?

Ideally, when fixing a complex problem, you should look at:

  • All calls to the function you are changing, especially if it's a function used by several different and unrelated subsystems
  • Anything that uses any data affected by your change
  • Anything you have a vague gut feel might be related (intuition can be invaluable)

Your regression testing should cover not just the specific bug you've fixed but any other related code, even if "it should work fine." In complex systems, things that "should work fine" have a nasty habit of becoming someone else's* complex problem to debug later.
[*] (or more likely yours, long after you've forgotten the details of the original issue)

Prevention Better than the Cure
As with so many things in life, with software bugs prevention is better than the cure. So how to limit the time you spend debugging? A little effort upfront can save a lot of time in the long run, and it's pretty well known that the earlier a problem is caught, the less it costs to fix. Here are some tips to help achieve that glorious nirvana where all software is developed bug-free from the start:

  • Take the time to understand the platform - its features, limitations, debugging facilities. Read the datasheets, talk to others who have used it about any particular quirks it has.
  • Keep comments up-to-date, accurate, and informative.
  • Always test return values from functions, and have a coherent strategy for handling memory/resource allocation failures (in general, propagate errors upwards until they can be handled). Panicking versions of allocation functions are more appropriate in some cases where even the most basic operation can't continue.
  • Build in monitoring tools (memory allocation, stack usage) and use them during development. Don't ignore strange behavior - investigate it and solve the problem straightaway. If you're really pressed for time, at least log it (in your daily log book or in your project's issue tracking tool) to be looked at later.
  • A nightly build can flush out lurking issues, especially if combined with a static analysis tool. Aim for warning-free builds and investigate any compiler warnings that are generated.
  • Code often evolves; if a module or function starts to feel too complicated, it probably is. Consider rewriting it from scratch, breaking it down further, or partitioning it differently.
  • Single-step code during development. Code may appear to pass tests, but watching it run can show up problems and improve confidence.
  • Monitor hardware signals when developing drivers. A driver may appear to work but inspection of the signals can show up problems such as:
  • Unhandled interrupts
    -Marginal timings
    -Excessive retries
    -Hardware bugs
About Vicky Larmour
Vicky Larmour is a Principal Software Engineer in the Wireless Division at Cambridge Consultants, one of the largest independent wireless design teams in the world. After reading Computer Science at Cambridge University, Vicky started her career writing database applications. Since then, she has moved closer to the hardware and now works on embedded systems; recent projects have included a DECT-based cardiac monitoring system and a new generation of a satellite phone handset.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
The challenges of aggregating data from consumer-oriented devices, such as wearable technologies and smart thermostats, are fairly well-understood. However, there are a new set of challenges for IoT devices that generate megabytes or gigabytes of data per second. Certainly, the i...
Whenever a new technology hits the high points of hype, everyone starts talking about it like it will solve all their business problems. Blockchain is one of those technologies. According to Gartner's latest report on the hype cycle of emerging technologies, blockchain has just ...
CloudEXPO New York 2018, colocated with DevOpsSUMMIT and DXWorldEXPO New York 2018 will be held November 12-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI and Machine Le...
CloudEXPO | DevOpsSUMMIT | DXWorldEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and to...
DXWorldEXPO LLC announced today that Nutanix has been named "Platinum Sponsor" of CloudEXPO | DevOpsSUMMIT | DXWorldEXPO New York, which will take place November 12-13, 2018 in New York City. Nutanix makes infrastructure invisible, elevating IT to focus on the applications and se...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021



SYS-CON Featured Whitepapers
ADS BY GOOGLE