Step 6: Narrow It Down
This step takes almost a day in my two day course. It's beyond the scope of this website to cover it completely. If you can't take the course, you can learn all about this step in any of these books:
To see which book will best fit your needs, look here.
Mathematics tells us the fastest way to find a single element in an ordered set is binary search. Binary search is the process of repeatedly ruling out half the remaining search area until the element is found. What makes the system you're troubleshooting an ordered set is your knowledge of it, reinforced by manuals and documentation. It's that knowledge that allows you to devise tests to split the search area in half. Below is a simple diagram of a binary search finding the violet component with only six tests:
This technique really shows its power in systems of several hundred thousand components. For instance, binary search could find a single component in a system of 1,048,576 components (a moderate sized automated system) using only 20 tests.
NOTE: Implicit in all this is that if you keep narrowing it down, whether binary or not, as long as you don't repeatedly double back in areas you've already tested, it is a MATHEMATICAL CERTAINTY you'll eventually solve the problem.
Of course, life isn't this simple. If it were, we'd be getting minimum wage. Several factors combine to make troubleshooting a challenge, even to those employing binary search.
Intermittence invalidates most tests which could split the search area, resulting in backtracking. It thus renders binary search a useful but insufficient tool for troubleshooting. Intermittence eliminates the mathematical certainty of solution -- indeed many intermittents remain unsolved. There are several techniques to maximize your chance of solving an intermittent.
Remember the order comes from your knowledge of the system, and nobody knows everything, including the system documentation. The less complete your knowledge, the more trial and error is necessary. Nevertheless, in real life even a minimum of knowledge allows a reasonable approximation of binary search, so this isn't much of a limitation.
The best way to make your system knowledge into an ordered set is to organize it as a block diagram, which I often term as a "Mental Model", because it can be carried in the head as well as on paper. The following is an example of a Mental Model, which organizes knowledge of the Daemontools software into a Mental Model to achieve an ordered set:
A more significant limitation is the fact that troubleshooting tests are often time consuming and risky. The test which would most exactly split the remaining search area in half is often the toughest. Thus we temper our desire for even divisions with the reality that we need to minimize time and risk. Often our troubleshooting instinct tells us it's likely the problem resides in a tiny portion of the remaining search area. In that case it's perfectly permissible to test to prove or disprove it's in the tiny area, but it's never permissible just to assume it. If a test carries a credible risk of harm to the system, property or person, try to find a safer test, even if it's harder, less likely, and doesn't divide the remaining problem scope as evenly.
The following diagram illustrates the actual mental process to narrow it down, influenced by the Mental Model and Quadruple Tradeoff:
The Divide and Conquer process can be thought of as continually forcing the problem into ever smaller boxes, until it's trapped. Some of the worst troubleshooting debacles I've seen involved the problem escaping the box. In other words, the troubleshooter thought he had proved it was in one area, when it was really in another. When that happens, tests become inconclusive and the troubleshooter starts to doubt himself. Whole days can be wasted. Take every precaution to avoid this -- don't skip steps.
The March 1998 issue of Troubleshooting Professional Magazine, themed "Bottleneck Analysis", is essential reading for narrowing problems in systems whose symptom description includes words like "too" or "insufficient". You can see it at http://www.troubleshooters.com/tpromag/9803.htm. The December 1998 TPM describes the narrowing process on intermittent problems, and can be read at http://www.troubleshooters.com/tpromag/9812.htm.