Copyright (C) 2005 by Steve Litt. All rights
Materials from guest authors copyrighted by them and licensed for
use to Troubleshooting Professional Magazine. All rights reserved to
copyright holder, except for items specifically marked otherwise
free software source code, GNU/GPL, etc.). All material herein provided
User assumes all risk and responsibility for any outcome.
Volume 9 Issue
Issues | Linux
Productivity Magazine ]
“We now have an intermittent,
transient kind of failure, which is the worst kind of thing to
troubleshoot.” -- Wayne Hale
(NASA deputy manager of the shuttle program)
By Steve Litt
You are not alone!
All those customers, bosses and co-workers who unloaded on you because
you were slow to fix a problem -- you are not alone.
The frustration, the uncertainty, the no-win decisions when faced with
an intermittent -- you are not alone.
As I write this article on 7/24/2005, NASA is hoping to launch Space
Shuttle Discovery on 7/26. But an intermittent says otherwise. At
1:30pm on 7/13/2005, with launch scheduled for 3:51pm that day, the
mission was scrubbed because liquid hydrogen sensor No. 2 indicated
"full" when an empty condition was simulated. A "full" reading during
flight, when the tank is really empty, could cause an engine rupture
followed by tail damage and disaster.
What made this really ugly is they'd seen it before. The a similar
problem occurred during a routine fueling test in April 2005.
Subsequent attempts to reproduce the symptom were unsuccessful. The
fuel tank was replaced with a newer version, and some other
possible causes were addressed. Some engineers wanted another fueling
test, but top management opted to test during the launch sequence.
At 1:30pm during the 7/13/2005 launch sequence, the intermittent once
again showed its ugly head.
In the time since 7/13, fourteen teams of engineers have constructed a
fault tree and determined a troubleshooting strategy. Wires were
wiggled, and three areas of bad grounds were sanded and made tight. The
electronic box that processes the fuel gauges' signals have been
investigated. Fuel sensors 2 and 4 were exchanged -- a classic
There's so much to investigate:
As I write this article on 7/24/2005, NASA's plan is to launch the
morning of 7/26. The fuel gauges will be tested full and simulated
empty. My understanding is that if the fuel gauges test good, the
launch will proceed, and if the same fuel gauge tests bad in the same
way as on 7/13, and if the mode of failure is understood, they might
- The fuel sensor
- The wiring (including grounds)
- The electronic box that processes the sensor signals
- The gauge itself
The stakes are high. If the launch doesn't
occur in July, the next
launch window occurs in early September, which is the height of
hurricane season in a season that produced (so far) 7 named storms in
June and July. Because there's an Atlantis launch planned for
September, Atlantis would be pushed back. If the launch does occur, and the problem recurs,
could cause a fatal malfunction killing the astronauts and jeopardizing
the shuttle program.
I'm glad I don't have to make the decision. And NASA seems a little
more human to me.
I somehow assumed that the NASA rocket scientists were so brilliant
that they didn't require the Universal Troubleshooting Process. Now,
reading about this classic intermittent's savage battle with 14 teams
of rocket scientists and the full financial and manpower resources of
NASA, it becomes clear that NASA is using the same intermittent busting
techniques I teach in the Universal Troubleshooting Process class, and
facing the same tough decisions about when the intermittent is declared
If you're a good Troubleshooter, you'll be assigned many intermittent
problems, with all the attendant difficulties. At some point you'll
probably doubt yourself. If so, remember you're not alone -- NASA has
the same problems. So kick back, relax,
and remember -- if you're a Troubleshooter, this is your magazine.
What Actually Happened
(Written 4pm 7/24-12:45pm 7/25)
By Steve Litt
This article attempts to chronologically summarize the timeline leading
and including the discovery and handling of Space Shuttle Discovery's
fuel gauge intermittent.
The Soviet Union launched Sputnik I on October 4, 1957. The 183 pound
satellite orbited the earth once. The military uses were obvious, so
the United States moved their space program to the highest priority.
For the next 12 years, no expense was spared in the U.S. space program,
culmintating with the Apollo 11 moon landing on July 20, 1969. Five
moon landings followed, the last of which was Apollo 17 in December of
1972. As of today, no man has set foot on the moon since December
American priorities changed in the 1970's, with many questioning why we
spent so much on space when many Americans were poor and uneducated.
Space exploration became less frequent and unmanned. Then, in the
1980's, the U.S. began the shuttle missions. Columbia, our first
shuttle, became operational in November 1982. Challenger flew in 1983,
Discovery in 1984, Atlantis in 1985, and Endeavor in 1992. The
Challenger blew up January 28,
1986, slowing space exploration. The
next shuttle launch after the Challenger
disaster was October 10, 1988.
The shuttles had been grounded for 2.5 years.
Shuttles continued flying. These shuttles added enormously to our
knowledge, and the satellites they launched are responsible for our
electronic way of life today.
Space Shuttle Columbia shook
apart high over Texas on February 1, 2003. The
shuttle program was put on hold while NASA explored how to reduce the
likelihood of such disasters. Another factor was the economic meltdown
of the early 2000's, which put every federal dollar under increased
competition. More than ever, the space program was expected to be cost
Another factor was the age of the Space Shuttle fleet. Even though the
oldest two shuttles had blown up, the average shuttle age in 2005 is 18
years. Discovery is now 21 years old. When you read about "transistors"
in Discovery's fuel gauge electronics box, keep in mind that Discovery
was built in 1984.
Such was the situation in 2005, as NASA attempted to restart the
The Discovery Timeline
Space Shuttle Discovery was tapped to be the first shuttle into space
since the Columbia disaster. The following is a timeline as it relates
to the fuel gauge problem:
- April 2005: A routine fueling test on Discovery turns up an
inaccuracy in one or more fuel gauge(s). Subsequent attempts to
reproduce the problem failed, and it was labeled an "unexplained
anomaly". My research indicates that steps were taken to cure the
problem, including replacement of some cables, the electronic box for
the gauge, and the fuel tank. I'm not sure whether the additional
reproduction attempts were performed before or after the steps to cure
- July 13, 2005 at 12:00 noon: Discovery is on track for its
scheduled 3:51pm liftoff.
- July 13, 2005 at 1:30pm: A fuel gauge reads full when simulated
to read empty. The launch is scrubbed.
- July 13, 2005, afternoon: Launch team drains fuel tank and begins
- July 14, 2005, morning: The fuel sensor on the new empty tank
continues to read full, but later on reverts to the (accurate) reading
- July 14-15, 2005: Engineers enter the aft section of the orbiter
to investigate the electronics box and wiring.
- July 14-15, 2005: Engineers begin making a "troubleshooting plan".
- July 18, 2005: A NASA spokesman mentions that engineers are about
halfway through their troubleshooting plan, which included a fault
tree. Engineers prioritized least invasive first.
- July 21, 2005: NASA managers plan to launch on July 26, even
though they haven't identified the root cause of the fuel gauge
- July 23, 2005: Problems traced to electrical interference and
grounding problems. Three areas of poor grounds were fixed.
- July 23, 2005: Troubleshooting checks complete. One test was to
swap connectors for the #2 and #4 gauges -- a classic doubleswap.
- July 24, 2005: This article is written in Troubleshooting
I have no contacts at NASA. All info here was gleaned off the Internet.
Most of this information was corroborated on several websites, and also
seems to agree with what I've heard on radio and TV. The timeline
mentions that the problems were traced to electrical interference and
grounding problems. I've found no info on how such "tracing" took place
-- was it traced by valid troubleshooting, or was it traced simply by
following a cascade of possible faults.
Fuel gauge anomolies were found during an April routine fueling test.
Some subsequent tests found the fuel gauges to be accurate, pointed to
an intermittent problem. Several possible causes were explored and
fixed, including cables, the electronic box for the gauge, and the fuel
tank. This is classic corrective maintenance (general maintenance).
During the July 13 launch sequence similar problems occurred, so the
launch was scrubbed. Some later fuel gauge tests showed the problem
still existed, but still later the symptom went away.
A deep exploration of possible causes was performed, and possible
causes were proactively repaired, such as faulty ground connections, of
which three were actually found.
It has now been decided to launch on July 26, and during that launch to
retest the fuel gauges. There has been some talk of launching with a
defective gauge reading if such defect is in the same gauge as the July
13 problem, and if the defect is well understood.
My investigations on the web, especially when reading between the
lines, tell me that there has been no positive identification of a root
cause, which of course is not uncommon in intermittent problems.
How NASA Coped (Written 4pm
By Steve Litt
The sparsest of all intermittents is an event, and that's just what
happened during an April 2005 fueling test. The gauge read full during
an empty tank simulation, but later (correctly) read empty. By
definition this was an intermittent, in that NASA knew of no way to
reproduce the symptom.
Corrective maintenance is a powerful weapon against intermittence. NASA
performed corrective maintenance by repairing or replacing the fuel
tank, some wiring, and the electronic box that handles the sensor's
Classic Universal Troubleshooting Process theory maintains that one
does not attempt corrective maintenance in safety critical situations
because it eliminates the opportunity to find the root cause. In an
ideal world with infinite funding for NASA, other intermittent busting
tactics would have been used.
However, the reality of the world is that there is always a tradeoff
between safety and economics. NASA felt that, after corrective
maintenance, they could postpone final testing until an actual launch
sequence on July 13.
On July 13, routine fuel gauge testing 21/2 hours
before launch, the problem recurred. This is very fortunate, because if
it had occurred during flight instead of before launch, it might (or
might not) have been fatal.
They scrubbed the launch and began looking for the problem. NASA had
Going into full troubleshooting mode would have enabled more detailed
testing for the root cause, but also would have involved more
disassembly and foreclosed on any possibility to launch in July or
August, which in turn would have impinged on Atlantis' launch, which is
scheduled for the September window. Staying in pre-launch mode would
reduce the likelihood of finding the root cause, but would keep open
the possibility of launching in July. It was chosen to stay in
pre-launch while troubleshooting.
- Go into full troubleshooting mode
- Stay in pre-launch mode
Direct from my troubleshooting course, here is a list of intermittent
NASA is famous for preventive maintenance, but in this case the
intermittent slipped through.
- Preventative maintenance
- Corrective maintenance
- Turn the intermittent against itself
- Convert the intermittent into a reproducible
- Logs and strip chart recorders
- Ignore it
Corrective maintenance was exploited -- the fuel tank, electronic
box and wiring were addressed between April and July. Three defective
electronic grounds were found and corrected after July 13.
They certainly tried to turn the intermittent against itself. Here is a
quote from Shuttle program deputy manager Wayne Hale: "The repair that
might get us to Sunday would be if we go in and wiggle some of the
wires and find a loose connection". In that same news conference Hale
said "You laugh" ... "That probably is the first step in any
troubleshooting plan. Some technician is going to put his hand on the
wires and the connectors ... and start wiggling them."
Ignoring it is not an option in a safety critical situation, so of
course NASA didn't ignore it.
My investigation hasn't turned up any evidence of their trying
specifically to find a reproduction sequence (convert to a
reproducible), but I'm certain they did that.
I know of no use of logs, strip chart recorders or other
instrumentation that looks back in time, but then again, I wasn't
working there, so my information comes from news sources.
What really impressed me about NASA is a tactic not listed above --
fault tree analysis. Fault tree analysis is very expensive, so it would
never be used on consumer computers or the like. But in a situation
that's both safety critical and cost critical, creation of a fault tree
through cause and effect analysis of the system can provide an
exhaustive list of components on which to perform corrective
maintenance, thereby making the corrective maintenance much more likely
to be effective. If the corrective maintenance is truly effective, the
lack of identification of a root cause is less of a problem, although
it's still a problem.
Now that everything in the fault tree has been addressed, the plan is
to launch on July 26, and thoroughly test the fuel gauges during
pre-launch. If the symptom does not appear, the launch will take place.
There is some discussion of launching even in the face of symptom
occurrence, if the symptom is identical, affecting the identical gauge,
and it is understood.
As of noon on 7/25/2005, it appears that NASA definitely plans to
launch with only 3 sensors if the same sensor malfunctions in the same
way and they thoroughly understand the mode by which this malfunction
Critique of NASA's
Handling of this Intermittent (Written 4pm
By Steve Litt
Hey, this is NASA. Every one of their hundreds of engineers is smarter
than I am. These guys truly are rocket scientists.
This article was written between 4pm on July 24, 2005, and 12:45pm on
25, 2005, well
before the launch at 10:39am on July 26. I've deliberately stopped
writing this article before the launch to prevent myself from Monday
morning quarterbacking NASA. Hindsight is always 20/20, and for that
After the launch I'll write a separate article in which perhaps I'll
look with hindsight at not only NASA's actions but also my writings in
They are also under two tremendously conflicting pressures -- safety
and economics. The politics of the situation is momentous. It would be
silly to second guess the NASA engineers.
Monday Morning Quarterbacking is never appropriate, so I am rushing
this TPM issue to press before launch, so that by definition I cannot
be Monday Morning Quarterbacking (unless I have a working crystal ball
The above being said, I'd like to analyze my understanding of NASA'S
actions from a Troubleshooting viewpoint.
First let me start with my one point of disagreement. Some have
mentioned that if the malfunction occurs on July 26, but it happens to
the same gauge in the same way and is thoroughly understood, the launch
should occur anyway. I STRONGLY disagree. There is currently a safety
policy that you do not launch without all 4 gauges working perfectly.
Safety policies should never be changed to accommodate an intermittent.
I fully applaud NASA's decision to scrub the July 13 launch upon seeing
this problem. You don't ignore intermittents in safety critical
situations. Another point of admiration is their use of a fault tree to
reveal, check, maintain and if need be correct possible root causes.
It's this kind of behavior that makes them true rocket scientists.
Against a brutal intermittent, in a safety critical situation, under
extreme time and budgetary pressure, they made a plan and carried it
out. They displayed The Attitude.
If it were my call, I'd have done more troubleshooting between April
and July 13. Ideally, I'd have persued troubleshooting methods not
destructive of the root cause. With the frozen fuel still in the tanks,
I'd have persued manipulation (wiggling etc), tried to find a
reproduction sequence, tried to do some detective work back in time,
perhaps involving logs, journals or strip chart recorders, and possibly
used a method such as Root Cause Analysis.
Slightly less ideally, I'd have persued the fault tree in April,
corrected/maintained everything revealed, and then done at least one
full cryogenic load in May to try to verify the fix, so that we
wouldn't arrive in July with an intermittent and only one chance before
space to reproduce it.
The preceding two paragraphs outline what I'd do with endless
resources. I have no idea of the time and money constraints of NASA,
nor how many other events (unexplained anomalies) they regularly run
into. Although I'd have tried to handle it a little differently, I have
no beef with the way they handled it.
I'm concerned with the prospect of launching if the symptom doesn't
appear at 10:39 on July 26. If the intermittent has not been fixed by
the corrective maintenance, and chooses to rear its ugly head in space
rather than on the launchpad, things could get ugly. I don't know how
practical this would be, but I'd prefer perhaps a mock launch on July
25, followed by a real launch on July 27. This would give the symptom
two chances to occur on the ground instead of one. Here again, I'm not
privy to the economic, political and safety pressures on NASA, nor do I
have information on the practicality of performing a launch two days
after a trial launch.
I have some suggestions for the future. The fact that three areas of
bad grounding were found indicates a weakness in NASA's preventive
maintenance up to this point. Into the future, I'd like to see
procedures and policies for maintaining all ground connections at
intervals commensurate with the ease of such maintenance and its safety
ramifications. I'd then like to see an engineering group reformulate
all preventive maintenance procedures and policies for the maintenance
of this fleet that is now includes craft that are 21, 20 and 13 years
Message to the Press: It's Not
a "Glitch"(Written 11:45am
By Steve Litt
The press missed an opportunity to help the public understand the
significance of intermittent problems. Had the press taken advantage of
this opportunity, John Q. Average could have understood why it takes so
long for his mechanic to fix where "every few days the car bucks for a
few minutes and then goes back to normal".
They could have called it an intermittent problem. Instead they called
it a "glitch". They could have explained that in diagnosing
intermittent problems, one seeks to make the symptom reappear. Instead,
they glossed over it.
Launching rockets might be rocket science, but understanding
intermittent problems is not. An intermittent is simply an on again,
off again problem for which there is no known way to make it happen.
Therefore, diagnostic tests are of limited value, because you don't
know whether the symptom went away because of the diagnostic test, as
opposed to random chance.
If the press had spoken to me, I could have explained this.
Instead, they called it a "glitch".
The best I can fathom from dictionary definitions is that "glitch"
means a sudden, unexpected change, often with the connotation of being
minor. There's nothing minor about a problem that could rupture a
shuttle's engine, and nowhere in that definition is it stated that the
glitch will probably reappear. It may seem a single, random event, but
given enough time, it will happen again unless fixed.
If you work for the press, please interview me. You owe it to every car
driver and computer user in the country.
No Further Symptoms (Written
Discovery launched at 10:39am on July 26, 2005. I saw the launch from
60 miles away -- it was beautiful. Extensive tests on launch day failed
to reproduce the fuel gauge symptom, so either this is a very sparse
intermittent or NASA's fault tree driven corrective maintenance fixed
the root cause. Although I would be skittish about launching with an
intermittent not thoroughly understood, the fact is that many times
that's just what we have to do.
I'm very glad they did NOT alter their launch policies and launch with
a known bad fuel gauge system. That, in my opinion, would have been the
wrong decision -- one does not alter a safety policy to accommodate an
intermittent, or for any other reason other than proof that the safety
policy was unnecessary.
Every Troubleshooter in the world should take pride in the
troubleshooting job done by NASA's Engineers. Forced upon them was a
sparse intermittent on one of the worlds most complex and technically
challenging systems, in a politically charged situation that had brewed
for 21/2 years (actually much longer).
Time constraints made non-destructive troubleshooting methods
impractical, so they went with the quickest effective weapon on the war
on intermittents -- corrective maintenance. But not just any corrective
maintenance -- they drove that corrective maintenance with a fault tree
derived from a mental model of the system. This is not easy. It really
is rocket science.
NASA -- you're outstanding!
Reason They Call it Rocket Science (Written 10:15pm
By Steve Litt
Discovery launched at 10:39am on July 26, 2005, ending a 21/2
year post-Columbia hiatus. The fuel gauge intermittent problem was
addressed, extensively tested for, and most likely fixed.
Some foam and insulation fell off during launch, creating a possible
safety problem on reentry. This is similar to what destroyed the
Columbia, and the thrust of the last 21/2 years
was to prevent future occurrence of falling insulation. It obviously
didn't work. What now?
For starters, future Shuttle launches are on hold, and some question
the future of our space program. That's not good.
During the last 21/2 years, many contingencies
have been put in place to address such an event. First, we launch only
in daylight so we can see it happen. We now photograph the launched
craft from many angles, including high altitude jets. If it happens,
we're much more likely to know about it.
Once we know about it, the Astronauts have been given materials and
training to fix many types of problems caused by falling insulation. If
that can't work, the Astronauts stay at the space station until a
rescue craft can be sent.
My point is this: When Columbia shook to pieces over Texas, we didn't
see it coming, and for several days we didn't know the cause. This time
we know it happened, know what to look for, know how to fix it in space
if it can be fixed in space, and have a plan if it can't be fixed.
Anybody saying NASA had 21/2 years to fix this
problem and failed doesn't understand the magnitude of what NASA has
Our shuttles are hugely complex because breaking free of Earth's
gravity is a monumental task. It's a challenge, and with challenge
comes failure. Asking six sigmas might be reasonable
manufacturing ball bearings, but not when maintaining a space shuttle.
Space shuttles are extreme, so there are injuries.
I skateboard from point A to point B, and never leave the ground. My
worst injury was a little lost skin and a few bruises. Tony Hawk goes
yards in the air, skates vertical, goes 360 in pipes. Would you hold
him to the same safety record as me? Never.
There's a reason they call it rocket science!
We expect so much from our space program, but do we as a nation have
the committment to support those expectations? Hiring and keeping the
best brains in the world isn't cheap. The preventive maintenance,
research and development necessary to consistently get up and come down
safely and successfully is expensive.
The fact that the Engineers found three bad grounds during their work
on the fuel gauge problem means they must vastly improve their
preventive maintenance. But is America willing to pay for it? Or are we
looking for performance on the cheap?
If we want a successful, safe space program, we need to pay for it,
even though it's very expensive. We need to get the money. Whether we
cut social programs, cut the Iraq war, cut the military in general,
break medical monopolies, raise taxes, or start selling our national
forests, we must pay for the performance we expect.
One could respond that NASA could work smarter and cheaper. That
retoric might work in some sectors, but few are smarter than those
employed by NASA.
Bottom line, we can either pay for a safe and successful space program,
or cede space leadership to China, Russia or the European Union.
Letters to the
All letters become the property of the publisher (Steve Litt), and
be edited for clarity or brevity. We especially welcome additions,
corrections or flames from vendors whose products have been reviewed in
magazine. We reserve the right to not publish letters we deem in
(bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be
the subject reads "Letter to the Editor". We regret that we cannot
your letter, so please make a copy of it for future reference.
How to Submit an Article
We anticipate two to five articles per issue, with issues coming out
We look for articles that pertain to the Troubleshooting Process, or
on tools, equipment or systems with a Troubleshooting slant. This can
done as an essay, with humor, with a case study, or some other literary
A Troubleshooting poem would be nice. Submissions may mention a
but must be useful without the purchase of that product. Content must
overpower advertising. Submissions should be between 250 and 2000 words
Any article submitted to Troubleshooting Professional Magazine must
licensed with the Open Publication License, which you can view at
At your option you may elect the option to prohibit substantive
However, in order to publish your article in Troubleshooting
Magazine, you must decline the option to prohibit commercial use,
Troubleshooting Professional Magazine is a commercial publication.
Obviously, you must be the copyright holder and must be legally able
so license the article. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission for
or brevity, within the scope of the Open Publication License. If you
to prohibit substantive modifications, we may elect to place editors
outside of your material, or reject the submission, or send it back for
Any published article will include a two sentence description of the
a hypertext link to his or her email, and a phone number if desired.
request, we will include a hypertext link, at the end of the magazine
to the author's website, providing that website meets the
criteria for links
and that the
website first links to Troubleshooters.Com. Authors: please understand
can't place hyperlinks inside articles. If we did, only the first
would be read, and we can't place every article first.
Submissions should be emailed to Steve Litt's email address, with
line Article Submission. The first paragraph of your message should
as follows (unless other arrangements are previously made in writing):
Copyright (c) 2001 by <your name>. This
may be distributed only subject to the terms and conditions set forth
the Open Publication License, version Draft v1.0, 8 June 1999
at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for
at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest
is presently available at http://www.opencontent.org/openpub/).
Open Publication License Option A [ is | is not]
so this document [may | may not] be modified. Option B is not elected,
this material may be published for commercial purposes.
After that paragraph, write the title, text of the article, and a
sentence description of the author.
Why not Draft v1.0, 8 June 1999 OR LATER
The Open Publication License recommends using the word "or later" to
the version of the license. That is unacceptable for Troubleshooting
Magazine because we do not know the provisions of that newer version,
it makes no sense to commit to it. We all hope later versions will be
but there's always a chance that leadership will change. We cannot take
chance that the disclaimer of warranty will be dropped in a later
All trademarks are the property of their respective owners.
(R) is a registered trademark of Steve Litt.
URLs Mentioned in this Issue