Troubleshooters.Com Presents

Troubleshooting Professional Magazine

November 2022

The October Debacle

Steve Litt is the author of the Universal Troubleshooting Process Courseware, which can be presented either by Steve or by your own trainers.

He is also the author of Troubleshooting Techniques of the Successful Technologist, and several other books on troubleshooting and other human performance factors.


The most valuable thing you can make is a mistake - you can't learn anything from being perfect. -- Adam Osborne

CONTENTS:

Editor's Desk

I call it the October Debacle. Troubleshooters.Com was down from October 13, 2022 to October 17, because my main computer was down. For a business, this is inexcusable, so this issue of Troubleshooting Professional Magazine serves as a Truth and Reconciliation document for the purpose of preventing future occurrence. The conclusions drawn might surprise you.

Just to review, the ten steps are the Universal Troubleshooting Process are as follows:

In the preceding, "Prepare" includes "Get The Attitude". The Attitude is a frame of mind necessary for troubleshooting. Oversimplifying, this frame of mind is a combination of "don't panic", "don't get angry", and "hunt the root cause like you're a cold, deadly predator". When the Troubleshooter's mindset departs from The Attitude, it's called an Attitude Violation, and the best thing is to stop until the Attitude Violation is resolved.

Although Make a Damage Control Plan is step 2, it must be revisited when things change. Remember this, because this was where I truly departed from effective troubleshooting.

This issue of Troubleshooting Professional Magazine tells a sad story that might help you avoid similar situations. So kick back, relax, and enjoy the read. And remember, if you're a Troubleshooter, this is your magazine.

Steve Litt is the author of "Troubleshooting Techniques of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

Genesis of a Problem

By Steve Litt

It began innocently enough. Over the weekend 10/8/2022 and 10/9/2022, and then Monday 10/10/2022, I rewrote my now glitchy, overpatched and almost unmaintainable Perl reminder program in Python. By Tuesday morning it was complete except for deployment, which is when things started getting dodgy.

Deployment consisted of hooking up my program to either cron or making it a daemon so that it could give reminders at certain times. Cron is great for things that run as root, but multiple versions and inadequate documentation make it difficult for things that run as a normal user, which is what I needed.

In researching making it a daemon, I began having some hideous problems daemonizing it with my Runit process supervisor, causing me to ask questions about Void Linux's Runit implementation, on the #voidlinux IRC channel, late on 10/11/2022. At 3:15 AM on 10/12 I uttered this prophetic sentence: "It felt like I was fighting an evil magician."

By 4:20 AM I'd been instructed to test the integrity of my package installations, which it turned out had several errors. Then the suggestion was made that I reinstall those packages by force. I believed this would be dangerous, possibly causing a no-boot or partial-boot situation. So I decided not to follow that suggestion until I had a good, solid data backup, and enough data to reinstall my OS.

By the way, I'd fixed most of my Runit supervision problems by this point, so my main concern now was fixing the package installation problems. At 5:11 AM on 10/12 I finally went to sleep.

I awoke at 10:45 on 10/12 and got right down to it. I should mention that I need quite a bit more than 8 hours of sleep at night, so I was starting to get sleep deprived. I switched to the #xbps IRC channel, which is devoted more to the package manager than to Void Linux. At 4:16 PM on 10/12 my Evolution email client stopped working, aborting on a symbol lookup error. At about 4:20 PM someone on IRC hypothesized that the root cause of all my many spooky symptoms could be a failing disk, and talk turned to hard disk testing and use of the Smartctl disk testing sofware. A short Smartctl test yielded the scary sounding but unhelpful message "Error Information Log Entries: 255", with no way I could find of actually listing those errors. At this point I went online and ordered a new NVMe for about $200 USD. Sleep deprived and agitated as I was, I asked questions which I'd normally answer with web searches. But my brain was operating on half power: I needed some hand-holding. At 4:34 PM on 10/12 came the post that added a new dimension to the process.

Steve Litt is the author of Twenty Eight Tales of Troubleshooting. Steve can be reached at Steve Litt's email address

The Fateful Post

At 4:34 PM on 10/12/2022, an IRC inhabitant posted the following:

I'm a little disappointed, given you run http://troubleshooters.com

I get that a lot. Always have. Sometimes it's legitimate, sometimes it's an unrealistic expectation that I'm perfect, and sometimes it's a mixture of both. When I hear phrases like this, I always take some time, back away, and evaluate.

There's absolutely no doubt that many of my IRC questions were answerable with a few minutes of web research, and that I acted like a panicked newbie.

On the other hand, it's also true that I was sleep deprived, facing business disruption, dealing with one or more intermittent and hard to define problems, and that I'm not all that knowledgeable about either packaging systems or hardware problems. I needed quite a bit of help.

Slide 69 of my Universal Troubleshooting Process courseware is titled "Panic Remedies", and the third bullet item on that slide is "Ask for help". So I was pretty much doing what I teach. However, slide 70, titled "Panic Buster", has an item called "Time, attention, sleep or rest", which I obviously was disregarding. That being said, I get insomnia when I have problems, so in this case I felt the best course of action was to carefully continue to troubleshoot instead of trying unsuccessfully to sleep.

My course has an entire section on The Attitude, which is the attitude and mindset necessary for productive troubleshooting. I teach to watch out for Attitude Violations and solve them before continuing to troubleshoot. Although for days I was on the edge of several Attitude Violations (panic and anger), I recognized these upcoming Attitude Violations and kept them at bay.

Bottom line, except for lack of sleep, which is a physiological and psychological thing with me, I was doing pretty much what I teach others to do.

I'm a world-renowned expert on the process and mindset of troubleshooting, which is very different than being a subject matter expert on things like package managers and hard disk utilities. To troubleshoot something you need both the process and mindset of troubleshooting, and a complete Mental Model of the thing you're repairing, and good knowledge of the tools needed for the thing you're repairing. If I'd been troubleshooting a Python or C program I wouldn't have needed to ask so many questions, because I'm pretty knowledgeable about those.

Then there's the fact that I was just plain having a bad day (bad week actually). On January 28, 1970, Jimi Hendrix, the best rock guitar player of that era, played two songs at Madison Square Garden, then put his guitar down and walked off stage, never to finish his act. Everybody, no matter how expert in their field, has the occasional bad day.

So, as the guy who runs Troubleshooters.Com and has taught hundreds of students face to face on the process and mindset of Troubleshooting, I'd say that while my performance was a little rocky (and it got worse on following days), I kept following process and didn't let an attitude violation get the best of me.

One more thing. Remember the guy who said "I'm a little disappointed, given you run http://troubleshooters.com"? He gave me tons of help before and after saying that: I owe him a lot. I also owe him because he acted as a canary in a coal mine: He brought up my excessive questioning before everyone else got sick of it, allowing me to ask less frequent and more thoughtful questions.

Interestingly, his next sentence was this: "But I guess it will make up a good story for it afterwards."" This turned out to be an understatement. Read on...

Steve Litt is the author of the Universal Troubleshooting Process courseware. Steve can be reached at Steve Litt's email address.

Bad RAM

By Steve Litt

I started a Memtest86 RAM test at about 2 AM on 10/13/2022. Even though I expected the test to take ten hours, within a few minutes I got errors. This was actually good news: There was now an explanation consistent with all the weird stuff that had been happening to my computer. At this point all I needed to do was Memtest86 with each of my four 16GB RAM sticks, one at a time. I was about 16 hours from resolution of my problems. But instead I made a huge blunder...

Steve Litt is the author of the Recession Relief Package. You can email Steve here.

The Blunder

By Steve Litt

This part of the story is hard to tell. Let's just say it's not ideal resume material.

I'm kind of a slob and kind of a hoarder. Before I started working for myself, most of my performance reviews went something like "You do excellent work Steve, but you really need to improve your housekeeping and toolsmanship."

The correct move after discovering the memory errors would have been to bring my (full tower) computer to my work bench, take it apart, and test with one RAM stick at a time. But the tiny table that serves as my "work bench" was covered with all sorts of stuff, and on my desk the computer was socked in with all sorts of stuff, making removal difficult. It would have taken a few hours to do the right thing and bring the computer to my (clear) work bench. So instead, at 2:30 AM on 10/13, with little sleep the past two days, I messed with the BIOS settings in case they were what was at fault. Bzzzt, wrong!

I made every effort to reduce the frequency the RAM ran at, to no avail. Finally, I disabled the computer's fast boot, and at the same time set an exotic timing setting to a very low setting. When I rebooted, the fans spun, but nothing happened on the monitor, or anywhere else. By this time it was 4 or 5 AM. I slept fitfully for a few hours. My records show that at 11:24 AM on 10/13, using my laptop, I shut down Troubleshooters.Com to prevent new orders, which I couldn't fulfill. Then I got back to work.

I cleared off my work bench, not an easy task given there's very little room to temporarily store what was there. A lot got thrown out. Then I cleared enough off my desk to remove the computer, once again tossing a lot of stuff, and brought it to my work bench.

My experience in no-boot situations has been that you disconnect all disks and peripherals and pretty much everything but the power supply, RAM and video card. So that's what I did. I even tried it with each of my RAM sticks, one at a time. And still nothing on my monitor. By this time it was mid-afternoon, my attitude was shot, and I knew I was beat. I put my computer in my car and drove it to Refresh Computers to have them repair it. They'd probably know exactly which two pins to short to set the BIOS back to its default values.

Steve Litt teaches courses on troubleshooting. You can email Steve here

.

Refresh Computers

By Steve Litt

In my opinion, Refresh Computers is currently the best computer repair outfit within 10 miles of me. I showed up with my computer at about 6pm on 10/13, but their technician, John, wasn't there and wouldn't be there on 10/14 either. I gave the salesman a very detailed symptom description, letting him know that when the tech got anything to show on the screen, his part of the repair was complete. I'd handle it from there. I went home, feeling relieved my computer was in good hands. That night, for the first time in a long time, I got a good nights sleep.

John, the tech at Refresh Computers, couldn't get to my computer Saturday, but he called me Sunday morning, 10/16/2022 with some unexpected and unwanted news: He couldn't get my computer to put anything on the screen. He'd taken out the battery and done some other things that would typically reset the BIOS, but nothing worked. He said there were no pins you could short to reset the BIOS.

With my heart thudding like a sequence of firecrackers, I drove to Refresh Computers to pick up my dead machine. All the way there, I contemplated the expense and time it would take to obtain and install a new motherboard, CPU and RAM. Troubleshooters.Com could be down a couple weeks, and that was a lot more scary than the thousand bucks I'd need to spend for a the new parts.

During the ten foot trip from my car and the front door of Refresh Computers I regained a shred of The Attitude, and decided to ask John to try again with a known good video card.

I'm often asked how I can tell if somebody is a good Troubleshooter. The way I detect a good Troubleshooter is that I suggest a short and easy diagnostic test. Poor Troubleshooters, who comprise more than half of all people in repair positions, respond with a five minute dialog about how my diagnostic test couldn't possibly shed light on the situation. Good Troubleshooters say OK, let's try it. When I asked John to swap in a known good video card, John said "I doubt that will help, but sure, let's try it!". John's one of the good ones.

John swapped in a known good video card, and POP, the BIOS configuration screen came up on the monitor. And a big smile came to John's face. Then John put back my original video card back in, and once again, the BIOS config screen came to the monitor.

Saaaaay whaaaat???

Nothing can be proven, because I'm unwilling to redo the entire incident to prove what really happened. But I have a strong suspicion what happened...

First of all, when John took out the battery and did his other tricks, he did succeed in resetting the BIOS back to its default settings. I know this because when I went back in, the unwise changes I had made were backed out.

Next, I have a suspicion that my video card kept some state information, powered up or not. However, once my video card was disconnected from the computer, it lost that state info and was once again able to communicate with a monitor.

I owe Refresh Computers a big debt of gratitude. When I completely and utterly lost The Attitude, they stepped in and got things back on track.

Steve Litt is the author of the Universal Troubleshooting Process courseware. Steve can be reached at Steve Litt's email address.

The Rest of the Story

By Steve Litt

So here I was, the afternoon of Sunday, 10/16, driving home with a computer that could put the BIOS config screen on a monitor. I'd also bought a small known good monitor from Refresh, because I have my suspicions about the monitor and HDMI cable I'd been using before. The original monitor had at a certain point grayed out a lot of its menu items, and that's just spooky.

On the way home I decided to work slowly, deliberately, and relax to the extent possible, making sure not to have another Attitude Violation. If Troubleshooters.Com stayed down a few more days, tough.

At home again, the computer wouldn't go past the BIOS config screen. An hour of trial and error solved that. By now it was night. A four pass Memtest86 run told me my first RAM stick was perfect. Early into testing, my second RAM stick failed, so I put it aside, kicked off a test on the third stick, and went to bed.

The next day was Monday, 10/17/2022. My third stick had tested perfect, so I tested the fourth stick, which took about 4.5 hours. It tested perfect. Then I put in all three good sticks in the proper memory slots, and ran a shorter Memtest86 on the 48GB to verify that the combination of them was OK. It was.

I downloaded my email, made sure I could send email, and made sure everything else was OK. By this time it was late at night. At 2:50 AM on 10/18/2022, I restored Troubleshooters.Com. The October Debacle was over. Troubleshooters.Com had been down from 11:24 AM on 10/13 to 2:50 AM on 10/18. Almost five days. This is not good for business.

Analysis

By Steve Litt

Step 9 of the Universal Troubleshooting Process is Take Pride. A big part of Take Pride is to answer the following two questions:

  1. In what ways did I do well?
  2. How can I do even better next time?

Question #2 is phrased to put less than optimal things into a positive light, which helps one be honest with one's self.

In what ways did I do well?

Most obviously, for days I was operating right on the edge of an Attitude Violation, mainly panic. But I think an analysis will show I never completely departed from The Attitude. Knowing when you've lost The Attitude, stopping troubleshooting until you can re-acquire The Attitude, and only then resuming Troubleshooting, is a skill. So is recognizing when an Attitude Violation is imminent, and backing off. I displayed both those skills.

Also, I was sleep deprived the whole time, and you shouldn't troubleshoot when sleep deprived. But what choice did I have? I couldn't sleep, so I could either toss and turn and be even more sleep deprived, or I could get up and troubleshoot. I did the latter. Regrettable, but the right decision given my physiology and psychology.

I don't feel particularly bad about asking newbie questions on the Voidlinux and XBPS IRC channels. I did some web searches before asking, but not enough. A Troubleshooter should ask for help with things they don't know. Considering my mental state, lack of sleep, and inadequate Mental Model of XBPS, hardware and cron, I think my questioning was acceptable.

I like the fact that I turned it over to the pros at Refresh Computer. When the time came, I recognized that further efforts by me would be futile, and did the right thing.

I like the idea that I did the disk tests on the NVMe. Somebody else suggested it, I realized such testing was low risk and perhaps high return, so I did it. Once given the idea of testing hardware, memory testing was obvious, and I did it.

Particularly important was my almost flawless data backup strategy, continuously improved since an unrecoverable data loss in 1987. No matter how bad things got, I knew eventually I'd get things put back together because I'd backed up all my data, leaving only the OS unbacked. Installing a new OS isn't particularly hard, it just takes a few hours.

How can I do even better next time?

First and foremost, from now on I'll refrain from messing with BIOS settings I know nothing about. Also, I was much too close to an Attitude Violation the whole time. I need a way to take the pressure off when my main business machine goes down.

I was too close to the edge of an Attitude Violation the whole time. I should have done even more to get my attitude strong. And next time I'll remember that being close to an attitude violation is my problem, not the problem of my fellow IRC inhabitants.

Step 10: Prevent Future Occurrence

By Steve Litt

The last step of the Universal Troubleshooting Process, Step 10, is Prevent Future Occurrence. In any future incidents resembling the October Debacle, the most important thing to prevent is the business disruption. I've come up with a set of ideas:

Never again mess with a BIOS setting I know nothing about.

This is obvious. Messing with bios settings I didn't understand was a stupid move with a foreseeable result that set me back three or four days.

Have manuals for my motherboards.

Perhaps if I had a manual for my Daily Driver Desktop I could have looked up the meaning of the BIOS settings before changing them. Also, I possibly could have found how to reset the bios.

Keep my desk and work bench clear enough for a moments action.

Moving my computer from my desk to my work bench should take ten minutes, not hours. The ten minute timeframe requires housekeeping action on at least a bi-weekly basis. It's not fun, it's not interesting, I'd rather be writing code, web pages, a book or a course, but after seeing the result of the mess, I think I'm finally in a position to clear up regularly.

Have two cardboard boxes ready to sweep my desk and workbench stuff into for a quick change of plans.

No matter how well I maintain housekeeping, there will be some stuff both on my desk and on my work bench. I need boxes to simply sweep all the stuff into so getting to work takes five minutes. Because I have so little room, these boxes should be stored broken down and assembled as boxes only when needed.

Maintain and keep up to date a shadow computer to switch over to.

Imagine how differently this story would have played out if I'd had an up-to-date, spare computer ready to fill in as my Daily Driver Desktop (DDD) while I was repairing my real DDD. Imagine the reduced stress.

I've already taken my old DDD (2014-2020) and turned it into the shadow computer. I just need procedures for keeping it up to date.

Create a script to sync the shadow computer with my Daily Driver Desktop (DDD)

This is tricky and must be done exactly right. This script, run on a regular basis, will rsync my DDD data to the shadow computer. It is vitally important this rsync never go the other direction, writing the shadow's older data to the DDD. Both my DDD and shadow will have static IP addresses, so one part of the solution is to fail with an error if run from the wrong computer or to the wrong computer. There must be other safeguards, which I haven't thought of yet. So this will take awhile.

Maintain a Crash Kit

At all times, I must keep a crash kit in a known place. This crash kit will consist of:

Simplify my filesystems

Check out the following:

[slitt@mydesk ~]$ mount | grep "/dev/sd" | wc -l
16
[slitt@mydesk ~]$

Stop the madness! Sixteen partitions, not counting the root partition (/dev/nvme0n1p1 ), is just crazy. Sixteen partitions that must be sized right, too little and things break, too much and every one of those partitions wastes space. Sixteen mount points to keep track of in /etc/fstab. Each mount point can stop the boot if missing. My shadow computer has one partition per hard disk, with the various mount points now being bind mounts that can expand or contract as needed. When I get a new Daily Driver Desktop, I'll lay out the disks with bind mounts.

Regularly test both computers RAM and disks

No more surprises. The next disk or memory flaw I want to catch early. I'll be running memory tests every couple months, and disk tests (smartctl) every two to four weeks.

Wrapup

By Steve Litt

The October Debacle was a miserable time for me. I'm glad it's over. One colossal bungle changed it from something my customers never would have known about to a five day shutdown of Troubleshooters.Com. I also was on the edge of an Attitude Violation the whole time, and there were many instances of my asking questions like a newbie, which can be annoying.

Except for the colossal bungle (messing with BIOS settings I didn't understand), my conduct during the October Debacle was in line with what I teach in my courses. However, I'm taking steps to prevent future occurrence.

Steve Litt teaches courses on troubleshooting. You can email Steve here.

Steve Litt is the author of the Universal Troubleshooting Process courseware. Steve can be reached at Steve Litt's email address.

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com (R) is a registered trademark of Steve Litt.

URLs Mentioned in this Issue