Let me open with two questions. First, why does completely nonsensical crap happen so often around here? And second, why is it that every instance of such crap is roughly as bizarre and inexplicable as it possibly can be? It really seems sometimes as if I live in an alternate reality of some sort, one in which the laws of physics do not apply and concepts such as reliability and repeatability simply have no meaning.
What sets my "pen" to "paper" this time, as the title of the essay aptly suggests, is the death of the CPU in my server. That's no big deal, right? I mean, when a CPU dies it just goes up in smoke. It just stops working in such a way that it's clearly a problem and, in the grand scheme of things, it's a very simple problem to fix. All one need do to solve such a problem is buy another CPU and plug it in, right?
If you happen to be me, the answers to those and other such questions are a resounding "NO!"
To begin, I was actually using my server when it failed, and I don't mean in the trivial sense in which I'm using it any time I open a file over the network that resides physically on one of its hard drives. I mean that I had switched to my server, using the KVM switch that was so difficult to get working, and was using it to browse the web. My desktop machine was re-installing a rather large game to its hard drive, after receiving a replacement CD from the manufacturer (the original was corrupt, as they so often become once they enter the unreality field I generate), and I was in need of some information from a web site. Toggling over to the server and using it for a while seemed like the best way to avoid interfering with the install process.
Or at least it did until the server restarted. I was literally touching nothing when it happened. I was sitting there reading the text of a web page. I didn't even have my hands on the keyboard or mouse. I was leaning back in my chair, my hands locked behind my head, reading the text on the screen when the system simply restarted for no apparent reason. It didn't make any noise, it didn't beep at me, it didn't display an error message; it did nothing of the kind. The screen suddenly went completely black, and I could hear the server's drives power cycling.
I found this rather odd, of course, given that my server has worked fine since I wrote about the motherboard crapshoot nearly a year ago. At that time, the motherboard on my server failed and I was forced to replace it with an MSI K7N2 Delta-L. I have had no problems with it since that time until the server restarted yesterday. Still, my server is running a Microsoft operating system (Windows XP Professional), so there's no reason to expect any kind of trouble-free and/or consistent operation for long. I figured the thing had probably experienced yet another of those freaky, inexplicable errors, one that required a reboot to remove from the system.
Sure enough, the server started up again without incident, and I was soon back to reading. I got roughly one paragraph further before the system restarted itself again, also right out of the blue. I was beginning to think that maybe I was going to have to look into the situation at greater length when something really odd happened: the system would no longer even begin its power-on self test (POST) at startup. I've seen that happen before on rare occasion, though not with my server. Usually, such a thing is the result of a glitch so nasty that nothing but a completely cold boot will do. I powered down the computer completely, waited a ten-count, re-applied power, and... nothing. The computer still wouldn't POST.
I tried completely disconnecting the server from its power cable and letting it sit for a few minutes, thinking perhaps heat was the issue. It wasn't particularly hot in my office where the server is located. We're having a very mild summer out here in California, and the mercury hadn't risen much above 80° F—must be that global warming Al Gore yammers on about—but I figured it was probably the first thing to eliminate nevertheless. A few minutes later the system still wouldn't even POST. It was doing nothing, and I mean nothing: it wouldn't give me any error beeps, the video card wouldn't initialize the screen display, nothing. After fussing around with it for a few more minutes, trying a variety of other things in the interests of being thorough, I was convinced that something had gone wrong.
It was time to start the serious troubleshooting. Initially, I had but one suspect: the motherboard. I'm not sure why I doubted the motherboard so readily; maybe it's because I've had so many issues with them in the past. I figured the simplest way to determine whether it was a problem was to start disconnecting things from it. Were the problem due to a bad component, the system should at least POST once it was removed. So I disconnected all the drives from the motherboard and that made no difference. I pulled out the RAM and that made no difference. I pulled out the video card and that made no difference. Heck, I even removed the CPU and that made no difference. No matter what I did the problem remained the same. The system was a functional paperweight but useless as a computer.
I concluded that the problem was likely the motherboard. After all, functional motherboards gripe via a series of beeps about not having a CPU, memory, a video card, and so forth. I reasoned that I could expect my motherboard to complain about something I'd removed were it still functional. Still, having been through all kinds of hell with the motherboard hokey pokey and a subsequent motherboard crapshoot, I've learned to trust nothing. So when I went to Fry's Electronics in search of replacement components, I took almost everything with me. I walked in there with the server's motherboard, memory, CPU, and video card in the hope of avoiding Phil's Law: any non-trivial computer repair/upgrade will require a minimum of four trips to Fry's. Remember that.
I headed for the customer service counter and asked one of the technicians to test my motherboard. She gave me quite an unhappy look when I pulled out all my components, but she was very thorough nonetheless. The store I frequent has a convenient testing rig located right behind the customer service counter, which allows them to test pretty much anything in a known-good configuration. She tested my CPU and it worked just fine in another motherboard. She did the same thing for my memory and then my video card; both of those worked with my CPU in that other motherboard. So then she hooked up my motherboard with my CPU, memory, and video card.
I think the whole store heard the "thump" my jaw made when it dropped on the floor; everything worked just fine at Fry's. I didn't believe her, so I looked the whole thing over myself. Sure enough, closer inspection showed that she had indeed connected the server's motherboard, memory, CPU, and video card in exactly the same configuration that had failed so many times at home. I asked her if she did anything special, to which she replied that the only thing she had done was short the battery jumper. I didn't mention it before, but that was one of the things I had tried (three times, no less) while fussing around for sake of being thorough. The K7N2 motherboard has a jumper that clears the BIOS settings when closed; I had tried it three times at home without any success, but it seemed to work for her in the store.
So I headed home. All I could figure was that something went wrong with the BIOS in such a way that I hadn't cleared it properly and she had. That's the only thing that made sense to me anyway, insofar as I saw all of the components connected at working on their testing rig. They hadn't worked at home roughly thirty minutes ago, but they had clearly worked at Fry's. I didn't understand how it could have happened, but I didn't care much either. I just wanted to go home and get my server back up and running. I just wanted to be done with the whole problem before it could eat my entire evening.
While I was gone a friend and my wife had both arrived at my house for some Friday night socializing. My friend Jim found it just as strange as I did that all the server components would work just fine at Fry's after failing here at home. I was going to call it a night and work on it the next day, but I discovered when I checked the web that my gaming clan, Steel Maelstrom, was probably going to have to fight a battle in just a day or two at most, and I still had a fair amount of leg work to arrange the scheduling. Both of those things meant that I needed my server up and running post-haste, so after dinner I excused myself and quickly re-assembled the whole thing.
Naturally, it didn't work at all. Again, it wouldn't even POST. Yes, you read that right: the very same components in the very same configuration, which had worked just fine on the testing rig at Fry's roughly an hour and a half ago, would not work in my case at home. Because this was quickly becoming one of those frequent, ridiculous problems I so "enjoy", and because Fry's was going to close for the night in about half an hour, I headed to my car immediately for one more crack at buying some replacement components. I chided myself for not buying them when I was at Fry's earlier, but I did have every reason to believe that they would work at home. After all, they worked just fine at Fry's.
I made it to Fry's and purchased a new motherboard just before they closed. I figured I had every other component needed for testing at home. I could always pillage RAM, a video card, and even a CPU from my desktop system, or my wife's computer as needed, but I didn't have a spare motherboard sitting around. I was pleased to be able to purchase exactly the same make and model of motherboard because I knew that would let me avoid reinstalling Windows XP, something I recently had to do with my desktop machine. I headed home, confident that I could find the problem and that the new motherboard would likely do the trick.
But it didn't. Sure enough, I hooked up the new motherboard and the system still refused to POST. So I sat and thought about it. I had to believe the new motherboard was good. And for that matter I had reason to believe the old motherboard was good. I had the same reason to believe that my RAM and video card were also good. All of the server components had worked at Fry's earlier that afternoon; I had justification for the belief that they all were good. And now a new motherboard was giving me the same result, so either (1) my motherboard was likely good too, or (2) I was seeing two different problems with exactly the same symptoms.
So what was left? The only things I could think of were (1) the power supply had gone bad, or (2) the motherboard was shorting somehow against the case. I checked my desktop machine to make sure it was still functioning—one can never be too careful when living in the unreality field I generate—powered it down, pulled out its power supply, swapped it out with the power supply in the server, removed the motherboard from the server and placed it on a non-conducting mat, applied power, and... nothing. I still had the same problem.
Let's recapitulate the salient facts for a moment, shall we?
There are two conclusions supported by these facts: (1) all of the server's components are good, and (2) at least one of the server's components is bad. Those two are mutually exclusive, contradictory conclusions; i.e., they cannot both be the case. But they are exactly what the evidence suggested. Since reality was clearly on holiday, I decided to start swapping components. After all, something has to start making sense at some point, doesn't it?
So I swapped the power supplies again between my desktop and server, verified that my desktop was still working, shut it down again, and started swapping components in and out. The first and simplest thing I tried was to put my desktop's CPU in the server's motherboard, with the server's RAM and video card, but that didn't work at all. I tried several other combinations until I eventually hit upon a winner: the new motherboard with my desktop's CPU, server's RAM, and server's video card would actually POST. "Eureka," I thought to myself, "both the motherboard and the CPU must somehow have failed!" That thought didn't make any sense, of course, given that they both worked just fine at Fry's, but I wasn't letting rationality bother me anymore in my search for a working system.
I tried to validate that hunch by using the old motherboard instead, but I was all the more confused when I discovered that the server's original motherboard worked just fine with my desktop's CPU, the server's RAM, and the server's video card. Yes, you read that right too: my desktop's CPU in the server's motherboard, with the server's RAM and video card—the exact combination of components that I said didn't work in the previous paragraph—was now working. Why do I keep expecting reality to be consistent? Hmm? Can someone tell me why I expect things ruled by deterministic laws of physics to behave repeatably?
I tried a couple of other tests, even going so far as to pull the CPU from my wife's machine and use that instead, and I was finally seeing repeatable behavior. As soon as I removed the server's CPU from the configuration, my server would POST with any of the other two CPUs I had available. I wasn't happy that I had wasted my whole night on such a ridiculous problem, and I couldn't reconcile what I was seeing with the fact that the server's CPU had worked at Fry's some hours earlier, but I couldn't ignore the evidence of my senses: my server was now making it through its POST with other CPUs.
After all the confusion and nastiness of the evening, I showered up, grabbed a beer, and relaxed watching part of a movie (The Lord of the Rings: The Return of the King) before bed. I got up in the morning, headed off to Fry's, failed to return the new motherboard—I had somehow mixed up the two, if you can believe it, and was trying to return the old motherboard—and bought a new, Athlon XP 2400 CPU. The server was originally running an Athlon XP 2400, and I saw no reason to upgrade it. I'm disappointed to say that it cost me roughly twice what it would have cost me elsewhere ($80+ compared to the $40 pricing I'm seeing on the Internet), but Fry's is always there for me so I coughed up the extra money for the convenience of being able to return it easily should the need arise. I hoped my troubles were finally over.
I went home, put the new CPU into the server's old motherboard, added the server's original RAM and video card, and... the system would not POST. I was just about ready to lose my mind at this point. I knew that very same system had worked last night with a different CPU. I examined all the connections, reinstalled the CPU and fan/heatsink, and checked everything I could think to check, but the server still stubbornly refused to POST. I completely uninstalled and reinstalled the new CPU four times, but the problem would not go away. The new CPU seemed not to work; could I really have received a bad CPU? Is it really possible that one man can be that unlucky on a consistent basis?
In a fit of desperation, I swapped everything back into the new motherboard. I put the new CPU, along with the server's old RAM and video card, into the new motherboard, and you could have knocked me over with a feather when it worked. So again, maybe the old CPU and the old motherboard had somehow gone bad? Maybe I needed the new motherboard after all? I still couldn't explain why both the old CPU and old motherboard worked fine at Fry's the previous day, and why the old motherboard had worked just fine the previous evening as well, but only the new CPU and new motherboard were working together today.
Until I reconnected everything that is. Just for a lark, I put the new CPU into the server's old motherboard, with the old RAM and video card, and it worked. It didn't work the four times I had tried it previously, mind you, but it did work on the fifth attempt. Thus, I uninstalled the new CPU and put the old CPU in there instead—on the off chance that I had hallucinated the events of the last day—but that didn't work. When I removed the old CPU and re-installed the new CPU it would again work. In summary, I'm sad to say that the new CPU doesn't work, but I'm also happy to report that the new CPU does work.
So what can I take away from this? The whole mess began for no apparent reason. When I took them in, all of the components in my server checked out just fine on the testing rig at Fry's. I saw it with my own eyes. Yet those very same components, in the very same configuration, failed completely in my server at home. The only remaining differences between Fry's testing rig and my server, the case and its power supply, were removed from the equation without changing a thing. Clearly, all of the components were good, and yet one (or more) clearly also had to be bad.
An absurd amount of frogging around seemed to indicate that, in fact, the CPU had somehow gone bad and needed to be replaced. That behavior, at least, was repeatable. But purchasing a new CPU and installing it this morning didn't fix anything either. Or at least, it didn't fix anything until everything was plugged into a new motherboard as well, after which those same components would then work just fine in the old motherboard—you know, the one in which they wouldn't work ten minutes earlier. My head hurts just reading this paragraph.
I've had to reduce my expectations over the years. I've come to the point whereat I expect that roughly 90% of everything is crap. That seems to be approximately the correct number in my experience. But you know what? Even with that belief, I'm still a radical optimist when it comes to technology. I'm still stupid enough to expect notions like cause and effect to have some traction. I still assume, idiotically it seems, that problems occur for some rational reason and that, even more idiotically it seems, that one can expect problems with (supposedly) deterministic machines to be themselves deterministic and thus repeatable.
But that's just silly. Were I simply incompetent, I could dismiss what I've seen over the last couple of days with a simple "Oh, I must have screwed something up." Everybody makes mistakes, after all, and I've made plenty. But when it comes to matters like these I'm as precise, careful, and thorough as any human being I've met. I keep a log of every action I take, I make note of any differences I observe, etc. When I uninstall and re-install a CPU from a motherboard four times in a row, I follow exactly the same procedure every time. That it fails four times in a row provides every reason for me to believe that it will fail the fifth time as well, but following the exact same procedure for a fifth time made the new CPU work in my server's old motherboard.
My mother used to have a piece of paper taped to our refrigerator when I was young. It read: "Insanity is doing the same thing over and over and expecting different results." I used to agree. These days, however, I'd have to say it's nuts to expect the same thing to happen at all, at least where technology is concerned. I've been walking around the house today spouting mutually contradictory statements, just to see if I can get the hang of it. It's hard to give up on notions like cause and effect, rationality, reliability and repeatability, etc. But I would never get myself out of the situations that arise in my house if I didn't. When it comes to technology, I just have to try what makes sense as a prelude for trying what doesn't make sense.
In the final analysis, there's only one truth I can take from this episode and that's Phil's Law: any non-trivial computer repair/upgrade will require a minimum of four trips to Fry's. I'm putting off the fourth trip, to return the new motherboard, for a couple of days. After all, I clearly don't need it but I clearly do need it. God, my head hurts...
08/07/2004