Friday, March 27, 2009

Less than helpful tech support

Last night, unbeknown to me, a switch was moved in the Bookstore. I had been telling the bookstore people for a little while that I needed to patch their servers, but I hadn't really gotten around to it quite yet. This morning I heard that they were going live with one of their machines tomorrow, so I scheduled to patch the server before hand. Just before lunch I started patching, and while it was going I walked away from my computer. When I came back, the server was rebooting.

The server never came back up. This is not good. I went over to the bookstore with a department car, and the machine was really, really messed up, so I went back to the office to get my laptop and car, to be in it for the long haul I knew was ahead. (The guys in the office plan to send out a search party if I am not back in 3 weeks.)

This server had a number of shared libraries zeroed out. I didn't know this at first, I just knew that it was popping an error that I couldn't find an exact match on in the searches I did. The box didn't boot off the old kernel I just was on, either. After a bit of trying, I decided it was time for my first software support call to HP in about 5 years or so. (We just haven't needed one in the interm, HPUX is a stable operating system, or our installations of HPUX are, at least...)

I ended up talking to someone with a average Latin accent. He tried to understand my problem, and he didn't understand why he couldn't find my exact error in his KB. (I was getting all zeros for the error number of this particular error, and the error was always supposed to have a whole number associated with it. He eventually told me he would call me back after talking to some of his coworkers about the problem. I was hesitant to let him hang up on me, because of the high likelihood that it would take longer than the 15 minutes he estimated.

While waiting, I poked around for the source of the problem. This was not easy, because one of the shared libraries is called by just about any command inside the OS, I had to stick to the limited commands reserved for a maintainence mode of the OS. It wasn't very fun, but eventually I noticed that the library in question was empty. Not gone, just empty. This is not a good thing.

I had patched one of the other bookstore servers 2 nights ago, so I started downloading the shared library from that machine, so I could burn it to disk, mount it on the sick server, and copy it over. (The sick machine would not talk to the network, and is old, and does not have a USB port.) I got the file copied over, and the OS started complaining about another shared library. It was at this point that I noticed there were about 30 shared libraries that were all zero length. Is this not a good thing.

Wireless access in the little computer room in the Bookstore is not really an achievable goal, I had to keep taking my laptop out to the commons area of the Student Center to connect to download stuff. I ended up downloading about 400 shared libraries from the previously patched machine, just to make sure I got everything I needed. While I was waiting for this, I got an email from the HP support guy (who happened to be in Costa Rica) asking me to try to boot off the older kernel. He thought that might fix my problem. I decided to call him to break the news that I had already tried that, and to update him on my discovery.

The support guy told me that he thought the case should be escalated to a higher level of support, he started sounding sad about it. (Why was it a letdown for him to have to escalate the call? Shouldn't a call always be escalated at the right time? Has HP got some sort of incentive in place sometime in the past 5 years where the front line support people get some sort of bonus if they close the call without escalating it?) Anyway, the guy he sent me to didn't really want to help. I explained what happened, he for a minute tried to lecture me about backups for systems, but I didn't let him apply the lecture effectively at all. I explained what I was in the process of doing, his response was to say that it sounded like a good plan, I should update the case when I got through with that, but he was leaving to go home in 20 minutes, so he wouldn't be doing anything else for me. He let me know that 'someone' would be monitoring the issue if I was still stuck after doing what I was doing. Nice. That meant I would need to explain everything over again if I decided to call back in.

I got the machine functional again after a while, my download and copy fromt he other machine appears to have worked. When I got the machine going again, I had 5 more patches to apply that were sitting on the other server. I tried to copy them over, but the network was in bad shape between the two machines. I noticed errors on the network card for the one box, and called the network engineers to look at this new switch they put in the night before. The network engineer indicated that he hadn't yet gotten the remote access stuff for this switch in place, he had been sleeping all day, and would call me back when he was ready to look at the ports on the switch.

While I was waiting, I decided to try to reinstall the patch set I had already done, to see if any patches got missed. By this point, I had realized that what very likely happened was that there was a network interruption while I was patching, because of this misconfiguration, I guess, that caused one of the patches to not get applied right. The retry of the patch set showed up with 3 patches that got missed, one of these ended up replacing the very shared libraries that were zero length before. Looks like a network problem got me again. (When I got home, Tara's comment was that network problems were always the cause of my problems. I'm not sure that is really the case, but it is a bad sign for the network engineers when the wives of the systems guys notice it is always the network...)

They couldn't get the remote management going, someone was sent over, he corrected the network misconfiguration for these servers, and I was able to finish patching the server. It has a clean bill of health now. It only took me 6 hours to do a 20 minute job.

I should have just not called HP. Then I could still say it has been 5 years since I made a call. It's not like those two really did anything to help matters...

Sorry about the boring story. We'll get back to regular programming soon.

No comments: