Monday, December 17, 2007

If you look long enough, you can eventually find the pattern...

Warning: Geek talk contained below. Feel free to simply skim the post if you like.

We've been having problems with one of our servers that does our backups lately. It just eventually stops working, usually at night when it's supposed to be running backups. There were 2 type of outages, one where it occasionally stopped working, but started back up on it's own. That one was ok, just annoying. No backups failed from it. The other was where the thing just seized up completely. It never came back, and the service had to be restarted to get things going again. For this one, if someone didn't get in to fix it, backups just wouldn't run. It was getting kind of annoying.

A few weeks ago, I set gmail to forward the email generated by the event to my cell phone. Usually the text message would come at bad times - during Stake Presidency meeting, while we were out to dinner, while I was dead asleep and didn't hear it, that sort of thing.

About 10 days ago, I started tracing the network connections for the service, until last night, the only thing I could tell was that during the 'soft' outages, the port was still open, during the 'hard' outages, the thing wasn't listening on the port anymore.

That wasn't a whole lot of help, but it got the engineer for the backups to open an issue with IBM... Last night, we had a hard outage, and I got on, but couldn't get the service completely stopped. It acted like it was stuck waiting on IO. This morning, (really afternoon, because I didn't get to work until 11:15) I looked at the system performance for the last 6 weeks, and stared at it until I saw the patterns. (Like the dudes on The Matrix, or those people that stare at the static on their television trying to see if they could find the messages being sent by extra terrestrials.) As far as I could tell, every once in a while it looked like I would have one process that would get stuck waiting on IO, and 9 or 10 others that would just sit there waiting to run, but the server wouldn't actually be doing anything. Very annoying.

After looking at all that, I went poking on the server to see what dates and times the machine had problems. I stumbled on an output file from one of the times the backup engineer restarted the service, and right in that file were a bunch of errors the server was popping out when it was having problems. The messages were very clear as to what the problem was, and I knew what to do to fix it. 15 minutes later, I made a configuration change on the server and rebooted it.

Now I don't know why we couldn't find these messages before, I found it seemingly by accident. Or by inspiration. (Which always comes after the perspiration.)

Tonight, the backup engineer finally found a document from IBM that talks about the thing I found and recommended that the setting be higher "to avoid problems with the server". He thinks I hit the nail on the head. I'm not holding my breath at this point, but we at least fixed one problem. (I'm not ruling out that there are other problems.) Why didn't we read this before, while setting things up? Because it was the installation manual, and manuals are for sissies...

At least Tara doesn't have to listen to me complain anymore that I can't figure out what is wrong on that server.

1 comment:

jjp said...

I came to work today, and had 2 new problems on other machines replace this one problem. Both of the new problems defy logic at this point as well. I think it's a conspiracy.