Thursday, November 22, 2007

Systems support

I was at the doctor yesterday morning when one of the network interfaces on one of our main database servers stopped responding. The guys in the office didn't call because I was at the doctor being poked at, they tried to figure out the problem themselves. After a while, they sent someone into the data center to look at the thing. The ethernet cord was disconnected from the server.

The story we heard was that the thing was just fallen out, the head wouldn't lock itself in when you inserted it. The NOC staff couldn't say for sure if someone was in there during the time it went down. (still not sure why they don't know, but they aren't really saying...) We figure it was either someone in there doing something in the rack (it's the only server in that rack, so I had hoped that wasn't the case.), some sort of earthquake, or someone really big walking or running by. (Which shouldn't happen. They built that datacenter to stay completely in one piece in an earthquake. The racks are secured to the lower floor 8 feet below. They have tons of rebar in that thing. Someday we will have an earthquake, and this giant cement bunker will rise out of the ground. (It is sitting on top of sand...)

Anyway, people were nervous about the plug coming out again, so I offered to glue it in with gorilla glue. This wasn't universally accepted. I decided to hot glue the thing into the NIC. A couple of people I told (like the network guys) didn't think I was serious at first.

I talked the Operations Manager into running home to get her glue gun, when she got in, I headed to the Data Center to do the deed. When I got there, I the ethernet cable in the server looked like it was completely secured. I gave it a tug, it was locked in. It finally came out then that it was the head on the switch side that had the problem. The network engineers didn't really want me applying hot glue to their switch. (Tara wants to buy me the shirt that says "I void warranties") They pulled a new line with new heads, and configured a new port for the server. If the outage reoccurs, the NOC staff will be calling me in a panic.

I imagine when I explain on Monday that the problem isn't resolved, some people will get nervous and decide we need to take everything on campus down to fix the problem...

No comments: