The linux war stories: episode 1 - SCO...got to start somewhere

Posted 01-25-2015 at 10:11 PM by binary_pearl

While I should keep the company I worked for a secret out of respect; it was a major retail chain in the US with about 7000 locations at the time.

The year was 2007. I had only been working for 2 years in my career, and I had already advanced to the infrastructure team. At that time we were still using SCO Unix 5.0.5 or 5.0.7 on all of the servers at each location.

Out of the normal chaos, a semi-urgent request came. Apparently there were certain server models that were known to reboot at random times unexpectedly. After asking around, I found out that this issue comes up from time to time, but then goes away.

So I started investigating it. Turns out, we had one particular server model that was kernel panicking. That server was the HP TC3100. This was the last server HP made before they merged with Compaq.

We had about 2700 of these TC3100's. That's a lot of servers that could be kernel panicking and rebooting. And sure enough they were rebooting all right. But what was causing it to kernel panic, and why?

A couple of us started looking into it. I didn't realize it at the time, but I was being steered down a path that would lead to an epic fail. There are those who take pride in making others look bad. My naive self thought this senior person was helping me. How wrong I was.

My college background had us mainly working on mainframes, so there was a lot of troubleshooting done through low level dumps of memory. When I saw the SCO kernel panics, they looked very similar to the OS 360 dumps of our programs. So I started capturing as many dumps as I could and analyzed them. One of the sections in the dumps was the last 10 functions called before the panic. After googling it, they all pointed to a problem with the network. At least I had something now to go on.

In the mean time, two other engineers (1 good engineer, the other our antagonist we shall call Anthony (not his real name)) tried brute forcing their way to figure out the problem. What they came up with was a stress test. They ftp'd a file to and from a TC3100 to see what would happen. It took a couple days, but eventually they were able to reproduce a kernel panic. So from two different areas, we now had something to go on. It was a issue related to networking.

So when I was on SCO's support site investigating this, I came across something interesting. There weren't just one set of drivers for the TC3100's internal NIC. There were two. One was produced by Intel, as the NIC was made my Intel. The other drivers were made by HP.

Turns out, we were using the HP drivers. The senior SCO person (another antagonist...a story for another day), said something cryptic about the Intel drivers having performance issues. He gave no facts to back that up.

By this time the other two engineers had gotten their technique down to where they could kernel panic a TC3100 in less than 5 minutes. This was good, as we now had a repetable test. So we did two different things. We acquired some old 3Com network cards. They worked flawlessly with kernel panic test. Our other option was to use the Intel drivers. They weren't as good as the 3Com cards, but they didn't kernel panic. There were still a small amount of errors generated, but at least they held up in the stress test.

Management didn't like the idea of buying 2700 new network cards. So now we at least have a solution; we need to get the Intel drivers installed on 2700 servers. I don't see a problem with that, do you?

So we picked a local store as a pilot. We go to the store, and replace the HP drivers with the Intel drivers. During the day, interestingly enough. This day in age I can't imagine that would be tolerated...but that is a story for another day. The server came up, all was good, and we had our first successful NIC driver replacement. Yay!

But now, how do we get this on all of the rest of the servers? I was told that sending out technicians to the stores was out of the question, as it would be too expensive. That meant an automated solution was the only acceptable route. At 27 years old and with only 2 years of experience, who was I to argue?

So I volunteered to write the automated solution. I knew the senior SCO person who said the Intel drivers had "performance issues" wouldn't be capable of doing it properly, and certainly didn't want him messing it up. So I went back to figure out how to completely automate removing the HP drivers and replace them with the Intel drivers.

One challenge that I had, is that the servers were all single connected. Meaning that they only had one network cable. Had I known better, I would have refused to do do an automated solution without dual NIC's. But I wasn't experienced enough yet to say that.

The way to replace the NIC drivers was through a SCO utility (the name escapes me). If the utility could do it, it was just a matter of figuring out what it was doing, and I could script it. I don't recall if I used shell or perl at that point, but regardless, I was going to make a script to replace these drivers automatically, that I was dead set on.

I got about 90% of it figured out, when I came to a road block. In the man page of this SCO utility it said something in big bold letters to the effect of 'THIS OPERATION MUST BE PERFORMED IN PERSON'. BS I thought. The utility can do it, then it can be automated. So I dug very deep into the depths of the network configuration. I eventually figured it out. I now had a fully automated solution! I could uninstall the HP drivers, install the Intel network drivers and have it come back up on the network!

I should mention, that none of servers had any type of ILO/remote management. If a server didn't come back up, I was screwed. And these servers were scattered across all 50 states.

So we start testing this automated solution 1 store at a time. Eventually 2 at a time. Eventually we got about 70 stores converted. Success! Or so I thought. Now is when the story starts to get more interesting.

At this point in my career, I wasn't familiar with the networking concepts of auto-negotiation vs forcing the speed and duplex settings. Now would I have a senior help explain this stuff to me? Of course not! Why explain it to a hot shot junior who is destined to make a fool out of himself?

I told my manager in passing that, oh yea, the new drivers are set to auto-negotiation by default. He got very angry, and told me that was wrong. He told me that the switches were set to force everything to 100/Full. I thought big deal, I set it to auto-negotiation, what's the problem?

I learned the hard way, that when you set one end of a connection to auto-negotiate, and the other is forced to 100/Full, auto-negotiation unconditionally fails, and it defaults to half-duplex. So now we have 1 way at full-duplex, and another way at half-duplex. This creates a very nasty situation in which network transfers rapidly slow down and often time out. There is much more to this topic (in which it is called a 'duplex mismatch', and is another war story for another day).

So by the time I realized this, I was in desperate despair. My success with the 70 servers was now nullified. I now had 70 new issues that I had to deal with. By this time the other good senior engineer was taken off this issue to work on other things. This left me an Anthony. So now we suspended converting any new stores until we got the duplex issues fixed.

So now that we got that fixed, and tested the new conversion script on a few stores, the conversion script was considered ready for mass deployment. I made one tiny mistake though that had disastrous consequences. What I haven't mentioned yet, is that the server had to be rebooted for the driver changes to take effect. My manager asked me, can you run your script, and just have the servers naturally reboot (either through planned or a kernel panic)? And I thought about it, and said yes.

Here is the mistake that I made. When rebuilding the kernel, it would ask you 2 Yes/No questions. One of these was "Do you want to rebuild the kernel environment now?". This part was easily scriptable, and I selected Yes. Actually it was just 'Y' or 'N'. And that's all it was. One little letter sent me down a path of hell that I wouldn't have wished upon my worst enemy. Or maybe now...yes...you mess with me like this, you get what you deserve.

The problem was that this was now known as an infrastructure issue. Infrastructure was not allowed to have issues. We were to be perfect. Not seen or heard. If you knew who we are, that was bad because that meant there were issues. So there was immense pressure to get this TC3100 thing resolved.

August 22, 2007. The decision was made to send my automated conversion script out to all 2700 TC3100 servers and have it run. The servers over time would fix themselves then upon the next reboot. I had tickets to a Herbie Hancock show that night. I had to abandon and not go due to them deciding to send the script out that night.

At first, it seemed fine. Everything went to plan, and it looked like my script ran successfully.

August 23, 2007. My birthday. Issues started coming in with converting stores to a new release. After the code distribution people found out about what we did, they were furious. They claimed that our NIC replacement script was causing the issues, and they had to reboot the servers to fix the issue. I knew they were correct. By my senior and management cliamed that it wasn't the case. I remember talking to my mother that day. She had just taken a new job, and was working with severly disabled children with major behavioral issues, and she was even bitten by a student that day. So she her day was just as bad as mine. Made for a great happy birthday conversation!

August 24, 2007. Yea, it gets worse. In the morning it wasn't too bad relatively speaking. Then I came back from lunch. MAJOR ISSUE! Remember how I said that I selected 'Y' instead of 'N'? Well now we found out the consequences. Turns out what happened is that sar was spewing messages that said "I can't find the booted kernel!". Each server was sending 3 emails every 5 minutes. Multiply that by 2700 servers, and that's lot of email. So much that it was spamming the production email servers. So if you were expecting an email saying that your prescription was ready that weekend, sorry, that was my fault. Or so I blamed myself at the time.

By this time, nearly every person related to IT store operations was furious with me. Shaun's change did this! And Shaun's change did that! It was bad. It was the talk of the IT department.

The next week, they pulled in all of the 2nd level support staff to work extra hours to reboot the servers. That was the only way to stop all of this madness. Between the application releases not working and the email issue, I had gone and messed stuff up real good.

By this time, I was hiding in my cube. I didn't want to talk to anyone. My director came by and I thought this was it, I was going to be fired. He didn't even talk about it, but he asked me an unrelated question, and then walked away.

So is that it? No, of course not. So we finally got almost all of the stores converted. But we had 2 issues left. One was that there were a small handful of servers that didn't come back up. The other issue was how to handle new stores being built.

The servers that didn't come back up? I don't know what was wrong, but there were about 10-15 of them. I worked with the techs over the phone, and just couldn't get them back up. So I had to order a re-clone. A re-clone wiped out all of the store's data, and there was no way to recover it. Ordering a re-clone was like ordering someone to death. The store lost so much important data. It wasn't my fault they didn't have a proper backup solution, but at the time I felt it was.

The other issue is how to handle new stores. Now picture me, completely distraught over screwing up the entire chain. Now I had to put these NIC driver changes for new stores. This involves updating a separate script that the techs run when building new servers. This is where Anthony really started directing me. I had to rely on what he was saying, because, well, I wouldn't have known what to do otherwise.

Anthony gave me all sorts of weird restrictions, ip ranges, and it was an uber nightmare trying to modify this script. There were many conditionals, depending upon the server model, where it was in a depot or staging networks...but I finally got it working...almost. Turns out I made one other mistake. The intention was to have a command that used '&&' and '||' to dictate what to do upon success or failure. But I accidentally did '|' instead of '||' in one of commands. Well that had the effect of waiting for standard input instead of doing a logical OR. Which means that the script hang...and never finished. Now I had just prevented any new stores from being built. And that was bad, because stores had to open on tight deadlines. So I probably delayed a hundred or so stores from opening.

So that's it. Now you might be wondering, you built up this Anthony character, and yet we haven't heard that much evidence to back up why he is so bad at this point. The reason is, I didn't realize what was going on at the time. In future stories, we will explore this character further, and then, it will make sense as to why things went the way they did in this story.

But the end result is that I remotely replaced the NIC drivers on 2700 single connected SCO servers. Despite the issues that surfaced, I believe this is was one the most insane deployments every attempted, and by and large, it worked. Veni, vidi, vici.

--Shaun

The linux war stories: episode 1 - SCO...got to start somewhere

Comments