Send in your Unix questions today! |
See additional Unix tips and tricks
The title of this week's column may sound like something of an oxymoron. At least a newcomer to the art of Unix might easily believe that a fatal error means that it's all over. End of story. The system's a goner. But those of us with years of experience managing Unix systems have seen, survived and frequently recovered from so many fatal errors that the word "fatal" doesn't suggest the end of the story any more than the word "urgent" in bold letters on a piece of mail means it's going to be worth opening (I think there's actually a negative correlation between how important a piece of mail actually is and what it looks like).
Some fatal errors, on the other hand, are significantly harder to recover from than others and, having just narrowly avoided a possible reinstall of a server, I figured my particular near disaster was worth mentioning.
The problem arose because a sysadmin (who prefers to remain anonymous) made one of those little mistakes that has consequences way out of proportion to the mistake -- an easy mistyping of a command and the system is nearly unusable. Like the accidental clobbering of root's shell in the /etc/passwd file, this mistake looked like one that was going to be very tough to resolve and it happened at a time when taking the particular server offline for hours would have been a very bad move. What happened was this: Our anonymous sysadmin accidentally zeroed out a file named /usr/lib/libgen.so.1. This file is used for string pattern matching and pathname manipulation. When you don't have this shared library on a system, you will get errors such as this one:
ld.so.1: sed: fatal: /usr/lib/libgen.so.1: unknown file type
In fact, you will get errors like this for a wide variety of basic commands. If
you try to edit a file:
ld.so.1: vi: fatal: libgen.so.1: open failed: No such file or directory
You won't be able to log in remotely. You won't be able to use scp or ftp to copy the file from another system.
The affected system was also one that is in an unusual position on its network and only its user home directories are backed up.
Recovering from errors such as this are the type that earn sysadmins their stripes.
In this case, we were lucky. We located another system of the same architecture and OS on another subnet. We copied this system's /usr/lib/libgen.so.1 file to my laptop and burned a CD. We then mounted the CD on the crippled system. Since we still had a single live ssh session, we were able to do this without dragging a monitor to the system. We then copied the missing file back to /usr/lib and breathed a sigh of relief. The system immediately recovered and we had a happy ending to our long week. Disaster averted with some quick thinking. The system was back to its normal condition roughly ten minutes from the start of the near disaster. Had we not had a system of the right type to supply a copy of the missing file, the process of getting the system back online would have been far more complicated and far more time consuming.
Mistakes happen but, whenever you're working on a system for which you don't have a complete backup, you need to be extra careful. It's also a good idea to avoid having only one server running a particular OS. Having multiple systems of the same basic type and OS level means you have a spare copy of most system files within easy reach.