some (sys/web)admin hints

Just thought I’d regale ye with two errors from my past.

The first involves SSH. SSH rules. It is /so/ handy to be able to work on many machines simultaneously. However, there is a single fundamental flaw to this method of working.

[root@localhost ~]# poweroff

I turn my laptop off every night. As I usually have my fingers on the keyboard, I usually turn it off using the above command.

One night, I turned my machine off, then went upstairs to bed. I realised I’d forgotten to lock the back door, and came down to do so.

My laptop was still humming away.

I don’t think you can imagine the feeling of dread that came over me. A cold sweat welled up, and I could feel the hair on the back of my neck raising. I realised I must have typed poweroff in an active SSH connection. I checked, and it was our live production server, with a hundred or more active websites on it.

The next hour or two was frantic, as I tried to get in contact with the hosting provider (which has a supposed “24-7” phone line which is only ever answered from 9-5).

Luckily, it was Sunday, so no-one noticed before the provider’s support techies finally turned up for work in the morning and turned it back on.

As soon as I knew the machine was on, I logged into it and added this line to the ~/.bashrc file:

alias poweroff="echo No! Step away from the computer and think about what you\'re doing"

The second error is probably a common one.

It /sounds/ like a good idea to set up an error log which emails you if an error ever occurs on your server.

Don’t! Or at least, read on and find out how to do it properly.

I made the mistake of setting up an error catcher which would email me as soon as an error occurred. The reasoning was that I’m usually online, so I could catch the error quickly and fix it before the client even noticed the error was happening.

Unfortunately, one day I made an upgrade to one piece of code which adversely affected another piece that was almost unrelated (so it didn’t occur to me to check it – admit it, you’ve all made this mistake).

I went home and was offline for the rest of the day. The following day, I came in to find the office phone’s answering machine was blinking. Apparently, that client’s site had “stopped responding”.

I thought it would be something simple, so tried to log in. The system was /slow/. After a few minutes of painful testing (even SSH is very slow if your load is high enough!), I found the problem – about 5 million emails in the email queue, and qmail was going CPU mad trying to deal with it all.

It turns out that any time anyone accessed the client’s site, it would trigger a recursive chain of events which each caused an error. Before we could even start on solving the problem, we needed to turn off web access to the machine! Not a good thing, when your business is the web.

The next few hours days were spent clearing the emails and scouring for any code which sends an automatic error email…

The solution to the above? Log your errors to the syslog instead of via email. If you still want an email sent out, then set it up using a cron job which emails the contents of the log file then clear it.

Moral of the story is – there is always a better way to do things. Usually, though, it will not occur to you until after the damage is done.

1 Comments.

  1. If you’re running Debian or Ubuntu, install the mollyguard package. It detects if you’re on a ssh sessions, and if you attempt to run poweroff, halt, reboot, or a few other commands, it will ask you to type in the hostname of the server.

    Niall.

%d bloggers like this: