Over the weekend, the network my server lives on went down. Apparently it was a planned outage that no one told me about. So I got a call Saturday night from my friend Daniel Reeves, who uses my server. I happened to be at a friend’s house, and couldn’t do anything about it at the very moment. When I got home, I started trying to reach the server. I used nmap to see what ports were open. Nothing. Then I tried it for a whole range of ip addresses similar to mine, in case the IP address of my machine had changed. Not that either. So then I tried logging on to some other servers on the same network as mine that I have access to. No go there. Now I knew that the problem was out of my hands. So I went to bed.
Sunday morning I tried again. Now the other servers on the network were reachable, but mine was not. I did not want to drive 35 minutes each way to see what the problem was, so I put it off until this morning. My suspicion was that there was a power outage. Thus when I came in this morning and found my machine on, I was surprised. I soon discovered that the network was not working though. So I restarted the network using:
and then things were working fine again. However, it was clear to me that there must be a better way of handling this. So I started searching the internet and one of the very first pages I found described monit, a program which can monitor system services and can respond to various conditions. It was quite easy to install, using the standard ./configure, make, and make install. The documentation is also quite good, and many examples are provided. It did take me a good 2 hours of testing to make sure I had things configured correctly, as there was no example of testing whether the network services were up. I think it should be worth it though. The next time the network goes down, my machine should be back online 2 minutes after the network is back up. With time, I could configure monit even more, e.g. one can monitor cpu and memory usage for processes, so if mysql is hogging everything, or someone has created an infinite loop. you can kill a given process.