If you own a software/hardware deployment which serves your customers and the availability of your system is of high importance, here are the 5 things you can’t afford to miss out as far as monitoring your system health is concerned.
While this is no guide on how to setup monitoring, this definitely is a very quick and crisp check-list on bare minimum steps you MUST take achieve reliable monitoring of your system.
- Monitor your system’s availability through external pings. As an example, if you have an http based service, configure your monitoring system to hit your most important URLs at a particular interval and look for the response content and also make sure that the response is returned within a given time limit. You should know which are the URLs which represent your application’s availability and are nearly mutually exclusive. This monitoring shows you how the external world perceives your application/server.
- Monitor your system’s internal health through monitoring agents sitting on your machines. This would include basic hardware metrics around CPU, Memory, I/O, Disk Space, Network activity, no. of processes etc and application specific metrics like – error counts in your application logs, no. of requests, process health etc. While the external pings may be looking green, there may be something cooking internally which can hamper the external availability if not stopped.
- Monitor your monitoring system’s availability through external pings. This one is special. People generally tend to miss this one out. What if your monitoring system go down? Will you get alert emails / smses if your application goes down when your monitoring system had crashed ? If you don’t have external monitoring on the monitoring system itself, you are playing a lot on luck. In such a scenario, what you could do is to deploy a alternate system monitoring your application’s health. So if your primary monitoring system is down – most probably your secondary monitoring system would be up. So you have to have a secondary monitoring server, which would monitor either the primary monitoring system OR the application itself.
- Clearly differentiate between critical and warning alerts. When you get an alert you should know how severe it is. If you can afford to ignore the alert for the next 2-4 hours, its probably a warning. If you are losing(or may lose) business every second you ignore the alert you can call it critical. This is just a high level guideline, you should decide whats critical for you and whats not.
- Configure Instant notifications on your mobile for critical alerts.Your system might go down in the night when you are sleeping, it may go down on a Sunday or any other time when you are not checking your mailbox every 10 mins. If you really want a solid reliable monitoring system, this one is a super must.
While the above list is meant for your applications/server, you can easily relate the above strategy to monitor your own self as a leader 🙂 .