originally published on bulldogjob.pl
Check what can be monitored, how to do it effectively and how to arrange it to draw conclusions easily.
IT monitoring is a systematic measurement and observation of the IT environment, giving a picture of how infrastructure, systems and applications work. The monitored parameters relate to a physically or logically separated environment.
Allows you to control the state of your IT environment in real time. When a failure occurs, the monitoring system is able to indicate its source and automatically inform the selected channel of the persons indicated earlier. Thanks to that, it is possible to take immediate actions to minimize losses resulting from incorrect functioning of a given system or device.
On the basis of the collected data and its correlation, the monitoring system is able to forecast the behaviour of specific components. Thanks to the ability to set trends, it enables actions aimed at eliminating undesirable events in the future.
How to determine what should be monitored?
To determine what parameters we should monitor, we should answer the following questions:
- To what resources do we have direct access to and which of them have distributed access (like a need to contect via proxy, gathering monitoring data from several remote nodes)?
- Which applications and systems require monitoring and how often?
- Monitoring of which parameters result directly from the agreement signed with the customer and SLA (Service Level Agreement) levels?
The SLA is an agreement to maintain and systematically improve the service quality level agreed between the customer and the service provider through a constant cycle, including:
- monitoring of the service
- performance review
The first step in implementing the SLA is to create a catalogue of services provided. The services are combined into groups, the next ones into subsequent ones, until finally a complete product – a service definition – is created. On the basis of the defined service, service parameters placed in the SLA are specified.
If it is required for our application to be available 24h/7 (like the LIDO application, whose aim is to support airlines during their daily operations (of which I was a long-time database administrator in my company)), it is necessary to ensure very frequent monitoring. Measurements should be made every 5 minutes, sometimes even more often – every 2 minutes. The response time to the failure should also be very fast – max. 15 min, and the time of solving the problem should not exceed 30 mins.
It is also necessary to think about what and how to monitor. From the system parameters I recommend:
- monitoring of disk usage
- memory consumption monitoring
- number of processes on the machine
- processor usage
- network usage
- number of zombie processes
From the database parameters:
- connection time to the database (database availability)
- time of execution of a specific enquiry
- disk usage (especially important are: locations containing sorting files, transaction logs, backups, daily snapshots or journals files)
- for mysql, the parameters for the disposal of the indices
- for Oracle, disk usage by tablespaces
From the application parameters:
- availability of web servers (HTTP, Tomcat, Wildfly, nodejs)
- availability of the web application
In each case the monitoring looks a little different and is application-specific, server-specific. It is the task of the developer, tester and administrator to define the necessary metrics and to determine the corrective actions and response time to problems reported by the system.
It is also necessary to design the monitoring server in such a way that periodical collection of metrics does not cause a significant drop in performance of the monitored application. It is good to collect metrics in groups, which are checked during one request sent by the monitoring server. It is also important that the script collecting metrics is executed as soon as possible.
Using Nagios monitoring system, which is my favorite and which I recommend, you can do it by implementing plugin: check_multi. More information about this plugin can be found here. It collects selected metrics and combines them into groups. Then one process is started on the monitored server, which using threads checks the metrics in parallel. This makes checking several metrics very fast (e.g. 5s). All results, from each monitoring script, are collected together, packed and sent over the network to the monitoring server.
Each measurement, apart from „measurement” data, should also collect statistical data, on the basis of which diagrams are drawn and server or application availability reports are created. These statistical data can also be used for statistical functions such as covariance or statistical prediction to determine time and level of probability, the failure will occur again. It is also possible to assess whether the amount of resources on the server should be increased by analyzing the memory consumption and storage space graphs.
Data from monitoring systems for analysis can be presented in many different ways, e.g. using: PNP4Nagios, Grafana, Plesk, SPLUNK.