Configuration rules will help you solve all these problems and more. Every time you increase your monitoring net to cover a new device, or you want to target a configuration change to a particular set of devices, configuration rules come to your rescue. This document will show you how.
Before we dive head-first, see if this scenario sounds familiar.
You are trying to set up your new server monitoring tool. The plan is to configure it properly like:
Why does this sound painful? Because out-of-the box monitoring tools are pretty straight forward, such as generating an alert when a server goes down. But how do you bend it to suit your needs without breaking it or paying a consultant (who could raise an invoice that would potentially bankrupt you)?
Let's say your SREs and sysadmins have found the time and patience to set up your monitoring tool to satisfy all the above conditions. Good job! But good luck replicating the same process for the remaining thousands of servers out there.
To end this dreadful process, Site24x7 has configuration rules. With configuration rules, any configuration change to a monitor can be pushed to only the server monitors that satisfy certain conditions; for example, only the monitors tagged as USWest1 or falling under a specific IP range.
The use cases are plentiful. But first, let us start with something simple.
A typical use case
Consider an example organization that has 20,000 servers across on-premises locations and all three major public cloud providers (AWS, Azure, and GCP). The servers in the cloud are spread across different locations. The databases are set to full recovery, so log files getting too big is a known issue. The app servers are under threat of a memory leak. The VMs keep scaling up and down, and every time a new VM is spawned, it has to be monitored and when it is terminated, monitoring should stop.
So, an ideal monitoring setup for this environment should:
- Segregate the monitors with unique identifiers. For example, Azure VM's monitors and AWS VM's monitors should be grouped separately. Application servers and database servers should be tagged for identification. There should be an option to group and tag the monitors based on a variety of identifiers.
- The use case for each server is unique. Database and caching servers are to be monitored for connectivity and disk use, while application servers are to be monitored for CPU and memory utilization.
- Momentary spikes should not send out alerts for some servers.
- Alert fatigue is dangerous. The appropriate team or person should get the alerts, not the entire sysadmin team.
- Database servers should alert when the "mysqld" process is down and app servers should alert when a critical java process or any process in a particular path is down.
13,000+ organizations handle the above problems without breaking a sweat. How?
With ManageEngine Site24x7. Let's address the above scenario piece by piece.
Monitor groups and tags
Site24x7 allows grouping your servers. For example, if you have 20,000 servers and 10,000 of them are in Azure and the remainder are in AWS, you can create two monitor groups named "Azure" and "AWS". The best part about this method is that with configuration rules, you don't need to create a monitor group every single time. Set a rule to create monitor groups, then Site24x7 will handle it for you. Site24x7 allows grouping of your server monitors based on a lot of parameters like host name, IP address, and many more (including OS type).
Alerts
Servers are utilized for various reasons, meaning the alerts associated with each server should also be tailor-made; however, tailor-made does not necessarily mean made manually every time. Create threshold profiles for each type of servers you have just once, assign rules to dictate which profile has to be applied to a server, and you are sorted for life. Any new or existing monitor will comply by those rules—configuration rules can never be bent or broken.
Want to understand threshold profiles better? Think of this as a template containing the trigger to alerts. Set the threshold for the performance and health metric just once. It can be associated to hundreds or thousands of servers as per their capacity and usage.
Here is an example of how alert thresholds are usually set. With Site24x7, there are three severity levels for alerts: Down, Critical, and Trouble. Let's say you want to create a threshold profile for a compute VM that is prone to getting memory utilization spikes but is critical to a business process. You would set these thresholds:
- "Trouble" alert at "90%" memory utilization.
- "Critical" alert at "95%" memory utilization.
- Poll frequency (i.e., data fetching frequency) set as "1 minute".
- Poll Value set as "2" polls, which gives decent leeway to filter out momentary spikes. This way the alerts are triggered only when the limits are breached for two consecutive data collection cycles (i.e. 2 minutes).
This is just one way to configure alerts for a server or VM. With Site24x7, you are armed with options to set alert limits for more than 80 health and performance metrics of your servers.