Server monitoring checklist

Do you ever look at the list of metrics you monitor and feel overwhelmed? That is a nice problem to have instead of needing to tweak your server performance KPIs because your server monitoring tool does not monitor them. With Site24x7's server monitoring suite, it is easy to be spoiled for choice when it comes to which metric to monitor.

We analyzed the problems we solved for our customers and prepared this server monitoring checklist, which will help you implement a robust monitoring strategy.

What is wrong with having default thresholds?

Proper server monitoring is a mile deep and an inch wide. To understand this better, out of the thousands of servers you have, a few of them will be database servers. You probably guessed it already. Deadlocks, slow queries, and backup failures are issues that are exclusive to databases, but your default monitoring won't list these. This is exactly why you need thresholds tailored to each server's purpose. Time to revisit your thresholds with purpose-based monitoring.

Server availability

Let's start with the basics. Before we move into performance bottlenecks, we need to be aware whether there are any hosts that are or have been offline.

Host down: If the agent could not send the availability signal to the Site24x7 servers, your servers could have either shut down or there is a network outage.
Server or VM reboot: Unintentional reboots in servers can indicate an underlying threat. Site24x7 sends you an alert, an RCA, and the server restart report to help you be aware of any host reboots. One more tip is to monitor Event ID 41 via event log checks or event log monitoring.
Servers or VM that require a reboot: A critical patch that is not applied—because a host has not been restarted, for example—can be harmful. Know whether there are any hosts that require a reboot via servers by viewing the pending reboot for Windows Update report.

CPU performance metrics

One untested application update is all it takes to spike CPU utilization to 100% and bring down a server. Set alerts to occur if the CPU utilization crosses 90% (first alert) and then at 95% (second alert). You can monitor the following metrics as well.

Processor queue length: >2 per core
Processor time (%): >85%
Context switches per sec: >5,000
Interrupt time: >10%

Memory (RAM) metrics

Like the CPU, sustained spikes in memory utilization can also bring your server to a stand-still. Set alerts if the memory utilization of your server crosses 90% (first alert) and then at 95% (second alert). Below are a few more metrics you should monitor.

Page faults per second: >1,000
Memory pages per second: >1,000
Page reads per second: >10

Disks and disk partitions

Ideally, your thresholds should be set to alert when you have only 20% (first alert) and 10% (second alert) of disk space left.

Consistently high disk queue length over a period of time (say 30 minutes) indicates that there are multiple read and write operations waiting to be processed by the disk. Set alerts for increased disk queue length over a period to step up your capacity plans.

Disk queue length: >2 per spindle

Average disk reads per second: >10ms
Average disk writes per second: >10ms

Network metrics

If your server has been using either too much or too low bandwidth compared to the baseline, it could signal a misconfiguration or a problem. Some ports have to be kept down, and some have to be kept up. Monitor the status of critical ports.

The status of your network interfaces needs to be monitored, especially if your hosts need to communicate within and outside your IT infrastructure.

Application health and performance

Enterprises have dedicated servers called application servers (app servers) with the sole intention of running business-critical applications on them. Let's see which parameters indicate the status of application health and security.

There are a lot of applications, services, and processes involved. In addition to monitoring at the system's health and performance level, the building blocks such as applications, services, and processes have to be monitored.

Application availability: Check the application's availability on a server first. Be it IIS, Active Directory, SharePoint, Exchange, Docker, Java, .Net, PHP, Ruby, Node.js, Python, or anything else, the application's status should be monitored.
Process and service availability: Monitor the status of business-critical services and processes with the flexibility to specify the path, start-up mode, and arguments to pinpoint the exact process or service that is important to you.
Resource usage by processes and services: Yes, processes and services are important, but they should not take up all your server's resources. Learn how much server resources your services and processes are taking up and also set alerts when any service or process utilizes more than the allowed CPU or memory.
Application freeze: Windows Event ID 1002 marks the event of an application crash or freeze. Set an alert for Event ID 1002 to know which application failed so that you can delve deep and see what caused it. There are also event IDs that pinpoint suspicious activities of applications. You can even get this alert from the specific application's logs as well with log monitoring.

In addition to these components, monitor these metrics:

Process monitoring:

Process CPU time
Process memory usage
Handle count
Thread count

Service monitoring:

Service state
Service startup type
Service response time

Security monitoring

Site24x7 server monitoring (or any server monitoring tool) is not a replacement for your endpoint security tool. But Site24x7 can let you know whether there are suspicious events that need your attention. Here are some of them.

Windows Firewall status: You can use resource checks to know the status of Windows Firewall. You can also use EventLog monitoring to get alerts for Event ID 5025. Event ID 5025 means that the Firewall has been disabled.
Security application's services and processes: Use Site24x7 to monitor the status of your security application's services and processes.
Failed log-in attempts: Set alerts for when the EventIDs 4625, 4740, 644, and 4777 occur more than thrice. The three-time cushion is to prevent alert fatigue in cases where a user types the wrong credentials by mistake.

To learn more about how Site24x7 can strengthen your security posture, read our solution article on detecting cyber-attacks with Site24x7 server monitoring.

Bottom line

Thresholds work only when they are set right. Utilize the checklist we have provided as a guideline so that you track the metrics that make a difference. If you would like to offload the threshold limits to AI, you can do so with our dynamic thresholds feature (powered by Zia AI).

Site24x7's server monitoring agent is your single-tab solution for all your datacenter needs. Be it on-premises servers, VMs spread over all major cloud service providers, containers, or even kiosks, our light-weight server monitoring agent keeps an AI-powered watchful eye on the health and performance of your servers. Take a spin with the 30-day, zero-restrictions trial and see the capabilities for yourself. Alternatively, you can let our product support team give you a demo, tailor-made to your business and IT infrastructure.

Topic Participants
Geoffrin Edwin

Customer Self-Service Portal