Yesterday, some of you might have noticed some strange behavior with Site24x7 service. First of all, let me apologize for the inconvenience caused. I will do my best to explain to you what happened and why it happened and how we are going to deal with this kind of problem in future.
The Incident
Yesterday around 12.30 pm PST, we had close to 3 outages at our IDC, the longest downtime being for 5 minutes. Our engineering team was able to get the server up and running within 20 minutes of receiving the downtime alert. But this doesn’t mean that the monitoring didn’t stop, there was some erratic behavior such as abrupt logging out of your account (if you have logged in during that period) and receiving downtime alerts even when website was up.
Once the problem was rectified, our monitoring system went on to work without any issues.
The Problem
After meticulous troubleshooting, we found the root cause to be failure of one of the router serving our link at our ISP (Internap) end. Unexpectedly our redundant link was also homed on the same router. Our ISP is analyzing the failure of the redundant link.
The Result
This affected our monitoring as we were unable to connect to the target monitored server. As a result of this, false alerts were generated from our system.
How to Delete the Downtime from your Report
You can delete these downtimes from your report by following the below steps:
This will delete the downtime for the monitor.
The Learning
In addition to the existing checks for avoiding false alerts, we will also add additional network connectivity check to avoid this sort of scenario in future.
We will also refund the SMS credits to users who were affected due to yesterday's downtime.
If you have any further questions or clarifications, do contact us at support@site24x7.com.