An API returning 200 OK can still be broken—wrong payload, a four-second P95, an expired certificate quietly downgraded to HTTP. Postman's State of the API 2024 report found that 58% of API failures are first reported by end users. If your customers are your alerting mechanism, you're not monitoring correctly.
These 15 rules cover the full reliability surface: availability, performance, correctness, security, and operational maturity. Each includes what to measure, why it matters, and how to set it up.
A single monitoring location tells you the API is reachable from there. Nothing more.
Why it matters: BGP route leaks and regional network failures can make an API unreachable across large portions of the internet while it stays fully operational from other regions. Postman's State of the API 2024 report found that 67% of organizations serve API traffic from three or more regions, but multi-location monitoring remains unevenly adopted. An API that's healthy in us-east-1 and timing out in ap-southeast-1 looks perfectly fine from a single monitoring point.
What to measure: Response time and availability from every region where you have users.
How to set it up: Configure monitors across multiple global locations with per-location alerting thresholds. Alert on the location—not just the aggregate—so regional degradation doesn't average out.
Five-minute check intervals create five-minute blind spots—five minutes of failed payments, broken logins, or silent data corruption before anyone knows.
Why it matters: Gartner® research shows that mean time to detect (MTTD) for API degradation drops from 15–30 minutes to one to three minutes with synthetic checks at one-minute intervals. For a payment API losing thousands of dollars per minute of downtime, the math is unambiguous. The right check interval doesn't just reduce MTTD. It determines whether your monitoring finds the problem or your users do.
What to measure: Availability and response time at the frequency your criticality tier demands.
How to set it up: Tier your endpoints by business impact:
A 200 OK response with an empty body, an error wrapped in a success envelope, or a partial dataset looks fine to a status-code-only check—and passes silently while users experience failures.
Why it matters: During major payment API degradations, APIs have returned 200 responses with partial or empty bodies before escalating to hard 500 errors. Status-code monitoring stays green while transactions fail.
What to measure: Status code and response body content. Assert that critical fields exist and contain expected values.
How to set it up: Use content-match assertions or JSONPath validation—for example, $.status == "active" or $.data.length > 0 —on every critical monitor. A 200 that doesn't say what you expect is a failure.
A single 500ms threshold is too tight for search endpoints and too loose for health checks. One creates alert noise; the other creates blind spots.
Why it matters: Say a CRM platform's REST API degraded and response times spiked 10–30 times. The APIs returned 200 OK but were effectively unusable. Global thresholds would have caught nothing at the lower end and created constant noise on the search endpoints.
What to measure: Response time against endpoint-specific thresholds:
| Endpoint type | P95 target | P99 target |
|---|---|---|
| Health check / ping | < 50ms | < 100ms |
| Simple CRUD | < 200ms | < 500ms |
| List / search with filtering | < 800ms | < 1,500ms |
| Aggregation / reporting | < 1,500ms | < 3,000ms |
How to set it up: Baseline each endpoint for two weeks, then set thresholds at P95 and P99 of normal behavior. Revisit every quarter—traffic patterns shift, and a threshold that was meaningful in Q1 becomes noise in Q3.
Average response time is a liar. If your P50 is 100ms and your P99 is 5,000ms, the average might show 200ms—perfectly healthy on a dashboard. Meanwhile, 1% of your users experience 50 times worse performance.
Why it matters: In API degradation incidents, P95 and P99 latency spikes precede error rate climbs by minutes. Tail latency is the early warning signal. Average latency is often still within bounds when degradation has already started—which means the alert fires late or not at all.
What to measure: P50 (median), P95 (early warning), and P99 (structural failure indicator).
How to set it up: Alert on P95 and P99 independently. P50 is your baseline. P95 is your canary. When P99 spikes, something is structurally wrong—and the sooner you know which endpoint, the faster you can isolate the cause.
Total response time tells you something is slow. Phase decomposition tells you why.
Why it matters: A two-second response could stem from slow DNS resolution, a TCP handshake bottleneck, sluggish TLS negotiation, or slow server processing. Each has a different owner and a different fix. Without decomposition, you're checking every downstream service manually, with no indication of where to start. DNS slowdowns, TLS renegotiations, and slow time to first byte (TTFB) don't share a root cause—and they don't share a solution.
What to measure: DNS lookup, TCP connection, SSL/TLS handshake, TTFB, and content transfer—each independently.
How to set it up: Use a REST API monitor that breaks down response time into connection phases. When total response time spikes, drill into per-phase data to isolate the bottleneck before starting any downstream investigation.
Your continuous integration (CI) pipeline validates the schema at deploy time. What happens when a feature flag flips the schema in production? Or a gradual rollout changes the payload for 10% of requests? CI passes. Production breaks.
Why it matters: Phased deployments routinely change payload structures in production without triggering CI failures. Postman data shows teams deploying daily are three times more likely to experience production schema drift than those deploying weekly. By the time a client reports a parsing failure, the damage is already widespread—and it started before any alert fired.
What to measure: Response body structure against a defined schema on every check—field presence, correct types, and expected nesting.
How to set it up: Define JSONPath assertions that validate structural integrity alongside your content checks. When a field disappears or changes type, you want to know before your API consumers do.
Monitoring only GET /health monitors one code path. A POST endpoint can fail while GET succeeds—different controllers, different database operations, different authorization logic.
Why it matters: API authentication failures have taken down authenticated endpoints while public GET endpoints stayed operational, meaning GET-only monitoring showed 100% availability during a real outage affecting every authenticated user. The operations your users depend on are rarely the ones easiest to monitor.
What to measure: The operations your users actually perform. If your API accepts POST, PUT, PATCH, and DELETE, monitor those flows.
How to set it up: Create synthetic transaction monitors that chain multi-step API flows: Authenticate, create a resource, read it, update it, delete it. Each step gets its own assertions and latency thresholds. If your monitoring wouldn't have caught your last incident, it isn't covering the right methods.
If your Content-Type header disappears after a deploy, clients fail silently. A wrong Cache-Control directive serves stale data for hours. A misconfigured CORS header blocks your frontend entirely. None of these return an error status code.
Why it matters: Header misconfiguration is a class of silent failure that status-code monitoring never catches. OWASP's API Security Top 10 identifies security misconfiguration—including CORS and authentication header issues—as one of the most prevalent API vulnerability classes. Your monitoring should catch misconfigurations before anyone else does.
What to measure: Content-Type, Cache-Control, Access-Control-Allow-Origin, and any custom headers your clients depend on.
How to set it up: Add header assertions to your monitors. Validate that values match expectations and haven't changed unexpectedly. A header value that shifts silently is as much a failure as a 500 response.
An expired certificate isn't gradual degradation—it's an instant, total outage. Every HTTPS client refuses to connect. No retry helps. No fallback kicks in.
Why it matters: When a widely used root certificate authority expires, APIs chained to that root can fail TLS handshakes across millions of devices simultaneously—with no warning. Certificate expiry is one of the most preventable outage causes in API operations, and it remains one of the most common.
What to measure: Days until expiry, certificate chain validity, OCSP revocation status, and protocol version.
How to set it up: Configure tiered alerts:
Monitoring only public endpoints is like testing the front door of a building and assuming every room inside is fine.
Why it matters: Large-scale API authentication failures have caused valid API keys to return 403 Forbidden while public endpoints stayed operational. Any monitoring hitting only public endpoints would have shown zero issues during a live outage affecting every authenticated user.
What to measure: Full authentication flows—including OAuth 2.0 token grants, API key authentication, and JWT validation—using real credentials.
How to set it up: Configure monitors with your actual authentication methods and assert on both the authentication step and the authenticated response. If your users can't log in, your monitoring setup should know first.
An HTTPS endpoint silently redirecting to HTTP is a security incident. An unexpected 301 chain is either misconfiguration or compromise.
Why it matters: Protocol downgrades and unexpected redirect chains signal network-layer or configuration-layer problems that availability checks miss entirely. By the time a user reports that something feels off, the misconfiguration may have been active for hours.
What to measure: Redirect chains (hops and final destination), protocol downgrades (HTTPS to HTTP), and unexpected 301/302 responses on endpoints that should return 200.
How to set it up: Configure monitors to fail on unexpected redirects rather than following them silently. Assert on the final URL and protocol. A redirect your monitor swallows is a failure your monitoring hides.
An alert sent to an unmonitored email inbox is not monitoring. It's an audit log nobody reads.
Why it matters: Google's SRE book establishes that teams receiving more than five non-actionable alerts per shift begin ignoring all alerts within two to three months. Alert fatigue doesn't just slow response times—it erases the value of everything you've built in Rules 1 through 12. A perfectly tuned threshold firing into a dead Slack channel is worthless.
What to measure: Alert delivery reliability—are alerts reaching the right person within the expected timeframe?
How to set it up: Route alerts to your on-call system (PagerDuty, Opsgenie, or equivalent) with multi-tier escalation. Use Slack or Teams for visibility, never as the sole channel. Target fewer than five actionable alerts per shift. If your false positive rate exceeds 20%, your thresholds were set during a traffic pattern that no longer exists—which is a signal to revisit Rule 15.
Your API's SLA is only as strong as its weakest upstream dependency. If your checkout calls a payment API and your email calls a transactional messaging service, those are part of your reliability surface whether you own them or not.
Why it matters: The average enterprise manages over 15,000 APIs (Gartner®, 2024), with monitoring coverage typically spanning fewer than 40% of production endpoints. Your blind spots are almost certainly in your dependencies—and when a critical third-party API degrades, you want to know before it cascades into your own SLA breach. A downstream failure that surfaces as an outage—because you had no visibility into the dependency—is the worst kind to explain.
What to measure: Availability and latency of every third-party API your service depends on, tracked separately from your own endpoints.
How to set it up: Map your dependency tree. Create a dedicated monitor for each critical third-party API with its own thresholds and escalation policy. Configure alert suppression so that when a dependency goes down, dependent monitor alerts are automatically silenced, preventing a cascade of false positives from masking the real root cause.
Set-and-forget thresholds are a slow failure. Traffic patterns shift, features change endpoint behavior, and a threshold that was meaningful in Q1 becomes noise in Q3.
Why it matters: Alert fatigue kills monitoring faster than missing alerts. Stale thresholds are the primary source of non-actionable alert floods. Once engineers start dismissing alerts as noise, the next real incident goes unnoticed. The quarterly review isn't housekeeping—it's what keeps Rules 1 through 14 working.
What to measure: Alert volume per endpoint, false positive rate, threshold-to-baseline ratio, and traffic pattern changes.
How to set it up: Schedule a quarterly review:
| # | Rule | What to monitor | Action |
|---|---|---|---|
| 1 | Multi-location monitoring | Availability from all user regions | Monitor from three or more locations; alert per location. |
| 2 | One-minute check intervals | Availability and response time | Tier endpoints by criticality; one minute for Tier 1. |
| 3 | Status code and body validation | Response content, not just status | Add JSONPath assertions and content matching. |
| 4 | Per-endpoint latency thresholds | Response time per endpoint type | Baseline for two weeks; set P95 and P99 thresholds. |
| 5 | Percentile tracking (P50, P95, and P99) | Tail latency distribution | Alert on P95 and P99 independently. |
| 6 | Response phase decomposition | DNS, TCP, SSL, TTFB, transfer | Decompose total time; alert per phase. |
| 7 | Runtime schema validation | Response structure in production | Assert field presence, types, and nesting. |
| 8 | All HTTP methods | POST, PUT, PATCH, DELETE, not just GET | Create synthetic transactions for multi-step flows |
| 9 | Response header assertions | Content-Type, Cache-Control, CORS | Fail on missing or unexpected headers. |
| 10 | SSL/TLS certificate monitoring | Days to expiry, OCSP revocation, chain validity | Set up tiered alerts: 60, 30, 14, and seven days. |
| 11 | Authenticated endpoint testing | OAuth 2.0, API key, JWT flows | Monitor with real credentials. |
| 12 | Redirect and downgrade detection | Protocol, redirect chains, final URL | Fail on unexpected redirects. |
| 13 | Alert pipeline integration | Alert delivery and acknowledgment | Route to PagerDuty/Opsgenie; fewer than five alerts per shift. |
| 14 | Dependency monitoring | Third-party API availability and latency | Dedicated monitors per critical dependency; configure alert suppression. |
| 15 | Quarterly threshold review | Alert volume, false positives, baselines | Schedule and document quarterly. |
Rules 1 through 12 map directly to Site24x7's REST API monitor and API transaction monitor :
Multi-location checks (Rule 1): Monitors run from more than 130 global locations with per-location alerting. A regional degradation fires an alert tied to that location—it doesn't disappear into an aggregate average.
Check intervals (Rule 2): Configurable down to one minute on standard plans, with 30-second intervals available on higher tiers. Pair with multi-location rechecks to eliminate false positives before an alert fires.
Response body and schema validation (Rules 3 and 7): JSONPath, XPath, and RegEx assertions run against the response body on every check. JSON schema validation catches structural drift—field type changes or missing required fields—that deploy-time tests won't catch in a phased rollout.
Per-endpoint latency and percentile thresholds (Rules 4 and 5): Thresholds are set per monitor, not globally. Configure independent P95 and P99 alert conditions per endpoint type.
Response phase decomposition (Rule 6): The REST API monitor breaks total response time into DNS, TCP connection, SSL/TLS handshake, TTFB, and content transfer—each reported independently. When total response time spikes, the per-phase breakdown tells you which layer is responsible before you open a single log.
All HTTP methods and authenticated flows (Rules 8 and 11): The API transaction monitor supports POST, PUT, PATCH, DELETE, and PROPFIND across multi-step sequences. Authentication covers OAuth 2.0, Basic/NTLM, client certificates, and web tokens. Extract a value from one step's response—an access token or a resource ID—and pass it as a variable into the next step's request. This is what makes it possible to monitor a real user flow, not just isolated endpoints.
Header assertions (Rule 9): Assert on any response header. Configure monitors to fail when a header is absent, has an unexpected value, or changes without a corresponding deployment.
SSL/TLS certificate monitoring (Rule 10): The SSL/TLS Certificate monitor checks expiry, OCSP revocation status, SHA-1 fingerprint integrity (to detect certificate tampering), and blocklisted certificate authorities from 130 global locations. SNI environments with multiple certificates on the same IP are supported. Configure tiered threshold alerts for 60, 30, 14, and seven days to expiry.
Redirect and downgrade detection (Rule 12): Configure monitors to fail on unexpected redirects rather than following them silently. Assert on final URL and protocol to catch HTTP downgrades.
For Rules 13 through 15 , Site24x7 integrates natively with PagerDuty, Opsgenie, Slack, and Microsoft Teams, with multi-tier escalation policies and alert suppression controls. The dependent monitor feature (Rule 14) suppresses alerts on downstream monitors when an upstream dependency is already in a DOWN state—so a third-party API failure doesn't generate 40 simultaneous alerts across the services that depend on it.
The teams that find gaps in Rules 3, 7, and 14 most often are the ones whose monitoring was built around status codes and never extended to body validation or dependency visibility. Start a free trial of Site24x7 to run the checklist against your current setup.
Reliable APIs aren't lucky; they're monitored across every surface—correctness, performance, security, and operational health—in every region users depend on.
Start with the rules that close your biggest gaps. For most teams, that means multi-location monitoring, response body validation, and per-endpoint latency thresholds. The best monitoring setup isn't the one with the most checks. It's the one where every alert reaches someone who can act on it—and where that person finds out before the first user ticket lands.
What is the most important REST API monitoring metric?
P99 latency and response body validation, not uptime percentage. Uptime tells you the server is responding. P99 tells you whether the slowest 1% of requests are acceptable—and that 1% is usually where user-facing failures concentrate. Response body validation tells you whether the data that came back is actually correct. An API can have 99.9% uptime and still be silently delivering wrong or partial data on every request.
How do I set API latency thresholds?
Baseline each endpoint for two weeks, then set alert thresholds at P95 and P99 of normal behavior, not at a global number applied across all endpoints. A health check endpoint should alert at 50ms for P95. A search endpoint with aggregations might have a legitimate P95 of 800ms. The Rule 4 table in this article gives starting-point targets by endpoint type. Review and adjust every quarter as traffic patterns change.
Why does my API monitor show green when users report errors?
Almost always one of two reasons: status-code-only monitoring or a monitoring gap in the method being called. If your monitor only checks that the endpoint returns 200 OK, it will show green through any partial failure, empty response, or schema breakage. Fix it by adding JSONPath assertions on the response body—assert that $.errors is absent and that critical fields are present and non-null. If your monitor only hits GET endpoints, it will show green while POST or authenticated flows fail entirely.
How often should I check API availability?
One minute for Tier 1 endpoints—payment flows, authentication, and checkout. One to three minutes for core product APIs. Five minutes for internal or admin endpoints. Ten to 15 minutes for documentation or status pages. The interval only matters if the check itself is meaningful: A one-minute check that only validates a status code gives you one-minute visibility into server uptime, not API health. Pair short intervals with body validation and authentication for the checks that actually matter.
What happens when an SSL certificate expires?
Every HTTPS client that connects to the endpoint gets an immediate, hard failure—no graceful degradation and no retry that helps. Browser users see a certificate warning and can't proceed. API clients get a TLS handshake error before a single byte of application data is exchanged. The fix is a tiered alert structure: informational at 60 days, warning at 30, critical at 14, emergency at seven. Don't rely on auto-renewal alone—verify it completed by checking days remaining directly. Site24x7's SSL/TLS Certificate monitor also checks OCSP revocation status and blocklisted certificate authorities, catching problems that an expiry-only check misses.
How do I monitor third-party API dependencies?
Create a dedicated monitor for each critical dependency with its own thresholds and escalation policy, separate from your own endpoint monitors. Then configure alert suppression: When the dependency monitor is down, suppress alerts from all monitors that depend on it. Without suppression, a single third-party outage generates a cascade of alerts from every downstream service, which makes it harder to identify the actual root cause and accelerates alert fatigue. Map your full dependency tree before setting this up; most teams discover two or three undocumented dependencies in the process.
What is synthetic API transaction monitoring?
A synthetic transaction monitor chains multiple API calls in sequence—authenticate, create a resource, read it back, update it, verify the result—and runs the full sequence on a schedule from external monitoring locations. Unlike a simple endpoint check, it validates that real user workflows function end to end, not just that individual endpoints respond. Each step has independent assertions and configurable failure behavior. This is how you catch failures in POST flows, authenticated endpoints, and multi-step operations that a single GET check would never reach.
What is the best way to reduce false positive alerts in API monitoring?
Three controls in combination: multi-location rechecks, dependent monitor suppression, and quarterly threshold review. Multi-location rechecks confirm that a failure is real before alerting—a single-location timeout often reflects a transient network issue, not an API outage. Dependent monitor suppression prevents a single upstream failure from generating dozens of alerts across downstream services. Quarterly threshold review catches thresholds that have drifted out of alignment with current traffic patterns and are now firing on normal variance. Any one of these alone reduces noise. All three together keep signal-to-noise ratios above 80%.
What is the difference between P95 and P99 latency, and which should I alert on?
P95 latency is the response time at or below which 95% of requests complete—it reflects the experience of most users and is the right threshold for early warnings. P99 latency is the response time at or below which 99% of requests complete—it reflects the tail of the distribution, where outliers concentrate, and is the right threshold for structural failure signals. Alert on both independently. P50 tells you the median experience. P95 is your canary—when it rises, something is degrading. P99 is your hard indicator—when it spikes, something is broken.
How do I prevent alert fatigue in API monitoring?
Tier your endpoints and match check frequency to business impact. Route alerts to an on-call system with multi-tier escalation, not to a shared inbox. Set a target of fewer than five actionable alerts per on-call shift—if you're exceeding it, your thresholds need tuning, not your alerting volume. Run a quarterly review to identify alerts that are firing on normal behavior. Alert fatigue doesn't announce itself; it builds gradually until engineers start treating all alerts as noise, at which point the monitoring has failed even if every rule is technically configured.