Customer Self-Service Portal

Reduce Databricks Downtime: Jobs Monitoring Built into Site24x7


Hi there!
We're excited to announce built-in Databricks jobs monitoring in Site24x7. With this update, you get focused capability to help engineering and SRE teams observe, troubleshoot, and optimize Databricks job performance and reliability.

What’s included

Here’s what’s included:
  1. End-to-end job visibility: Track job runs, stages, and tasks across the Databricks workspace.
  2. Key metrics: View execution time, success/failure rates, retries, and resource usage (CPU, memory).
  3. Real-time alerts: Receive alerts when thresholds are reached and anomalies are detected for job failures, excessive run times, and SLA breaches.
  4. Dashboards and reports: Create customizable views to monitor trends, SLA compliance, and capacity planning.
  5. Integration: Combine these capabilities seamlessly with Site24x7 incident workflows, notification channels, and ITSM tools.

Why it matters

Here's why it matters to your monitoring strategy:
  1. Faster incident resolution: Pinpoint job failures and problematic stages without context switching.
  2. Better reliability: Reduce SLA violations and missed downstream processes with proactive alerts.
  3. Actionable telemetry: Correlate Databricks job health with infrastructure and application signals already monitored in Site24x7.
Read the setup guide and prerequisites here in our help document.

Questions?

If you’d like a walkthrough or need some suggested alert and playbook configurations for common pipeline patterns, reply here or open a support ticket and we’ll assist.
Happy monitoring!