On-Call Health is an open-source tool specifically designed to proactively identify and address unsustainable on-call workloads within engineering teams before they escalate into burnout and retention issues. It serves engineering managers, heads of infrastructure, and team leads who are responsible for maintaining operational reliability while safeguarding the well-being of their on-call responders. The primary purpose is to transform anecdotal feedback and scattered operational data into a clear, data-backed narrative about team health, enabling leaders to quantify on-call burden and make informed, preemptive operational improvements. By providing objective signals and AI-powered analysis, it shifts the focus from reactive firefighting to proactive health management, ensuring that the people behind the systems are supported as diligently as the systems themselves.
In modern software engineering, maintaining system reliability through on-call rotations is a critical but often burdensome responsibility that can lead to significant human cost. The problem context revolves around the silent accumulation of stress and overload among engineers, which typically remains anecdotal and unquantified until it manifests as burnout, decreased productivity, or attrition. Teams lack visibility into the compounding factors of after-hours incidents, high-severity alerts, communication overload, and ticket workload that collectively drive exhaustion. Without objective data, engineering leaders struggle to tell a clear story about team health or make a compelling case for necessary changes like rebalancing rotations or adding automation, often only acting when the problem has already impacted morale and retention.
The first major feature group involves comprehensive signal integration from the tools engineering teams already use daily. On-Call Health connects to Rootly or PagerDuty to ingest detailed incident data, including frequency, severity, and resolution times. It integrates with Linear to capture ticket workload and project management context, and with GitHub to detect after-hours commits and deployment signals that indicate work extending beyond normal hours. Furthermore, it analyzes Slack for communication patterns, measuring the volume and timing of messages related to incidents to understand context-switching and cognitive load. This multi-source approach ensures a holistic view of an engineer's operational burden by pulling data from across the entire incident response and development workflow, transforming disparate data points into a unified health assessment.
admin
The second major feature group centers on collecting direct, low-friction sentiment from the engineers themselves to complement the objective data. The tool periodically sends short, designed surveys via Slack, allowing responders to quickly share how they are feeling about their current workload and on-call experience. These check-ins are intentionally fast and simple to reduce stigma and encourage honest participation, ensuring that subjective well-being is captured alongside quantitative metrics. By combining this self-reported sentiment with the integrated tool signals, On-Call Health creates a more complete picture of risk, acknowledging that burnout is influenced by both measurable factors and personal perception, and providing a channel for engineers to voice concerns before they become critical.
A third critical capability is the generation of individual and team risk scores, along with AI-powered analysis to drive actionable insights. The system computes a clear risk score from 0 to 100, segmented into bands like 0-24 for maintaining balance, 25-49 for monitoring risk, 50-74 for early intervention, and 75-100 for immediate action. More than just a number, the AI analyzes what changed and identifies the specific factors driving score fluctuations, such as a spike in high-severity incidents or increased after-hours work. This helps managers understand not just that someone is at risk, but why, enabling them to make better, informed decisions about interventions like pausing non-urgent work or adjusting schedules.
Technically, On-Call Health works by establishing personalized and team-specific baselines rather than applying fixed, one-size-fits-all thresholds. It tracks trends over time for each individual and team, comparing current activity against their own historical norms to identify meaningful deviations. This approach ensures fairness and accuracy, as it accounts for natural variations in workload tolerance and role expectations. The system continuously ingests data from connected integrations, correlates signals, and applies analytical models to compute risk, presenting everything through a centralized dashboard that visualizes trends, scores, and AI-generated summaries for quick comprehension.
The benefits for users are both measurable and strategic. Engineering leaders gain data-backed evidence to quantify on-call burden and advocate for operational improvements with concrete metrics, moving discussions from subjective feelings to objective facts. Teams benefit from early detection of trend shifts, allowing for small, timely fixes like rebalancing rotations or adding automation before issues spiral. The tool fosters a culture of proactive care, aligning stakeholders around not just system health but people health, and turning weekly reviews into holistic conversations that can prevent burnout from becoming a retention problem.
Concrete use cases are evident in specific workflow examples. An engineering manager can use the dashboard to see that a team member's risk score jumped into the 'Early Intervention' band, with the AI summary highlighting a recent increase in PagerDuty alerts combined with negative sentiment check-ins. This prompts a one-on-one conversation and a temporary adjustment of their on-call duties. During a resource planning meeting, a head of infrastructure can present trend graphs showing a steady climb in team-wide after-hours GitHub activity, making the case to hire an additional site reliability engineer. In post-incident reviews, teams can discuss not only the root cause of an outage but also the human impact, using data to justify investing in automation for recurring alert types.
The target users are primarily engineering managers, heads of infrastructure engineering, site reliability engineering (SRE) leads, and anyone responsible for on-call rotations and team well-being. It integrates seamlessly with popular tools in the modern tech stack, including Rootly and PagerDuty for incident management, Linear and Jira for ticketing, GitHub for development activity, and Slack for communication and surveys. As an open-source project under the Apache License 2.0, it is freely available for use and modification, with no mentioned pricing plans, emphasizing accessibility and community-driven improvement for organizations of all sizes.
In summary, On-Call Health provides the critical visibility and analytical depth needed to transform on-call management from a reactive necessity into a proactive practice of team care. By unifying objective tool signals with subjective sentiment and analyzing them against personal baselines, it empowers leaders to spot exhaustion before it becomes burnout. The ultimate takeaway is that protecting engineers from unsustainable workloads is not just an ethical imperative but a strategic one, and this tool offers the data-backed framework to make that protection measurable, actionable, and integrated into the daily workflow of reliable software operations.
The primary target audience is engineering leadership responsible for team well-being and operational reliability, including Engineering Managers, Heads of Infrastructure, Site Reliability Engineering (SRE) Leads, and DevOps team leads. These users need to maintain system uptime while preventing burnout among their on-call responders. They work in organizations that utilize modern incident management tools like PagerDuty or Rootly, ticketing systems like Linear or Jira, and communication platforms like Slack, and seek data-driven insights to make fair, proactive people decisions alongside technical ones.