Automating Incidence Response

This guide discusses how to leverage PagerDuty, Trigger.dev, and Telemetry to automate your incident response workflows effectively. By integrating Trigger.dev’s event-driven automation with Telemetry’s powerful data APIs and PagerDuty’s industry-leading incident management, you can create a system that not only detects issues in real time but also takes immediate action to resolve them.

Why We Recommend PagerDuty and Trigger.dev

You may be curious why we recommend combining PagerDuty and Trigger.dev with Telemetry. The answer is simple: while we may explore more integrated solutions within Telemetry in the future, we believe in doing it right. PagerDuty offers a robust platform for managing incidents, and Trigger.dev provides a flexible way to automate workflows directly in your codebase. By combining these tools with Telemetry, you get the best of all worlds:

  • Automated Incident Workflows: Trigger.dev enables the creation of sophisticated workflows that respond to events in real time.

  • Real-Time Data Access: Telemetry’s real-time querying and robust API ensure that your incident detection is based on the latest data.

  • Reliable Incident Response: PagerDuty ensures that alerts reach the right people at the right time through its trusted incident management platform.

Setting Up Your Workflow

1. Fetching Data from Telemetry

First, you'll need to retrieve the necessary data to detect incidents using Telemetry’s Query API. The Query API supports SQL-like queries, allowing you to filter, aggregate, and analyze data in real time.

Example Query:

The query used in the fetchTelemetryData function is designed to count the number of heartbeats received in the last 5 minutes:

// telemetry.ts

export async function fetchTelemetryData(query: string) {
  const API_KEY = 'YOUR_TELEMETRY_API_KEY';
  const ENDPOINT = 'https://api.telemetry.sh/query';

  const response = await fetch(ENDPOINT, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({ query }),
  });

  const result = await response.json();
  return result.data;
}

2. Sending Alerts with PagerDuty

Once you have the necessary data, you can use PagerDuty to send an alert if an incident is detected. PagerDuty manages the notification process, ensuring the appropriate teams are informed and can escalate the issue if necessary.

Example:

// pagerduty.ts

export async function sendPagerDutyAlert(incidentDetails: any) {
  const API_KEY = 'YOUR_PAGERDUTY_API_KEY';
  const ENDPOINT = 'https://api.pagerduty.com/incidents';

  const response = await fetch(ENDPOINT, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({
      incident: {
        type: 'incident',
        title: incidentDetails.title,
        service: {
          id: 'YOUR_SERVICE_ID',
          type: 'service_reference',
        },
        body: {
          type: 'incident_body',
          details: incidentDetails.description,
        },
      },
    }),
  });

  return await response.json();
}

3. Gluing It Together with Trigger.dev

Finally, you can use Trigger.dev to automate the entire process by creating a scheduled task that runs every 5 minutes. This task will fetch the telemetry data, check if the heartbeat count is below a threshold, and trigger a PagerDuty alert if necessary.

Example:

import { schedules } from "@trigger.dev/sdk/v3";
import { fetchTelemetryData } from './telemetry';
import { sendPagerDutyAlert } from './pagerduty';

export const incidentResponseTask = schedules.task({
  id: "incident-response-task",
  cron: "*/5 * * * *",  // Every 5 minutes (UTC timezone)
  run: async (payload) => {
    const query = `
      SELECT COUNT(*) AS heartbeat_count 
      FROM heartbeats 
      WHERE timestamp >= datetime('now', '-5 minutes')
    `;
    const data = await fetchTelemetryData(query);

    if (data.heartbeat_count < 1) {  // No heartbeats detected in the last 5 minutes
      await sendPagerDutyAlert({
        title: "No heartbeats detected",
        description: "No heartbeats have been detected in the last 5 minutes. Immediate attention is required.",
      });

      // Optionally, restart a service
      await restartService("your-service-name");

      // Generate an incident report
      await generateIncidentReport(data);
    }
  },
});

Automating Incident Response

Combining PagerDuty, Trigger.dev, and Telemetry allows you to automate the entire incident response process. For example, you can set up tasks that not only detect incidents but also take automatic actions such as:

  • Restarting Services: Automatically restart affected services when an incident is detected.

  • Escalation Policies: Use PagerDuty’s escalation policies to ensure that critical incidents are addressed by senior engineers swiftly.

  • Detailed Reporting: Automatically generate and distribute incident reports based on Telemetry data.

While we may consider more integrated automation solutions within Telemetry in the future, using PagerDuty and Trigger.dev with Telemetry currently provides the best experience. PagerDuty’s reliable incident management, combined with Trigger.dev’s powerful workflow automation and Telemetry’s real-time data, creates a seamless and effective incident response system.

Last updated