Skip to content

Notification & Escalation Procedure

Overview

Urgency Service Notification Setting Use When Response
High 24/7 High-priority PagerDuty Alert 24/7/365
  • Issue is in Production
  • Or affects the applications/services and in turn affects the normal operation of the clinics
  • Or prevents clinic patients to interact with the applications/services
  • Requires immediate human action
  • Escalate as needed
  • The engineer should be woken up
High during support hours High-priority Slack Notifications during support hours
  • Issue impacts development team productivity
  • Issue impacts the normal business operation
  • Requires immediate human action ONLY during business hours
Low Low Priority Slack Notification
  • Any issue, on any environment, that occurs during working hours
  • Requires human action at some point
  • Do not escalate
  • An engineer should not be woken up

Service Notification Settings

Service Notification Setting Description
High-priority PagerDuty Alert 24/7/365
  • Notify on-call engineers --- At first, notify via SMS/Push --- Notify via Phone Call if after 10 minutes the previous has not acknowledged
  • Notify person X (this is a person who needs to be aware of any of these issues always)
  • Notify to Slack => engineering-urgent-alerts channel
High-priority Slack Notifications during support hours
  • Notify to Slack => engineering-alerts channel
Low Priority Slack Notification
  • Notify to Slack => engineering-alerts channel

Alert Types

UpTimeRobot (black box)

Prometheus Alert Manager (black box, metrics-based)

  • http://prometheus.aws.domain.com/
  • Clusters issues (masters/nodes high resources usage)
  • Instance issues (Pritunl VPN, Jenkins, Spinnaker, Grafana, Kibana, etc)
  • Alerts from Prometheus Blackbox Exporter

Kibana ElastAlert (black box, logs-based)

  • Intended for applications/services logs
  • Applications/services issues (frontends, backend services)
  • Cluster components issues (nginx-ingress, cert-manager, linkerd, etc)

PagerDuty

Implementation Reference Example

Slack

All alerts are sent to #engineering-urgent-alerts channel. Members that are online can have visibility from there. AlertManager takes care of sending such alerts according to the rules defined here: TODO

Note: there is a channel named engineering-alerts but is used for Github notifications. It didn’t make sense to mix real alerts with that, that is why a new engineering-urgent-alerts channel was created. As a recommendation, Github notifications should be sent to a channel named like #engineering-notifications and leave engineering-alerts for real alerts.

PagerDuty

AlertManager only sends to PagerDuty alerts that are labeled as severity: critical. PagerDuty is configured to turn these into incidents according to the settings defined here for the Prometheus Critical Alerts service. The aforementioned service uses HiPriorityAllYearRound escalation policy to define who gets notified and how.

Note: currently only the TechOwnership role gets notified as we don’t have agreements or rules about on-call support but this can be easily changed in the future to accommodate business decisions.

UpTimeRobot

We are doing basic http monitoring on the following sites: * www.domain_1.com * www.domain_2.com * www.domain_3.com

Note: a personal account has been set up for this. As a recommendation, an new account should be created using an email account that belongs to your project.