Notification & Escalation Procedure ¶
Overview ¶
Urgency | Service Notification Setting | Use When | Response |
High 24/7 | High-priority PagerDuty Alert 24/7/365 |
|
|
High during support hours | High-priority Slack Notifications during support hours |
|
|
Low | Low Priority Slack Notification |
|
|
Service Notification Settings ¶
Service Notification Setting | Description |
High-priority PagerDuty Alert 24/7/365 |
|
High-priority Slack Notifications during support hours |
|
Low Priority Slack Notification |
|
Alert Types ¶
UpTimeRobot (black box)
- https://uptimerobot.com/
- Sites or APIs are down
Prometheus Alert Manager (black box, metrics-based)
- http://prometheus.aws.domain.com/
- Clusters issues (masters/nodes high resources usage)
- Instance issues (Pritunl VPN, Jenkins, Spinnaker, Grafana, Kibana, etc)
- Alerts from Prometheus Blackbox Exporter
Kibana ElastAlert (black box, logs-based)
- Intended for applications/services logs
- Applications/services issues (frontends, backend services)
- Cluster components issues (nginx-ingress, cert-manager, linkerd, etc)
PagerDuty
- https://domain.pagerduty.com/
- Incident management
Implementation Reference Example ¶
Slack
All alerts are sent to #engineering-urgent-alerts channel. Members that are online can have visibility from there. AlertManager takes care of sending such alerts according to the rules defined here: TODO
Note: there is a channel named engineering-alerts
but is used for Github notifications. It didn’t make sense to
mix real alerts with that, that is why a new engineering-urgent-alerts
channel was created. As a recommendation,
Github notifications should be sent to a channel named like #engineering-notifications and leave
engineering-alerts
for real alerts.
PagerDuty
AlertManager only sends to PagerDuty alerts that are labeled as severity: critical. PagerDuty is configured to turn these into incidents according to the settings defined here for the Prometheus Critical Alerts service. The aforementioned service uses HiPriorityAllYearRound escalation policy to define who gets notified and how.
Note: currently only the TechOwnership role gets notified as we don’t have agreements or rules about on-call support but this can be easily changed in the future to accommodate business decisions.
UpTimeRobot
We are doing basic http monitoring on the following sites: * www.domain_1.com * www.domain_2.com * www.domain_3.com
Note: a personal account has been set up for this. As a recommendation, an new account should be created using an email account that belongs to your project.