Infrastructure monitoring tooling and processes
Rule: Prometheus is used to track server metrics
Rationale: All servers should be monitored for disk usage, load, memory usage, etc
Rule: All alerts must be forwarded to Opsgenie
Rationale: All active alerts should be visible in a single dashboard. Note: we intend to migrate from Opsgenie to goalert.
Rule: Custom, re-usable dashboards are created in Grafana
Rationale: Team members may desire custom dashboards focused on particular areas of interest. Grafana is connected to the Prometheus instance.
Priority level alert and response standards and processes
Rule: Priority 1 and 2 alerts should be distributed immediately upon detection.
Rationale: Alert level P1 is for "drop everything, 24/7, get out of bed" - alerts are sent immediately.
Rule: The support response for P1 incidents should be immediate.
Rationale: If critical services are down, an immediate response is warranted.
Rule: P2 alerts are sent during business hours only.
Rationale: If important (but not critical) services are down, a working-hours response is warranted.
Rule: P2 support response can occur within normal working hours.
Rationale: P2 is for "drop everything if within working hours"
Rule: Priority 3, 4, and 5 alerts trigger no notifications
Rationale: Not necessary
Rule: P1 and P2 alerts must be registered in the Incident Registry (Kaizen Issues)
Rationale: This ensures these incidents are available for post-mortem, root cause analysis, and future mitigation or avoidance. Review and resolution is a NEN/ISO requirement.
Infrastructure support tooling and processes
Rule: A self hosted instance of Rundeck is used to trigger scripted routine tasks.
Rationale: Most routines are time-based and triggered daily and a couple ad-hoc commands can help admins/support team to perform quick remediation, troubleshooting or fixes for customers.
#infrastructure
)