LinkORB Engineering | Infrastructure

Infrastructure monitoring tooling and processes

Rule: Prometheus is used to track server metrics

Rationale: All servers should be monitored for disk usage, load, memory usage, etc

Rule: All alerts must be forwarded to Opsgenie

Rationale: All active alerts should be visible in a single dashboard. Note: we intend to migrate from Opsgenie to goalert.

Rule: Custom, re-usable dashboards are created in Grafana

Rationale: Team members may desire custom dashboards focused on particular areas of interest. Grafana is connected to the Prometheus instance.

Priority level alert and response standards and processes

Rule: Priority 1 and 2 alerts should be distributed immediately upon detection.

Rationale: Alert level P1 is for "drop everything, 24/7, get out of bed" - alerts are sent immediately.

Rule: The support response for P1 incidents should be immediate.

Rationale: If critical services are down, an immediate response is warranted.

Rule: P2 alerts are sent during business hours only.

Rationale: If important (but not critical) services are down, a working-hours response is warranted.

Rule: P2 support response can occur within normal working hours.

Rationale: P2 is for "drop everything if within working hours"

Rule: Priority 3, 4, and 5 alerts trigger no notifications

Rationale: Not necessary

Rule: P1 and P2 alerts must be registered in the Incident Registry (Kaizen Issues)

Rationale: This ensures these incidents are available for post-mortem, root cause analysis, and future mitigation or avoidance. Review and resolution is a NEN/ISO requirement.

Infrastructure support tooling and processes

Rule: A self hosted instance of Rundeck is used to trigger scripted routine tasks.

Rationale: Most routines are time-based and triggered daily and a couple ad-hoc commands can help admins/support team to perform quick remediation, troubleshooting or fixes for customers.

About Infrastructure

Name: Infrastructure (#infrastructure)

Rules in [#infrastructure]