Skip to content

Reliability Performance Roadmap

Features / Functionalities 🚀⏲📊

Category Tags / Labels
Feature / Functionality
Status Doc
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
prometheus
grafana
Metrics: install and configure Prometheus (NodeExporter for EC2 / BlackBox exporter / Alert Monitroing), install and configure Grafana (K8s Plugin + Prometheus int + CloudWatch int)
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
grafana
cloudwatch
Metrics: Grafana + AWS Cloudwatch integrations config (https://github.com/monitoringartist/grafana-aws-cloudwatch-dashboards)
2021 Q2
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
apm
APM: review, analyze and implement (New Relic, DataDog, ElasticAPM Agent/Server)
2021 Q2
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
documentation
Define and document reference notification/escalation procedure
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
Alerting: configure AlertsManager, Elastalert (optimized logs rotation when using it from docker image), PagerDuty, Slack according to the procedure above
2021 Q2
Monitoring
Metrics
& Alerting
leverage
monitoring-metrics-alerting
prometheus
Monitor Infra Tool Instances (WebHook Proxy, Jenkins, Vault, Pritunl, Prometheus, Grafana, etc) / implement monitoring via Prometheus + Grafana or Another Solution
Monitoring
Distributed
Tracing
leverage
monitoring-tracing
jaeger
Distributed Tracing Instrumentation: review, analyze and implement to detect and improve transactions performance and svs dep analysis (jaeger, instana, lightstep, AWS X-Ray, etc)
2021 Q3
Monitoring
Logging
leverage
monitoring-logs
efk
Logging / EFK - use separate indexes per K8s components & apps/svc for each custer/env (segregating dev/stg from prd) + enable ES monitoring w/ X-Pack + configure curator to rotate indices + tool to improve index mgmt
2021 Q2
Performance
& Optimization
leverage
performance-optimization
ci-cd-pipeline
Load Testing: set up and run continuous load tests pipelines (Jenkins) to determine and improve apps/services capacity through time (apapche ab, gatling, iperf, locust, taurus, BlazeMeter and https://github.com/loadimpact/k6)
2021 Q3
Performance
& Optimization
leverage
performance-optimization
ci-cd-pipeline
Performance Testing (stress, soak, spike, etc): set up and run continuous performance tests pipelines (Jenkins) to measure performance through time (apapche ab, gatling, iperf, locust, taurus and BlazeMeter)
2021 Q3
Performance
& Optimization
leverage
performance-optimization
kubernetes
Tune K8S nodes (EC2 family type, size and AWS ASG -> K8s HPA + Cluster AutoScaler )
2021 Q3
Performance
& Optimization
leverage
performance-optimization
kubernetes
Tune K8S requests and limits per namespace (CPU and RAM) / https://github.com/FairwindsOps/goldilocks
2021 Q2
Performance
& Optimization
leverage
performance-optimization
s3
S3: ensure each bucket is using the proper storage types and persistence (automate mv these objs into lower $ storage tier w/ Life Cycle Policies or w/ S3 Intelligent-Tiering)
Disaster
Recovery
leverage
disaster-recovery
backup
AWS Backup Service: RDS, EC2 (AMI), EBS, Dynamo, EFS, SFx, Storage Gw
Disaster
Recovery
leverage
disaster-recovery
backup
Replication: S3 (CRR cross-region replication or SRR same-region replication)
Disaster
Recovery
leverage
disaster-recovery
backup
Replication: VPC / Compute / Database (CRR cross-region replication)
Disaster
Recovery
leverage
disaster-recovery
backup
kubernetes
Backup and migrate Kubernetes applications and their persistent volumes w/ https://velero.io/
2021 Q3
Disaster
Recovery
leverage
documentation
disaster-recovery
Review: Disaster recovery plan, missing resources, RTO / RPO, level of automation
2021 Q4
Disaster
Recovery
leverage
documentation
disaster-recovery
Improve Plan: create a plan to improve the existing recovery plan and determine implementation phases
2021 Q4
Disaster
Recovery
leverage
documentation
disaster-recovery
Execute Plan: implement according to the plan, review/measure and iterate
2021 Q4