在大规模Kubernetes集群上实现高SLO的方法
Healthy Terminating Pod Number Daily Report Validation Housekeepi ng High Available Fast Recovery Display Board Alert Analysis Platform Weekly Report SLO: Indicate the cluster is healthy or there • orphaned pod directories/volumes • orphaned cgroups • orphaned net device and so on, node recovery system can cleanup those dirty data or alert cluster admins to process dirty data manually. Unhealthy ntoller failedPodContr oller Detector Strategy Unhealthy node list Fast Taint Weight Adjust Recovery Manual Handling Improve Auto Human experience Improve of strategy …… 1. Collect data from0 码力 | 11 页 | 4.01 MB | 1 年前3QCon北京2017/智能化运维/Self Hosted Infrastructure:以自动运维 Kubernetes 为例
member Kubelet Pods API Server Scheduler Controller Manager etcd operator etcd Disaster Recovery Node failure in HA deployments (Kubernetes) Partial loss of control plane components (Kubernetes) the entire control plane (Kubernetes) Permanent loss of control plane (External tool) Disaster Recovery Permanent loss of control plane ● Similar situation to initial node bootstrap, but utilizing start a temporary replacement api-server ○ Could be binary, static pod, new tool, bootkube, etc. ● Recovery once etcd+api is available can be done via kubectl (as seen previously) Self-Driving Kubernetes0 码力 | 73 页 | 1.58 MB | 1 年前3Operator Pattern 用 Go 扩展 Kubernetes 的最佳实践
Observerbility 日志、系统指标等采集、分析;监控配置与报警;性能 指标收集与分析等等。 Backup & Restore 备份策略、备份方式、恢复方式、备份管理等等。 Disaster Recovery & High Availability Failover/Switchover、多可用区、数据恢复等等。 Security & Compliance 访问控制、审计、安全链接、加密存储等等。0 码力 | 21 页 | 3.06 MB | 9 月前3
共 3 条
- 1