Kubernetes - 监控

Table of Contents

暴露的 Kubernetes metric:

1. 容器监控

以下表达式适用于 K8s v1.19,不同版本暴露的 metrics 的 label 可能略有差异。

1.1. CPU 利用率

(sum(rate(container_cpu_usage_seconds_total{namespace != "", container != ""}[5m])) by (namespace, pod)) / (sum(container_spec_cpu_quota{namespace!="", container!=""}/container_spec_cpu_period{namespace!="", container!=""}) by (namespace, pod)) * 100

注意:一定要指定 container!="" 否则,数据会计算成多份。比如说查询表达式为: container_cpu_usage_seconds_total{pod="client-boss-6f86b6f7f-ssrmt"}

container_cpu_usage_seconds_total{container="POD",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod63cecba9_3bd5_47ed_89f5_cd98e4de7dee.slice/docker-c5079f13184284f9d1c71344f6bb240529177f4a942ac48f0c9fcbba95fca8c7.scope",image="dockerhub.test.wacai.info/google_containers/pause-amd64:3.1",instance="172.30.1.3",job="kubernetes-nodes-cadvisor",name="k8s_POD_client-boss-6f86b6f7f-ssrmt_cloud_63cecba9-3bd5-47ed-89f5-cd98e4de7dee_0",namespace="cloud",node="172.30.1.3",pod="client-boss-6f86b6f7f-ssrmt"}	7.855229224
container_cpu_usage_seconds_total{container="client-boss",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod63cecba9_3bd5_47ed_89f5_cd98e4de7dee.slice/docker-3e7af8aa1ecf321f08b6966a0a406ec0db6839831629bfc228d3d2b1505e4e20.scope",image="sha256:3a9858a23eb393f2458cc677a88f5be0e569cde6cfd4ed5506d1ffe431f26be6",instance="172.30.1.3",job="kubernetes-nodes-cadvisor",name="k8s_client-boss_client-boss-6f86b6f7f-ssrmt_cloud_63cecba9-3bd5-47ed-89f5-cd98e4de7dee_0",namespace="cloud",node="172.30.1.3",pod="client-boss-6f86b6f7f-ssrmt"}	942.895084514
container_cpu_usage_seconds_total{cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod63cecba9_3bd5_47ed_89f5_cd98e4de7dee.slice",instance="172.30.1.3",job="kubernetes-nodes-cadvisor",namespace="cloud",node="172.30.1.3",pod="client-boss-6f86b6f7f-ssrmt"}	950.755770628

可以看到最后一条记录是前两条记录的加和(Pause 和 业务容器)。

1.2. 内存利用率

(sum(container_memory_usage_bytes{container != "POD", container != "", namespace !="kube-system"} - container_memory_cache{container != "POD", container != "", namespace !="kube-system"}) by (pod, namespace, container)) / (sum(container_spec_memory_limit_bytes{container != "POD",container != "",namespace !="kube-system"}) by (pod, namespace, container)) * 100

1.3. 磁盘占用

(sum by (namespace, container, pod) (container_fs_usage_bytes{container!="POD", container!=""}) / (1024 * 1024 * 1024))

1.4. 空间 CPU 配额

sum(kube_resourcequota{type="hard", resource="limits.cpu"}) by (namespace) - sum(kube_resourcequota{type="used", resource="limits.cpu"}) by (namespace)

1.5. 空间内存配额不足

(sum(kube_resourcequota{type="hard", resource="limits.memory"}) by (namespace) - sum(kube_resourcequota{type="used", resource="limits.memory"}) by (namespace) ) / 1024 / 1024 / 1024

1.6. Ingress QPS

sum by (kubernetes_node) (rate(nginx_ingress_controller_nginx_process_requests_total{job="kubernetes-service-endpoints"}[1m]))

1.7. 打开 fd 数量

sum by (kubernetes_namespace, instance, kubernetes_io_hostname) (process_open_fds)

2. 节点监控

2.1. CPU 利用率

100 - (avg by (kubernetes_node) (irate(node_cpu_seconds_total{mode="idle", kubernetes_node!=""}[2m]))) * 100

2.2. 内存利用率

sum by (kubernetes_node) (100 - ((node_memory_MemFree_bytes{kubernetes_node!=""}+node_memory_Cached_bytes{kubernetes_node!=""}+node_memory_Buffers_bytes{kubernetes_node!=""}+node_memory_Slab_bytes{kubernetes_node!=""})/node_memory_MemTotal_bytes{kubernetes_node!=""}) * 100)

注意,一定要计算 node_memory_Buffers_bytes ,否则 Cache/Buffer 计算的值要比实际小很多1

2.3. 磁盘利用率

100 - (sum by (kubernetes_node) (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",kubernetes_node!=""})) / (sum by (kubernetes_node) (node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",kubernetes_node!=""})) * 100

2.4. 网络

sum by (kubernetes_node) (irate(node_network_transmit_bytes_total{kubernetes_node!="", device=~"eth[0-9]"}[2m])) / (1024*1024) # 流出
sum by (kubernetes_node) (irate(node_network_transmit_bytes_total{kubernetes_node!="", device=~"eth[0-9]"}[2m])) / (1024*1024) # 流入

具体 device 怎么匹配,要跟运维的同学咨询一下。

Footnotes:

First created: 2020-09-28 16:21:42
Last updated: 2022-12-11 Sun 12:49
Power by Emacs 29.0.91 (Org mode 9.6.6)