通过 Amazon S3 协议挂载 OSS
docker 部署 Prometheus 监控服务器及容器并发送告警 | chris'wang
> 本文由 [简悦 SimpRead](http://ksria.com/simpread/) 转码, 原文地址 [chriswsq.github.io](https://chriswsq.github.io/post/docker-bu-shu-prometheus-jian-kong-fu-wu-qi-ji-rong-qi-bing-fa-song-gao-jing/) 记录一下利用 prometheus 监控服务器信息和容器服务并发送邮件告警 基本原理 ---- Prometheus 的基本原理是通过 HTTP 协议周期性抓取被监控组件的状态,任意组件只要提供对应的 HTTP 接口就可以接入监控。不需要任何 SDK 或者其他的集成过程。这样做非常适合做虚拟化环境监控系统,比如 VM、Docker、Kubernetes 等。输出被监控组件信息的 HTTP 接口被叫做 exporter 。目前互联网公司常用的组件大部分都有 exporter 可以直接使用,比如 Varnish、Haproxy、Nginx、MySQL、Linux 系统信息 (包括磁盘、内存、CPU、网络等等)。 相关组件 ---- * Prometheus: Prometheus Daemon 负责定时去目标上抓取 metrics(指标) 数据,每个抓取目标需要暴露一个 http 服务的接口给它定时抓取。 * Grafana: 接入 prometheus 数据,图形化展示监控信息 * Node-exporter: 负责收集 host 硬件和操作系统数据。它将以容器方式运行在所有 host 上。 * Cadvisor: 负责收集容器数据。它将以容器方式运行在所有 host 上。 * Alertmanager: 警告管理器,用来进行报警。 <table><thead><tr><th>主机名</th><th>ip</th><th>服务</th></tr></thead><tbody><tr><td>test-1</td><td></td><td>cadvisor、node-exporter、grafana、prometheus、alertmanager.yml</td></tr><tr><td>test-2</td><td></td><td>node-exporter、cadvisor</td></tr><tr><td>test-3</td><td></td><td>node-exporter、cadvisor</td></tr></tbody></table> 安装 docker、docker-compose ------------------------ ### 安装 docker ``` # 安装依赖包 yum install -y yum-utils device-mapper-persistent-data lvm2 # 添加Docker软件包源 yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo # 安装Docker CE yum install docker-ce -y # 启动 systemctl start docker # 开机启动 systemctl enable docker # 查看Docker信息 docker info ``` ### 安装 docker-compose ``` curl -L https://github.com/docker/compose/releases/download/1.23.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose ``` 添加配置文件 ------ ``` mkdir -p /usr/local/src/config cd /usr/local/src/config ``` ### 添加 prometheus.yml 配置文件 vim prometheus.yml ``` # my global config global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - rule_files: - "/etc/prometheus/config/rule/*rule.yml" scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: [''] - job_name: 'cadvisor' scrape_interval: 5s static_configs: - targets: ['', '', ''] - job_name: 'node-exporter' scrape_interval: 5s static_configs: - targets: [''] labels: instance: test-1 - service: node-service - targets: [''] labels: instance: test-2 - service: node-service - targets: [''] labels: instance: test-3 - service: node-service ``` ### 添加邮件告警配置文件 添加配置文件 alertmanager.yml,配置收发邮件邮箱 vim alertmanager.yml ``` global: # The smarthost and SMTP sender used for mail notifications. 用于邮件通知的智能主机和SMTP发件人。 smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxxxxxxxx@163.com' smtp_auth_username: 'xxxxxxxxxxxxxxx@163.com' smtp_auth_password: 'xxxxxxxxxxxxx' # The auth token for Hipchat. Hipchat的身份验证令牌。 templates: - '/etc/alertmanager/default-monitor.tmpl' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'mail' receivers: - name: 'mail' email_configs: - to: 'xxxxxxxxxxxxxxxx@163.com, xxxxxxxxxxxxxx@qq.com' send_resolved: true #告警恢复通知 html: '{{ template "default-monitor.html" . }}' #应用那个模板 headers: { Subject: "[WARN] 报警邮件" } #邮件主题信息 如果不写headers也可以在模板中定义默认加载email.default.subject这个模板 ``` ### 添加报警规则 ``` mkdir -p /usr/local/src/config/rule cd /usr/local/src/config/rule ``` 创建两个文件 node-exporter-record-rule.yml node-exporter-alert-rule.yml 第一个文件用于记录规则,第二个是报警规则。 由于之前我们在 prometheus.yml 中已经引用了所有已 rule 结尾的文件,所以我们不用在修改 prometheus.yml 配置文件。 创建 node-exporter-record-rule.yml ``` groups: - name: node-exporter-record rules: - expr: up{job=~"node-exporter"} record: node_exporter:up labels: desc: "节点是否在线, 在线1,不在线0" unit: " " job: "node-exporter" - expr: time() - node_boot_time_seconds{} record: node_exporter:node_uptime labels: desc: "节点的运行时间" unit: "s" job: "node-exporter" ############################################################################################## # cpu # - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))) * 100 record: node_exporter:cpu:total:percent labels: desc: "节点的cpu总消耗百分比" unit: "%" job: "node-exporter" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))) * 100 record: node_exporter:cpu:idle:percent labels: desc: "节点的cpu idle百分比" unit: "%" job: "node-exporter" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="iowait"}[5m]))) * 100 record: node_exporter:cpu:iowait:percent labels: desc: "节点的cpu iowait百分比" unit: "%" job: "node-exporter" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="system"}[5m]))) * 100 record: node_exporter:cpu:system:percent labels: desc: "节点的cpu system百分比" unit: "%" job: "node-exporter" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="user"}[5m]))) * 100 record: node_exporter:cpu:user:percent labels: desc: "节点的cpu user百分比" unit: "%" job: "node-exporter" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode=~"softirq|nice|irq|steal"}[5m]))) * 100 record: node_exporter:cpu:other:percent labels: desc: "节点的cpu 其他的百分比" unit: "%" job: "node-exporter" ############################################################################################## ############################################################################################## # memory # - expr: node_memory_MemTotal_bytes{job="node-exporter"} record: node_exporter:memory:total labels: desc: "节点的内存总量" unit: byte job: "node-exporter" - expr: node_memory_MemFree_bytes{job="node-exporter"} record: node_exporter:memory:free labels: desc: "节点的剩余内存量" unit: byte job: "node-exporter" - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} record: node_exporter:memory:used labels: desc: "节点的已使用内存量" unit: byte job: "node-exporter" - expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"} record: node_exporter:memory:actualused labels: desc: "节点用户实际使用的内存量" unit: byte job: "node-exporter" - expr: (1-(node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100 record: node_exporter:memory:used:percent labels: desc: "节点的内存使用百分比" unit: "%" job: "node-exporter" - expr: ((node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100 record: node_exporter:memory:free:percent labels: desc: "节点的内存剩余百分比" unit: "%" job: "node-exporter" ############################################################################################## # load # - expr: sum by (instance) (node_load1{job="node-exporter"}) record: node_exporter:load:load1 labels: desc: "系统1分钟负载" unit: " " job: "node-exporter" - expr: sum by (instance) (node_load5{job="node-exporter"}) record: node_exporter:load:load5 labels: desc: "系统5分钟负载" unit: " " job: "node-exporter" - expr: sum by (instance) (node_load15{job="node-exporter"}) record: node_exporter:load:load15 labels: desc: "系统15分钟负载" unit: " " job: "node-exporter" ############################################################################################## # disk # - expr: node_filesystem_size_bytes{job="node-exporter" ,fstype=~"ext4|xfs"} record: node_exporter:disk:usage:total labels: desc: "节点的磁盘总量" unit: byte job: "node-exporter" - expr: node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} record: node_exporter:disk:usage:free labels: desc: "节点的磁盘剩余空间" unit: byte job: "node-exporter" - expr: node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} record: node_exporter:disk:usage:used labels: desc: "节点的磁盘使用的空间" unit: byte job: "node-exporter" - expr: (1 - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"}) * 100 record: node_exporter:disk:used:percent labels: desc: "节点的磁盘的使用百分比" unit: "%" job: "node-exporter" - expr: irate(node_disk_reads_completed_total{job="node-exporter"}[1m]) record: node_exporter:disk:read:count:rate labels: desc: "节点的磁盘读取速率" unit: "次/秒" job: "node-exporter" - expr: irate(node_disk_writes_completed_total{job="node-exporter"}[1m]) record: node_exporter:disk:write:count:rate labels: desc: "节点的磁盘写入速率" unit: "次/秒" job: "node-exporter" - expr: (irate(node_disk_written_bytes_total{job="node-exporter"}[1m]))/1024/1024 record: node_exporter:disk:read:mb:rate labels: desc: "节点的设备读取MB速率" unit: "MB/s" job: "node-exporter" - expr: (irate(node_disk_read_bytes_total{job="node-exporter"}[1m]))/1024/1024 record: node_exporter:disk:write:mb:rate labels: desc: "节点的设备写入MB速率" unit: "MB/s" job: "node-exporter" ############################################################################################## # filesystem # - expr: (1 -node_filesystem_files_free{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_files{job="node-exporter",fstype=~"ext4|xfs"}) * 100 record: node_exporter:filesystem:used:percent labels: desc: "节点的inode的剩余可用的百分比" unit: "%" job: "node-exporter" ############################################################################################# # filefd # - expr: node_filefd_allocated{job="node-exporter"} record: node_exporter:filefd_allocated:count labels: desc: "节点的文件描述符打开个数" unit: "%" job: "node-exporter" - expr: node_filefd_allocated{job="node-exporter"}/node_filefd_maximum{job="node-exporter"} * 100 record: node_exporter:filefd_allocated:percent labels: desc: "节点的文件描述符打开百分比" unit: "%" job: "node-exporter" ############################################################################################# # network # - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netin:bit:rate labels: desc: "节点网卡eth0每秒接收的比特数" unit: "bit/s" job: "node-exporter" - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netout:bit:rate labels: desc: "节点网卡eth0每秒发送的比特数" unit: "bit/s" job: "node-exporter" - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netin:packet:rate labels: desc: "节点网卡每秒接收的数据包个数" unit: "个/秒" job: "node-exporter" - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netout:packet:rate labels: desc: "节点网卡发送的数据包个数" unit: "个/秒" job: "node-exporter" - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netin:error:rate labels: desc: "节点设备驱动器检测到的接收错误包的数量" unit: "个/秒" job: "node-exporter" - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: node_exporter:network:netout:error:rate labels: desc: "节点设备驱动器检测到的发送错误包的数量" unit: "个/秒" job: "node-exporter" - expr: node_tcp_connection_states{job="node-exporter", state="established"} record: node_exporter:network:tcp:established:count labels: desc: "节点当前established的个数" unit: "个" job: "node-exporter" - expr: node_tcp_connection_states{job="node-exporter", state="time_wait"} record: node_exporter:network:tcp:timewait:count labels: desc: "节点timewait的连接数" unit: "个" job: "node-exporter" - expr: sum by (environment,instance) (node_tcp_connection_states{job="node-exporter"}) record: node_exporter:network:tcp:total:count labels: desc: "节点tcp连接总数" unit: "个" job: "node-exporter" ############################################################################################# # process # - expr: node_processes_state{state="Z"} record: node_exporter:process:zoom:total:count labels: desc: "节点当前状态为zoom的个数" unit: "个" job: "node-exporter" ############################################################################################# # other # - expr: abs(node_timex_offset_seconds{job="node-exporter"}) record: node_exporter:time:offset labels: desc: "节点的时间偏差" unit: "s" job: "node-exporter" ############################################################################################# - expr: count by (instance) ( count by (instance,cpu) (node_cpu_seconds_total{ mode='system'}) ) record: node_exporter:cpu:count ``` 创建 node-exporter-alert-rule.yml ``` groups: - name: node-exporter-alert rules: - alert: node-exporter-down expr: node_exporter:up == 0 for: 1m labels: severity: 'critical' annotations: summary: "instance: {{ $labels.instance }} 宕机了" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-cpu-high expr: node_exporter:cpu:total:percent > 80 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续三分钟高过80% 。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-cpu-iowait-high expr: node_exporter:cpu:iowait:percent >= 12 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-load-load1-high expr: (node_exporter:load:load1) > (node_exporter:cpu:count) * 1.2 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-memory-high expr: node_exporter:memory:used:percent > 85 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-disk-high expr: node_exporter:disk:used:percent > 88 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-disk-read:count-high expr: node_exporter:disk:read:count:rate > 3000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-disk-write-count-high expr: node_exporter:disk:write:count:rate > 3000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-disk-read-mb-high expr: node_exporter:disk:read:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}" description: "" instance: "{{ $labels.instance }}" value: "{{ $value }}" - alert: node-exporter-disk-write-mb-high expr: node_exporter:disk:write:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-filefd-allocated-percent-high expr: node_exporter:filefd_allocated:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-network-netin-error-rate-high expr: node_exporter:network:netin:error:rate > 4 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-network-netin-packet-rate-high expr: node_exporter:network:netin:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-network-netout-packet-rate-high expr: node_exporter:network:netout:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-network-tcp-total-count-high expr: node_exporter:network:tcp:total:count > 40000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-process-zoom-total-count-high expr: node_exporter:process:zoom:total:count > 10 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: node-exporter-time-offset-high expr: node_exporter:time:offset > 0.03 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" ``` ### 添加告警模板 ``` mkdir template vim template/default-monitor.tmpl ``` ``` {{ define "default-monitor.html" }} {{ range .Alerts }} =========start==========<br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} 级 <br> 告警类型: {{ .Labels.alertname }} <br> 故障主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ .StartsAt.Format "2019-08-04 16:58:15" }} <br> =========end==========<br> {{ end }} {{ end }} ``` 编写 docker-compose 文件 -------------------- vim docker-compose-monitor.yml ``` version: '2' networks: monitor: driver: bridge services: prometheus: image: prom/prometheus:v2.16.0 container_name: prometheus restart: always ports: - 9090:9090 volumes: - /bsn/prometheus/prometheus:/prometheus - /bsn/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /bsn/prometheus/alert/alert.rules:/usr/local/prometheus/rules/alert.rules - /etc/localtime:/etc/localtime command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus/' - '--storage.tsdb.retention.time=90d' depends_on: - alertmanager grafana: image: grafana/grafana:6.4.2 container_name: grafana restart: always volumes: - /bsn/prometheus/grafana:/var/lib/grafana - /bsn/prometheus/grafana/grafana.ini:/etc/grafana/grafana.ini - /etc/localtime:/etc/localtime ports: - 3000:3000 depends_on: - prometheus alertmanager: image: prom/alertmanager:v0.21.0-rc.0 container_name: alertmanager volumes: - /bsn/prometheus/alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml - /bsn/prometheus/alert/email.tmpl:/etc/alertmanager/template/email.tmpl - /etc/localtime:/etc/localtime command: - '--config.file=/etc/alertmanager/alertmanager.yml' ports: - 9093:9093 restart: always node-exporter: image: quay.io/prometheus/node-exporter container_name: node-exporter hostname: $HOSTNAME restart: always ports: - "9100:9100" volumes: - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro restart: always command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' cadvisor: image: google/cadvisor:latest container_name: cadvisor hostname: cadvisor restart: always volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080" ``` 启动 docker-compose ----------------- ``` #启动容器: docker-compose -f /usr/local/src/config/docker-compose-monitor.yml up -d #删除容器: docker-compose -f /usr/local/src/config/docker-compose-monitor.yml down #重启容器: docker restart id ``` 在其他节点分别启动 cadvisor 和 node-exporter 容器 ``` version: '3' services: node-exporter: image: quay.io/prometheus/node-exporter container_name: node-exporter hostname: $HOSTNAME restart: always ports: - "9100:9100" volumes: - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro restart: always command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' cadvisor: image: google/cadvisor:latest container_name: cadvisor hostname: cadvisor restart: always volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080" ``` **容器启动如下:** ![](https://chriswsq.github.io/post-images/1599121054830.png) **prometheus targets 界面如下:** ![](https://chriswsq.github.io/post-images/1599121262766.png) 备注:如果 State 为 Down,应该是防火墙问题,参考下面防火墙配置。 **prometheus targets 界面如下:** ![](https://chriswsq.github.io/post-images/1599121408543.png) 备注:如果没有数据,同步下时间。 配置 grafana ---------- ### 添加 Prometheus 数据源 ![](https://chriswsq.github.io/post-images/1599121507210.png) ### 配置 dashboards **说明:可以用自带模板,也可以去 https://grafana.com/dashboards,下载对应的模板。** 添加监控服务器模板 此处用的模板 id 是 8919 也可使用(1860) ![](https://chriswsq.github.io/post-images/1599121616143.png) ![](https://chriswsq.github.io/post-images/1599121699226.png) ![](https://chriswsq.github.io/post-images/1599121761143.png) 添加监控容器模板 此处用 893 也可用(8321) ![](https://chriswsq.github.io/post-images/1599122084044.png) 也可用综合的 9276 告警 -- 停止 的 cadvisor 和 node-exporter 容器 ![](https://chriswsq.github.io/post-images/1599122289244.png) 收到告警邮件 ![](https://chriswsq.github.io/post-images/1599122330270.png) 注: 如果日期有问题则将告警模板的触发时间参数改为 `{{ .StartsAt.Format "2019-08-04 16:58:15" }} <br>` 即可。 参考博客: https://juejin.im/post/6844903809517371406 https://blog.csdn.net/w342164796/article/details/105079231/ https://blog.csdn.net/aixiaoyang168/article/details/98474494
