前言

前几年做性能测试监控部署了 telegraf+infludb+grafana，方便查看和存档性能监控数据，随着 K8s 流行起来，Prometheus 方案开始得到各大厂青睐，最近尝试部署 Prometheus+Grafana，再配合各种 exporter，丰富性能测试监控手段。

Prometheus 简介

概念

Prometheus 是由 SoundCloud 开发的开源监控报警系统和时序列数据库 (TSDB)。Prometheus 使用 Go 语言开发，是 Google BorgMon 监控系统的开源版本。
2016 年由 Google 发起 Linux 基金会旗下的原生云基金会 (Cloud Native Computing Foundation), 将 Prometheus 纳入其下第二大开源项目。
Prometheus 目前在开源社区相当活跃。
Prometheus 和 Heapster(Heapster 是 K8S 的一个子项目，用于获取集群的性能数据。) 相比功能更完善、更全面。Prometheus 性能也足够支持上万台规模的集群。

官网地址：https://prometheus.io/

特点

多维度数据模型
高效灵活的查询语句
不依赖分布式存储，单个服务器节点是自主的
通过基于 HTTP 的 pull 方式采集时序数据
可以通过中间网关进行时序列数据推送
通过服务发现或者静态配置来发现目标服务对象
支持多种多样的图表和界面展示，比如 Grafana 等

组件

promethues server：主要获取和存储时间序列数据
exporters：主要作为 agent 收集数据发送到 prometheus server，不同的数据收集由不同的 exporters 实现，如监控服务器有 node-exporters，redis 有 redis-exporter。

更多 exporters 可参看EXPORTERS AND INTEGRATIONS
对应端口号Default port allocations · prometheus/prometheus Wiki · GitHub

pushgateway：允许短暂和批处理的 jobs 推送它们的数据到 prometheus；由于这类工作的存在时间不长，需要他们主动将数据推送到 pushgateway，然后由 pushgateway 将数据发送给 prometheus。
alertmanager：实现 prometheus 的告警功能。

架构

Grafana

Grafana 是一个跨平台的开源的度量分析和可视化工具，可以通过将采集的数据查询然后可视化的展示，并及时通知。

特点

展示方式：快速灵活的客户端图表，面板插件有许多不同方式的可视化指标和日志，官方库中具有丰富的仪表盘插件，比如热图、折线图、图表等多种展示方式；
数据源：Graphite，InfluxDB，OpenTSDB，Prometheus，Elasticsearch，CloudWatch 和 KairosDB 等；
通知提醒：以可视方式定义最重要指标的警报规则，Grafana 将不断计算并发送通知，在数据达到阈值时通过 Slack、PagerDuty 等获得通知；
混合展示：在同一图表中混合使用不同的数据源，可以基于每个查询指定数据源，甚至自定义数据源；
注释：使用来自不同数据源的丰富事件注释图表，将鼠标悬停在事件上会显示完整的事件元数据和标记；
过滤器：Ad-hoc 过滤器允许动态创建新的键/值过滤器，这些过滤器会自动应用于使用该数据源的所有查询

安装部署

prometheus

手动下载，并根据 yml 配置文件启动服务

download the latest release

wget https://github.com/prometheus/prometheus/releases/download/v*/prometheus-*.*-amd64.tar.gz
tar xvf prometheus-*.*-amd64.tar.gz
cd prometheus-*
nohup ./prometheus --config.file=./prometheus.yml &

grafana

手动下载，并根据 yml 配置文件启动服务

download the latest release

wget https://dl.grafana.com/oss/release/grafana-7.5.6-1.x86_64.rpm
sudo yum install grafana-7.5.6-1.x86_64.rpm

docker 安装

docker-compose 统一安装 prometheus 及 grafana

cd /opt
mkdir -p prometheus/config/
mkdir -p grafana/data
chmod 777 grafana/data
mkdir -p /data/prometheus
chmod 777 /data/prometheus

创建prometheus.yml配置文件

cd /opt/prometheus/config/
touch prometheus.yml

编辑prometheus.yml配置文件

#my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  #scrape_timeout is set to the global default (10s).

#Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      #- alertmanager:9093

#Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  #- "first_rules.yml"
  #- "second_rules.yml"

#A scrape configuration containing exactly one endpoint to scrape:
#Here it's Prometheus itself.
scrape_configs:
  #The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    #metrics_path defaults to '/metrics'
    #scheme defaults to 'http'.

    static_configs:
    - targets: ['192.168.9.140:9090']


  - job_name: "node"
    static_configs:
    - targets: ["192.168.9.140:9100"]

  - job_name: "qianmingyanqian"
    static_configs:
    - targets: ["11.12.108.226:9100","11.12.108.225:9100"]

  ## config for the multiple Redis targets that the exporter will scrape
  - job_name: "redis_exporter_targets"
    scrape_interval: 5s
    static_configs:
      - targets:
        - redis://192.168.9.140:6379
        - redis://192.168.9.140:7001
        - redis://192.168.9.140:7004
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.9.140:9121

创建docker-compose_prometheus_grafana.yml配置文件

cd /opt
mkdir docker-compose
touch docker-compose_prometheus_grafana.yml

编辑docker-compose_prometheus_grafana.yml文件并键入

version: '2'

networks:
  monitor:
    driver: bridge

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    hostname: prometheus
    restart: always
    volumes:
      - /opt/prometheus/config:/etc/prometheus
      - /data/prometheus:/prometheus
    ports:
      - "9090:9090"
    expose:
      - "8086"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--log.level=info'
      - '--web.listen-address=0.0.0.0:9090'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention=15d'
      - '--query.max-concurrency=50'
    networks:
      - monitor

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    hostname: grafana
    restart: always
    volumes:
      - /opt/grafana/data:/var/lib/grafana
    ports:
      - "3000:3000"
      - "26:26"
    networks:
      - monitor
    depends_on:
      - prometheus

docker-compose 运行 docker 容器

docker-compose -p prometheus_grafana -f docker-compose_prometheus_grafana.yml up -d

启动成功通过浏览器http://192.168.9.140:9090/访问 prometheus 服务

通过浏览器http://192.168.9.140:9090/config访问 prometheus 配置信息

Prometheus配置信息

通过浏览器 3000 端口访问 grafana 服务

监控服务器 node

安装并运行 node_exporter

在待被监控的服务器上下载并安装 node_exporter

download the latest release

wget https://github.com/prometheus/node_exporter/releases/download/v*/node_exporter-*.*-amd64.tar.gz
tar xvfz node_exporter-*.*-amd64.tar.gz
cd node_exporter-*.*-amd64
nohup ./node_exporter &

或者使用 docker 安装，创建docker-compose_node-exporter.yml配置文件

cd /opt/docker-compose
touch docker-compose_node-exporter.yml

编辑docker-compose_node-exporter.yml配置文件如下：

---
version: '3.8'

services:
  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - '/:/host:ro,rslave'

docker-compose 运行 docker 容器

docker-compose -p node_exporter -f docker-compose_node-exporter.yml up -d

尝试访问以下

curl http://192.168.9.140:9100/metrics

返回以下数据证明服务正常

# HELP node_xfs_read_calls_total Number of read(2) system calls made to files in a filesystem.
# TYPE node_xfs_read_calls_total counter
node_xfs_read_calls_total{device="dm-1"} 10196
node_xfs_read_calls_total{device="dm-2"} 17401
node_xfs_read_calls_total{device="dm-3"} 970
node_xfs_read_calls_total{device="dm-4"} 10
node_xfs_read_calls_total{device="dm-5"} 19
node_xfs_read_calls_total{device="dm-6"} 132
node_xfs_read_calls_total{device="sda2"} 16378
node_xfs_read_calls_total{device="sda3"} 2.67817784e+09
node_xfs_read_calls_total{device="sda6"} 1.053587e+06

配置 Prometheus

prometheus.yml追加以下内容

- job_name: "node"
 static_configs:
 - targets: ["192.168.9.140:9100"]

修改配置文件后需重启prometheus容器

docker restart CONTAINER ID

成功后如下图

配置 Grafana

登陆后添加数据源

选择 Prometheus

在 URL 输入框键入http://192.168.9.140:9090，如图所示，并保存

添加 Dashboards

导入模板https://grafana.com/grafana/dashboards/1860，并 load 成功

更多模板参考

详细监控页面，可根据需求选择监控时间段

监控 redis

安装并运行 redis_exporter

在待被监控的 redis 服务器上下载并安装 redis_exporter

download the latest release

wget https://github.com/oliver006/redis_exporter/releases/download/v1.23.1/redis_exporter-v1.23.1.linux-386.tar.gz
tar zxvf redis_exporter-v1.23.1.linux-386.tar.gz
nohup ./redis_exporter -redis.addr 192.168.9.140:6379 -redis.password 111111 &

或者使用 docker 安装

docker run -d --name redis_exporter -p 9121:9121 oliver006/redis_exporter --redis.addr=192.168.9.140:6379 --redis.password=111111

配置 Prometheus

redis 单实例，prometheus.yml追加以下内容

## config for scraping the exporter itself
- job_name: 'redis_exporter'
  scrape_interval: 5s
  static_configs:
    - targets:[192.168.9.140:9121]```

- redis集群，`prometheus.yml`追加以下内容

```yaml
  ## config for the multiple Redis targets that the exporter will scrape
  - job_name: "redis_exporter_targets"
    scrape_interval: 5s
    static_configs:
      - targets:
        - redis://192.168.9.140:6379
        - redis://192.168.9.140:7001
        - redis://192.168.9.140:7004
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.9.140:9121

修改配置文件后需重启prometheus容器

docker restart CONTAINER ID

成功后如下图

配置 Grafana

添加 Dashboards，导入模板https://grafana.com/grafana/dashboards/11835，并 load 成功

详细监控页面，可根据需求选择监控时间段

监控 mysql

安装并运行 mysqld_exporter

首先授权用户

root@localhost 14:43:  [(none)]>CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'mysql_exporter';
Query OK, 0 rows affected (0.04 sec)

root@localhost 14:43:  [(none)]>GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
Query OK, 0 rows affected (0.03 sec)

创建my.cnf配置文件

cd /opttouch .my.cnf
vim .my.cnf

配置文件键入以下内容

[ client ]
user = exporter
password = mysql_exporter

在待被监控的 mysql 服务器上下载并安装 mysqld_exporter

download the latest release

wget https://github.com/prometheus/mysqld_exporter/releases/download/v*/mysqld_exporter-*.*-amd64.tar.gz
tar xvfz mysqld_exporter-*.*-amd64.tar.gz
cd mysqld_exporter-*.*-amd64
nohup ./mysqld_exporter --config.my-cnf=/opt/.my.cnf &

或者使用 docker 安装，创建docker-compose_mysqld-exporter.yml配置文件

cd /opt/docker-composetouch docker-compose_mysqld-exporter.yml

编辑docker-compose_mysqld-exporter.yml配置文件如下：

version: '2'

networks:
    monitor:
        driver: bridge

services:
    mysql-exporter:
        image: prom/mysqld-exporter
        container_name: mysql-exporter
        hostname: mysql-exporter
        restart: always
        ports:
            - "9104:9104"
        networks:
            - my-mysql-network

        environment:
            DATA_SOURCE_NAME: "exporter:mysql_exporter@(192.168.9.140:3306)/"
networks:
    my-mysql-network:
        driver: bridge

docker-compose 运行 docker 容器

docker-compose -p mysql_exporter -f docker-compose_mysqld-exporter.yml up -d

配置 Prometheus

prometheus.yml追加以下内容

- job_name: 'mysql' 
    static_configs: 
    - targets: ['192.168.9.140:9104'] 
        labels: 
            instance: mysql

修改配置文件后需重启prometheus容器

docker restart CONTAINER ID

成功后如下图

配置 Grafana

添加 Dashboards，导入模板https://grafana.com/grafana/dashboards/11323，并 load 成功

详细监控页面，可根据需求选择监控时间段

如果只想单纯监控 MySQL 或 MongoDB，可以考虑 PMM（Percona Monitoring and Management）监控工具，添加了慢查询收集等额外功能

监控 docker 容器

为了解决 docker stats 的问题 (存储、展示)，谷歌开源的 cadvisor 诞生了，cadvisor 不仅可以搜集一台机器上所有运行的容器信息，还提供基础查询界面和 http 接口，方便其他组件如 Prometheus 进行数据抓取
cAdvisor 可以对节点机器上的资源及容器进行实时监控和性能数据采集，包括 CPU 使用情况、内存使用情况、网络吞吐量及文件系统使用情况.
Cadvisor 使用 Go 语言开发，利用 Linux 的 cgroups 获取容器的资源使用信息，在 K8S 中集成在 Kubelet 里作为默认启动项，官方标配。

安装并运行 cadvisor

使用 docker 安装，创建docker-compose_cadvisor.yml配置文件

cd /opt/docker-compose
touch docker-compose_cadvisor.yml

编辑docker-compose_cadvisor.yml配置文件如下：

version: '3.2'

services:
  cadvisor:
    image: google/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - '18080:8080'
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

docker-compose 运行 docker 容器

docker-compose -p cadvisor -f docker-compose_cadvisor.yml up -d

成功后浏览器访问http://11.12.110.38:18080/containers/

containers

cpu

配置 Prometheus

prometheus.yml追加以下内容

- job_name: 'cadvisor'
  scrape_interval: 5s
  static_configs:
    - targets: ['192.168.9.140:18080']

修改配置文件后需重启prometheus容器

docker restart CONTAINER ID

成功后如下图

cadvisor

配置 Grafana

添加 Dashboards，导入模板https://grafana.com/grafana/dashboards/8321，并 load 成功

Dashboards

详细监控页面，可根据需求选择监控时间段

详细监控页面

问题修复

Prometheus

Prometheus 服务启动后通过浏览器访问 9090 端口提示以下错误信息

Warning: Error fetching server time: Detected 785.6099998950958 seconds time difference between your browser and the server.

到装 prometheus 的机器上，同步一下时钟

ntpdate time3.aliyun.com

redis_exporter

prometheus.yml 配置不正确可能会提示 too many redis instances
redis_exporter 启动后通过 Prometheus 服务 web 页面/targets 下scrape也可能会出现以下错误，此时一般为 redis_exporter 启动时密码配置不正确，如部分 redis 实例不需要密码强行配置密码，或者 redis 实例需要密码而没有配置密码

- redis_exporter_last_scrape_error{err="dial redis: unknown network redis"} 1

mysqld_exporter

创建 mysql 用户时可能提示以下错误

ERROR 1290 (HY000): The MySQL server is running with the --skip-grant-tables option so it cannot execute this statement

刷新权限后再次执行即可

flush privileges;

后记

除了上述 Linux 服务器、Redis、Mysql 等，还可以通过其他 exporter 监控对应组件，工具不是万能的，适合并且能够快速定位性能瓶颈最重要。

↙↙↙阅读原文可查看相关链接，并与作者交流