ECS 规格:
**ECS-Monitor** | 2vCPU / 4GiB(s6.medium.2) | Ubuntu 22.04 | 40GiB SSD | 1 | 跑 Prometheus + Grafana + Alertmanager |
| **ECS-Target** | 2vCPU / 2GiB(s6.small.2) | Ubuntu 22.04 | 40GiB SSD | 1~N | 被监控节点,跑 Node Exporter |
网络规划:
| VPC | CIDR 192.168.0.0/16 |
| 子网 | CIDR 192.168.0.0/24 |
| EIP | ECS-Monitor 必须绑定,用于访问 Grafana 页面 |
入方向规则:
| **22** | TCP | 0.0.0.0/0 或你的 IP | SSH 登录 |
| **3000** | TCP | 你的本地 IP | Grafana Web UI |
| **9090** | TCP | 你的本地 IP | Prometheus Web UI |
| **9093** | TCP | 你的本地 IP | Alertmanager Web UI |
| **9100** | TCP | ECS-Monitor 内网 IP | Node Exporter 指标接口 |
我这里用华为云的ECS:
这里使用的是MobaXterm
用你自己设置的密码登录后
创建专用用户和目录:
```bash
# 创建 prometheus 系统用户(不允许登录)
sudo useradd --no-create-home --shell /bin/false prometheus
# 创建配置和数据目录
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
# 设置目录权限
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
下载并安装 Prometheus:
# 更新系统
sudo apt update
# 下载 Prometheus(二进制包)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
下载的时间比较长,耐心等待....
我这里网速实在是太慢,于是用了 apt 装,怕有人需要,这里给出命令
sudo apt update
sudo apt install -y prometheus
安装完毕后,打开浏览器:ip地址用的是我自己虚拟机的公网ip地址
步骤三:安装 Alertmanager
sudo apt install -y prometheus-alertmanager
sudo systemctl status prometheus-alertmanager
配置告警通知(邮件示例)
sudo vim /etc/alertmanager/alertmanager.yml
输入下面这些:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'your_email@qq.com'
smtp_auth_username: 'your_email@qq.com'
smtp_auth_password: 'your_auth_code' # QQ邮箱授权码,不是密码!
smtp_require_tls: false
route:
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'target_email@qq.com'
send_resolved: true # 恢复时也发通知
把上面的邮箱号进行改动
如果你跟我一样是用apt下载的,他会有一段默认配置
全部删掉,在上面的基础上加一条
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
重启服务
sudo systemctl restart prometheus
sudo systemctl restart alertmanager
安装 Grafana
添加 Grafana 官方仓库
# 安装依赖
sudo apt install -y wget software-properties-common apt-transport-https
# 添加 GPG 密钥
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
# 添加仓库
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
# 更新
sudo apt update
安装并启动 Grafana
sudo apt install -y grafana
# 启动服务
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server
```
做到这里的时候,我发现我3000端口被禁止使用,于是我切换了端口为3030
grep -n "http_port" /etc/grafana/grafana.ini
看看输出是什么。如果看到;http_port = 3000或者http_port = 3000,就用下面这个命令:
sudo sed -i 's/^[;#]*[[:space:]]*http_port[[:space:]]*=[[:space:]]*3000/http_port = 3030/' /etc/grafana/grafana.ini
然后重启:
sudo systemctl restart grafana-server
sudo ss -tlnp | grep 3030
正常就可以进去了
进入内部
站号admin
密码admin
被监控节点安装 Node Exporter
在 **每台 ECS-Target** 上执行(我这里只用一台)
重新在华为云中注册一台虚拟机....不再演示
登录到另一台虚拟机
sudo apt update
# 下载最新版 Node Exporter
cd /tmp
curl -s https://api.github.com/repos/prometheus/node_exporter/releases/latest \
| grep browser_download_url \
| grep linux-amd64 \
| cut -d '"' -f 4 \
| wget -qi -
我这里通过apt装(网速不行)命令如下
sudo apt update
sudo apt install -y prometheus-node-exporter
启动服务
看到9100端口就代表成功了
在ECS-Monitor上配置 Prometheus 采集
sudo vim /etc/prometheus/prometheus.yml
把里面的内容全部替换为
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'example'
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- /etc/prometheus/rules.yml
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
scrape_timeout: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: node
static_configs:
- targets: ['localhost:9100', '192.168.x.x:9100']
里面有的ip改成内网被检查的那台服务器的内网ip
创建文件
sudo tee /etc/prometheus/rules.yml > /dev/null << 'EOF'
groups:
- name: instance-down
interval: 15s
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 已宕机"
description: "Job {{ $labels.job }} 的实例 {{ $labels.instance }} 已经宕机超过 1 分钟"
- name: resource-usage
interval: 15s
rules:
- alert: CPUHigh
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "实例 {{ $labels.instance }} 的 CPU 使用率超过 80%,持续 5 分钟"
- alert: MemoryHigh
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "实例 {{ $labels.instance }} 的内存使用率超过 85%,持续 5 分钟"
- alert: DiskHigh
expr: (1 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘使用率过高"
description: "实例 {{ $labels.instance }} 的磁盘 {{ $labels.mountpoint }} 使用率超过 80%"
EOF
直接复制使用后
输入promtool check config /etc/prometheus/prometheus.yml验证
重启
sudo systemctl restart prometheus
sudo systemctl restart prometheus-alertmanager
进入
进入add new connection
搜索prometheus
输入http://localhost:9090
然后导入仪表盘
浏览器直接访问:
http://116.204.78.22:3030/dashboard/import在Import via grafana.com输入框填:
1860点击import
在 ECS-Monitor 上:
bash
复制
sudo vim /etc/prometheus/prometheus.yml找到:
yaml
复制
- job_name: node static_configs: - targets: ['localhost:9100', '192.168.0.15:9100']改成:
yaml
复制
- job_name: 'node_exporter' static_configs: - targets: ['localhost:9100', '192.168.0.15:9100']保存后:
bash
复制
promtool check config /etc/prometheus/prometheus.yml sudo systemctl restart prometheus整个 Prometheus + Node Exporter + Grafana 监控链路已经跑通。
日常访问地址:
- Grafana:http://116.204.78.22:3030
- Prometheus targets:http://116.204.78.22:9090/targets
- Alerts:http://116.204.78.22:9090/alerts