前言
监控是运维的眼睛。Prometheus+Grafana是目前最流行的开源监控方案组合,本文从零开始搭建完整的服务器监控体系。
一、架构概览
- Prometheus:时序数据库 + 指标采集引擎
- Node Exporter:服务器指标采集(CPU/内存/磁盘/网络)
- cAdvisor:容器指标采集
- Grafana:可视化仪表盘
- Alertmanager:告警通知(邮件/钉钉/飞书)
二、Docker Compose部署
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
grafana:
image: grafana/grafana:10.3.0
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
node-exporter:
image: prom/node-exporter:v1.7.0
ports:
- "9100:9100"
volumes:
prometheus_data:
三、Prometheus配置
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'docker'
static_configs:
- targets: ['cadvisor:8080']
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
四、告警规则
# alert_rules.yml
groups:
- name: server_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率超过80%"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 3m
labels:
severity: critical
annotations:
summary: "内存使用率超过90%"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间使用率超过85%"
五、Grafana仪表盘
推荐导入社区仪表盘模板:
- Node Exporter Full(ID: 1860)- 服务器全景
- Docker Container(ID: 893)- 容器监控
- Prometheus Stats(ID: 2)- Prometheus自身状态
在Grafana中导入:Dashboards → Import → 输入Dashboard ID即可。
监控体系搭建后,建议每周回顾一次告警规则,根据实际运行情况调整阈值,避免告警疲劳。