知乎專欄 | 多維度架構 | 微信號 netkiller-ebook | QQ群:128659835 請註明“讀者” |
Metric 的格式: metric 名稱 {標籤名=標籤值} 監控樣本
<metric name>{<label name>=<label value>, ...} <sample>
指標的名稱(metric name)用於定義監控樣本的含義,名稱只能由ASCII字元、數字、下劃線以及冒號組成並必須符合正則表達式[a-zA-Z_:][a-zA-Z0-9_:]*
標籤(label)反映了當前樣本的特徵維度,通過這些維度Prometheus可以對樣本數據進行過濾,聚合等。標籤的名稱只能由ASCII字元、數字以及下劃線組成並滿足正則表達式[a-zA-Z_][a-zA-Z0-9_]*
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total # HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 16761.9 node_cpu_seconds_total{cpu="0",mode="iowait"} 2.91 node_cpu_seconds_total{cpu="0",mode="irq"} 0 node_cpu_seconds_total{cpu="0",mode="nice"} 0 node_cpu_seconds_total{cpu="0",mode="softirq"} 5.76 node_cpu_seconds_total{cpu="0",mode="steal"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 440.28 node_cpu_seconds_total{cpu="0",mode="user"} 135.58 node_cpu_seconds_total{cpu="1",mode="idle"} 16851.16 node_cpu_seconds_total{cpu="1",mode="iowait"} 1.81 node_cpu_seconds_total{cpu="1",mode="irq"} 0 node_cpu_seconds_total{cpu="1",mode="nice"} 0 node_cpu_seconds_total{cpu="1",mode="softirq"} 1.33 node_cpu_seconds_total{cpu="1",mode="steal"} 0 node_cpu_seconds_total{cpu="1",mode="system"} 440.52 node_cpu_seconds_total{cpu="1",mode="user"} 125.7 node_cpu_seconds_total{cpu="2",mode="idle"} 16792.57 node_cpu_seconds_total{cpu="2",mode="iowait"} 2.52 node_cpu_seconds_total{cpu="2",mode="irq"} 0 node_cpu_seconds_total{cpu="2",mode="nice"} 0 node_cpu_seconds_total{cpu="2",mode="softirq"} 1.36 node_cpu_seconds_total{cpu="2",mode="steal"} 0 node_cpu_seconds_total{cpu="2",mode="system"} 445.29 node_cpu_seconds_total{cpu="2",mode="user"} 129.73 node_cpu_seconds_total{cpu="3",mode="idle"} 16844.57 node_cpu_seconds_total{cpu="3",mode="iowait"} 1.16 node_cpu_seconds_total{cpu="3",mode="irq"} 0 node_cpu_seconds_total{cpu="3",mode="nice"} 0 node_cpu_seconds_total{cpu="3",mode="softirq"} 1.24 node_cpu_seconds_total{cpu="3",mode="steal"} 0 node_cpu_seconds_total{cpu="3",mode="system"} 430.82 node_cpu_seconds_total{cpu="3",mode="user"} 135.15
Prometheus 定義了4種不同的指標類型(metric type):
Counter 例子
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total # HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 16761.9
Gauge 類型的指標側重於反應系統的當前狀態,指標的樣本數據可增可減。常用於內存容量的監控。
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_memory_MemFree # HELP node_memory_MemFree_bytes Memory information field MemFree_bytes. # TYPE node_memory_MemFree_bytes gauge node_memory_MemFree_bytes 2.933243904e+09
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9090/metrics | grep prometheus_tsdb_compaction_chunk_range # HELP prometheus_tsdb_compaction_chunk_range_seconds Final time range of chunks on their first compaction # TYPE prometheus_tsdb_compaction_chunk_range_seconds histogram prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="100"} 2 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="400"} 2 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1600"} 2 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6400"} 2 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="25600"} 2 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="102400"} 3 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="409600"} 1506 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1.6384e+06"} 1558 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6.5536e+06"} 4564 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="2.62144e+07"} 4564 prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="+Inf"} 4564 prometheus_tsdb_compaction_chunk_range_seconds_sum 5.85524936e+09 prometheus_tsdb_compaction_chunk_range_seconds_count 4564
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9090/metrics | grep prometheus_tsdb_wal_fsync_duration_seconds # HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of WAL fsync. # TYPE prometheus_tsdb_wal_fsync_duration_seconds summary prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.5"} NaN prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.9"} NaN prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.99"} NaN prometheus_tsdb_wal_fsync_duration_seconds_sum 1.63e-05 prometheus_tsdb_wal_fsync_duration_seconds_count 1
查詢 instance="node-exporter:9100"
node_cpu_seconds_total{instance="node-exporter:9100"}
mode!="irq" 排出 irq
node_cpu_seconds_total{mode!="irq"}
查詢所有 mode="user"
{mode="user"}
正則查詢
node_cpu_seconds_total{mode=~"user|system|nice"} restful_api_requests_total{environment=~"staging|testing|development",method!="GET"} {instance =~"n.*"}
正則排除
node_cpu_seconds_total{mode!~"steal|softirq|irq|iowait|idle"}
PromQL的時間範圍選擇器支持時間單位:
該表達式將會查詢返回時間序列中最近5分鐘的所有樣本數據:
rate(node_memory_MemAvailable_bytes{}[5m])
可以使用offset時間位移操作:
node_memory_MemAvailable_bytes{} offset 5m rate(node_load1{}[5m] offset 1m)
PromQL 支持:數學運算符,邏輯運算符,布爾運算符
PromQL操作符中優先順序由高到低依次為:
Bytes 轉 MB 的例子
node_memory_MemFree_bytes / (1024 * 1024)
計算磁碟讀寫總量
(node_disk_read_bytes_total{device="vda"} + node_disk_written_bytes_total{device="vda"}) / (1024 * 1024)
內存使用率計算
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100 # 查詢出內存使用率到達 80% 的節點 (node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes > 0.8 node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 > 80
PromQL內置的聚合操作和函數可以讓用戶對這些數據進行進一步的分析
通過PromQL內置函數delta()可以獲取樣本在一段時間返回內的變化情況。例如,計算CPU溫度在兩個小時內的差異:
delta(cpu_temp_celsius{host="zeus"}[2h])
delta 適用於 Gauge 類型的監控指標
使用predict_linear()對數據的變化趨勢進行預測。例如,預測系統磁碟空間在4個小時之後的剩餘情況:
predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)
求和操作
sum(node_cpu_seconds_total) sum(node_cpu_seconds_total) by (mode)
Element Value {mode="steal"} 0 {mode="system"} 2632.2400000000002 {mode="user"} 768.49 {mode="idle"} 93899.19 {mode="iowait"} 8.85 {mode="irq"} 0 {mode="nice"} 0 {mode="softirq"} 13.35
sum(node_cpu_seconds_total) without (instance)
sum(node_cpu_seconds_total) by (mode,cpu)
sum(sum(irate(node_cpu{mode!='idle'}[5m])) / sum(irate(node_cpu[5m]))) by (instance)
計算平均數
avg(node_cpu_seconds_total) by (mode)
Element Value {mode="nice"} 0 {mode="softirq"} 3.3374999999999995 {mode="steal"} 0 {mode="system"} 658.06 {mode="user"} 192.1225 {mode="idle"} 23474.7975 {mode="iowait"} 2.2125 {mode="irq"} 0