你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

系统指标和来宾 OS 性能计数器的 PromQL

本文提供了在 Azure Monitor 中使用 PromQL 查询 OpenTelemetry 系统指标以及来宾操作系统性能计数器的指导。 这包括 Azure Monitor 代理或 OpenTelemetry 收集器收集系统级遥测数据的情况。

小窍门

系统指标查询在以工作区为范围的模式和以资源为范围的模式中都起作用。 使用资源范围的查询时,应筛选并分组"Microsoft.resourceid",而不是使用泛型标识符如instancehost.name,以确保准确限定到您的资源。

先决条件

系统指标概述

OpenTelemetry 系统指标通过标准化的指标名称和语义约定,提供对客户操作系统性能的全面洞察。

核心系统指标类别

类别 OpenTelemetry 指标 传统等效项
中央处理器 system.cpu.utilization 处理器时间百分比
Memory system.memory.usagesystem.memory.utilization 可用内存、内存使用情况
磁盘 system.disk.io.bytessystem.disk.operations 磁盘字节数/秒、磁盘作数/秒
Network system.network.io.bytessystem.network.packets 网络字节数/秒、数据包数/秒
过程 process.cpu.utilizationprocess.memory.usage 特定于进程的计数器

CPU 指标和查询

系统范围的 CPU 利用率

# Overall CPU utilization across all cores
avg({"system.cpu.utilization"}) by ("Microsoft.resourceid")

# CPU utilization by state (user, system, idle, etc.)
{"system.cpu.utilization"} by ("Microsoft.resourceid", state)

# High CPU utilization detection
avg_over_time({"system.cpu.utilization"}[5m]) > 0.8

核心级 CPU 分析

# CPU utilization per core
{"system.cpu.utilization"} by ("Microsoft.resourceid", cpu)

# Identify CPU hotspots
topk(5, 
  avg_over_time({"system.cpu.utilization"}[5m]) by (cpu)
)

# CPU load distribution
histogram_quantile(0.95,
  rate({"system.cpu.load_average"}[5m])
) by ("Microsoft.resourceid")

进程级 CPU 监测

# Top 5 CPU-consuming processes by command
topk(5, sum by ("process.command") ({"process.cpu.utilization"}))

# Process CPU time accumulation (for cumulative metrics)
rate({"process.cpu.time"}[5m]) by ("process.command")

# Identify CPU-bound processes
{"process.cpu.utilization"} > 0.5

内存指标和查询

系统内存分析

# Memory utilization percentage
({"system.memory.usage"} / {"system.memory.limit"}) * 100

# Available memory
{"system.memory.usage"}{state="available"}

# Memory pressure indicators
{"system.memory.utilization"} > 0.9

按类型排序的内存使用情况

# Memory usage breakdown
{"system.memory.usage"} by ("Microsoft.resourceid", state)

# Swap usage monitoring
{"system.memory.usage"}{state="swap_used"} / 
{"system.memory.usage"}{state="swap_total"}

# Cache and buffer utilization
{"system.memory.usage"}{state=~"cache|buffers"}

进程内存监视

# Top 5 memory-consuming processes by command (percentage)
topk(5, 100 * sum by ("process.command") ({"process.memory.usage"}))

# Process memory growth rate (for delta metrics)
increase({"process.memory.usage"}[10m]) by ("process.command")

# Memory leak detection (for cumulative metrics)
rate({"process.memory.usage"}[10m]) > 1000000  # 1MB/sec growth

磁盘 I/O 指标和查询

磁盘性能监视

# Disk I/O rate (bytes per second)
rate({"system.disk.io.bytes"}[5m]) by ("Microsoft.resourceid", device, direction)

# Disk operations per second
rate({"system.disk.operations"}[5m]) by ("Microsoft.resourceid", device, direction)

# Disk utilization percentage
{"system.disk.utilization"} by ("Microsoft.resourceid", device)

磁盘吞吐量分析

# Top 5 processes by disk operations (read and write)
topk(5, sum by ("process.command", "direction") (rate({"process.disk.operations"}[2m])))

# Read vs write throughput
sum(rate({"system.disk.io.bytes"}{"direction"="read"}[5m])) by ("Microsoft.resourceid") /
sum(rate({"system.disk.io.bytes"}[5m])) by ("Microsoft.resourceid")

# High disk activity detection
rate({"system.disk.operations"}[5m]) > 1000

# Disk queue length monitoring
{"system.disk.pending_operations"} by ("Microsoft.resourceid", device)

存储容量监测

# Disk space utilization
({"system.filesystem.usage"} / {"system.filesystem.limit"}) * 100 by (device, mountpoint)

# Available disk space
{"system.filesystem.usage"}{state="available"} by (device, mountpoint)

# Low disk space alerts
({"system.filesystem.usage"} / {"system.filesystem.limit"}) > 0.9

网络指标和查询

网络吞吐量监视

# Network I/O bytes per second
rate({"system.network.io.bytes"}[5m]) by ("Microsoft.resourceid", device, direction)

# Network packets per second  
rate({"system.network.packets"}[5m]) by ("Microsoft.resourceid", device, direction)

# Network utilization by interface
{"system.network.io.bytes"} by ("Microsoft.resourceid", device)

网络性能分析

# Network error rates
rate({"system.network.errors"}[5m]) by ("Microsoft.resourceid", device, direction)

# Dropped packet detection
rate({"system.network.dropped"}[5m]) by ("Microsoft.resourceid", device, direction)

# Network saturation indicators
rate({"system.network.io.bytes"}[5m]) / {"system.network.bandwidth"} > 0.8

系统运行状况仪表板

基础结构的黄金信号

延迟(磁盘 I/O 延迟):

# Average disk operation time
{"system.disk.operation.time"} / {"system.disk.operations"}

# 95th percentile disk latency
histogram_quantile(0.95,
  rate({"system.disk.operation.time_bucket"}[5m])
) by (device)

流量(系统吞吐量):

# Combined network and disk throughput
sum(rate({"system.network.io.bytes"}[5m])) by ("Microsoft.resourceid") +
sum(rate({"system.disk.io.bytes"}[5m])) by ("Microsoft.resourceid")

错误(系统错误):

# System error rate
sum(rate({"system.network.errors"}[5m])) by ("Microsoft.resourceid") +
sum(rate({"system.disk.errors"}[5m])) by ("Microsoft.resourceid")

饱和度(资源利用率):

# Overall system saturation score
(
  avg({"system.cpu.utilization"}) +
  avg({"system.memory.utilization"}) +
  avg({"system.disk.utilization"})
) / 3 by ("Microsoft.resourceid")

性能计数器映射

OpenTelemetry 的 Windows 性能计数器

Windows 计数器 OpenTelemetry 等效项 PromQL 查询
\Processor(_Total)\% Processor Time system.cpu.utilization avg({"system.cpu.utilization"}) by (cpu)
\Memory\Available Bytes system.memory.usage{state="available"} {"system.memory.usage"}{state="available"}
\PhysicalDisk(_Total)\Disk Bytes/sec system.disk.io.bytes rate({"system.disk.io.bytes"}[5m])
\Network Interface(*)\Bytes Total/sec system.network.io.bytes rate({"system.network.io.bytes"}[5m])

将 Linux 指标导入 OpenTelemetry

Linux 源 OpenTelemetry 等效项 PromQL 查询
/proc/stat (CPU) system.cpu.utilization {"system.cpu.utilization"} by (state)
/proc/meminfo system.memory.usage {"system.memory.usage"} by (state)
/proc/diskstats system.disk.operations rate({"system.disk.operations"}[5m])
/proc/net/dev system.network.io.bytes rate({"system.network.io.bytes"}[5m])

进程正常运行时间监控

# Count process restarts over time by command
sum by ("process.command") (count_over_time({"process.uptime", "process.command"!="__empty"}[2m]))

# Process availability over time
avg_over_time({"process.uptime"}[5m]) by ("process.command")

# Detect process crashes (absence of uptime metric)
absent_over_time({"process.uptime"}[5m])

数据收集规则(DCR)集成

配置系统指标集合

将 Azure Monitor 代理与用于 OpenTelemetry 指标的 DCR 配合使用时:

# Verify data collection is working
up{job="azure-monitor-agent"}

# Check metric collection frequency
rate({"system.cpu.utilization"}[1m]) != 0

排查数据收集问题

# Missing metrics detection
absent({"system.cpu.utilization"} offset 5m)

# Data freshness check
(time() - timestamp({"system.cpu.utilization"})) > 300  # 5 minutes old

系统指标的警报模式

关键系统警报

CPU 使用率高:

avg_over_time({"system.cpu.utilization"}[10m]) > 0.9

内存耗尽:

{"system.memory.utilization"} > 0.95

磁盘空间严重不足:

({"system.filesystem.usage"} / {"system.filesystem.limit"}) > 0.95

高磁盘 I/O 延迟:

histogram_quantile(0.95,
  rate({"system.disk.operation.time_bucket"}[5m])
) > 0.1  # 100ms

预测警报

内存泄漏检测:

# Memory usage growing over 1 hour
(
  {"system.memory.utilization"} - 
  {"system.memory.utilization"} offset 1h
) > 0.1  # 10% increase

磁盘空间趋势:

# Predict disk full in 24 hours
predict_linear({"system.filesystem.usage"}[2h], 24*3600) > 
{"system.filesystem.limit"} * 0.95

系统指标的最佳做法

  1. 设置适当的时间窗口 - 为系统指标使用更长的范围(5-15 米),以避免干扰
  2. 监控利用率和饱和度 - 跟踪百分比和绝对值
  3. 使用资源感知聚合 - 按逻辑单元分组(实例、设备、装入点)
  4. 实现分级警报 - 利用率达到 80% 时发出警告,达到 95% 时发出关键警告
  5. 考虑季节性 - 系统使用模式可能因一天/周的时间而异

排查系统度量查询问题

常见问题

未显示的指标:

  • 验证是否已为 OpenTelemetry 指标配置 Azure Monitor 代理
  • 检查 DCR 是否包括系统指标集合
  • 确保对带点的指标名称进行适当的 UTF-8 引用

不一致值:

  • 验证指标时态性(累积与增量)
  • 使用适当的聚合函数
  • 检查收集间隔和查询时间范围

高基数警告:

  • 首先按实例或设备限制查询
  • 使用录制规则进行昂贵的聚合
  • 避免同时查询所有维度