Prometheus 远程存储配置指南：Thanos 与 Cortex 实战

2025/8/26 01:43:21 251 0 0 0

Prometheus 作为云原生监控领域的事实标准，凭借其强大的数据采集和告警能力，深受广大开发者和运维人员的喜爱。然而，Prometheus 本地存储存在容量限制，不适合长期存储监控数据。为了解决这个问题，我们需要配置 Prometheus 远程存储，将监控数据推送到远端存储系统，实现数据的持久化和可扩展性。

本文将深入探讨如何配置 Prometheus 远程存储，并以 Thanos 和 Cortex 这两个流行的解决方案为例，提供详细的配置指南和最佳实践。

为什么要使用远程存储？

在深入配置细节之前，让我们先了解一下使用 Prometheus 远程存储的几个关键优势：

持久化存储： Prometheus 默认将数据存储在本地磁盘，一旦 Prometheus 实例发生故障，数据将会丢失。远程存储可以将数据持久化保存，避免数据丢失的风险。
可扩展性： Prometheus 单个实例的存储容量有限，当监控规模扩大时，本地存储将成为瓶颈。远程存储可以水平扩展，满足大规模监控的需求。
全局视图： 通过将多个 Prometheus 实例的数据推送到同一个远程存储，可以实现全局的监控数据视图，方便进行统一分析和告警。
长期存储： Prometheus 默认只保留最近一段时间的数据，远程存储可以长期保存历史监控数据，用于趋势分析和容量规划。

Thanos vs Cortex：选择哪个？

Thanos 和 Cortex 都是 CNCF 毕业的项目，为 Prometheus 提供了可扩展的长期存储解决方案。它们各有优缺点，适用于不同的场景：

Thanos： Thanos 通过 sidecar 模式与 Prometheus 集成，将 Prometheus 的数据上传到对象存储（如 AWS S3、Google Cloud Storage、Azure Blob Storage）。Thanos 采用去中心化的架构，易于部署和维护，适用于中小型规模的监控场景。
Cortex： Cortex 是一个多租户、高可用的 Prometheus 即服务平台。它将 Prometheus 的数据存储在分布式存储系统（如 Cassandra、DynamoDB、Google Bigtable）中。Cortex 采用中心化的架构，需要更复杂的配置和管理，适用于大规模、多租户的监控场景。

在选择 Thanos 或 Cortex 时，需要综合考虑以下因素：

监控规模： 如果监控规模较小，Thanos 是一个不错的选择。如果监控规模很大，或者需要支持多租户，Cortex 更合适。
运维复杂度： Thanos 的部署和维护相对简单，Cortex 则需要更多的运维经验。
预算： Cortex 需要使用分布式存储系统，成本相对较高。

使用 Thanos 配置远程存储

以下是如何使用 Thanos 配置 Prometheus 远程存储的步骤：

部署 Thanos Sidecar： 在 Prometheus 所在的 Kubernetes 集群中部署 Thanos Sidecar。Thanos Sidecar 会自动发现 Prometheus 实例，并将数据上传到对象存储。

以下是一个 Thanos Sidecar 的 Kubernetes Deployment 示例：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-sidecar
  labels:
    app: thanos-sidecar
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-sidecar
  template:
    metadata:
      labels:
        app: thanos-sidecar
    spec:
      containers:
        - name: thanos-sidecar
          image: quay.io/thanos/thanos:v0.30.2
          args:
            - sidecar
            - --tsdb.path=/prometheus
            - --prometheus.url=http://localhost:9090
            - --objstore.config-file=/etc/thanos/objstore.yml
          ports:
            - containerPort: 10901
              name: grpc
            - containerPort: 10902
              name: http
          volumeMounts:
            - name: prometheus-storage
              mountPath: /prometheus
            - name: objstore-config
              mountPath: /etc/thanos
      volumes:
        - name: prometheus-storage
          emptyDir: {}
        - name: objstore-config
          secret:
            secretName: thanos-objstore-config

请注意以下几点：

--tsdb.path 参数指定 Prometheus 的数据目录。
--prometheus.url 参数指定 Prometheus 的 HTTP 地址。
--objstore.config-file 参数指定对象存储的配置文件。

配置对象存储： 创建一个对象存储桶（如 AWS S3 bucket），并配置 Thanos Sidecar 的对象存储配置文件。该配置文件需要包含对象存储的访问密钥和区域信息。

以下是一个 AWS S3 的对象存储配置文件示例：
```
type: S3
config:
  bucket: your-s3-bucket-name
  region: your-s3-region
  access_key: your-s3-access-key
  secret_key: your-s3-secret-key
```
将该配置文件保存为 objstore.yml，并将其作为 Kubernetes Secret 挂载到 Thanos Sidecar 的容器中。
配置 Prometheus： 无需修改 Prometheus 的配置。Thanos Sidecar 会自动发现 Prometheus 实例，并将数据上传到对象存储。

部署 Thanos Querier： 部署 Thanos Querier，用于查询存储在对象存储中的监控数据。Thanos Querier 会聚合多个 Prometheus 实例的数据，提供全局的监控数据视图。

以下是一个 Thanos Querier 的 Kubernetes Deployment 示例：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-querier
  labels:
    app: thanos-querier
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-querier
  template:
    metadata:
      labels:
        app: thanos-querier
    spec:
      containers:
        - name: thanos-querier
          image: quay.io/thanos/thanos:v0.30.2
          args:
            - query
            - --log.level=info
            - --query.timeout=5m
            - --store=thanos-sidecar:10901
            # Add more --store flags for each sidecar.
            - --objstore.config-file=/etc/thanos/objstore.yml
          ports:
            - containerPort: 9090
              name: http
          volumeMounts:
            - name: objstore-config
              mountPath: /etc/thanos
      volumes:
        - name: objstore-config
          secret:
            secretName: thanos-objstore-config

请注意 --store 参数，它指定了 Thanos Sidecar 的地址。你需要为每个 Thanos Sidecar 添加一个 --store 参数。

访问 Thanos Querier： 通过 Kubernetes Service 暴露 Thanos Querier 的 HTTP 地址，然后就可以通过浏览器或 Prometheus 的 UI 访问 Thanos Querier，查询存储在对象存储中的监控数据。

使用 Cortex 配置远程存储

以下是如何使用 Cortex 配置 Prometheus 远程存储的步骤：

部署 Cortex： 在 Kubernetes 集群中部署 Cortex。Cortex 的部署相对复杂，需要配置多个组件，包括：
- Ingester： 接收 Prometheus 推送的数据，并将数据写入分布式存储。
- Distributor： 将 Prometheus 推送的数据路由到 Ingester。
- Querier： 查询存储在分布式存储中的监控数据。
- Store Gateway： 从分布式存储中读取数据。
- Compactor： 将小块数据合并成大块数据，提高查询性能。
Cortex 官方提供了 Helm Chart，可以简化 Cortex 的部署过程。具体部署步骤请参考 Cortex 官方文档。
配置 Prometheus： 修改 Prometheus 的配置文件，配置 remote_write，将监控数据推送到 Cortex 的 Distributor。

以下是一个 Prometheus 的 remote_write 配置示例：
```
remote_write:
  - url: http://cortex-distributor:9095/api/v1/push
    remote_timeout: 30s
    queue_config:
      capacity: 10000
      max_samples_per_send: 1000
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s
```
请注意 url 参数，它指定了 Cortex Distributor 的地址。
访问 Cortex： 通过 Kubernetes Service 暴露 Cortex Querier 的 HTTP 地址，然后就可以通过浏览器或 Prometheus 的 UI 访问 Cortex Querier，查询存储在分布式存储中的监控数据。

常见问题与解决方案

在配置 Prometheus 远程存储的过程中，可能会遇到以下问题：

数据丢失： 检查 Prometheus 和远程存储之间的网络连接是否正常。确保 Prometheus 能够成功将数据推送到远程存储。
查询性能问题： 如果查询性能较差，可以尝试优化远程存储的配置，例如增加 Ingester 的数量，或者调整 Compactor 的配置。
存储成本问题： 远程存储的成本可能很高，需要根据实际需求选择合适的存储方案，并定期清理过期数据。

总结

Prometheus 远程存储是实现监控数据持久化和可扩展性的关键。Thanos 和 Cortex 都是优秀的远程存储解决方案，可以根据实际需求选择合适的方案。本文提供了详细的配置指南和最佳实践，希望能帮助您成功配置 Prometheus 远程存储，构建稳定可靠的监控系统。

参考链接：

Thanos 官方网站：https://thanos.io/
Cortex 官方网站：https://cortexmetrics.io/

监控大师兄 Prometheus Thanos Cortex