Welcome to Centmin Mod Community
Become a Member

Cloudflare Monitoring Cloudflare Analytics with Grafana, Prometheus, InfluxDB v2, and Thanos

Discussion in 'System Administration' started by eva2000, Oct 29, 2024 at 3:47 PM.

  1. eva2000

    eva2000 Administrator Staff Member

    53,811
    12,159
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,711
    Local Time:
    10:34 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    I have a monitoring system that uses Grafana, Prometheus, an older version of InfluxDB, and Telegraf to create charts and display metrics for my K6 benchmarking analytics. However, recently, I was inspired to revist my setup and step up my Cloudflare analytics monitoring game. So, I’ve been expanding my existing setup to support Cloudflare HTTP traffic, WAF Firewall, Workers, R2, and Tunnel analytics for my domain zones. This includes updating from older InfluxDB to InfluxDB v2’s bucket format and adding Thanos to use Cloudflare R2 S3 buckets for long-term storage of Prometheus metrics data :D

    For Cloudflare Analytics support in Grafana, been using this Cloudflare Exporter from a fellow Cloudflare MVP and for Cloudflare Tunnel metrics in Grafana using guide at Monitor Cloudflare Tunnel with Grafana | Cloudflare Zero Trust docs :)

    So now I have a single setup that supports my K6 benchmarking with Telegraf/InfluxDB and also Cloudflare analytics for multiple Cloudflare domain zones :cool:

    Architecture's diagram

    grafana-k6-cloudflare-architecture-01.png

    grafana-dashboard-cloudflare-03b.png
    1. Prometheus:
      • Role: Prometheus is the main monitoring and alerting toolkit in my setup. It scrapes metrics from various exporters and services, storing the time-series data in its local storage.
      • Functionality: Prometheus operates by pulling metrics at regular intervals from "exporters" (such as Node Exporter and Cloudflare Exporter) and HTTP endpoints. It supports flexible queries and alerting rules, allowing you to set conditions for alert generation.
      • Storage and Retention: Prometheus has a short-term storage design, which is ideal for recent data but less suited for long-term historical data storage. In my setup, Thanos extends Prometheus to provide a durable, scalable solution for longer retention.
    2. Thanos Sidecar:
      • Role: The Thanos Sidecar runs alongside Prometheus, extending its capabilities by shipping Prometheus' metrics data to object storage (Cloudflare R2 in my case) for long-term retention.
      • Functionality: The sidecar component monitors Prometheus and uploads data to object storage once a block is complete (every 2 hours by default in Prometheus). It also provides a gRPC API, which allows the Thanos Query component to access data from the Prometheus instance.
      • Benefits: With Thanos Sidecar, you achieve scalable and cost-effective long-term storage without altering Prometheus. It also facilitates data federation, allowing multiple Prometheus instances to be queried together, which is particularly valuable in large, distributed environments.
    3. Thanos Store Gateway:
      • Role: Thanos Store Gateway is a cache layer that retrieves and serves historical data stored in Cloudflare R2 (or other object storage).
      • Functionality: The Store Gateway downloads blocks on-demand from object storage and caches them locally, optimizing read performance for repeated queries on historical data. It responds to query requests made through the Thanos Query component, providing data access even after the data has aged out of Prometheus' local storage.
      • Use Case: This component is useful for environments where a lot of historical data needs to be queried without impacting Prometheus' performance. It enables efficient access to old data without overwhelming object storage or Prometheus.
    4. Thanos Query:
      • Role: Thanos Query acts as a centralized querying layer across multiple Prometheus instances and Thanos components, providing a unified view of both real-time and historical data.
      • Functionality: It connects to various Thanos components (like Sidecar and Store Gateway) and aggregates data across all connected sources. This allows users to query data from multiple Prometheus instances as if they were a single dataset.
      • Benefits: Thanos Query enables high availability and redundancy in metric querying. If one Prometheus instance is down, Thanos Query can still fetch data from other instances, ensuring continuity in monitoring and alerting.
    5. Thanos Compactor:
      • Role: The Thanos Compactor is responsible for compacting, deduplicating, and downsampling the data stored in object storage.
      • Functionality: The Compactor periodically processes the stored metrics data, combining smaller data blocks into larger ones (compaction) and downsampling old data to reduce storage costs. For example, it might aggregate data points from 15-second intervals to 1-minute intervals after a certain time threshold.
      • Benefits: This component helps maintain manageable storage sizes, improve query performance, and ensure data retention policies are met without excessive storage usage. Downsampling old data still preserves trends while reducing storage space.
    6. InfluxDB:
      • Role: InfluxDB is a time-series database optimized for high-write loads, typically used in my setup for storing k6 test results and performance metrics. Example at k6-benchmarking/bench-ramping-vus.sh-3.md at master · centminmod/k6-benchmarking
      • Functionality: It supports flexible schemas, fast write and query speeds, and retention policies, making it ideal for high-precision time-series data like performance metrics from load testing tools.
      • Use Case: In my case, InfluxDB stores results from load tests conducted with k6, separate from Prometheus, which is more suited for infrastructure metrics. This segregation allows you to optimize data handling for both types of metrics.
    7. Telegraf:
      • Role: Telegraf is an agent that collects metrics from various sources and sends them to InfluxDB, acting as a bridge between data sources and InfluxDB.
      • Functionality: Telegraf has a broad range of plugins for collecting system metrics (CPU, memory, disk usage), application metrics, and custom metrics from k6. It can also parse data in various formats and apply transformations before writing to InfluxDB.
      • Advantages: By using Telegraf, you can consolidate multiple data sources into InfluxDB, allowing you to analyze load test results alongside other metrics collected from different systems.
    8. Grafana:
      • Role: Grafana is a visualization tool used to create dashboards and graphs for monitoring and analysis. It aggregates data from Prometheus and InfluxDB, giving you a centralized view of all metrics.
      • Functionality: Grafana allows for customizable dashboards with support for querying, alerting, and real-time data updates. It supports both PromQL (Prometheus Query Language) and Flux queries for InfluxDB, giving you flexible querying capabilities across both data sources.
      • Use Case: With Grafana, you can create unified dashboards that visualize infrastructure metrics from Prometheus, performance metrics from InfluxDB, and data from external sources like Cloudflare, all in one interface.
    9. Node Exporter:
      • Role: Node Exporter is a Prometheus exporter that exposes hardware and operating system metrics for Linux systems.
      • Functionality: It collects low-level system metrics such as CPU usage, memory, disk I/O, filesystem statistics, and network usage, exposing them in a format that Prometheus can scrape.
      • Use Case: Node Exporter provides essential insights into server health, enabling you to track system performance and identify resource bottlenecks or anomalies in real time.
    10. Cloudflare Exporter:
      • Role: The Cloudflare Exporter collects metrics from Cloudflare's API, exposing them in a format that Prometheus can scrape.
      • Functionality: This exporter retrieves data on Cloudflare’s service metrics, such as HTTP requests, cache usage, threats detected, and other Cloudflare-specific metrics that are useful for understanding application performance and security at the edge.
      • Advantages: Integrating Cloudflare metrics provides visibility into web traffic and security metrics, enabling you to monitor the health and performance of my Cloudflare-protected assets alongside other infrastructure components.


     
  2. eva2000

    eva2000 Administrator Staff Member

    53,811
    12,159
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,711
    Local Time:
    10:34 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    So anyone else using Grafana? What are your monitoring? :D
     
  3. eva2000

    eva2000 Administrator Staff Member

    53,811
    12,159
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,711
    Local Time:
    10:34 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+