r/grafana

▲ 38 r/grafana+4 crossposts

I built a repo of ready-to-run OpenTelemetry Collector configs (Prometheus, Jaeger, Dynatrace, Datadog, Loki, k8s), feedback welcome

I just open-sourced a collection of ready-to-run OpenTelemetry

Collector configurations, because finding complete, working configs

for your specific backend always takes hours of trial and error.

It now includes examples for:

  • Prometheus
  • Jaeger
  • Grafana Loki
  • Dynatrace
  • Datadog
  • Kubernetes Operator
  • Kubernetes Pod Annotation Scraping (with full relabeling)
  • Debug (no backend needed, perfect for local dev)

Each example includes Docker Compose so you can run it in 60 seconds.

The k8s pod annotation scraping example includes relabeling for

prometheus.io/scrape, prometheus.io/port, and prometheus.io/path

annotations, the config everyone googles when setting up k8s monitoring.

I also actively contribute to the OpenTelemetry open source project,

recently got PRs merged into open-telemetry/otel-arrow and have PRs

open in opentelemetry-android, opentelemetry-helm-charts, and

opentelemetry-dotnet-instrumentation.

https://github.com/Cloud-Architect-Emma/opentelemetry-collector-examples

Feedback and contributions welcome! ⭐ if it's useful.

#OpenTelemetry #DevOps #Observability #Kubernetes #SRE #Monitoring #CloudNative #OpenSource

u/EmmaOpu — 3 days ago
▲ 21 r/grafana

Grafana dashboard for Claude Code CLI metrics on a Prometheus-compatible backend

Hi! I'm an SRE who got pretty excited when Claude Code added the ability to emit OpenTelemetry metrics. Felt like that capability landed pretty quietly out there, so I built a Grafana dashboard on top.

It consumes Claude Code's OTLP metrics on Prometheus-compatible backends (Prometheus, VictoriaMetrics, Mimir, Thanos), all queries in PromQL.

https://preview.redd.it/91di760hoo0h1.png?width=1840&format=png&auto=webp&s=4f36834f24ff6f38c840ed23d37add196557e2dd

Panels: cost by model/project/user, cache hit ratio, active time, edit-decision breakdowns, leaderboards. Custom labels for per-team / per-project views via OTEL_RESOURCE_ATTRIBUTES.

Parallel implementation of dashboard 25052 by 1w2w3y, which targets Azure Application Insights via KQL. Every panel rewritten in PromQL for the OSS metrics stack. Credit to that author for the original concept.

https://preview.redd.it/8bzzqlikoo0h1.png?width=1833&format=png&auto=webp&s=0343f83bb6e092c5e6ed8e4a25496d48b07e1c90

Direct download: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/

Article: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/

Repo (MIT, PRs welcome): https://github.com/rockdarko/claude-code-metrics-prometheus

Happy to answer questions about the panel queries or extend with what people want.

reddit.com
u/rockdarko — 1 day ago
▲ 16 r/grafana+4 crossposts

paradedb/benchmarker: a workload agnostic, multi-backend benchmarking tool.

Hi r/postgresql!

We just open sourced ParadeDB Benchmarker, a multi-backend benchmarking framework built on top of the excellent Grafana k6 (blog post).

One of the goals was avoiding a shared query abstraction layer. PostgreSQL queries stay PostgreSQL queries, with their own driver and native SQL.

Supports PostgreSQL, Elasticsearch, OpenSearch, ClickHouse, MongoDB, and ParadeDB with:

  • mixed read/write workloads
  • support for docker-compose profiles per backend
  • dataset loader
  • config and setup capture
  • live metrics + exported reports

One of the ah-ha moments I had building this was using the pgx Go driver in anger for the first time, I'm a Rust guy, but I'm seriously impressed with pgx and what it can do.

Any comments welcome, we will be using this to benchmark ParadeDB, but you can write your own datasets and workloads which have nothing to do with full-text search.

github.com
u/jamesgresql — 1 day ago

Greetings. I just started setting up the LGTM stack on my k8s cluster using alloy and I'm using the mimir-distributed helm chart for HA, but the small sample values are for 1M+ series and request a lot of memory (60GB+) so I tried reducing the overall requests. I wonder if I'm missing anything else or if something will break eventually. These are my helm values. I haven't touched any of mimir's parameters, only set the storage backend to S3.

alertmanager:
  persistentVolume:
    enabled: true
  replicas: 2
  resources:
    limits:
      memory: 256Mi
    requests:
      cpu: 50m
      memory: 128Mi
  statefulSet:
    enabled: true

compactor:
  persistentVolume:
    size: 5Gi
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 100m
      memory: 512Mi

distributor:
  replicas: 2
  resources:
    limits:
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi

ingester:
  persistentVolume:
    size: 10Gi
  replicas: 3
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 200m
      memory: 512Mi
  topologySpreadConstraints: {}
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - ingester
          topologyKey: "kubernetes.io/hostname"

  zoneAwareReplication:
    topologyKey: "kubernetes.io/hostname"

chunks-cache:
  enabled: false

index-cache:
  enabled: false

metadata-cache:
  enabled: false

results-cache:
  enabled: false

minio:
  enabled: false

overrides_exporter:
  replicas: 1
  resources:
    limits:
      memory: 128Mi
    requests:
      cpu: 50m
      memory: 64Mi

querier:
  replicas: 2
  resources:
    limits:
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi

query_frontend:
  replicas: 2
  resources:
    limits:
      memory: 256Mi
    requests:
      cpu: 100m
      memory: 128Mi

ruler:
  replicas: 2
  resources:
    limits:
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi

store_gateway:
  persistentVolume:
    size: 10Gi
  replicas: 3
  resources:
    limits:
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi
  topologySpreadConstraints: {}
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - store-gateway
          topologyKey: "kubernetes.io/hostname"
  zoneAwareReplication:
    topologyKey: "kubernetes.io/hostname"

gateway:
  replicas: 1
  resources:
    limits:
      memory: 731Mi
    requests:
      cpu: 1
      memory: 512Mi

I think I should also increase kafka's replicas as it is a single point of failure currently, or disable it altogether and just use grpc.

Help would be appreciated.

reddit.com
u/csantve — 9 days ago

I’ve been an early user of Grafana Assistant, but honestly it’s getting very frustrating with how slow and dumb it is at times. So, I decided to try and use Claude Code.

I gave Claude Code context on our stack (dashboards, data sources, etc) and made a skill. With that knowledge, I ask Opus to generate the dashboard JSON directly.
One shot (as it can’t validate visually) and it gets it better and faster than Grafana Assistant (yes with no validation).

The only frustrating thing is that you would need to reimport the JSON after every edit (or do them manually), but honestly I am starting to prefer it over Assistant.

Not sure if it’s model difference (Opus vs Sonnet) or just Grafana Assistant drowning in context, mostly itrelevant.

reddit.com
u/Initial-Detail-7159 — 8 days ago
▲ 34 r/grafana+3 crossposts

As a Linux DevOps engineer, part of my job is making sense of complex systems through monitoring. Over the years I have worked with many different monitoring tools. About 10 years ago, I came across Prometheus, Alertmanager, Node Exporter and Grafana. Later on I got to know Loki, Tempo, and lately Pyroscope. In my organization these tools form the core of our monitoring stack and we run them on Kubernetes (OpenShift), managed by operators. We use them to monitor not just OpenShift, but Linux, Oracle, Postgres, WebLogic, VMware and Windows as well. The cool part of these monitoring tools is that, although they do quite different things, they are built with a similar philosophy, work with labels and label filtering, integrate tightly with each other, and use Grafana as a single pane of glass.

This shared philosophy fundamentally changed the approach to observability. Instead of relying on rigid, hierarchical data structures, this entire ecosystem revolves around a multi-dimensional, label-based architecture. Whether it's a time-series metric in Prometheus, a log stream in Loki, a request trace in Tempo, or a code profile in Pyroscope, everything is tagged with the exact same key-value pairs (like app="my-service" or env="production").

Furthermore, they share a design principle of lightweight, cost-effective storage. Tools like Loki and Tempo were explicitly designed to avoid the heavy full-text indexing required by older logging and tracing backends. Instead, they only index the metadata (the labels) and push the compressed raw data into scalable object storage, like AWS S3 or MinIO.

This unified labeling taxonomy is what allows Grafana to tie everything together seamlessly. It enables seamless, contextual drill-down: you can spot a latency spike on a Prometheus metric dashboard, jump directly to the correlated Loki logs for that specific timeframe, pivot to the exact Tempo trace to see the request lifecycle, and finally open a Pyroscope Flame Graph to pinpoint the exact function or line of code causing the bottleneck. You get complete root-cause analysis without ever having to manually copy-paste trace IDs or synchronize timestamps between different tools.

Sometimes it's handy to have such a monitoring stack available on your Fedora workstation, to test applications locally and try out new things. I’ve put together a complete monitoring stack designed specifically to run smoothly on Fedora using rootless podman compose. I'm sharing it here so that interested folks can try out this stack as well. It’s a fully automated, production-like observability lab, all up-and-running within 10 minutes, depending on how fast your internet connection can download 19 container images ;-).

The Stack in a Nutshell:

  • Grafana: The frontend for it all, with over a dozen pre-built dashboards.
  • Prometheus & Loki: For scraping metrics and aggregating logs, with dozens of pre-configured Prometheus and Loki rules.
  • Tempo & Pyroscope: For distributed tracing and continuous profiling.
  • OpenTelemetry Collector & Grafana Alloy: For flexible, modern data ingestion.
  • Metrics Blackbox Exporter, Podman Exporter and Node Exporter: For exposing service, node and container stats.
  • Alerting: Alertmanager for routing alerts, webhook-tester for debugging alerts and at finally Karma and KeepHQ for visualizing alerts.
  • Traefik: As a reverse proxy with automatic TLS.
  • MinIO: Providing local, S3-compatible storage for logs, traces and profiles.
  • NGINX: Serving a static landing page.
  • Everything as Code: The entire setup, from the landing page (with a self-signed cert) to every dashboard and alert rule, is configured through code.
  • An automated validation script to verify the health of all individual components and validate the end-to-end data flows across the entire observability pipeline.

I’d love for you to check it out. You can find the repo here: https://github.com/tedsluis/monitoring

This stack is by no means perfect or completely finished, but it works really well as a solid foundation. I’m looking for feedback from fellow Fedora users. What would you improve or do differently? Let me know what you think!

reddit.com
u/ted-sluis — 13 days ago
▲ 28 r/grafana+7 crossposts

I added dedicated OpenShift support to KubeShark.

Mini recap:

KubeShark is my Kubernetes skill for Claude Code and Codex.

It helps AI agents generate, review, and refactor Kubernetes manifests without falling into the usual LLM traps: missing security contexts, deprecated API versions, broken selectors, wildcard RBAC, unsafe probes, missing resource requests, and rollout configs that look okay but fail under real traffic.

The important part is that KubeShark is failure-mode-first. It does not just tell the model “write good Kubernetes”. It forces the model to reason about what can go wrong before it generates YAML, and then return validation and rollback guidance as part of the answer.

That matters a lot with Kubernetes, because many bad manifests are accepted by the API server and only fail later at runtime.

Repo: https://github.com/LukasNiessen/kubernetes-skill

---

Now what’s new:

KubeShark now has special dedicated OpenShift support.

When the task involves OpenShift, OKD, ROSA, ARO, Routes, SCCs, OLM, ImageStreams, or oc, KubeShark switches into OpenShift-aware guidance.

This matters because OpenShift is Kubernetes, but with important platform behavior that generic Kubernetes YAML often ignores.

Common LLM mistakes include:

  • hardcoding runAsUser: 1000
  • assuming root-capable images will run
  • telling users to edit default SCCs
  • granting anyuid or privileged too broadly
  • using Ingress-controller annotations on OpenShift Routes
  • forgetting to validate with oc

Example guidance KubeShark now keeps in mind:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: app
spec:
  to:
    kind: Service
    name: app
  tls:
    termination: edge

It also knows to treat OpenShift Routes, SCCs, arbitrary UID containers, and OLM-managed resources as first-class concerns.

So instead of generic Kubernetes advice, you get OpenShift-aware manifest generation and review.

u/trolleid — 11 days ago

Hello,

I'm using a few MCP servers for various areas of the business, I've been asked to build or use an existing MCP for Loki logs and if possible metrics I store in Grafana.

If you are using one for Loki which one would you recommend? We use Grafana v13 and Loki all in house (OSS).

Thanks

reddit.com
u/Hammerfist1990 — 10 days ago
▲ 17 r/grafana

I have a cluster with 2 nodes a controlplane and a worker node.

The top 2 lines are both alloy-log that are taking a whopping 420-450mb each 1 for each node. Is this normal? its a just a small homelab i can't afford a -1GB just for logging.

In total the observability namespace has taken around 1.4GB and i thought i was being smart by not using a kube-prometheus-stack and trying to cut on memory usage by using a more 'lightweight' open telemetry collector and making grafana cloud carry me.

Im using the grafana-k8s-monitoring chart

helmCharts:
  - name: k8s-monitoring
    repo: https://grafana.github.io/helm-charts
    releaseName: grafana-k8s-monitoring
    namespace: observability
    version: "^3"
    valuesFile: values.yaml
    includeCRDs: true

with the following values.yaml

cluster:
  name: homelab

destinations:
  - name: grafana-cloud-metrics
    type: prometheus
    url: https://<some-prod-somewhere>.grafana.net/api/prom/push
    auth:
      type: basic
      usernameKey: metrics-username
      passwordKey: token
    secret:
      create: false
      name: grafana-cloud-credentials
      namespace: observability
  - name: grafana-cloud-logs
    type: loki
    url: https://<a-log-id>.grafana.net/loki/api/v1/push
    auth:
      type: basic
      usernameKey: logs-username
      passwordKey: token
    secret:
      create: false
      name: grafana-cloud-credentials
      namespace: observability

clusterMetrics:
  enabled: true

clusterEvents:
  enabled: true

podLogs:
  enabled: true
  namespaces:
    - auth
    - traefik
    - argocd
    - whoami
    - external-secrets
    - cert-manager
    - observability

integrations:
  alloy:
    instances:
      - name: alloy
        labelSelectors:
          app.kubernetes.io/name:
            - alloy-metrics
            - alloy-singleton
            - alloy-logs

alloy-metrics:
  enabled: true
  alloy:
    resources:
      requests:
        cpu: 50m
        memory: 128Mi
      limits:
        memory: 512Mi
  configReloader:
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
      limits:
        memory: 128Mi

alloy-singleton:
  enabled: true
  alloy:
    resources:
      requests:
        cpu: 25m
        memory: 128Mi
      limits:
        memory: 512Mi
  configReloader:
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
      limits:
        memory: 128Mi

alloy-logs:
  enabled: true
  alloy:
    resources:
      requests:
        cpu: 50m
        memory: 128Mi
      limits:
        memory: 512Mi
  configReloader:
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
      limits:
        memory: 128Mi

Any help or suggestion would be appreciated.

u/FormationHeaven — 13 days ago
▲ 10 r/grafana

Hello all! Im trying to set up grafana to ultimately integrate into my homepage. I got prometheus and grafana installed and everything works perfectly fine internally (check screenshot). However, when I share it externally, it shows as "No Data". Any help? Thanks

  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana
    restart: unless-stopped
    user: "472:472"
    ports:
      - "3002:3000"
    volumes:
      - ./grafana-config/grafana:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD}
      - GF_SECURITY_ALLOW_EMBEDDING=${GF_SECURITY_ALLOW_EMBEDDING}
      - GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED}
      - GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE}
      - GF_SECURITY_COOKIE_SAMESITE=${GF_SECURITY_COOKIE_SAMESITE}
      - GF_SECURITY_COOKIE_SECURE=${GF_SECURITY_COOKIE_SECURE}
      - GF_AUTH_ANONYMOUS_ORG_NAME=${GF_AUTH_ANONYMOUS_ORG_NAME}
      - GF_SERVER_ROOT_URL=${GF_SERVER_ROOT_URL}



  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./grafana-config/prometheus/config:/etc/prometheus
      - ./grafana-config/prometheus/data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"


  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    network_mode: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"



GF_SECURITY_ADMIN_USER=---
GF_SECURITY_ADMIN_PASSWORD=---
GF_SERVER_ROOT_URL=https://xxx.xxx.com




GF_SECURITY_ALLOW_EMBEDDING=true
GF_SECURITY_COOKIE_SAMESITE=disabled
GF_SECURITY_COOKIE_SECURE=false


GF_AUTH_ANONYMOUS_ENABLED=true
GF_AUTH_ANONYMOUS_ORG_NAME="Main Org."
GF_AUTH_ANONYMOUS_ORG_ROLE="Viewer"
u/StarchyStarky — 13 days ago

Before I go away and make this a reality, I want to know if anyone has built a grafana dashboard view for Apple TV. Eg it loads the dashboards as you’d see them online.

I don’t need any sort of interaction other than to maybe change dashboard.

reddit.com
u/CatLumpy9152 — 10 days ago
▲ 4 r/grafana+1 crossposts

Hey all, I asked the other day about the ability to view the dashboard on Apple TV and show the data there.

Since then I’ve kinda made it work see attached link to YouTube short where I show it of.

https://youtube.com/shorts/zW5y66pUQ5U?si=7dXIYngaJ2gkurYu

I wanted to engage some understanding about demand and feature capability. Eg do you want a dashboard picker or do you want to it to be fixed to one dashboard etc. I’m going to build what I want but it might be easy to add some of this stuff in while I’m building it so would love to know what you’d like

if there is some demand I’ll out my code on GitHub for everyone when I have something !!

u/CatLumpy9152 — 9 days ago