
r/grafana

I built a repo of ready-to-run OpenTelemetry Collector configs (Prometheus, Jaeger, Dynatrace, Datadog, Loki, k8s), feedback welcome
I just open-sourced a collection of ready-to-run OpenTelemetry
Collector configurations, because finding complete, working configs
for your specific backend always takes hours of trial and error.
It now includes examples for:
- Prometheus
- Jaeger
- Grafana Loki
- Dynatrace
- Datadog
- Kubernetes Operator
- Kubernetes Pod Annotation Scraping (with full relabeling)
- Debug (no backend needed, perfect for local dev)
Each example includes Docker Compose so you can run it in 60 seconds.
The k8s pod annotation scraping example includes relabeling for
prometheus.io/scrape, prometheus.io/port, and prometheus.io/path
annotations, the config everyone googles when setting up k8s monitoring.
I also actively contribute to the OpenTelemetry open source project,
recently got PRs merged into open-telemetry/otel-arrow and have PRs
open in opentelemetry-android, opentelemetry-helm-charts, and
opentelemetry-dotnet-instrumentation.
https://github.com/Cloud-Architect-Emma/opentelemetry-collector-examples
Feedback and contributions welcome! ⭐ if it's useful.
#OpenTelemetry #DevOps #Observability #Kubernetes #SRE #Monitoring #CloudNative #OpenSource
Grafana dashboard for Claude Code CLI metrics on a Prometheus-compatible backend
Hi! I'm an SRE who got pretty excited when Claude Code added the ability to emit OpenTelemetry metrics. Felt like that capability landed pretty quietly out there, so I built a Grafana dashboard on top.
It consumes Claude Code's OTLP metrics on Prometheus-compatible backends (Prometheus, VictoriaMetrics, Mimir, Thanos), all queries in PromQL.
Panels: cost by model/project/user, cache hit ratio, active time, edit-decision breakdowns, leaderboards. Custom labels for per-team / per-project views via OTEL_RESOURCE_ATTRIBUTES.
Parallel implementation of dashboard 25052 by 1w2w3y, which targets Azure Application Insights via KQL. Every panel rewritten in PromQL for the OSS metrics stack. Credit to that author for the original concept.
Direct download: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/
Article: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/
Repo (MIT, PRs welcome): https://github.com/rockdarko/claude-code-metrics-prometheus
Happy to answer questions about the panel queries or extend with what people want.
paradedb/benchmarker: a workload agnostic, multi-backend benchmarking tool.
Hi r/postgresql!
We just open sourced ParadeDB Benchmarker, a multi-backend benchmarking framework built on top of the excellent Grafana k6 (blog post).
One of the goals was avoiding a shared query abstraction layer. PostgreSQL queries stay PostgreSQL queries, with their own driver and native SQL.
Supports PostgreSQL, Elasticsearch, OpenSearch, ClickHouse, MongoDB, and ParadeDB with:
- mixed read/write workloads
- support for docker-compose profiles per backend
- dataset loader
- config and setup capture
- live metrics + exported reports
One of the ah-ha moments I had building this was using the pgx Go driver in anger for the first time, I'm a Rust guy, but I'm seriously impressed with pgx and what it can do.
Any comments welcome, we will be using this to benchmark ParadeDB, but you can write your own datasets and workloads which have nothing to do with full-text search.
Greetings. I just started setting up the LGTM stack on my k8s cluster using alloy and I'm using the mimir-distributed helm chart for HA, but the small sample values are for 1M+ series and request a lot of memory (60GB+) so I tried reducing the overall requests. I wonder if I'm missing anything else or if something will break eventually. These are my helm values. I haven't touched any of mimir's parameters, only set the storage backend to S3.
alertmanager:
persistentVolume:
enabled: true
replicas: 2
resources:
limits:
memory: 256Mi
requests:
cpu: 50m
memory: 128Mi
statefulSet:
enabled: true
compactor:
persistentVolume:
size: 5Gi
resources:
limits:
memory: 1Gi
requests:
cpu: 100m
memory: 512Mi
distributor:
replicas: 2
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
ingester:
persistentVolume:
size: 10Gi
replicas: 3
resources:
limits:
memory: 1Gi
requests:
cpu: 200m
memory: 512Mi
topologySpreadConstraints: {}
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- ingester
topologyKey: "kubernetes.io/hostname"
zoneAwareReplication:
topologyKey: "kubernetes.io/hostname"
chunks-cache:
enabled: false
index-cache:
enabled: false
metadata-cache:
enabled: false
results-cache:
enabled: false
minio:
enabled: false
overrides_exporter:
replicas: 1
resources:
limits:
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
querier:
replicas: 2
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
query_frontend:
replicas: 2
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
ruler:
replicas: 2
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
store_gateway:
persistentVolume:
size: 10Gi
replicas: 3
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
topologySpreadConstraints: {}
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- store-gateway
topologyKey: "kubernetes.io/hostname"
zoneAwareReplication:
topologyKey: "kubernetes.io/hostname"
gateway:
replicas: 1
resources:
limits:
memory: 731Mi
requests:
cpu: 1
memory: 512Mi
I think I should also increase kafka's replicas as it is a single point of failure currently, or disable it altogether and just use grpc.
Help would be appreciated.
I’ve been an early user of Grafana Assistant, but honestly it’s getting very frustrating with how slow and dumb it is at times. So, I decided to try and use Claude Code.
I gave Claude Code context on our stack (dashboards, data sources, etc) and made a skill. With that knowledge, I ask Opus to generate the dashboard JSON directly.
One shot (as it can’t validate visually) and it gets it better and faster than Grafana Assistant (yes with no validation).
The only frustrating thing is that you would need to reimport the JSON after every edit (or do them manually), but honestly I am starting to prefer it over Assistant.
Not sure if it’s model difference (Opus vs Sonnet) or just Grafana Assistant drowning in context, mostly itrelevant.
As a Linux DevOps engineer, part of my job is making sense of complex systems through monitoring. Over the years I have worked with many different monitoring tools. About 10 years ago, I came across Prometheus, Alertmanager, Node Exporter and Grafana. Later on I got to know Loki, Tempo, and lately Pyroscope. In my organization these tools form the core of our monitoring stack and we run them on Kubernetes (OpenShift), managed by operators. We use them to monitor not just OpenShift, but Linux, Oracle, Postgres, WebLogic, VMware and Windows as well. The cool part of these monitoring tools is that, although they do quite different things, they are built with a similar philosophy, work with labels and label filtering, integrate tightly with each other, and use Grafana as a single pane of glass.
This shared philosophy fundamentally changed the approach to observability. Instead of relying on rigid, hierarchical data structures, this entire ecosystem revolves around a multi-dimensional, label-based architecture. Whether it's a time-series metric in Prometheus, a log stream in Loki, a request trace in Tempo, or a code profile in Pyroscope, everything is tagged with the exact same key-value pairs (like app="my-service" or env="production").
Furthermore, they share a design principle of lightweight, cost-effective storage. Tools like Loki and Tempo were explicitly designed to avoid the heavy full-text indexing required by older logging and tracing backends. Instead, they only index the metadata (the labels) and push the compressed raw data into scalable object storage, like AWS S3 or MinIO.
This unified labeling taxonomy is what allows Grafana to tie everything together seamlessly. It enables seamless, contextual drill-down: you can spot a latency spike on a Prometheus metric dashboard, jump directly to the correlated Loki logs for that specific timeframe, pivot to the exact Tempo trace to see the request lifecycle, and finally open a Pyroscope Flame Graph to pinpoint the exact function or line of code causing the bottleneck. You get complete root-cause analysis without ever having to manually copy-paste trace IDs or synchronize timestamps between different tools.
Sometimes it's handy to have such a monitoring stack available on your Fedora workstation, to test applications locally and try out new things. I’ve put together a complete monitoring stack designed specifically to run smoothly on Fedora using rootless podman compose. I'm sharing it here so that interested folks can try out this stack as well. It’s a fully automated, production-like observability lab, all up-and-running within 10 minutes, depending on how fast your internet connection can download 19 container images ;-).
The Stack in a Nutshell:
- Grafana: The frontend for it all, with over a dozen pre-built dashboards.
- Prometheus & Loki: For scraping metrics and aggregating logs, with dozens of pre-configured Prometheus and Loki rules.
- Tempo & Pyroscope: For distributed tracing and continuous profiling.
- OpenTelemetry Collector & Grafana Alloy: For flexible, modern data ingestion.
- Metrics Blackbox Exporter, Podman Exporter and Node Exporter: For exposing service, node and container stats.
- Alerting: Alertmanager for routing alerts, webhook-tester for debugging alerts and at finally Karma and KeepHQ for visualizing alerts.
- Traefik: As a reverse proxy with automatic TLS.
- MinIO: Providing local, S3-compatible storage for logs, traces and profiles.
- NGINX: Serving a static landing page.
- Everything as Code: The entire setup, from the landing page (with a self-signed cert) to every dashboard and alert rule, is configured through code.
- An automated validation script to verify the health of all individual components and validate the end-to-end data flows across the entire observability pipeline.
I’d love for you to check it out. You can find the repo here: https://github.com/tedsluis/monitoring
This stack is by no means perfect or completely finished, but it works really well as a solid foundation. I’m looking for feedback from fellow Fedora users. What would you improve or do differently? Let me know what you think!
I added dedicated OpenShift support to KubeShark.
Mini recap:
KubeShark is my Kubernetes skill for Claude Code and Codex.
It helps AI agents generate, review, and refactor Kubernetes manifests without falling into the usual LLM traps: missing security contexts, deprecated API versions, broken selectors, wildcard RBAC, unsafe probes, missing resource requests, and rollout configs that look okay but fail under real traffic.
The important part is that KubeShark is failure-mode-first. It does not just tell the model “write good Kubernetes”. It forces the model to reason about what can go wrong before it generates YAML, and then return validation and rollback guidance as part of the answer.
That matters a lot with Kubernetes, because many bad manifests are accepted by the API server and only fail later at runtime.
Repo: https://github.com/LukasNiessen/kubernetes-skill
---
Now what’s new:
KubeShark now has special dedicated OpenShift support.
When the task involves OpenShift, OKD, ROSA, ARO, Routes, SCCs, OLM, ImageStreams, or oc, KubeShark switches into OpenShift-aware guidance.
This matters because OpenShift is Kubernetes, but with important platform behavior that generic Kubernetes YAML often ignores.
Common LLM mistakes include:
- hardcoding
runAsUser: 1000 - assuming root-capable images will run
- telling users to edit default SCCs
- granting
anyuidorprivilegedtoo broadly - using Ingress-controller annotations on OpenShift Routes
- forgetting to validate with
oc
Example guidance KubeShark now keeps in mind:
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: app
spec:
to:
kind: Service
name: app
tls:
termination: edge
It also knows to treat OpenShift Routes, SCCs, arbitrary UID containers, and OLM-managed resources as first-class concerns.
So instead of generic Kubernetes advice, you get OpenShift-aware manifest generation and review.
Hello,
I'm using a few MCP servers for various areas of the business, I've been asked to build or use an existing MCP for Loki logs and if possible metrics I store in Grafana.
If you are using one for Loki which one would you recommend? We use Grafana v13 and Loki all in house (OSS).
Thanks
I have a cluster with 2 nodes a controlplane and a worker node.
The top 2 lines are both alloy-log that are taking a whopping 420-450mb each 1 for each node. Is this normal? its a just a small homelab i can't afford a -1GB just for logging.
In total the observability namespace has taken around 1.4GB and i thought i was being smart by not using a kube-prometheus-stack and trying to cut on memory usage by using a more 'lightweight' open telemetry collector and making grafana cloud carry me.
Im using the grafana-k8s-monitoring chart
helmCharts:
- name: k8s-monitoring
repo: https://grafana.github.io/helm-charts
releaseName: grafana-k8s-monitoring
namespace: observability
version: "^3"
valuesFile: values.yaml
includeCRDs: true
with the following values.yaml
cluster:
name: homelab
destinations:
- name: grafana-cloud-metrics
type: prometheus
url: https://<some-prod-somewhere>.grafana.net/api/prom/push
auth:
type: basic
usernameKey: metrics-username
passwordKey: token
secret:
create: false
name: grafana-cloud-credentials
namespace: observability
- name: grafana-cloud-logs
type: loki
url: https://<a-log-id>.grafana.net/loki/api/v1/push
auth:
type: basic
usernameKey: logs-username
passwordKey: token
secret:
create: false
name: grafana-cloud-credentials
namespace: observability
clusterMetrics:
enabled: true
clusterEvents:
enabled: true
podLogs:
enabled: true
namespaces:
- auth
- traefik
- argocd
- whoami
- external-secrets
- cert-manager
- observability
integrations:
alloy:
instances:
- name: alloy
labelSelectors:
app.kubernetes.io/name:
- alloy-metrics
- alloy-singleton
- alloy-logs
alloy-metrics:
enabled: true
alloy:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
memory: 512Mi
configReloader:
resources:
requests:
cpu: 10m
memory: 50Mi
limits:
memory: 128Mi
alloy-singleton:
enabled: true
alloy:
resources:
requests:
cpu: 25m
memory: 128Mi
limits:
memory: 512Mi
configReloader:
resources:
requests:
cpu: 10m
memory: 50Mi
limits:
memory: 128Mi
alloy-logs:
enabled: true
alloy:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
memory: 512Mi
configReloader:
resources:
requests:
cpu: 10m
memory: 50Mi
limits:
memory: 128Mi
Any help or suggestion would be appreciated.
Hello all! Im trying to set up grafana to ultimately integrate into my homepage. I got prometheus and grafana installed and everything works perfectly fine internally (check screenshot). However, when I share it externally, it shows as "No Data". Any help? Thanks
grafana:
image: grafana/grafana-oss:latest
container_name: grafana
restart: unless-stopped
user: "472:472"
ports:
- "3002:3000"
volumes:
- ./grafana-config/grafana:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD}
- GF_SECURITY_ALLOW_EMBEDDING=${GF_SECURITY_ALLOW_EMBEDDING}
- GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED}
- GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE}
- GF_SECURITY_COOKIE_SAMESITE=${GF_SECURITY_COOKIE_SAMESITE}
- GF_SECURITY_COOKIE_SECURE=${GF_SECURITY_COOKIE_SECURE}
- GF_AUTH_ANONYMOUS_ORG_NAME=${GF_AUTH_ANONYMOUS_ORG_NAME}
- GF_SERVER_ROOT_URL=${GF_SERVER_ROOT_URL}
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./grafana-config/prometheus/config:/etc/prometheus
- ./grafana-config/prometheus/data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
pid: host
network_mode: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
GF_SECURITY_ADMIN_USER=---
GF_SECURITY_ADMIN_PASSWORD=---
GF_SERVER_ROOT_URL=https://xxx.xxx.com
GF_SECURITY_ALLOW_EMBEDDING=true
GF_SECURITY_COOKIE_SAMESITE=disabled
GF_SECURITY_COOKIE_SECURE=false
GF_AUTH_ANONYMOUS_ENABLED=true
GF_AUTH_ANONYMOUS_ORG_NAME="Main Org."
GF_AUTH_ANONYMOUS_ORG_ROLE="Viewer"
Before I go away and make this a reality, I want to know if anyone has built a grafana dashboard view for Apple TV. Eg it loads the dashboards as you’d see them online.
I don’t need any sort of interaction other than to maybe change dashboard.
Hey all, I asked the other day about the ability to view the dashboard on Apple TV and show the data there.
Since then I’ve kinda made it work see attached link to YouTube short where I show it of.
https://youtube.com/shorts/zW5y66pUQ5U?si=7dXIYngaJ2gkurYu
I wanted to engage some understanding about demand and feature capability. Eg do you want a dashboard picker or do you want to it to be fixed to one dashboard etc. I’m going to build what I want but it might be easy to add some of this stuff in while I’m building it so would love to know what you’d like
if there is some demand I’ll out my code on GitHub for everyone when I have something !!