
We built an agentless CLI tool to instantly diagnose Proxmox VE for "silent timebombs" and best practices - cv4pve-diag
Hi again all!
A huge thanks for all the great feedback to our cv4pve-report tool we put out last Monday! We absolutely loved it (the HTML export feature was actually inspired by a user on here), and we're also adding a JSON export option so you can easily compare old and new reports to see exactly what changed.
Today, for Community Showcase Day, we're presenting the second part of our internal Proxmox toolkit. Where the report tool is excellent for reporting what is running in our clusters, we were always missing something to run a deep health check on our setups before things actually broke.
While managing clusters we were constantly running into the same problems but they were so "silent" that we didn't notice them until it was too late: thin provisioning pools filled up dangerously high without us knowing, stale snapshots holding VMs hostage, live migrations breaking on us with VMs having CPU type 'host' since the dawn of time, disk caches dangerously left on 'unsafe', etc. To automate finding these problems, we've built cv4pve-diag.
This is NOT a continuous monitoring daemon. It provides a one-time snapshot of the current state of your cluster. cv4pve-diag is a lightweight, agentless CLI tool (Win/Linux/macOS) that you can run manually: it will connect to the PVE API, perform a full configuration/health audit on the precise state of your current infrastructure in a few seconds, and exit.
What kind of checks does it perform:
- Storage and Snapshots: Detect LVM-thin/ZFS overcommits, dangling snapshots, or "lost" virtual disks no longer used by any VMs or CTs.
- VM/CT best practices: Identify inconsistent VM/CT CPU types across nodes (broken live migration), unsafe disk caching, or missing keyctl for LXC nested virt.
- Cluster configuration and health: Inspect Corosync configs, cluster quorum, and network mismatches.
- Output Options: gives your nodes and VMs a health score and outputs all warnings and critical messages in Text, JSON, HTML, Markdown or Excel.
Discussion: how do you perform configuration auditing beyond the standard continuous CPU/RAM monitoring in Grafana? What are the worst "silent timebombs" or gotchas in your Proxmox infrastructure a point-in-time diagnostics tool should discover?
The github repo is here if you want to audit your cluster: https://github.com/Corsinvest/cv4pve-diag
Thank you again for your support!