u/Franklupog

Hi again all!

A huge thanks for all the great feedback to our cv4pve-report tool we put out last Monday! We absolutely loved it (the HTML export feature was actually inspired by a user on here), and we're also adding a JSON export option so you can easily compare old and new reports to see exactly what changed.

Today, for Community Showcase Day, we're presenting the second part of our internal Proxmox toolkit. Where the report tool is excellent for reporting what is running in our clusters, we were always missing something to run a deep health check on our setups before things actually broke.

While managing clusters we were constantly running into the same problems but they were so "silent" that we didn't notice them until it was too late: thin provisioning pools filled up dangerously high without us knowing, stale snapshots holding VMs hostage, live migrations breaking on us with VMs having CPU type 'host' since the dawn of time, disk caches dangerously left on 'unsafe', etc. To automate finding these problems, we've built cv4pve-diag.

This is NOT a continuous monitoring daemon. It provides a one-time snapshot of the current state of your cluster. cv4pve-diag is a lightweight, agentless CLI tool (Win/Linux/macOS) that you can run manually: it will connect to the PVE API, perform a full configuration/health audit on the precise state of your current infrastructure in a few seconds, and exit.

What kind of checks does it perform:

Storage and Snapshots: Detect LVM-thin/ZFS overcommits, dangling snapshots, or "lost" virtual disks no longer used by any VMs or CTs.
VM/CT best practices: Identify inconsistent VM/CT CPU types across nodes (broken live migration), unsafe disk caching, or missing keyctl for LXC nested virt.
Cluster configuration and health: Inspect Corosync configs, cluster quorum, and network mismatches.
Output Options: gives your nodes and VMs a health score and outputs all warnings and critical messages in Text, JSON, HTML, Markdown or Excel.

Discussion: how do you perform configuration auditing beyond the standard continuous CPU/RAM monitoring in Grafana? What are the worst "silent timebombs" or gotchas in your Proxmox infrastructure a point-in-time diagnostics tool should discover?

The github repo is here if you want to audit your cluster: https://github.com/Corsinvest/cv4pve-diag

Thank you again for your support!

Hello all! For the Community Showcase Day we wanted to share an open-source tool we’ve been building to alleviate a recurring problem we've faced when administrating Proxmox clusters at scale.

We often hear from Sys Admins moving from VMware to Proxmox VE that they miss having a global inventory extractor for capacity management and compliance audit similar to RVTools. With this in mind, we engineered cv4pve-report.

It's entirely agentless; it will read from the native PVE REST APIs, then combine everything into a structured and cross-linked spreadsheet (or if you want a huge CSV dump you can feed into Big Data processing, pure CSV is also possible).

Some of the main technical points we aimed at:

Deep Extraction: not only simple CPU/RAM, but also RRD historical metrics, SDN configuration and rules, firewall rules, snapshots present (with RAM retained state), and even physical disk SMART telemetry.
API Load Management: As you might imagine querying very dense clusters against the PVE API will hit the PVE daemon pretty hard; therefore we implemented 3 scanning modes (Fast, Standard, Full) that trade depth of scan against API load on PVE.
Network topology: generates a SVG vector diagram depicting the whole network setup within the DC, from the Physical NIC up to the leaf VMs and including bonds/bridges/etc.

The Question to you all: Let us know what you think! Any comments, feedback, etc. are greatly appreciated. How are you performing documentation, compliance audit and capacity planning currently within your PVE clusters? What metrics/edge cases are a pain point to extract natively in PVE and could be a useful feature to add?

Here is the GitHub repo so you can have it and play around with it: https://github.com/Corsinvest/cv4pve-report

Many thanks in advance!

We built an agentless CLI tool to instantly diagnose Proxmox VE for "silent timebombs" and best practices - cv4pve-diag