Not promoting anything yet, just trying to understand if this is a real pain point.
I’m building a small tool for Google Colab users who train models for long sessions.
The idea: a Python package you add to your notebook that monitors your training run, tracks the latest checkpoint, backs it up to Drive, and shows a small dashboard with runtime status. If the runtime stops sending heartbeats, it marks the run as possibly disconnected or likely lost, so you know whether to reconnect or resume from the latest checkpoint.
It would not try to bypass Colab limits or auto-click anything. The goal is just to avoid losing work.
Would this be useful to you? What would you want it to alert/check first: checkpoints, Drive backup, runtime disconnects, GPU/RAM, or something else?