how do you scale infrastructure for ai agents on a budget?
we're running an agentic pipeline that does multi-modal file processing - large files, often hundreds of mb per request. The actual agent logic works fine. but the infrastructure is not.
during peaks the queue backs up fast. But staying provisioned at peak capacity 24/7 would eat our runway during the slow periods. Standard cpu/memory-based autoscaling is the wrong signal here - gpu utilization under inference workloads doesn't behave the way normal compute does. you can have a node that looks underutilized on conventional metrics while your queue is actually backing up.
how others have handled this?