u/Clouded_Leopard17

I trained CLIP model from scratch on CC3M (~2.9M image-text pairs) using 2× NVIDIA A5000 GPUs from scratch. It took me around 20 hours, was able to fit the batch size of 160x2(x2 for gradient accumulation). Got  47.68% zero-shot and 78.76% linear probe accuracy on CIFAR-10.

u/Clouded_Leopard17 — 18 days ago