
▲ 31 r/deeplearning
I trained CLIP model from scratch on CC3M (~2.9M image-text pairs) using 2× NVIDIA A5000 GPUs from scratch. It took me around 20 hours, was able to fit the batch size of 160x2(x2 for gradient accumulation). Got 47.68% zero-shot and 78.76% linear probe accuracy on CIFAR-10.
u/Clouded_Leopard17 — 18 days ago