
Bit of an unusual post here, this is not about running the latest and greatest open-weight models on cutting-edge hardware. Apologies for the click-baity title, but it is 100% true.
This is a success story of using OpenAI Whisper. I teach software development at a university and it is assessment marking season. Of course we cannot run everybody's code to verify their claimed features, so students were asked to submit demo videos to showcase features and discuss design decisions.
We have very limited time to mark each student, and some students' videos are really hard to understand, either due to them mumbling, talking really quickly (there is a time limit), poor recording quality, background noise, or heavy non-native English accent. This posed a challenge for me as I have to focus 120% on the video just to try and understand what they are saying so that I can mark them fairly. I struggled to write notes and focus on what they're saying simultaneously, so sometimes I had to rewind. This process was very time consuming.
To give every student a fair chance, I decided to generate subtitles for each of their videos. I did some research (asked Claude for help) and landed on OpenAI Whisper. It uses tiny models (ranging from 39M to 1.5B params) to perform speech-to-text. I landed on the "small" 244M model which takes up around 2GB of VRAM. This fits comfortably in my old 2020 Razer Blade Stealth's GTX 1650 Ti (4GB).
Installation was super simple, just a single pip command. I then ran whisper <path-to-video> --model small.en --output_format srt --language English --fp16 False --output_dir <path-to-output-dir> on each of the student submission videos using a script.
I did this in WSL as I already had python installed there and I prefer working with linux paths. To my surprise, no further setup was required for the NVIDIA GPU to be utilised correctly. On average, a 20 minute video would take around 5 minutes for the subtitle to be generated on this severely power-limited laptop GPU.
The accuracy was very impressive. Even when working with rather thick accents, I estimate the accuracy rate to be around 80 - 90%.
I now have subtitles to read while I watch students' submission videos which honestly helps so much!