u/AndromedaGambler

Hi all!

Shubham and Aryan here, putting out our first open source video language model release.

Story time: we were building video editing agents for social-media content and were using Gemini-2.5-Flash to analyse IG reels and find events in them. It works, but at around a thousand clips/day the cost adds up, and we kept hitting the content-policy on perfectly fine social media clips at our scale

We had a couple of H100s sitting around, so we put them on solving this as a side project. We kept the scope deliberately narrow: not a general VLM you can chat with, just two operations we needed in production. We're releasing it because it seems generally useful for anyone building structured-video pipelines.

The interesting work wasn't the training loop, it was the data curation. We expected to ride the public video-annotated corpora (Tarsier-Recap, ActivityNet, Charades-Ego, LSMDC, etc.) but were disappointed. In practice most of them have one-line captions and rough timestamps, and aren't really annotated event-by-event at second-level precision.

So we wrote a teacher + pooling + human-review pipeline with Gemini-3-Flash in thinking mode and re-annotated ~400K clips from publicly available dataset mixes with fine-grained temporal captions. We then ran SFT + SimPO post-training to make the model really good at dense captioning and temporal grounding. Honestly, most of the project was making sure this data pipeline was high-quality and free of hallucinations.

The result: Marlin is a 2B video VLM tuned for the two questions developers actually want to ask of their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video. At 2B params, it's the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), and competitive with Gemini-2.5 at a fraction of the cost. We'll also release our training recipe and a new benchmark for video captioning and grounding soon.

Marlin-2B is open-sourced and comes with vLLM inference and two modes:

marlin.caption() gives a structured output of scene description and time-grounded events from a video.
marlin.find() gives (start, end) timestamps for a natural-language query over a video.

Weights are open and free to use on HF. If you find it useful, or have ideas on what capabilities we should improve next for real-world use cases, we would love to hear them!!

We want to make more such specific small video language models to enable more open ended video analytics use cases.

This is how our results look like

https://preview.redd.it/nowpwlotyy1h1.jpg?width=1170&format=pjpg&auto=webp&s=aa68fdde3886b8a4dfd895b6f0e0e1e1d397a282

https://preview.redd.it/stfnnkotyy1h1.jpg?width=3370&format=pjpg&auto=webp&s=2323f4dc7c4a79e54db85bf1fd940a54e353d103

https://preview.redd.it/7ifpzjotyy1h1.jpg?width=1170&format=pjpg&auto=webp&s=c721ce9e253ef628e21b0a254798a0149e6444b7

Marlin2B: a tiny video language model to extract structured information from videos