u/Numerous-Campaign844

I am still processing this lol.

I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled).

Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal.

The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken.

I've attached the HTML log so you guys can read the whole debate. The fact that a 31B open-weight model can perform an agentic peer-review and bully a frontier MoE model into submission is insane to me. Check out the file.

Full conversation

TIL: Bigger model isn't smarter... Well atleast not all the time.

I am still processing this lol.

The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken.

Full conversation

TIL: Bigger isn't always smarter

Smaller models are getting scary good.

I let Gemma 4 (31B) debate Gemini 3 Deepthink. The result is insane.