u/Express_Quail_1493

500k context on 48gb VRAM!! - 21tok/s (coding)

500k context on 48gb VRAM!! - 21tok/s (coding)

I found this model hiding in the corner of huggingface: https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Looks to be tuned specifically for math but i thought i'd give it a try since i cant run the full 120b nemotron super and it seem to hold up like a champ in agentic coding for some odd reason. been using it to code all my projects for a week now its amazing. Wouldnt dream of having 500k tokens on my potato dual TITAN RTX.

If you do happen to try it drop a cmment on your experience with it where did it break what usecase did u use it for ETC.

u/Express_Quail_1493 — 3 days ago

Arent These single file LLM coding tests like browserOS pretty much redundant now most 2026 LLM can easily handle this? In what other ways we can stress test these models for novel coding problems?

reddit.com
u/Express_Quail_1493 — 25 days ago

Im thinking Honestly past the 70b margin most of the improvements are slim.

From 4b -> 8b is wide

8b -> 14b is still wide

14b -> 30b nice to have territory

30b -> 80b negligible

80b -> 300b or 900b barely

What are your thoughts?

reddit.com
u/Express_Quail_1493 — 2 months ago