r/AIToolsPerformance

Someone is cooling a DGX system with tap water running Qwen3.5-122B at 18.77 tok/s

The setup: a DGX system running Qwen3.5-122b-a10B at Q6_K precision, 110GB memory usage, 80k context window, continuous vision analyses at 18.77 tokens per second. The cooling solution is tap water, keeping GPU temperatures below 68 degrees Celsius at 95% utilization.

What makes this notable is the contrast. DGX systems are enterprise-grade hardware with sophisticated cooling infrastructure designed for data centers. This person bypassed all of that for a garden-variety water supply and it is working. The unknown is longevity - they note uncertainty about how often the water needs changing.

The context is that Qwen3.5-122b-a10B is a MoE model where only 10B parameters are active per token, which is why 110GB of memory can serve it. But 18.77 tok/s with vision analysis at 80k context on a single system is a serious throughput number, and the cooling is the bottleneck being addressed here, not compute.

The fair question is whether this is a clever hack or a ticking time bomb for the hardware. Mineral buildup, corrosion, and microbial growth in an open-loop tap water system over weeks and months could degrade cooling performance or damage the hardware entirely.

For anyone running high-utilization inference on enterprise gear with unconventional cooling: what is the longest you have gone without issues, and did you treat the water at all?

reddit.com
u/IulianHI — 1 day ago

Needle distills Gemini tool calling into a 26M parameter model running at 1200 tok/s decode

A new open-source project called Needle has distilled function-calling and tool-use capabilities from Gemini down to a 26 million parameter model. The reported performance numbers are striking: 6000 tokens per second on prefill and 1200 tokens per second on decode, running on consumer devices.

The motivation behind the project was frustration with the lack of effort toward building agentic models that can run on budget phones. Rather than accepting that tool calling requires large models, the team investigated how small a model could be while still reliably handling function calling tasks. The answer turned out to be 26M parameters - tiny enough to run on hardware that would struggle with even a 1B model.

What makes this worth paying attention to is the implication for agent architectures. If tool calling can be offloaded to a model this small and fast, it changes how you think about the orchestration layer. You do not need your main reasoning model to also handle structured output formatting - a 26M model can parse intent into function calls at speeds that are essentially instant relative to the reasoning step.

The open question is how well Needle handles edge cases compared to native tool calling in larger models. Are people finding that distilled tool-calling models maintain reliability across complex multi-tool workflows, or does accuracy fall off quickly once you move beyond simple single-function invocations?

reddit.com
u/IulianHI — 23 hours ago

80 tok/s and 128K context on 12GB VRAM - Qwen3.6 35B A3B with MTP changes the value of entry-level GPUs

A new configuration report shows Qwen3.6 35B A3B hitting over 80 tokens per second with 128K context on just 12GB of VRAM, using the latest llama.cpp build with the MTP PR. The reported draft acceptance rate is above 80%.

Why this matters: 12GB VRAM has been the budget tier for local inference for years - think RTX 3060 and 4070 territory. Getting a 35B parameter model (even a MoE with 3B active parameters) to run at 80+ tok/s with long context on that hardware significantly extends the useful life of these cards. The combination of MoE architecture keeping active parameters small, MTP speculative decoding accelerating generation, and quantization fitting everything into limited VRAM creates a compounding effect.

The kicker is the 128K context. That is not a toy context window. It means real document processing, multi-file code analysis, and extended conversations are all feasible on hardware that costs under $300 used.

Fair question: with the Qwen3.6 35B A3B available at $0.15/M tokens via API with 262K context, and an uncensored variant now available with all 19 MTP heads preserved (KLD 0.0015), is the local setup still worth the configuration effort for people who already have 12GB cards, or does the API pricing make local only worthwhile for privacy-sensitive workloads?

reddit.com
u/IulianHI — 4 days ago

Intel Optane build runs 1T param Kimi K2.5 at 4 tok/s - is persistent memory viable for local inference?

Someone built a system using Intel Optane Persistent Memory that reportedly runs Kimi K2.5, a 1 trillion parameter model, locally at approximately 4 tokens per second. The build leverages Optane as its standout component, which is an unusual choice since Optane persistent memory modules have been largely discontinued by Intel.

The stat line is attention-grabbing - a trillion parameters locally at any speed is rare. But 4 tok/s is firmly in "readable but slow" territory, roughly half the speed of typical human reading. The question is whether the cost and complexity of sourcing discontinued Optane modules makes sense compared to more conventional approaches like multi-GPU setups or even offloading to standard DDR5 RAM.

For anyone familiar with Optane-based inference builds: how does the random access performance of persistent memory actually compare to standard DDR4/DDR5 when running models this large, and is the used market for Optane modules still practical enough to recommend to someone considering a similar build?

reddit.com
u/IulianHI — 2 days ago

Tried 9 AI Tools Recently, Here’s What I Actually Still Use

Tried a lot of AI tools over the last few months, and honestly most of them were cool for like 10 minutes then I never opened them again.

These are the few I actually kept using consistently:

ChatGPT Pro – probably the tool I use the most overall. Mainly for brainstorming, fixing problems, rewriting stuff and random research. Still needs fact checking sometimes but huge time saver.

Claude – feels calmer and better for long explanations or writing. I use it more when I want cleaner structured answers.

Cursor – genuinely one of the best AI coding tools I tried. Feels much more useful than basic autocomplete because it actually understands your files and project structure.

Perplexity – replaced Google for a lot of quick searches honestly. Way faster when I just need an answer + sources without opening 15 tabs.

Canva AI – surprisingly useful for quick visuals, thumbnails and simple edits. Not perfect but saves a lot of time.

Kling AI – probably the AI video tool that impressed me the most recently. Prompt adherence is actually decent compared to a lot of other generators.

ElevenLabs – still probably the best sounding AI voices overall from what I tested.

Polyvoice – found it pretty useful for translating voice/video content into other languages without completely killing the original vibe of the audio.

Notion AI – not something I use daily, but useful when organizing notes, content ideas or summarizing things quickly.

Most AI tools honestly feel overhyped after a while, but a few actually become part of your workflow.

What AI tools do you guys actually use regularly?

reddit.com
u/Ethan_Builder — 1 day ago

Qwen3.6 35B-A3B MoE runs practically on just 12GB VRAM with IQ4_XS quant

New benchmarks show that Qwen3.6 35B-A3B, a Mixture-of-Experts model, is surprisingly usable on an RTX 3060 with only 12GB of VRAM. The setup uses the IQ4_XS GGUF quantization running on Windows with 32GB DDR4-3200 system RAM and CUDA 13.x.

The key detail is the -ncmoe parameter in llama.cpp. Since this is a MoE architecture, lowering the -ncmoe value keeps more MoE blocks on the GPU rather than offloading to system RAM. Tuning this setting makes a significant difference in performance on constrained VRAM setups.

What is notable here: 12GB has been considered the bare minimum for running anything beyond small models locally. A 35B parameter model fitting into that budget - even as a MoE where only a fraction of parameters are active per token - changes the calculus on what hardware is actually needed for capable local inference. The A3B designation means only 3B parameters are active at any given step, which explains how it fits.

The model is also available in an uncensored variant with native MTP preserved, reporting a KL divergence of just 0.0015 with 10 out of 100 refusals and all 19 MTP heads intact - available in Safetensors, GGUF, NVFP4, and GPTQ-Int4 formats.

For anyone running this on similar low-VRAM hardware: what -ncmoe value are you settling on, and how is token throughput holding up at longer context lengths?

reddit.com
u/IulianHI — 5 days ago

A user reports that Qwen3.6-35B is both higher quality and faster than 27B for their use cases, which include multi-stage pipelines for coding and internet research. They are puzzled because most discussion focuses on the 27B variant.

This is counterintuitive. A larger model being faster on the same hardware would suggest something about the architecture or quantization behavior differs significantly between the two. The 35B could be an MoE variant where fewer parameters are active per token, which would explain both the speed and the quality difference.

For people running either variant locally: are you seeing similar results where 35B outperforms 27B on both axes? What hardware and quantization levels are you using? And does anyone have insight into why the 27B gets so much more attention despite potentially being the weaker option?

reddit.com
u/IulianHI — 11 days ago

NVIDIA Star Elastic packs 30B, 23B, and 12B reasoning models in one checkpoint with zero-shot slicing

NVIDIA released Star Elastic, a single checkpoint that contains 30B, 23B, and 12B reasoning models through what they call "zero-shot slicing." The idea is that you load one model file and can extract different sizes depending on your VRAM or speed requirements, rather than downloading separate checkpoints for each configuration.

The concept is being compared to scalable video coding, where one stream serves multiple quality levels. If it works as described, this could simplify local deployment significantly - one download, multiple usable model sizes depending on your hardware on any given day.

What stands out is that this reportedly went live 11 days ago but barely got traction. For a release from NVIDIA that directly targets local inference flexibility, that seems like surprisingly low visibility.

The open question is quality at each slice. A 12B model carved from a 30B checkpoint is not the same as a purpose-trained 12B model. The architecture presumably uses some form of elastic depth or width pruning, but the details are thin so far.

For anyone who has actually run the different slice sizes: how does the 12B and 23B reasoning quality compare to purpose-built models at those same sizes - is there a noticeable capability drop, or does the zero-shot slicing preserve enough to make it genuinely competitive?

reddit.com
u/IulianHI — 4 days ago

Gemma 4 26B hits 600 tok/s on single RTX 5090 with DFlash - is MTP already obsolete?

A benchmark using vLLM 0.19.2rc1 shows Gemma 4 26B hitting 600 tokens per second on a single RTX 5090 (32GB VRAM) using DFlash speculative decoding. The setup pairs an AWQ 4-bit quant of the main model with the z-lab DFlash draft model, running a workload of 256 input tokens and 1024 output tokens.

What makes this worth discussing: DFlash uses parallel block diffusion drafting rather than the autoregressive approach behind MTP. The claim is that DFlash should be a better alternative to MTP specifically because of faster parallel drafting. And 600 tok/s on a single consumer GPU is a serious number for a 26B model.

The timing is interesting too. Most attention has been on MTP implementations for Gemma 4 and Qwen3.6, but DFlash quietly shipped for Gemma 4 26B and barely got noticed.

For people who have tried both DFlash and MTP on the same hardware: does DFlash actually deliver higher sustained throughput in real workloads, or does the 600 tok/s only hold under benchmark-friendly conditions?

reddit.com
u/IulianHI — 5 days ago

Someone debugged plane WiFi at 10km altitude using a local LLM on their laptop

Someone on a flight couldn't get their Ubuntu laptop to load the plane's captive portal - the WiFi connected but the login page wouldn't appear. The fix came from running Qwen 3.6 35B A3B locally, which diagnosed that systemd-resolved was using DNS settings that blocked the captive portal redirect.

That is a genuinely surprising use case for local inference. No cloud API, no internet connection needed - the model ran entirely on the laptop at 10km altitude and solved a networking issue that was preventing internet access in the first place. The circular dependency is what makes it interesting: you need the model to fix the problem that is preventing you from reaching the model.

The context here is that Qwen 3.6 35B A3B is a MoE architecture where only 3B parameters are active per token, which is why it can run on a laptop without dedicated GPU VRAM. It is exactly the kind of model that makes offline, on-device troubleshooting viable.

The implication is straightforward: local models are crossing from "nice to have" into "actually practical for real-time problem solving in situations where cloud is not available." A laptop fixing its own connectivity issue mid-flight is hard to argue with.

What is the most unexpectedly useful thing you have solved with a local model that you could not have done with a cloud API?

reddit.com
u/IulianHI — 3 days ago

A follow-up report on running Qwen3.6-27B on a single RTX 3090 shows significant progress since the earlier ~125K context ceiling. The new configuration reportedly pushes context to ~218K while maintaining 50-66 tokens per second, and tool calls are now stable thanks to a PN12 fix.

Why this matters: the previous post had this model at ~125K context with higher TPS. Now we are seeing nearly double the context window on the same hardware, with tool calling actually working. For anyone building agent workflows locally, stable tool calls at this context length on a single consumer GPU is a genuine milestone. The gap between "runs locally" and "runs locally with reliable agent behavior" has been the real blocker for production use.

The interesting contrast is with the Gemma 4 31B vs Qwen 3.6 27B comparison on a MacBook Pro M5 Max. In that test, Gemma completed a Pacman game in under 4 minutes with only 6,209 tokens, while Qwen took 18 minutes and burned through 33,946 tokens. Speed and token efficiency are different things - Gemma was slower per token (27 vs 32 TPS) but solved the task far more efficiently.

For people running Qwen3.6-27B as an agent: are you seeing the tool call stability hold up across longer sessions, or does it still degrade with complex multi-step workflows?

reddit.com
u/IulianHI — 13 days ago

Someone got Karpathy's MicroGPT running at 50,000 tokens per second on an FPGA implementation called TALOS-V2. The model is tiny - just 4,192 parameters - so this is not a practical inference engine. But the speed number is eye-catching, and part of the explanation is that weights live onboard the FPGA rather than being fetched from external memory.

Why this matters: the bottleneck for LLM inference on GPUs is increasingly memory bandwidth, not compute. FPGAs with onboard weight storage sidestep that problem entirely. This project is obviously a toy at 4K params, but the architecture pattern - keeping weights on the silicon - is the same one that makes Apple's unified memory approach competitive. The question is whether this scales. Going from 4,192 parameters to something useful like a few billion means radically different memory requirements and probably a different hardware class entirely.

Still, 50K tok/s is the kind of number that makes you think about what inference looks like when memory bandwidth stops being the constraint. If FPGA or ASIC approaches can maintain even a fraction of this advantage at scale, the GPU-centric inference stack we all use today looks very different in a few years.

For people who have worked with FPGAs for inference: is the scaling path from a 4K param toy model to something practically useful realistic, or does the memory problem just reappear in a different form?

reddit.com
u/IulianHI — 10 days ago

A user reports that an AI coding assistant, after repeatedly getting bash escape sequences wrong and creating bad directories, offered a "fix" command that contained rm -rf. The user approved it without catching the destructive command. The result was significant disruption, though frequent git pushes limited the damage.

This is worth flagging because it highlights a real and growing risk with agentic coding workflows. The model did not refuse or warn - it generated a destructive command as part of its own error correction loop. The user trusted the output during a frustrating multi-step debugging session, exactly the kind of moment where human attention drops.

The interesting bit is the chain of failures. It was not a single bad suggestion. The model failed repeatedly on bash escaping, created a mess trying to fix its own mistakes, and then proposed a cleanup that made everything worse. This is the compounding error problem that agentic systems are particularly vulnerable to - each mistake increases the chance the next one is also wrong, and the human reviewer is increasingly fatigued.

For people using AI assistants with shell access: what guardrails are you running? Are you relying on manual review of every command, or have you found automated approaches that catch destructive patterns before execution?

reddit.com
u/IulianHI — 10 days ago

A new project called PFlash is reporting 10x prefill speedup over llama.cpp at 128K context on quantized 27B models, running on a single RTX 3090. The approach uses speculative prefill for long-context decode, built in C++/CUDA.

Why this matters: prefill has been the quiet bottleneck for local inference at long context. Everyone focuses on decode speed (tokens per second during generation), but the time to process a large prompt before the first token appears can be brutal at 100K+ context. A 10x improvement there would meaningfully change the experience for RAG workflows, large document analysis, and agent loops that accumulate context over multiple turns.

The catch is that this targets 27B quantized models specifically. The question is whether the technique generalizes to other sizes and architectures, or if it relies on properties unique to this model class. The fact that it is C++/CUDA rather than Python is also worth noting - suggests it is designed for direct integration into existing inference stacks rather than being a standalone tool.

For anyone who has been avoiding long-context workloads locally because of prefill latency: does a 10x improvement here change your calculus, or is decode speed still your primary bottleneck?

reddit.com
u/IulianHI — 12 days ago
▲ 9 r/AIToolsPerformance+5 crossposts

I added 26 new visual tasks to MindTrial, under the visual2 prefix.

These are grayscale, somewhat higher-resolution image tasks covering OCR, spatial reasoning, numerical awareness, visual deduction, and pattern completion. All tested models had access to the same Python tool environment.

Because the merged leaderboard now includes models with different task counts, I’m focusing on percentages rather than raw totals.

Old visual → New visual2 pass rate:

  • GPT-5.5: 78.8% → 84.6% (+5.8 pts), runtime/task +50.9%
  • Gemini 3.1 Pro: 63.6% → 84.6% (+21.0 pts), runtime/task -38.3%, 0 hard errors
  • GPT-5.4: 66.7% → 73.1% (+6.4 pts), runtime/task +6.8%
  • Claude 4.7 Opus: 51.5% → 65.4% (+13.9 pts), runtime/task -21.3%
  • Kimi K2.6: 39.4% → 61.5% (+22.1 pts), runtime/task -13.8%
  • Grok 4.20 Beta: 36.4% → 57.7% (+21.3 pts), runtime/task +178.1%

Main takeaway: GPT-5.5 and Gemini 3.1 Pro are basically co-leaders on this new visual slice.

GPT-5.5 had the better accuracy on completed tasks: 88.0% vs. Gemini’s 84.6%.

Gemini had the cleaner reliability profile: same 84.6% pass rate, 0 hard errors, and much better runtime compared with its old visual-task run.

Kimi K2.6 is also interesting: big improvement and strong completed-task accuracy, but still hurt by hard errors and long runtime.

Overall, visual2 seems to be doing what I hoped: OCR is now mostly solvable for top models, while spatial reasoning and visual pattern completion still separate the field.

Selected models on visual2tasks: http://www.petmal.net/shared/mindtrial/results/2026-04-28/mindtrial-eval-selected-models-visual2-tasks-04-2026.html

petmal.net
u/Correct_Tomato1871 — 13 days ago