u/lordhiggsboson

We benchmarked our in-browser WebGPU inference engine against leading libraries and beat them all across key metrics

I'm part of Noumena Labs, a research group working on local inference improvements for running LLMs in browser through WebGPU acceleration.

We are in the process of open-sourcing our library for embedding LLMs inside web applications, and we recently ran benchmarks against both HuggingFace's Transformer.js and MLC WebLLM. Across all metrics we tested, we are seeing either on par with or exceeding them in TTFT and decode speeds.

Unlike other leading libraries, that utilize either ONNX, TVM, etc. as their backend, we are building on top of GGML/llama.cpp. This allows us to be more precise on shader, memory, GPU, and CPU utilizations. Recently, we have been contributing back to the WebGPU backend as part of our research, but the core results seen here comes from our internal version of llama.cpp which is ahead of upstream + a lot of scaffolding around it.

It's still the early days, but the results are looking promising. Even though we have yet to open-source the code, an alpha version of the NPM package is available to play around with:

https://www.npmjs.com/package/cogentlm

If you have a chance to try it would love to hear feedback on your experience. If you'd like access to the code to help contribute, also open to fielding questions around that pre-release.

Below is results for Long Input and Long Output (LILIO) tests over 9 runs with 1 warmup.

Engine Runs TTFT Mean E2E Latency Decode TPOT Mean 4G Repeat
CogentLM (Baseline) 9 35.5 ms 6,975.1 ms 78.31 tok/s 13.61 ms 0.0462
Transformers.js 9 754.5 ms 32,023.7 ms 16.35 tok/s 61.19 ms 0.0505
WebLLM 9 464.9 ms 37,294.6 ms 14.02 tok/s 72.79 ms 0.3828
u/lordhiggsboson — 7 days ago
▲ 64 r/aigamedev+2 crossposts

I vibe coded a mini wizard arena game where your prompts generate the spells (Open Source)

Hey all! I created a tiny wizard battle arena game where the gimmick is using prompts to generate the spells dynamically. The actual in-game prompts are processed locally running on an LLM through WebGPU. Here is my stack:

- IDE: VSCode
- Agent Harness: OpenCode
- AI Model: Opus 4.7 and GPT 5.5
- 3D assets: all created with three.js primitives and shaders
- Audio: procedural synth and ElevenLabs
- Multiplayer: peerjs
- Local AI: cogentlm

You can play the game solo or with others via P2P multiplayer. Would love to get feedback and hear your thoughts. Here are the links if interested!

Play the game

See the Code (Github)

Happy to answer any workflow related questions as well

u/lordhiggsboson — 6 days ago
▲ 6 r/threejs+1 crossposts

I built a proof-of-concept web app where a small LLM is used to power a dynamic AI character.

Tools:

- Opencode as coding harness

- Github Copilot (GPT 5.5/Opus 4.7) for LLMs

- VS Code for IDE

Some of the things I'm testing out is:

- Can a local LLM drive interesting behavior and allow for dynamic decision making

- Can we make WebGPU interference good enough to drive interactions in real-time

Current Approach:

- Adapted llama.cpp, with updated WebGPU backend (WASM + C++)

- Developed a harness for the agent thinking and decision making loop (WASM + TS)

- Built with three.js and Next.js

- Using LFM 2.5 1.2B as the primary model to drive the character AI

Known limitations:

- Only works in Chrome, and no mobile support yet (mostly due to how we are handling threading)

- No multi-gpu support

Live Demo: https://www.noumenalabs.ai/0xF3A3B1

u/lordhiggsboson — 14 days ago