Kesha Voice Kit — fully local STT + TTS for agent stacks
Been annoyed for a while with the friction of plugging voice into agent workflows without round-tripping to the cloud. So I built kesha-voice-kit — a local voice toolkit built for Bun and optimized for Apple Silicon.
This CLI gets invoked by LLM agents (OpenClaw routes voice messages through it) and from shell scripts. Every kesha audio.ogg pays the cold-start tax. Bun’s JS startup is noticeably faster than Node’s — and when an agent fires off 5 tool calls in parallel, those milliseconds compound. Not scientific numbers here, but Bun felt instant from day one; Node felt sluggish.
The whole app is a subprocess wrapper around kesha-engine (Rust binary). Twelve Bun.* calls across six files — Bun.spawn, Bun.file, Bun.write, Bun.which. No async/sync ceremony, no pipe-handling weirdness, pipe-friendly by default. Writing Bun.file(path).json() feels like it should’ve always been this way.
Voice in: NVIDIA Parakeet TDT 0.6B for speech-to-text (25 languages, not Whisper).
Voice out: Kokoro-82M for English, Piper for Russian. Auto-routed by detected text language — just kesha say "Привет" and it picks Piper automatically.
Fully on-device — no cloud, no API keys, no telemetry. Ships as an npm package + a ~20 MB Rust engine binary; first-class on macOS arm64 (CoreML via FluidAudio), also runs on Linux and Windows x64 (ONNX).
Numbers (M3 Pro)
Compared against whisper large-v3-turbo:
- ~15× faster on M3 Pro (CoreML / Apple Neural Engine)
- ~2.5× faster on CPU
- Real-time factor small enough for live dictation and responsive voice UX
Full methodology, fixtures, and exact commands in BENCHMARK.md.
OpenClaw agents receive voice on Telegram/WhatsApp/Slack today but can only reply in text. Kesha closes that loop:
bun install -g u/drakulavich/kesha-voice-kit
brew install espeak-ng
kesha install --tts # one-time, opt-in (~390 MB)
kesha voice.ogg # transcribe Russian voice message
kesha say "Hello World" > reply.wav # and talk back
The existing OpenClaw plugin path already hooks into tools.media.audio.models for input; the output side is a matter of a few lines of TS.
- npm: https://www.npmjs.com/package/@drakulavich/kesha-voice-kit
- GitHub: https://github.com/drakulavich/kesha-voice-kit
Happy to share more detailed numbers, tweak the API for real use cases, or walk through how the bidirectional voice pipeline is wired up.