← Back to blog

explainx / blog

Voicebox: The Free, Open Source AI Voice Studio That Replaces ElevenLabs and WisprFlow in One App

Voicebox is a local-first, open source AI voice studio — clone any voice, generate speech in 23 languages, dictate into any app, and give Claude or Cursor a voice via MCP. Seven TTS engines, full privacy, runs on macOS, Windows, and Linux. Here is the full setup and feature breakdown.

·9 min read·Yash Thakker
Voice AIOpen SourceMCPClaudeText to SpeechVoice CloningPrivacy
Voicebox: The Free, Open Source AI Voice Studio That Replaces ElevenLabs and WisprFlow in One App

Two of the most useful AI voice tools of the last few years live on opposite ends of the same loop. ElevenLabs handles output: clone a voice, generate speech, export audio. WisprFlow handles input: hold a hotkey, speak, text appears in whatever app you're in.

Both are paid cloud services. Both send your voice data to their servers.

Voicebox does both, locally, for free.

It is an open source AI voice studio built by Jamie Pine (also the creator of Spacedrive). Voicebox clones voices from a few seconds of audio, generates speech across 23 languages with seven different TTS engines, provides a global dictation hotkey that pastes into any app, runs a local LLM for personality rewrites, and exposes an MCP server so Claude, Cursor, and any other AI agent can speak to you in a voice you've cloned.

31,000+ GitHub stars. MIT license. Runs entirely on your machine. Nothing leaves without your permission.


The Problem It Solves

Voice AI tooling in 2026 has a fragmentation problem. You need one subscription for voice cloning, another for dictation, a separate API for your agents, and a different integration for each. You are paying per character of speech, per minute of transcription, and per API call — with your voice data living on someone else's infrastructure.

Voicebox's design answer is vertical integration, locally. One app, one model cache, one GPU footprint, covering:

  • Voice cloning — zero-shot cloning from a reference audio sample
  • TTS — 7 engines, 23 languages, post-processing effects
  • STT — Whisper (all sizes, plus Turbo) for transcription
  • Dictation — global hotkey, pastes into any text field on macOS
  • Local LLM — Qwen3 for personality rewrites and dictation cleanup
  • MCP server — agents can speak and transcribe via standard tool protocol
  • REST API — everything accessible programmatically

Voice Cloning: How It Works

Voicebox supports zero-shot voice cloning — you provide a reference audio sample (a few seconds of clean speech), and Voicebox creates a voice profile that matches it. No training, no fine-tuning, no wait time. The cloning happens at inference time: the TTS engine receives your reference audio and conditions its output to match the speaker.

To clone a voice:

  1. Open Voicebox → Voices tab
  2. Click "New Profile"
  3. Record or import a reference audio clip (10–30 seconds of clean speech works best)
  4. Name the profile — this is what you pass to voicebox.speak over MCP

Multi-sample support is available: upload several clips from the same speaker for higher quality cloning.

Profiles are exportable and importable, so you can share voice profiles or move them between machines.


The Seven TTS Engines

Voicebox ships seven TTS engines, each with different strengths. You can switch engines per generation — the UI lets you select which engine to use before generating.

EngineLanguagesStrengths
Qwen3-TTS (0.6B / 1.7B)10High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
Qwen CustomVoice109 preset voices with natural-language delivery control, no reference audio needed
LuxTTSEnglish + 10 moreLightweight (~1GB VRAM), 48kHz output, 150x real-time on Apple Silicon; best for Finnish, Greek, Hebrew, Hindi, Norwegian, Polish, Swahili
Chatterbox TurboEnglishFast 350M model, supports paralinguistic emotion tags like [laugh] [sigh] [gasp]
HumeAI TADA (1B / 3B)10700s+ of coherent continuous audio, text-acoustic dual alignment
Kokoro882M model, 50 curated preset voices, fast CPU inference — the lightweight default
Chatterbox MultilingualMultipleMultilingual Chatterbox base model

The paralinguistic tag support in Chatterbox Turbo deserves a callout. Type / in the text input to open the tag inserter and add:

"That's [laugh] actually quite funny, you know. [sigh] But here we are."

The model speaks the text with the indicated emotional cues inline. ElevenLabs offers something similar in its "Expressive" tier; Voicebox's implementation is free and local.


Global Dictation: Voice Input Anywhere

The input half of Voicebox is a global dictation system backed by OpenAI Whisper (running locally). Hold a hotkey anywhere on your system, speak, and the transcript pastes directly into the focused text field — terminal, editor, browser, any app.

Setup on macOS:

  1. Open Voicebox → Settings → Dictation
  2. Set your push-to-talk chord (default: hold Fn)
  3. Grant Accessibility and Input Monitoring permissions (Voicebox walks you through this with deep-links to System Settings)
  4. Hold the hotkey → speak → release → text appears

The paste implementation on macOS is accessibility-verified: it uses the Accessibility API to inject text into the focused element, not the clipboard, so your clipboard is not overwritten. The clipboard save/restore is atomic regardless.

Two modes:

  • Push-to-talk — hold chord to record, release to transcribe and paste
  • Toggle — tap chord to start, tap again to stop. Hold the push-to-talk chord and tap Space mid-hold to upgrade to toggle without a gap in recording.

LLM refinement is optional: before paste, Voicebox's bundled Qwen3 LLM cleans up filler words, stutters, and false starts. Toggle this per-dictation or set it as a default in Settings.

On-screen pill: A floating overlay shows recording, transcribing, refining, and speaking states — the same pill agents use when they speak to you, so there is one mental model for both directions of the voice loop.


Connecting Claude (and Other Agents) via MCP

This is the feature that distinguishes Voicebox from every other TTS tool. When Voicebox is running, it exposes an MCP server at http://127.0.0.1:17493/mcp. Any MCP-aware agent can use four tools:

  • voicebox.speak — synthesize text in a cloned voice
  • voicebox.transcribe — convert an audio file to text
  • voicebox.list_captures — browse your recording history
  • voicebox.list_profiles — browse your voice profiles

Connect Claude Code

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

Connect Cursor / Windsurf / VS Code

{
  "mcpServers": {
    "voicebox": {
      "url": "http://127.0.0.1:17493/mcp",
      "headers": { "X-Voicebox-Client-Id": "cursor" }
    }
  }
}

Connect Claude Desktop (stdio fallback)

{
  "mcpServers": {
    "voicebox": {
      "command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp",
      "env": { "VOICEBOX_CLIENT_ID": "claude-desktop" }
    }
  }
}

Using the speak tool in an agent session

Once connected, Claude can call:

voicebox.speak({
  text: "Tests passing. Ready to merge.",
  profile: "Morgan"
})

Claude speaks in Morgan's voice. You hear it through your speakers. The on-screen pill shows "Speaking" with the profile name.

Per-client voice binding: In Voicebox → Settings → MCP, pin each connected agent to a specific voice. Claude Code → Morgan. Cursor → Scarlett. When those agents call voicebox.speak without specifying a profile, Voicebox uses the bound voice. This means you can tell which agent is talking without looking.

Voice personalities over MCP: Add a persona description to any voice profile ("a calm senior engineer who explains things simply"). Then call:

voicebox.speak({
  text: "The build failed on line 42.",
  profile: "Morgan",
  personality: true
})

The text passes through the local Qwen3 LLM which rewrites it in Morgan's character before TTS. The agent's output becomes Morgan's voice in personality as well as sound.


The REST API

Everything in Voicebox is accessible via REST at http://127.0.0.1:17493. Full docs at http://127.0.0.1:17493/docs when the app is running.

# Generate speech — returns audio file
curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'

# Agent voice output
curl -X POST http://127.0.0.1:17493/speak \
  -H "Content-Type: application/json" \
  -H "X-Voicebox-Client-Id: my-script" \
  -d '{"text": "Deploy complete.", "profile": "Morgan"}'

# Transcribe audio
curl -X POST http://127.0.0.1:17493/transcribe \
  -F "[email protected]" \
  -F "model=whisper-turbo"

# List voice profiles
curl http://127.0.0.1:17493/profiles

The /speak endpoint accepts profile as a name (case-insensitive) or ID, resolving in the same order as the MCP tool: explicit argument → per-client binding → default capture voice.


Post-Processing Effects

After generating speech, apply audio effects non-destructively. Preview in real time, build presets per voice profile.

EffectOptions
Pitch Shift±12 semitones
ReverbRoom size, damping, wet/dry
DelayTime, feedback, mix
Chorus / FlangerModulated delay
CompressorDynamic range
Gain-40 to +40 dB
High-Pass FilterRemove low frequencies
Low-Pass FilterRemove high frequencies

Four built-in presets: Robotic, Radio, Echo Chamber, Deep Voice. Custom presets are saveable per profile.

Each generation creates a version — original TTS output is always preserved. Effects versions branch from the original, so you can apply different chains without losing the clean source.


Download and Install

macOS (Apple Silicon or Intel): Download the DMG from github.com/jamiepine/voicebox/releases. Drag to Applications, open it, grant the Accessibility and Microphone permissions when prompted.

Windows: Download the MSI installer from the same Releases page.

Docker:

docker compose up

Linux: Pre-built binaries are not yet published. Build from source — see voicebox.sh/linux-install for instructions.

On first launch, Voicebox downloads the base models (Whisper, Kokoro as default TTS, Qwen3 LLM for dictation refinement). Storage requirements vary by which models you choose to download — Kokoro is the lightest (~82M), TADA is the heaviest (1B/3B).


GPU Support

Voicebox auto-detects and uses the best available inference backend for your hardware:

PlatformBackendNotes
macOS Apple SiliconMLX (Metal)4–5x faster via Neural Engine
Windows / Linux NVIDIAPyTorch (CUDA)Auto-downloads from within the app
Linux AMDPyTorch (ROCm)Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)DirectMLUniversal Windows GPU support
Intel ArcIPEX/XPUIntel discrete GPU acceleration
AnyCPUWorks everywhere, slower

Privacy: The Non-Negotiable Default

Every part of Voicebox runs locally:

  • Voice cloning models run on your GPU/CPU
  • Generated audio never leaves your machine
  • Whisper STT runs locally — your speech is not sent to any API
  • The local Qwen3 LLM for personality and dictation cleanup runs on-device
  • The MCP server and REST API are localhost-only

There are no usage metrics, no voice data collection, and no cloud processing unless you explicitly configure an external integration.


Quick-Start Summary

  1. Download: github.com/jamiepine/voicebox/releases → macOS DMG or Windows MSI
  2. Clone a voice: Voices tab → New Profile → record or upload reference audio
  3. Enable dictation: Settings → Dictation → set hotkey → grant permissions
  4. Connect Claude Code: claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
  5. Ask Claude to speak: voicebox.speak({ text: "...", profile: "YourVoice" })

Related Reading

Related posts