Voicebox is a free, open source AI voice studio built with Tauri (Rust) and Python. It combines voice cloning (ElevenLabs-style output) with global dictation (WisprFlow-style input) in one local app. Seven TTS engines, 23 languages, Whisper STT, a local LLM for personality rewrites, an MCP server so AI agents can speak, and a REST API — all running entirely on your machine with no data sent to any cloud.

Is Voicebox really free?

Yes. Voicebox is MIT-licensed and free. The app is free, the voice cloning is free, the dictation is free, the MCP server is free, and the API is free. All processing runs on your local hardware. There is no subscription and no usage limit. You provide the compute; Voicebox provides the software.

How does Voicebox compare to ElevenLabs?

ElevenLabs is a cloud service with usage-based pricing. Voicebox is local and free. ElevenLabs has more polished production tooling and a larger pre-made voice library. Voicebox offers complete privacy (nothing leaves your machine), zero per-character cost, and MCP server integration for AI agents — features ElevenLabs does not offer in the same package.

How does the MCP integration work?

When Voicebox is running, it exposes an MCP server at http://127.0.0.1:17493/mcp. Add it to Claude Code with: claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp. Claude can then call voicebox.speak to synthesize text in any cloned voice, voicebox.transcribe to convert audio to text, list_profiles to browse your voices, and list_captures to access your recordings.

What TTS engines does Voicebox include?

Seven engines: Qwen3-TTS (0.6B / 1.7B, high-quality multilingual cloning), Qwen CustomVoice (9 preset voices with delivery control), LuxTTS (lightweight, 48kHz, great for Finnish/Greek/Hebrew/Hindi), Chatterbox Turbo (fast, with paralinguistic emotion tags like [laugh] and [sigh]), HumeAI TADA (1B/3B, 700s+ coherent audio), Kokoro (82M model, 50 preset voices, fast CPU), and Chatterbox Multilingual.

What platforms does Voicebox run on?

macOS (Apple Silicon and Intel), Windows, and Linux (build from source; pre-built binaries planned). Docker is also available. Linux pre-built binaries are not yet published — see the GitHub repo for build-from-source instructions.

Voicebox: Free Open Source ElevenLabs + WisprFlow Alternative with MCP (2026) | explainx.ai Blog

Two of the most useful AI voice tools of the last few years live on opposite ends of the same loop. ElevenLabs handles output: clone a voice, generate speech, export audio. WisprFlow handles input: hold a hotkey, speak, text appears in whatever app you're in.

Both are paid cloud services. Both send your voice data to their servers.

Voicebox does both, locally, for free.

It is an open source AI voice studio built by Jamie Pine (also the creator of Spacedrive). Voicebox clones voices from a few seconds of audio, generates speech across 23 languages with seven different TTS engines, provides a global dictation hotkey that pastes into any app, runs a local LLM for personality rewrites, and exposes an MCP server so Claude, Cursor, and any other AI agent can speak to you in a voice you've cloned.

31,000+ GitHub stars. MIT license. Runs entirely on your machine. Nothing leaves without your permission.

The Problem It Solves

Voice AI tooling in 2026 has a fragmentation problem. You need one subscription for voice cloning, another for dictation, a separate API for your agents, and a different integration for each. You are paying per character of speech, per minute of transcription, and per API call — with your voice data living on someone else's infrastructure.

Voicebox's design answer is vertical integration, locally. One app, one model cache, one GPU footprint, covering:

Voice cloning — zero-shot cloning from a reference audio sample
TTS — 7 engines, 23 languages, post-processing effects
STT — Whisper (all sizes, plus Turbo) for transcription
Dictation — global hotkey, pastes into any text field on macOS
Local LLM — Qwen3 for personality rewrites and dictation cleanup
MCP server — agents can speak and transcribe via standard tool protocol
REST API — everything accessible programmatically

Voice Cloning: How It Works

Voicebox supports zero-shot voice cloning — you provide a reference audio sample (a few seconds of clean speech), and Voicebox creates a voice profile that matches it. No training, no fine-tuning, no wait time. The cloning happens at inference time: the TTS engine receives your reference audio and conditions its output to match the speaker.

To clone a voice:

Open Voicebox → Voices tab
Click "New Profile"
Record or import a reference audio clip (10–30 seconds of clean speech works best)
Name the profile — this is what you pass to voicebox.speak over MCP

Multi-sample support is available: upload several clips from the same speaker for higher quality cloning.

Profiles are exportable and importable, so you can share voice profiles or move them between machines.

The Seven TTS Engines

Voicebox ships seven TTS engines, each with different strengths. You can switch engines per generation — the UI lets you select which engine to use before generating.

Engine	Languages	Strengths
Qwen3-TTS (0.6B / 1.7B)	10	High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
Qwen CustomVoice	10	9 preset voices with natural-language delivery control, no reference audio needed
LuxTTS	English + 10 more	Lightweight (~1GB VRAM), 48kHz output, 150x real-time on Apple Silicon; best for Finnish, Greek, Hebrew, Hindi, Norwegian, Polish, Swahili
Chatterbox Turbo	English	Fast 350M model, supports paralinguistic emotion tags like `[laugh]` `[sigh]` `[gasp]`
HumeAI TADA (1B / 3B)	10	700s+ of coherent continuous audio, text-acoustic dual alignment
Kokoro	8	82M model, 50 curated preset voices, fast CPU inference — the lightweight default
Chatterbox Multilingual	Multiple	Multilingual Chatterbox base model

The paralinguistic tag support in Chatterbox Turbo deserves a callout. Type / in the text input to open the tag inserter and add:

snippet

"That's [laugh] actually quite funny, you know. [sigh] But here we are."

The model speaks the text with the indicated emotional cues inline. ElevenLabs offers something similar in its "Expressive" tier; Voicebox's implementation is free and local.

Global Dictation: Voice Input Anywhere

The input half of Voicebox is a global dictation system backed by OpenAI Whisper (running locally). Hold a hotkey anywhere on your system, speak, and the transcript pastes directly into the focused text field — terminal, editor, browser, any app.

Setup on macOS:

Open Voicebox → Settings → Dictation
Set your push-to-talk chord (default: hold Fn)
Grant Accessibility and Input Monitoring permissions (Voicebox walks you through this with deep-links to System Settings)
Hold the hotkey → speak → release → text appears

The paste implementation on macOS is accessibility-verified: it uses the Accessibility API to inject text into the focused element, not the clipboard, so your clipboard is not overwritten. The clipboard save/restore is atomic regardless.

Two modes:

Push-to-talk — hold chord to record, release to transcribe and paste
Toggle — tap chord to start, tap again to stop. Hold the push-to-talk chord and tap Space mid-hold to upgrade to toggle without a gap in recording.

LLM refinement is optional: before paste, Voicebox's bundled Qwen3 LLM cleans up filler words, stutters, and false starts. Toggle this per-dictation or set it as a default in Settings.

On-screen pill: A floating overlay shows recording, transcribing, refining, and speaking states — the same pill agents use when they speak to you, so there is one mental model for both directions of the voice loop.

Connecting Claude (and Other Agents) via MCP

This is the feature that distinguishes Voicebox from every other TTS tool. When Voicebox is running, it exposes an MCP server at http://127.0.0.1:17493/mcp. Any MCP-aware agent can use four tools:

voicebox.speak — synthesize text in a cloned voice
voicebox.transcribe — convert an audio file to text
voicebox.list_captures — browse your recording history
voicebox.list_profiles — browse your voice profiles

Connect Claude Code

bash

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

Connect Cursor / Windsurf / VS Code

json

{
  "mcpServers": {
    "voicebox": {
      "url": "http://127.0.0.1:17493/mcp",
      "headers": { "X-Voicebox-Client-Id": "cursor" }
    }
  }
}

Connect Claude Desktop (stdio fallback)

json

{
  "mcpServers": {
    "voicebox": {
      "command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp",
      "env": { "VOICEBOX_CLIENT_ID": "claude-desktop" }
    }
  }
}

Using the speak tool in an agent session

Once connected, Claude can call:

snippet

voicebox.speak({
  text: "Tests passing. Ready to merge.",
  profile: "Morgan"
})

Claude speaks in Morgan's voice. You hear it through your speakers. The on-screen pill shows "Speaking" with the profile name.

Per-client voice binding: In Voicebox → Settings → MCP, pin each connected agent to a specific voice. Claude Code → Morgan. Cursor → Scarlett. When those agents call voicebox.speak without specifying a profile, Voicebox uses the bound voice. This means you can tell which agent is talking without looking.

Voice personalities over MCP: Add a persona description to any voice profile ("a calm senior engineer who explains things simply"). Then call:

snippet

voicebox.speak({
  text: "The build failed on line 42.",
  profile: "Morgan",
  personality: true
})

The text passes through the local Qwen3 LLM which rewrites it in Morgan's character before TTS. The agent's output becomes Morgan's voice in personality as well as sound.

The REST API

Everything in Voicebox is accessible via REST at http://127.0.0.1:17493. Full docs at http://127.0.0.1:17493/docs when the app is running.

bash

# Generate speech — returns audio file
curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'

# Agent voice output
curl -X POST http://127.0.0.1:17493/speak \
  -H "Content-Type: application/json" \
  -H "X-Voicebox-Client-Id: my-script" \
  -d '{"text": "Deploy complete.", "profile": "Morgan"}'

# Transcribe audio
curl -X POST http://127.0.0.1:17493/transcribe \
  -F "[email protected]" \
  -F "model=whisper-turbo"

# List voice profiles
curl http://127.0.0.1:17493/profiles

The /speak endpoint accepts profile as a name (case-insensitive) or ID, resolving in the same order as the MCP tool: explicit argument → per-client binding → default capture voice.

Post-Processing Effects

After generating speech, apply audio effects non-destructively. Preview in real time, build presets per voice profile.

Effect	Options
Pitch Shift	±12 semitones
Reverb	Room size, damping, wet/dry
Delay	Time, feedback, mix
Chorus / Flanger	Modulated delay
Compressor	Dynamic range
Gain	-40 to +40 dB
High-Pass Filter	Remove low frequencies
Low-Pass Filter	Remove high frequencies

Four built-in presets: Robotic, Radio, Echo Chamber, Deep Voice. Custom presets are saveable per profile.

Each generation creates a version — original TTS output is always preserved. Effects versions branch from the original, so you can apply different chains without losing the clean source.

Download and Install

macOS (Apple Silicon or Intel): Download the DMG from github.com/jamiepine/voicebox/releases. Drag to Applications, open it, grant the Accessibility and Microphone permissions when prompted.

Windows: Download the MSI installer from the same Releases page.

Docker:

bash

docker compose up

Linux: Pre-built binaries are not yet published. Build from source — see voicebox.sh/linux-install for instructions.

On first launch, Voicebox downloads the base models (Whisper, Kokoro as default TTS, Qwen3 LLM for dictation refinement). Storage requirements vary by which models you choose to download — Kokoro is the lightest (~82M), TADA is the heaviest (1B/3B).

GPU Support

Voicebox auto-detects and uses the best available inference backend for your hardware:

Platform	Backend	Notes
macOS Apple Silicon	MLX (Metal)	4–5x faster via Neural Engine
Windows / Linux NVIDIA	PyTorch (CUDA)	Auto-downloads from within the app
Linux AMD	PyTorch (ROCm)	Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)	DirectML	Universal Windows GPU support
Intel Arc	IPEX/XPU	Intel discrete GPU acceleration
Any	CPU	Works everywhere, slower

Privacy: The Non-Negotiable Default

Every part of Voicebox runs locally:

Voice cloning models run on your GPU/CPU
Generated audio never leaves your machine
Whisper STT runs locally — your speech is not sent to any API
The local Qwen3 LLM for personality and dictation cleanup runs on-device
The MCP server and REST API are localhost-only

There are no usage metrics, no voice data collection, and no cloud processing unless you explicitly configure an external integration.

Quick-Start Summary

Download: github.com/jamiepine/voicebox/releases → macOS DMG or Windows MSI
Clone a voice: Voices tab → New Profile → record or upload reference audio
Enable dictation: Settings → Dictation → set hotkey → grant permissions
Connect Claude Code: claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
Ask Claude to speak: voicebox.speak({ text: "...", profile: "YourVoice" })

Both are paid cloud services. Both send your voice data to their servers.

Voicebox does both, locally, for free.

31,000+ GitHub stars. MIT license. Runs entirely on your machine. Nothing leaves without your permission.

The Problem It Solves

Voicebox's design answer is vertical integration, locally. One app, one model cache, one GPU footprint, covering:

Voice cloning — zero-shot cloning from a reference audio sample
TTS — 7 engines, 23 languages, post-processing effects
STT — Whisper (all sizes, plus Turbo) for transcription
Dictation — global hotkey, pastes into any text field on macOS
Local LLM — Qwen3 for personality rewrites and dictation cleanup
MCP server — agents can speak and transcribe via standard tool protocol
REST API — everything accessible programmatically

Voice Cloning: How It Works

To clone a voice:

Open Voicebox → Voices tab
Click "New Profile"
Record or import a reference audio clip (10–30 seconds of clean speech works best)
Name the profile — this is what you pass to voicebox.speak over MCP

Multi-sample support is available: upload several clips from the same speaker for higher quality cloning.

Profiles are exportable and importable, so you can share voice profiles or move them between machines.

The Seven TTS Engines

Voicebox ships seven TTS engines, each with different strengths. You can switch engines per generation — the UI lets you select which engine to use before generating.

Engine	Languages	Strengths
Qwen3-TTS (0.6B / 1.7B)	10	High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
Qwen CustomVoice	10	9 preset voices with natural-language delivery control, no reference audio needed
LuxTTS	English + 10 more	Lightweight (~1GB VRAM), 48kHz output, 150x real-time on Apple Silicon; best for Finnish, Greek, Hebrew, Hindi, Norwegian, Polish, Swahili
Chatterbox Turbo	English	Fast 350M model, supports paralinguistic emotion tags like `[laugh]` `[sigh]` `[gasp]`
HumeAI TADA (1B / 3B)	10	700s+ of coherent continuous audio, text-acoustic dual alignment
Kokoro	8	82M model, 50 curated preset voices, fast CPU inference — the lightweight default
Chatterbox Multilingual	Multiple	Multilingual Chatterbox base model

The paralinguistic tag support in Chatterbox Turbo deserves a callout. Type / in the text input to open the tag inserter and add:

snippet

"That's [laugh] actually quite funny, you know. [sigh] But here we are."

The model speaks the text with the indicated emotional cues inline. ElevenLabs offers something similar in its "Expressive" tier; Voicebox's implementation is free and local.

Global Dictation: Voice Input Anywhere

Setup on macOS:

Open Voicebox → Settings → Dictation
Set your push-to-talk chord (default: hold Fn)
Grant Accessibility and Input Monitoring permissions (Voicebox walks you through this with deep-links to System Settings)
Hold the hotkey → speak → release → text appears

Two modes:

Push-to-talk — hold chord to record, release to transcribe and paste
Toggle — tap chord to start, tap again to stop. Hold the push-to-talk chord and tap Space mid-hold to upgrade to toggle without a gap in recording.

LLM refinement is optional: before paste, Voicebox's bundled Qwen3 LLM cleans up filler words, stutters, and false starts. Toggle this per-dictation or set it as a default in Settings.

Connecting Claude (and Other Agents) via MCP

This is the feature that distinguishes Voicebox from every other TTS tool. When Voicebox is running, it exposes an MCP server at http://127.0.0.1:17493/mcp. Any MCP-aware agent can use four tools:

voicebox.speak — synthesize text in a cloned voice
voicebox.transcribe — convert an audio file to text
voicebox.list_captures — browse your recording history
voicebox.list_profiles — browse your voice profiles

Connect Claude Code

bash

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

Connect Cursor / Windsurf / VS Code

json

{
  "mcpServers": {
    "voicebox": {
      "url": "http://127.0.0.1:17493/mcp",
      "headers": { "X-Voicebox-Client-Id": "cursor" }
    }
  }
}

Connect Claude Desktop (stdio fallback)

json

{
  "mcpServers": {
    "voicebox": {
      "command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp",
      "env": { "VOICEBOX_CLIENT_ID": "claude-desktop" }
    }
  }
}

Using the speak tool in an agent session

Once connected, Claude can call:

snippet

voicebox.speak({
  text: "Tests passing. Ready to merge.",
  profile: "Morgan"
})

Claude speaks in Morgan's voice. You hear it through your speakers. The on-screen pill shows "Speaking" with the profile name.

Voice personalities over MCP: Add a persona description to any voice profile ("a calm senior engineer who explains things simply"). Then call:

snippet

voicebox.speak({
  text: "The build failed on line 42.",
  profile: "Morgan",
  personality: true
})

The text passes through the local Qwen3 LLM which rewrites it in Morgan's character before TTS. The agent's output becomes Morgan's voice in personality as well as sound.

The REST API

Everything in Voicebox is accessible via REST at http://127.0.0.1:17493. Full docs at http://127.0.0.1:17493/docs when the app is running.

bash

# Generate speech — returns audio file
curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'

# Agent voice output
curl -X POST http://127.0.0.1:17493/speak \
  -H "Content-Type: application/json" \
  -H "X-Voicebox-Client-Id: my-script" \
  -d '{"text": "Deploy complete.", "profile": "Morgan"}'

# Transcribe audio
curl -X POST http://127.0.0.1:17493/transcribe \
  -F "[email protected]" \
  -F "model=whisper-turbo"

# List voice profiles
curl http://127.0.0.1:17493/profiles

The /speak endpoint accepts profile as a name (case-insensitive) or ID, resolving in the same order as the MCP tool: explicit argument → per-client binding → default capture voice.

Post-Processing Effects

After generating speech, apply audio effects non-destructively. Preview in real time, build presets per voice profile.

Effect	Options
Pitch Shift	±12 semitones
Reverb	Room size, damping, wet/dry
Delay	Time, feedback, mix
Chorus / Flanger	Modulated delay
Compressor	Dynamic range
Gain	-40 to +40 dB
High-Pass Filter	Remove low frequencies
Low-Pass Filter	Remove high frequencies

Four built-in presets: Robotic, Radio, Echo Chamber, Deep Voice. Custom presets are saveable per profile.

Each generation creates a version — original TTS output is always preserved. Effects versions branch from the original, so you can apply different chains without losing the clean source.

Download and Install

macOS (Apple Silicon or Intel): Download the DMG from github.com/jamiepine/voicebox/releases. Drag to Applications, open it, grant the Accessibility and Microphone permissions when prompted.

Windows: Download the MSI installer from the same Releases page.

Docker:

bash

docker compose up

Linux: Pre-built binaries are not yet published. Build from source — see voicebox.sh/linux-install for instructions.

GPU Support

Voicebox auto-detects and uses the best available inference backend for your hardware:

Platform	Backend	Notes
macOS Apple Silicon	MLX (Metal)	4–5x faster via Neural Engine
Windows / Linux NVIDIA	PyTorch (CUDA)	Auto-downloads from within the app
Linux AMD	PyTorch (ROCm)	Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)	DirectML	Universal Windows GPU support
Intel Arc	IPEX/XPU	Intel discrete GPU acceleration
Any	CPU	Works everywhere, slower

Privacy: The Non-Negotiable Default

Every part of Voicebox runs locally:

Voice cloning models run on your GPU/CPU
Generated audio never leaves your machine
Whisper STT runs locally — your speech is not sent to any API
The local Qwen3 LLM for personality and dictation cleanup runs on-device
The MCP server and REST API are localhost-only

There are no usage metrics, no voice data collection, and no cloud processing unless you explicitly configure an external integration.

Quick-Start Summary

Download: github.com/jamiepine/voicebox/releases → macOS DMG or Windows MSI
Clone a voice: Voices tab → New Profile → record or upload reference audio
Enable dictation: Settings → Dictation → set hotkey → grant permissions
Connect Claude Code: claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
Ask Claude to speak: voicebox.speak({ text: "...", profile: "YourVoice" })

Voicebox: The Free, Open Source AI Voice Studio That Replaces ElevenLabs and WisprFlow in One App

The Problem It Solves

Voice Cloning: How It Works

The Seven TTS Engines

Global Dictation: Voice Input Anywhere

Connecting Claude (and Other Agents) via MCP

Connect Claude Code

Connect Cursor / Windsurf / VS Code

Connect Claude Desktop (stdio fallback)

Using the speak tool in an agent session

The REST API

Post-Processing Effects

Download and Install

GPU Support

Privacy: The Non-Negotiable Default

Quick-Start Summary

Voicebox: The Free, Open Source AI Voice Studio That Replaces ElevenLabs and WisprFlow in One App

The Problem It Solves

Voice Cloning: How It Works

The Seven TTS Engines

Global Dictation: Voice Input Anywhere

Connecting Claude (and Other Agents) via MCP

Connect Claude Code

Connect Cursor / Windsurf / VS Code

Connect Claude Desktop (stdio fallback)

Using the speak tool in an agent session

The REST API

Post-Processing Effects

Download and Install

GPU Support

Privacy: The Non-Negotiable Default

Quick-Start Summary

Related posts

FluidVoice 1.6.1: The Open Source macOS Dictation App With On-Device STT and Fluid Intelligence

Palmier Pro: The Open Source Video Editor Where Claude Edits the Timeline With You

Hugging Face Speech-to-Speech: Build Open-Source Voice Agents

Related posts

FluidVoice 1.6.1: The Open Source macOS Dictation App With On-Device STT and Fluid Intelligence

Palmier Pro: The Open Source Video Editor Where Claude Edits the Timeline With You

Hugging Face Speech-to-Speech: Build Open-Source Voice Agents

The Problem It Solves

Voice Cloning: How It Works

The Seven TTS Engines

Global Dictation: Voice Input Anywhere

Connecting Claude (and Other Agents) via MCP

Connect Claude Code

Connect Cursor / Windsurf / VS Code

Connect Claude Desktop (stdio fallback)

Using the speak tool in an agent session

The REST API

Post-Processing Effects

Download and Install

GPU Support

Privacy: The Non-Negotiable Default

Quick-Start Summary

Related Reading

The Problem It Solves

Voice Cloning: How It Works

The Seven TTS Engines

Global Dictation: Voice Input Anywhere

Connecting Claude (and Other Agents) via MCP

Connect Claude Code

Connect Cursor / Windsurf / VS Code

Connect Claude Desktop (stdio fallback)

Using the speak tool in an agent session

The REST API

Post-Processing Effects

Download and Install

GPU Support

Privacy: The Non-Negotiable Default

Quick-Start Summary

Related Reading

Related posts

FluidVoice 1.6.1: The Open Source macOS Dictation App With On-Device STT and Fluid Intelligence

Palmier Pro: The Open Source Video Editor Where Claude Edits the Timeline With You

Hugging Face Speech-to-Speech: Build Open-Source Voice Agents

Related posts

FluidVoice 1.6.1: The Open Source macOS Dictation App With On-Device STT and Fluid Intelligence

Palmier Pro: The Open Source Video Editor Where Claude Edits the Timeline With You

Hugging Face Speech-to-Speech: Build Open-Source Voice Agents