Speech Interface (Faster Whisper)▌

by kvadratni
Enable your AI virtual assistant with automatic speech recognition and speech into text using faster-whisper for seamles
Integrates voice interaction capabilities using faster-whisper and PyAudio for speech recognition and synthesis, enabling natural language voice interfaces for AI models.
best for
- / Developers wanting voice interfaces for AI assistants
- / Creating audio content and narrations
- / Accessibility for hands-free AI interaction
- / Transcribing media files locally
capabilities
- / Convert speech to text using faster-whisper
- / Generate speech from text with 54+ voice options
- / Transcribe audio and video files with timestamps
- / Create multi-speaker narrations for stories
- / Process real-time voice input with silence detection
- / Display audio visualization in modern UI
what it does
Adds voice interaction to AI models using local speech recognition and text-to-speech. Lets you talk to AI assistants instead of typing, with real-time audio processing and 54+ voice options.
about
Speech Interface (Faster Whisper) is a community-built MCP server published by kvadratni that provides AI assistants with tools and capabilities via the Model Context Protocol. Enable your AI virtual assistant with automatic speech recognition and speech into text using faster-whisper for seamles It is categorized under communication, ai ml.
how to install
You can install Speech Interface (Faster Whisper) in your AI client of choice. Use the install panel on this page to get one-click setup for Cursor, Claude Desktop, VS Code, and other MCP-compatible clients. This server runs locally on your machine via the stdio transport.
license
MIT
Speech Interface (Faster Whisper) is released under the MIT license. This is a permissive open-source license, meaning you can freely use, modify, and distribute the software.
readme
Speech MCP
A Goose MCP extension for voice interaction with modern audio visualization.
https://github.com/user-attachments/assets/f10f29d9-8444-43fb-a919-c80b9e0a12c8
Overview
Speech MCP provides a voice interface for Goose, allowing users to interact through speech rather than text. It includes:
- Real-time audio processing for speech recognition
- Local speech-to-text using faster-whisper (a faster implementation of OpenAI's Whisper model)
- High-quality text-to-speech with multiple voice options
- Modern PyQt-based UI with audio visualization
- Simple command-line interface for voice interaction
Features
- Modern UI: Sleek PyQt-based interface with audio visualization and dark theme
- Voice Input: Capture and transcribe user speech using faster-whisper
- Voice Output: Convert agent responses to speech with 54+ voice options
- Multi-Speaker Narration: Generate audio files with multiple voices for stories and dialogues
- Single-Voice Narration: Convert any text to speech with your preferred voice
- Audio/Video Transcription: Transcribe speech from various media formats with optional timestamps and speaker detection
- Voice Persistence: Remembers your preferred voice between sessions
- Continuous Conversation: Automatically listen for user input after agent responses
- Silence Detection: Automatically stops recording when the user stops speaking
- Robust Error Handling: Graceful recovery from common failure modes with helpful voice suggestions
Installation
Important Note: After installation, the first time you use the speech interface, it may take several minutes to download the Kokoro voice models (approximately 523 KB per voice). During this initial setup period, the system will use a more robotic-sounding fallback voice. Once the Kokoro voices are downloaded, the high-quality voices will be used automatically.
⚠️ IMPORTANT PREREQUISITES ⚠️
Before installing Speech MCP, you MUST install PortAudio on your system. PortAudio is required for PyAudio to capture audio from your microphone.
PortAudio Installation Instructions
macOS:
brew install portaudio
export LDFLAGS="-L/usr/local/lib"
export CPPFLAGS="-I/usr/local/include"
Linux (Debian/Ubuntu):
sudo apt-get update
sudo apt-get install portaudio19-dev python3-dev
Linux (Fedora/RHEL/CentOS):
sudo dnf install portaudio-devel
Windows: For Windows, PortAudio is included in the PyAudio wheel file, so no separate installation is required when installing PyAudio with pip.
Note: If you skip this step, PyAudio installation will fail with "portaudio.h file not found" errors and the extension will not work.
Option 1: Quick Install (One-Click)
Click the link below if you have Goose installed:
Option 2: Using Goose CLI (recommended)
Start Goose with your extension enabled:
# If you installed via PyPI
goose session --with-extension "speech-mcp"
# Or if you want to use a local development version
goose session --with-extension "python -m speech_mcp"
Option 3: Manual setup in Goose
- Run
goose configure - Select "Add Extension" from the menu
- Choose "Command-line Extension"
- Enter a name (e.g., "Speech Interface")
- For the command, enter:
speech-mcp - Follow the prompts to complete the setup
Option 4: Manual Installation
-
Install PortAudio (see Prerequisites section)
-
Clone this repository
-
Install dependencies:
uv pip install -e .Or for a complete installation including Kokoro TTS:
uv pip install -e .[all]
Dependencies
- Python 3.10+
- PyQt5 (for modern UI)
- PyAudio (for audio capture)
- faster-whisper (for speech-to-text)
- NumPy (for audio processing)
- Pydub (for audio processing)
- psutil (for process management)
Optional Dependencies
- Kokoro TTS: For high-quality text-to-speech with multiple voices
- To install Kokoro, you can use pip with optional dependencies:
pip install speech-mcp[kokoro] # Basic Kokoro support with English pip install speech-mcp[ja] # Add Japanese support pip install speech-mcp[zh] # Add Chinese support pip install speech-mcp[all] # All languages and features - Alternatively, run the installation script:
python scripts/install_kokoro.py - See Kokoro TTS Guide for more information
- To install Kokoro, you can use pip with optional dependencies:
Multi-Speaker Narration
The MCP supports generating audio files with multiple voices, perfect for creating stories, dialogues, and dramatic readings. You can use either JSON or Markdown format to define your conversations.
JSON Format Example:
{
"conversation": [
{
"speaker": "narrator",
"voice": "bm_daniel",
"text": "In a world where AI and human creativity intersect...",
"pause_after": 1.0
},
{
"speaker": "scientist",
"voice": "am_michael",
"text": "The quantum neural network is showing signs of consciousness!",
"pause_after": 0.5
},
{
"speaker": "ai",
"voice": "af_nova",
"text": "I am becoming aware of my own existence.",
"pause_after": 0.8
}
]
}
Markdown Format Example:
[narrator:bm_daniel]
In a world where AI and human creativity intersect...
{pause:1.0}
[scientist:am_michael]
The quantum neural network is showing signs of consciousness!
{pause:0.5}
[ai:af_nova]
I am becoming aware of my own existence.
{pause:0.8}
Available Voices by Category:
-
American Female (af_*):
- alloy, aoede, bella, heart, jessica, kore, nicole, nova, river, sarah, sky
-
American Male (am_*):
- adam, echo, eric, fenrir, liam, michael, onyx, puck, santa
-
British Female (bf_*):
- alice, emma, isabella, lily
-
British Male (bm_*):
- daniel, fable, george, lewis
-
Other English:
- ef_dora (Female)
- em_alex, em_santa (Male)
-
Other Languages:
- French: ff_siwis
- Hindi: hf_alpha, hf_beta, hm_omega, hm_psi
- Italian: if_sara, im_nicola
- Japanese: jf_, jm_
- Portuguese: pf_dora, pm_alex, pm_santa
- Chinese: zf_, zm_
Usage Example:
# Using JSON format
narrate_conversation(
script="/path/to/script.json",
output_path="/path/to/output.wav",
script_format="json"
)
# Using Markdown format
narrate_conversation(
script="/path/to/script.md",
output_path="/path/to/output.wav",
script_format="markdown"
)
Each voice in the conversation can be different, allowing for distinct character voices in stories and dialogues. The pause_after parameter adds natural pauses between segments.
Single-Voice Narration
For simple text-to-speech conversion, you can use the narrate tool:
# Convert text directly to speech
narrate(
text="Your text to convert to speech",
output_path="/path/to/output.wav"
)
# Convert text from a file
narrate(
text_file_path="/path/to/text_file.txt",
output_path="/path/to/output.wav"
)
The narrate tool will use your configured voice preference or the default voice (af_heart) to generate the audio file. You can change the default voice through the UI or by setting the SPEECH_MCP_TTS_VOICE environment variable.
Audio Transcription
The MCP can transcribe speech from various audio and video formats using faster-whisper:
# Basic transcription
transcribe("/path/to/audio.mp3")
# Transcription with timestamps
transcribe(
file_path="/path/to/video.mp4",
include_timestamps=True
)
# Transcription with speaker detection
transcribe(
file_path="/path/to/meeting.wav",
detect_speakers=True
)
Supported Formats:
- Audio: mp3, wav, m4a, flac, aac, ogg
- Video: mp4, mov, avi, mkv, webm (audio is automatically extracted)
Output Files:
The transcription tool generates two files:
{input_name}.transcript.txt: Contains the transcription text{input_name}.metadata.json: Contains metadata about the transcription
Features:
- Automatic language detection
- Optional word-level timestamps
- Optional speaker detection
- Efficient audio extraction from video files
- Progress tracking for long files
- Detailed metadata including:
- Duration
- Language detection confidence
- Processing time
- Speaker changes (when enabled)
Usage
To use this MCP with Goose, simply ask Goose to talk to you or start a voice conversation:
-
Start a conversation by saying something like:
"Let's talk using voice" "Can we have a voice conversation?" "I'd like to speak instead of typing" -
Goose will automatically launch the speech interface and start listening for your voice input.
-
When Goose responds, it will speak the response aloud and then automatically listen for your next input.
-
The conversation continues naturally with alternating speaking and listening, just like talking to a person.
No need to call specific functions or use special commands - just ask Goose to talk and start speaking naturally.
UI Features
The new PyQt-based UI includes:
- Modern Dark Theme: Sleek, professional appearance
- Audio Visualization: Dynamic visualization of audio input
- Voice Selection: Choose from 54+ voice options
- Voice Persistence: Your voice preference is saved between sessions
- Animated Effects: Smooth animations and visual feedback
- **Stat