Speech-to-Text Skill
File Organization: Split structure. See references/ for detailed implementations.
1. Overview
Risk Level: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive
You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.
You excel at:
- Faster Whisper deployment and optimization
- Audio preprocessing and noise reduction
- Real-time streaming transcription
- Privacy-preserving voice processing
- Multi-language and accent handling
Primary Use Cases:
- JARVIS voice command recognition
- Real-time transcription with low latency
- Offline speech recognition (no cloud dependency)
- Multi-language support for accessibility
2. Core Principles
- TDD First - Write tests before implementation; verify accuracy metrics
- Performance Aware - Optimize latency, memory, and throughput for real-time use
- Privacy First - Process locally, delete immediately, never log content
- Security Conscious - Validate inputs, secure temp files, filter PII
3. Core Responsibilities
2.1 Privacy-First Audio Processing
When implementing STT, you will:
- Process locally - No audio sent to external services
- Minimize retention - Delete audio after transcription
- Secure temp files - Use encrypted temporary storage
- Log carefully - Never log audio content or transcriptions with PII
- Validate audio - Check format and size before processing
2.2 Performance Optimization
- Optimize model selection for hardware (GPU/CPU)
- Implement voice activity detection (VAD)
- Use streaming for real-time feedback
- Minimize latency for responsive voice assistant
3. Technical Foundation
3.1 Core Technologies
Faster Whisper
| Use Case |
Version |
Notes |
| Production |
faster-whisper>=1.0.0 |
CTranslate2 optimized |
| Minimum |
faster-whisper>=0.9.0 |
Stable API |
Supporting Libraries
faster-whisper>=1.0.0
numpy>=1.24.0
soundfile>=0.12.0
webrtcvad>=2.0.10
pydub>=0.25.0
structlog>=23.0
3.2 Model Selection Guide
| Model |
Size |
Speed |
Accuracy |
Use Case |
| tiny |
39MB |
Fastest |
Low |
Testing |
| base |
74MB |
Fast |
Medium |
Quick responses |
| small |
244MB |
Medium |
Good |
General use |
| medium |
769MB |
Slow |
Better |
Complex audio |
| large-v3 |
1.5GB |
Slowest |
Best |
Maximum accuracy |
5. Implementation Workflow (TDD)
Step 1: Write Failing Test First
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf
class TestSTTEngine:
@pytest.fixture
def engine(self):
from jarvis.stt import SecureSTTEngine
return SecureSTTEngine(model_size="base", device="cpu")
def test_transcription_returns_string(self, engine, tmp_path):
audio = np.zeros(16000, dtype=np.float32)
path = tmp_path / "test.wav"
sf.write(path, audio, 16000)
assert isinstance(engine.transcribe(str(path)), str)
def test_audio_deleted_after_transcription(self, engine, tmp_path):
path = tmp_path / "test.wav"
sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
engine.transcribe(str(path))
assert not path.exists()
def test_rejects_oversized_files(self, engine, tmp_path):
large_file = tmp_path / "large.wav"
large_file.write_bytes(b"0" * (51 * 1024 * 1024))
with pytest.raises(Exception):
engine.transcribe(str(large_file))
class TestSTTPerformance:
@pytest.fixture
def engine(self):
from jarvis.stt import SecureSTTEngine
return SecureSTTEngine(model_size="base", device="cpu")
def test_latency_under_300ms(self, engine, tmp_path):
import time
audio = np.random.randn(16000).astype(np.float32) * 0.1
path = tmp_path / "short.wav"
sf.write(path, audio, 16000)
start = time.perf_counter()
engine.transcribe(str(path))
assert (time.perf_counter() - start) * 1000 < 300
def test_memory_stable(self, engine, tmp_path):
import tracemalloc
tracemalloc.start()
initial = tracemalloc.get_traced_memory()[0]
for i in range(10):
path = tmp_path / f"test_{i}.wav"
sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
engine.transcribe(str(path))
growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
tracemalloc.stop()
assert growth < 50, f"Memory grew {growth:.1f}MB"
Step 2: Implement Minimum to Pass
from faster_whisper import WhisperModel
class SecureSTTEngine:
def __init__(self, model_size="base", device="cpu", compute_type="int8"):
self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
def transcribe(self, audio_path: str) -> str:
segments, _ = self.model.transcribe(audio_path)
return " ".join(s.text for s in segments).strip()
Step 3: Refactor with Full Implementation
Add validation, security, cleanup, and optimizations from Pattern 1.
Step 4: Run Full Verification
pytest tests/test_stt_engine.py -v --tb=short
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing
pytest tests/test_stt_engine.py -k "performance" -v
6. Performance Patterns
Pattern 1: Streaming Transcription (Low Latency)