Speech synthesis authoring for Eleven-class models

Treat speech synthesis like signal processing for language: ambiguity in writing becomes chaos in waveform space.

Pauses & rhythm

Older SSML-compatible stacks accept explicit <break time="1.5s" />; newer expressive models substitute narrative punctuation and bracketed deliveries. Anchor breathing room at punctuation, not arbitrary mid-clause commas, unless irony demands it—over-breaking destabilizes some voices.

Normalization playbook

Expand currencies, ordinals, phone numbers, and ambiguous decimals when listeners need conversational clarity—not spreadsheet fidelity.
Convert keyboard shortcuts (Cmd/Alt/Ctrl combos) into spoken phrases instead of glyphs.
For URLs, pick either hyper-verbalized paths or truncated brand references—avoid ambiguous slash stacks.
When LLMs upstream draft copy, prepend an instruction block mirroring ExplainX normalization recipes (cardinal vs ordinal distinctions, saints vs streets for “St.”).

Pronunciation controls

Phoneme tags shine on supported English flash models—verify compatibility before baking SSML-heavy scripts. Alias grapheme→phoneme substitutions work project-wide inside pronunciation dictionaries; keep case sensitivity in mind during bulk imports.

Eleven “v3” expressive tags

Use bracket tags such as [whispers], [laughs], [sighs] sparingly—they steer delivery but clash with mismatched acoustic priors inside the voice corpus. Compose dialogue cinematically; prune tags downstream if audible artifacts creep in.

Multi-pass composition

Stitch complex beds (ambience loops, narration, sfx) externally when timing must be frame-accurate—few single-shot prose blobs outperform layered stems for dense productions.

Voice & dialogue prompting▌

Pauses & rhythm

Normalization playbook

Pronunciation controls

Eleven “v3” expressive tags

Multi-pass composition