Text-to-speech Bare addon backed by the qvac-tts.cpp
GGML library. Currently ships the Chatterbox Turbo English model;
additional engines will land under the same package as the upstream
library grows.
Runs in-process with a persistent native engine — the GGUFs, the S3Gen
preload, the ggml backend, and any voice-conditioning tensors are
loaded once and reused across every synthesis call. GPU acceleration
(Metal on macOS/iOS, Vulkan / OpenCL on Linux/Windows)
is opt-in via config: { useGPU: true }; the default is CPU. On
Android useGPU flows through to tts-cpp, which picks the GPU
backend per its own per-vendor allowlist (Supertonic on Adreno/OpenCL,
Xclipse/Vulkan, Mali/Vulkan; Chatterbox on Adreno/Xclipse, declined to
CPU on Mali) (see
Backends & GPU acceleration).
run({ input }) → single PCM buffer).runStreaming(asyncIterable):
yields one audio chunk per input sentence.streamChunkTokens and audio
flows out of the C++ engine chunk-by-chunk as T3 tokens produce
S3Gen+HiFT output; sub-second first-audio-out inside a single
utterance.config.useGPU: true on GPU-capable hosts — including Android, where
tts-cpp selects the GPU backend per its per-vendor allowlist (see
Backends & GPU acceleration)..so files ship under prebuilds/<bare-target>/qvac__tts-ggml/
and are picked up at runtime via the new backendsDir option (see
Backends & GPU acceleration).model.cancel() — stops T3 decode on the next
token; in-flight S3Gen chunk runs to completion.npm install @qvac/tts-ggml
Requires Bare >=1.19.0.
Prebuilds are published for darwin-arm64, android-arm64, ios-arm64;
Linux x64 / Windows prebuilds coming as demand warrants. If your
platform has no prebuild the package falls back to a local build via
bare-make + cmake-vcpkg (see Build from source).
Two engines are wrapped, each with its own GGUF layout under models/:
# Chatterbox turbo (English)
chatterbox-t3-turbo.gguf (~742 MB) — T3 GPT-2 Medium + BPE + VoiceEncoder
chatterbox-s3gen.gguf (~1.0 GB) — S3Gen encoder/CFM + HiFT + CAMPPlus + S3TokenizerV2
# Chatterbox multilingual (en/es/fr/de/pt/it/zh/ja/ko/...)
chatterbox-t3-mtl.gguf (~1.0 GB)
chatterbox-s3gen-mtl.gguf (~1.0 GB)
# Supertonic English (Supertone/supertonic; 44.1 kHz, voice baked in)
supertonic.gguf (~263 MB)
# Supertonic multilingual (Supertone/supertonic-2; en/ko/es/pt/fr)
supertonic2.gguf (~263 MB)
The package converts these from upstream Resemble Chatterbox / Supertone checkpoints via a Python venv pipeline:
npm run setup-models # creates ./venv, installs requirements.txt, runs convert-models.sh
Or step-by-step:
npm run setup:venv
npm run convert-models
Point the addon at a custom location via files.modelDir (engine
auto-detected from the gguf filenames present), or pass explicit
files.t3Model + files.s3genModel (Chatterbox) /
files.supertonicModel (Supertonic).
const TTSGgml = require('@qvac/tts-ggml')
const model = new TTSGgml({
files: { modelDir: './models' }, // contains chatterbox-{t3-turbo,s3gen}.gguf
config: { language: 'en' },
opts: { stats: true }
})
await model.load()
const response = await model.run({
type: 'text',
input: 'Hello from qvac tts ggml.'
})
let pcm = []
await response
.onUpdate(data => {
if (data && data.outputArray) pcm = pcm.concat(Array.from(data.outputArray))
})
.await()
// pcm is Int16 mono @ 24 kHz
await model.unload()
runStreaming(asyncIter)Use when your text arrives as discrete sentences (e.g. buffered LLM
output) and you want the audio to flow sentence-by-sentence. One
onUpdate event per input yield.
async function * sentencesOverTime () {
yield 'First sentence.'
await new Promise(r => setTimeout(r, 200))
yield 'The second arrives shortly after.'
}
const response = await model.runStreaming(sentencesOverTime())
await response.onUpdate(data => {
// data.outputArray — Int16 PCM for this sentence's audio
// data.chunkIndex — 0-based index of the yielded sentence
// data.sentenceChunk — the sentence text that produced this audio
}).await()
Full runnable demo (with streaming playback):
bare examples/chatterbox-sentence-stream-tts.js
streamChunkTokensUse when you want the fastest possible first-audio-out within a
single utterance. The C++ engine splits each synthesis into chunks
of streamChunkTokens speech tokens (25 ≈ 1 s of audio) and emits
audio per chunk, keeping HiFT's source cache phase-continuous across
seams so the joins are inaudible.
const model = new TTSGgml({
files: { modelDir: './models' },
referenceAudio: './voices/jfk.wav', // optional
streamChunkTokens: 25, // ~1 s of audio per chunk
streamFirstChunkTokens: 10, // smaller first chunk = faster first-audio-out
cfmSteps: 1, // 1-step meanflow: halves CFM cost
config: { language: 'en' }
})
await model.load()
const response = await model.run({ input: 'A long sentence produces many chunks...' })
await response.onUpdate(data => {
if (data && data.outputArray) playPcmChunk(data.outputArray)
}).await()
Full runnable demo (with gapless playback via sox or ffplay):
bare examples/chatterbox-chunk-stream-tts.js
Pass a mono wav ≥ 5 s of clean speech — the engine does the loudness
normalisation (−27 LUFS), resampling, and all conditioning (VoiceEncoder,
CAMPPlus, S3TokenizerV2, mel extraction) natively at load() time:
const model = new TTSGgml({
files: { modelDir: './models' },
referenceAudio: './voices/me.wav',
config: { language: 'en' }
})
Alternatively point at a pre-baked profile directory produced by the
upstream CLI's --save-voice DIR (loads .npy tensors; skips the
preprocessing entirely):
new TTSGgml({
files: { modelDir: './models' },
voiceDir: './voices/me/',
})
When both are supplied, missing tensors in voiceDir are backfilled
from referenceAudio.
The addon delegates backend selection to tts-cpp's registry-only
init path. At load() time the engine walks the ggml-backend registry
once and picks the first available accelerator that matches the
host's policy:
| Platform | Default backend when useGPU: true |
|---|---|
| macOS / iOS | Metal |
| Linux / Windows | Vulkan |
| Android — Adreno 700+ | OpenCL |
| Android — Mali / others | Vulkan |
| Everything else / CPU-only build | CPU |
Chatterbox on ARM Mali is the one exception to the table:
tts-cppdeclines Mali for the Chatterbox / S3Gen graph (allow_arm_mali=false) and runs it on CPU there (reported viastats.gpuUnsupported). Supertonic runs on Mali via Vulkan.
Android prebuilds enable GGML_BACKEND_DL=ON and ship per-arch
backend .so files under
prebuilds/<bare-target>/qvac__tts-ggml/.
The engine dlopen()s the highest-tier CPU variant the device's
HWCAPs support and one of the GPU .so files based on the policy
table above. Hosts must pass backendsDir: path.join(__dirname, 'prebuilds') (or rely on the default fallback the package ships)
so the runtime knows where to look. openclCacheDir is also
Android-specific; setting it to a writable path lets the OpenCL
backend persist its compiled program cache across launches.
new TTSGgml(options)| Option | Type | Default | Notes |
|---|---|---|---|
files.modelDir | string | — | Dir containing the two GGUFs |
files.t3Model | string | — | Overrides modelDir for T3 |
files.s3genModel | string | — | Overrides modelDir for S3Gen |
referenceAudio | string | — | Mono wav ≥ 5 s for voice cloning |
voiceDir | string | — | Pre-baked voice profile |
seed | number | 42 | RNG seed (CFM noise + sampling) |
nGpuLayers | number | 0 | Layers offloaded to GPU (mirrors useGPU; pass 99 to offload all) |
nCtx | number | 4096 | Cap on the T3 context (prompt + generated speech tokens; 25 tokens ≈ 1 s of audio). The KV cache is allocated up-front at this length, so it directly bounds memory: the Turbo GGUF's native n_ctx=8196 would cost ~1.6 GB of f32 KV vs ~390 MB at the defaults (4096 + f16). Pass 0 to use the GGUF's full context |
kvCacheType | string | f16 | T3 KV-cache dtype: f32 | f16 | q8_0. f16 (~50% of f32) is the safe cross-backend default. q8_0 stores the cache at ~27% of f32 and decodes 20-30% faster on Metal, but only works on backends with a q8_0 CONT op (CPU, CUDA) — it hard-aborts the multilingual model on Metal, so it is opt-in. Turbo greedy decoding is byte-identical across all three (upstream-validated). Pass f32 for bit-exact pre-quantisation behaviour |
threads | number | hw.concurrency capped at 4 | |
streamChunkTokens | number | 0 | >0 enables native chunk streaming |
streamFirstChunkTokens | number | = streamChunkTokens | Smaller first chunk for low first-audio-out |
cfmSteps | number | 2 | 1 = faster (halved CFM cost) |
backendsDir | string | path.join(__dirname, 'prebuilds') | Root dir the addon scans for dynamically-loaded ggml backend .so files. Required on Android (host should pass path.join(__dirname, 'prebuilds')); ignored on platforms that statically link the backend |
openclCacheDir | string | unset | Android-only: directory where the OpenCL backend persists its compiled program-binary cache. Setting it across runs avoids re-JITing the kernels on every fresh process |
config.language | string | "en" | Chatterbox MTL accepts es/fr/de/pt/it/zh/ja/ko/...; turbo & Supertonic are English |
config.useGPU | boolean | false | Set to true to route through Metal / Vulkan / CUDA / OpenCL if available. Honored for both engines on GPU-capable hosts, including Android, where tts-cpp selects the GPU backend per its per-vendor allowlist (Chatterbox falls back to CPU on Mali) |
config.outputSampleRate | number | 24000 | Resample native 24 kHz output |
opts.stats | boolean | false | Populate response.stats with RTF, backendDevice (0=CPU, 1=GPU), backendId (0=CPU, 1=Metal, 3=Vulkan, 4=OpenCL, 99=other) etc. |
opts.exclusiveRun | boolean | false | Serialize overlapping streaming runs |
await model.load() — construct the native engine (loads T3, preloads
S3Gen, bakes voice conditioning). Subsequent run() calls reuse all
of it.await model.unload() — release everything. Idempotent.await model.reload(newConfig) — re-create the engine with a new
config (language, useGPU, outputSampleRate, …).await model.destroy() — unload() + mark this instance dead.await model.cancel() — best-effort cancel of any in-flight run.model.run({ input, type: 'text' }) → QvacResponse.model.run({ input, streamOutput: true }) → sentence-chunked
synthesis driven by the JS-side sentence splitter (see
lib/textChunker.js). Equivalent to runStream(input).model.runStream(text, { locale?, maxChunkScalars? }) → same as
above, but the options read more naturally for the "split this long
string" use case.model.runStreaming(textStream, opts) → streaming input + streaming
output (see Sentence streaming).All run* methods return a QvacResponse (from @qvac/infer-base):
response.onUpdate(data => {
data.outputArray // Int16Array — 24 kHz mono PCM
data.sampleRate // 24000
data.chunkIndex // present on sentence-streaming events only
data.sentenceChunk // present on sentence-streaming events only
})
await response.await()
// response.stats — only when constructor had `opts: { stats: true }`
response.stats.totalTime // seconds
response.stats.realTimeFactor // synthesis time / audio duration
response.stats.audioDurationMs
response.stats.totalSamples
response.stats.tokensPerSecond
Runnable demos under examples/:
| Script | Demonstrates |
|---|---|
chatterbox-tts.js | Batch synth + wav dump. bare examples/chatterbox-tts.js "Hello" |
chatterbox-sentence-stream-tts.js | runStreaming() over an async iterator of sentences, with gapless streaming playback |
chatterbox-chunk-stream-tts.js | Native per-chunk PCM streaming via streamChunkTokens, with gapless streaming playback |
The two streaming examples feed PCM into a single long-running
sox play / ffplay process so chunks play back-to-back without any
per-chunk spawn gaps — install one of them (brew install sox or
brew install ffmpeg on macOS) to enable playback. Absent a player
the demos still run and write the concatenated wav.
npm run test:unit # mocked binding; fast
npm run test:integration # spins up the real engine; needs models
npm run test # both
Integration tests scan a few candidate models/ directories for the
required GGUFs (see test/utils/downloadModel.js) and skip cleanly when
files are absent. They cover, across both engines:
runStream / run({ streamOutput: true })
/ runStreaming over async iterators),streamChunkTokens),response.stats.backendDevice +
backendId (set NO_GPU=true to skip on CPU-only runners,
QVAC_TTS_GPU_SMOKE_RELAX=1 to downgrade the strict gate to a
warning),chatterbox-mtl.test.js,To stress-test long inputs, set INPUT_SENTENCES=medium (or long)
and re-run the integration suite — addon.test.js reads the env var to
pick its sentence corpus from test/data/sentences-{medium,long}.js.
Prerequisites: clang with C++20 support, CMake ≥ 3.25,
vcpkg (set VCPKG_ROOT), bare-make.
npm install
npx bare-make generate # configures + fetches the tts-cpp port
npx bare-make build
npx bare-make install # copies the .bare into prebuilds/<triple>/
The vcpkg port is hosted in
tetherto/qvac-registry-vcpkg and pulls
qvac-tts.cpp at a pinned REF. See
vcpkg-configuration.json for the
baseline commit.
GPU backends are controlled by the tts-cpp port's vcpkg features:
metal (default on osx/ios), vulkan (default on
linux/windows/android), opencl (default on android).
On Android the port is configured with
GGML_BACKEND_DL=ON + GGML_CPU_ALL_VARIANTS=ON, so the build
produces per-arch CPU + Vulkan + OpenCL .so files alongside the
.bare module instead of statically linking; the resulting prebuilds
layout is what the backendsDir option expects (see
Backends & GPU acceleration).
t3 model not found / supertonic model not found — the paths in
files are wrong or the GGUFs weren't generated. Run
npm run setup-models (creates the Python venv and converts the
upstream checkpoints into the four / five expected GGUF files).
VoiceEncoder forward failed when passing referenceAudio** —
the reference wav is likely < 5 s of clean speech. Make it longer
(10–15 s gives the best similarity).
Crash on process exit with Metal's [rsets->data count] == 0
assertion — you're running on a build before the s3gen_unload()
teardown fix; bump the tts-cpp port to >= 2026-04-21 port-version.
Slower-than-expected RTF on darwin — set config: { useGPU: true }
(the default is now CPU; see Constructor
metal feature. Also confirm
your reference wav's mel was baked (Using C++ VoiceEncoder /
C++ S3TokenizerV2 messages in the log) — if voice conditioning
falls back to CPU, a chunk of the first-call overhead is visible in
RTF.Slow-but-otherwise-fine RTF on Android — set config: { useGPU: true } (the default is CPU; see
Backends & GPU acceleration) and confirm
your device's GPU is on tts-cpp's per-vendor allowlist. Chatterbox is
declined to CPU on ARM Mali, so on a Mali device that engine stays on
CPU regardless; Supertonic runs on the GPU there.
Apache-2.0. See LICENSE.