@qvac/tts-ggml

Text-to-speech Bare addon backed by the qvac-tts.cpp GGML library. Currently ships the Chatterbox Turbo English model; additional engines will land under the same package as the upstream library grows.

Runs in-process with a persistent native engine — the GGUFs, the S3Gen preload, the ggml backend, and any voice-conditioning tensors are loaded once and reused across every synthesis call. GPU acceleration (Metal on macOS/iOS, Vulkan / OpenCL on Linux/Windows) is opt-in via config: { useGPU: true }; the default is CPU. On Android useGPU flows through to tts-cpp, which picks the GPU backend per its own per-vendor allowlist (Supertonic on Adreno/OpenCL, Xclipse/Vulkan, Mali/Vulkan; Chatterbox on Adreno/Xclipse, declined to CPU on Mali) (see Backends & GPU acceleration).

Features

Batch synthesis (run({ input }) → single PCM buffer).
Sentence-granularity streaming — runStreaming(asyncIterable): yields one audio chunk per input sentence.
Native per-chunk streaming — set streamChunkTokens and audio flows out of the C++ engine chunk-by-chunk as T3 tokens produce S3Gen+HiFT output; sub-second first-audio-out inside a single utterance.
Voice cloning from a reference wav (or a pre-baked profile dir).
CPU by default, GPU (Metal / Vulkan / OpenCL) opt-in via config.useGPU: true on GPU-capable hosts — including Android, where tts-cpp selects the GPU backend per its per-vendor allowlist (see Backends & GPU acceleration).
Dynamic backend loading on Android — per-arch CPU + Vulkan + OpenCL .so files ship under prebuilds/<bare-target>/qvac__tts-ggml/ and are picked up at runtime via the new backendsDir option (see Backends & GPU acceleration).
Cancellation via model.cancel() — stops T3 decode on the next token; in-flight S3Gen chunk runs to completion.

Install

npm install @qvac/tts-ggml

Requires Bare >=1.19.0. Prebuilds are published for darwin-arm64, android-arm64, ios-arm64; Linux x64 / Windows prebuilds coming as demand warrants. If your platform has no prebuild the package falls back to a local build via bare-make + cmake-vcpkg (see Build from source).

Model files

Two engines are wrapped, each with its own GGUF layout under models/:

# Chatterbox turbo (English)
chatterbox-t3-turbo.gguf   (~742 MB) — T3 GPT-2 Medium + BPE + VoiceEncoder
chatterbox-s3gen.gguf      (~1.0 GB) — S3Gen encoder/CFM + HiFT + CAMPPlus + S3TokenizerV2

# Chatterbox multilingual (en/es/fr/de/pt/it/zh/ja/ko/...)
chatterbox-t3-mtl.gguf     (~1.0 GB)
chatterbox-s3gen-mtl.gguf  (~1.0 GB)

# Supertonic English (Supertone/supertonic; 44.1 kHz, voice baked in)
supertonic.gguf            (~263 MB)

# Supertonic multilingual (Supertone/supertonic-2; en/ko/es/pt/fr)
supertonic2.gguf           (~263 MB)

The package converts these from upstream Resemble Chatterbox / Supertone checkpoints via a Python venv pipeline:

npm run setup-models   # creates ./venv, installs requirements.txt, runs convert-models.sh

Or step-by-step:

npm run setup:venv
npm run convert-models

Point the addon at a custom location via files.modelDir (engine auto-detected from the gguf filenames present), or pass explicit files.t3Model + files.s3genModel (Chatterbox) / files.supertonicModel (Supertonic).

Quick start

const TTSGgml = require('@qvac/tts-ggml')

const model = new TTSGgml({
  files: { modelDir: './models' }, // contains chatterbox-{t3-turbo,s3gen}.gguf
  config: { language: 'en' },
  opts: { stats: true }
})

await model.load()

const response = await model.run({
  type: 'text',
  input: 'Hello from qvac tts ggml.'
})

let pcm = []
await response
  .onUpdate(data => {
    if (data && data.outputArray) pcm = pcm.concat(Array.from(data.outputArray))
  })
  .await()

// pcm is Int16 mono @ 24 kHz
await model.unload()

Streaming

Sentence streaming — `runStreaming(asyncIter)`

Use when your text arrives as discrete sentences (e.g. buffered LLM output) and you want the audio to flow sentence-by-sentence. One onUpdate event per input yield.

async function * sentencesOverTime () {
  yield 'First sentence.'
  await new Promise(r => setTimeout(r, 200))
  yield 'The second arrives shortly after.'
}

const response = await model.runStreaming(sentencesOverTime())
await response.onUpdate(data => {
  // data.outputArray    — Int16 PCM for this sentence's audio
  // data.chunkIndex     — 0-based index of the yielded sentence
  // data.sentenceChunk  — the sentence text that produced this audio
}).await()

Full runnable demo (with streaming playback): bare examples/chatterbox-sentence-stream-tts.js

Chunk streaming — `streamChunkTokens`

Use when you want the fastest possible first-audio-out within a single utterance. The C++ engine splits each synthesis into chunks of streamChunkTokens speech tokens (25 ≈ 1 s of audio) and emits audio per chunk, keeping HiFT's source cache phase-continuous across seams so the joins are inaudible.

const model = new TTSGgml({
  files: { modelDir: './models' },
  referenceAudio: './voices/jfk.wav', // optional
  streamChunkTokens: 25,              // ~1 s of audio per chunk
  streamFirstChunkTokens: 10,         // smaller first chunk = faster first-audio-out
  cfmSteps: 1,                        // 1-step meanflow: halves CFM cost
  config: { language: 'en' }
})

await model.load()

const response = await model.run({ input: 'A long sentence produces many chunks...' })
await response.onUpdate(data => {
  if (data && data.outputArray) playPcmChunk(data.outputArray)
}).await()

Full runnable demo (with gapless playback via sox or ffplay): bare examples/chatterbox-chunk-stream-tts.js

Voice cloning

Pass a mono wav ≥ 5 s of clean speech — the engine does the loudness normalisation (−27 LUFS), resampling, and all conditioning (VoiceEncoder, CAMPPlus, S3TokenizerV2, mel extraction) natively at load() time:

const model = new TTSGgml({
  files: { modelDir: './models' },
  referenceAudio: './voices/me.wav',
  config: { language: 'en' }
})

Alternatively point at a pre-baked profile directory produced by the upstream CLI's --save-voice DIR (loads .npy tensors; skips the preprocessing entirely):

new TTSGgml({
  files: { modelDir: './models' },
  voiceDir: './voices/me/',
})

When both are supplied, missing tensors in voiceDir are backfilled from referenceAudio.

Backends & GPU acceleration

The addon delegates backend selection to tts-cpp's registry-only init path. At load() time the engine walks the ggml-backend registry once and picks the first available accelerator that matches the host's policy:

Platform	Default backend when `useGPU: true`
macOS / iOS	Metal
Linux / Windows	Vulkan
Android — Adreno 700+	OpenCL
Android — Mali / others	Vulkan
Everything else / CPU-only build	CPU

Chatterbox on ARM Mali is the one exception to the table: tts-cpp declines Mali for the Chatterbox / S3Gen graph (allow_arm_mali=false) and runs it on CPU there (reported via stats.gpuUnsupported). Supertonic runs on Mali via Vulkan.

Android: dynamic backend loading

Android prebuilds enable GGML_BACKEND_DL=ON and ship per-arch backend .so files under prebuilds/<bare-target>/qvac__tts-ggml/.

The engine dlopen()s the highest-tier CPU variant the device's HWCAPs support and one of the GPU .so files based on the policy table above. Hosts must pass backendsDir: path.join(__dirname, 'prebuilds') (or rely on the default fallback the package ships) so the runtime knows where to look. openclCacheDir is also Android-specific; setting it to a writable path lets the OpenCL backend persist its compiled program cache across launches.

API overview

Constructor — `new TTSGgml(options)`

Option	Type	Default	Notes
`files.modelDir`	string	—	Dir containing the two GGUFs
`files.t3Model`	string	—	Overrides `modelDir` for T3
`files.s3genModel`	string	—	Overrides `modelDir` for S3Gen
`referenceAudio`	string	—	Mono wav ≥ 5 s for voice cloning
`voiceDir`	string	—	Pre-baked voice profile
`seed`	number	42	RNG seed (CFM noise + sampling)
`nGpuLayers`	number	0	Layers offloaded to GPU (mirrors `useGPU`; pass `99` to offload all)
`nCtx`	number	4096	Cap on the T3 context (prompt + generated speech tokens; 25 tokens ≈ 1 s of audio). The KV cache is allocated up-front at this length, so it directly bounds memory: the Turbo GGUF's native `n_ctx=8196` would cost ~1.6 GB of f32 KV vs ~390 MB at the defaults (4096 + `f16`). Pass `0` to use the GGUF's full context
`kvCacheType`	string	`f16`	T3 KV-cache dtype: `f32` \| `f16` \| `q8_0`. `f16` (~50% of f32) is the safe cross-backend default. `q8_0` stores the cache at ~27% of f32 and decodes 20-30% faster on Metal, but only works on backends with a q8_0 CONT op (CPU, CUDA) — it hard-aborts the multilingual model on Metal, so it is opt-in. Turbo greedy decoding is byte-identical across all three (upstream-validated). Pass `f32` for bit-exact pre-quantisation behaviour
`threads`	number	hw.concurrency capped at 4
`streamChunkTokens`	number	0	>0 enables native chunk streaming
`streamFirstChunkTokens`	number	= streamChunkTokens	Smaller first chunk for low first-audio-out
`cfmSteps`	number	2	1 = faster (halved CFM cost)
`backendsDir`	string	`path.join(__dirname, 'prebuilds')`	Root dir the addon scans for dynamically-loaded ggml backend `.so` files. Required on Android (host should pass `path.join(__dirname, 'prebuilds')`); ignored on platforms that statically link the backend
`openclCacheDir`	string	unset	Android-only: directory where the OpenCL backend persists its compiled program-binary cache. Setting it across runs avoids re-JITing the kernels on every fresh process
`config.language`	string	`"en"`	Chatterbox MTL accepts `es/fr/de/pt/it/zh/ja/ko/...`; turbo & Supertonic are English
`config.useGPU`	boolean	`false`	Set to `true` to route through Metal / Vulkan / CUDA / OpenCL if available. Honored for both engines on GPU-capable hosts, including Android, where `tts-cpp` selects the GPU backend per its per-vendor allowlist (Chatterbox falls back to CPU on Mali)
`config.outputSampleRate`	number	24000	Resample native 24 kHz output
`opts.stats`	boolean	`false`	Populate `response.stats` with RTF, `backendDevice` (0=CPU, 1=GPU), `backendId` (0=CPU, 1=Metal, 3=Vulkan, 4=OpenCL, 99=other) etc.
`opts.exclusiveRun`	boolean	`false`	Serialize overlapping streaming runs

Methods

await model.load() — construct the native engine (loads T3, preloads S3Gen, bakes voice conditioning). Subsequent run() calls reuse all of it.
await model.unload() — release everything. Idempotent.
await model.reload(newConfig) — re-create the engine with a new config (language, useGPU, outputSampleRate, …).
await model.destroy() — unload() + mark this instance dead.
await model.cancel() — best-effort cancel of any in-flight run.
model.run({ input, type: 'text' }) → QvacResponse.
model.run({ input, streamOutput: true }) → sentence-chunked synthesis driven by the JS-side sentence splitter (see lib/textChunker.js). Equivalent to runStream(input).
model.runStream(text, { locale?, maxChunkScalars? }) → same as above, but the options read more naturally for the "split this long string" use case.
model.runStreaming(textStream, opts) → streaming input + streaming output (see Sentence streaming).

Response shape

All run* methods return a QvacResponse (from @qvac/infer-base):

response.onUpdate(data => {
  data.outputArray   // Int16Array — 24 kHz mono PCM
  data.sampleRate    // 24000
  data.chunkIndex    // present on sentence-streaming events only
  data.sentenceChunk // present on sentence-streaming events only
})
await response.await()

// response.stats — only when constructor had `opts: { stats: true }`
response.stats.totalTime         // seconds
response.stats.realTimeFactor    // synthesis time / audio duration
response.stats.audioDurationMs
response.stats.totalSamples
response.stats.tokensPerSecond

Examples

Runnable demos under examples/:

Script	Demonstrates
`chatterbox-tts.js`	Batch synth + wav dump. `bare examples/chatterbox-tts.js "Hello"`
`chatterbox-sentence-stream-tts.js`	`runStreaming()` over an async iterator of sentences, with gapless streaming playback
`chatterbox-chunk-stream-tts.js`	Native per-chunk PCM streaming via `streamChunkTokens`, with gapless streaming playback

The two streaming examples feed PCM into a single long-running sox play / ffplay process so chunks play back-to-back without any per-chunk spawn gaps — install one of them (brew install sox or brew install ffmpeg on macOS) to enable playback. Absent a player the demos still run and write the concatenated wav.

Testing

npm run test:unit          # mocked binding; fast
npm run test:integration   # spins up the real engine; needs models
npm run test               # both

Integration tests scan a few candidate models/ directories for the required GGUFs (see test/utils/downloadModel.js) and skip cleanly when files are absent. They cover, across both engines:

batch synthesis with full RuntimeStats,
sentence-level streaming (runStream / run({ streamOutput: true }) / runStreaming over async iterators),
native sub-sentence chunk streaming (Chatterbox-only via streamChunkTokens),
sequential-run / fresh-instance / reload-stability behaviour,
strict GPU-backend assertion via response.stats.backendDevice + backendId (set NO_GPU=true to skip on CPU-only runners, QVAC_TTS_GPU_SMOKE_RELAX=1 to downgrade the strict gate to a warning),
multilingual Chatterbox sweep (es/fr/de/pt) via chatterbox-mtl.test.js,
on darwin the Chatterbox English batch path is additionally verified for WER against the synthesized audio (whisper-small).

To stress-test long inputs, set INPUT_SENTENCES=medium (or long) and re-run the integration suite — addon.test.js reads the env var to pick its sentence corpus from test/data/sentences-{medium,long}.js.

Build from source

Prerequisites: clang with C++20 support, CMake ≥ 3.25, vcpkg (set VCPKG_ROOT), bare-make.

npm install
npx bare-make generate      # configures + fetches the tts-cpp port
npx bare-make build
npx bare-make install       # copies the .bare into prebuilds/<triple>/

The vcpkg port is hosted in tetherto/qvac-registry-vcpkg and pulls qvac-tts.cpp at a pinned REF. See vcpkg-configuration.json for the baseline commit.

GPU backends are controlled by the tts-cpp port's vcpkg features: metal (default on osx/ios), vulkan (default on linux/windows/android), opencl (default on android). On Android the port is configured with GGML_BACKEND_DL=ON + GGML_CPU_ALL_VARIANTS=ON, so the build produces per-arch CPU + Vulkan + OpenCL .so files alongside the .bare module instead of statically linking; the resulting prebuilds layout is what the backendsDir option expects (see Backends & GPU acceleration).

Troubleshooting

t3 model not found / supertonic model not found — the paths in files are wrong or the GGUFs weren't generated. Run npm run setup-models (creates the Python venv and converts the upstream checkpoints into the four / five expected GGUF files).

VoiceEncoder forward failed when passing referenceAudio** — the reference wav is likely < 5 s of clean speech. Make it longer (10–15 s gives the best similarity).

Crash on process exit with Metal's [rsets->data count] == 0 assertion — you're running on a build before the s3gen_unload() teardown fix; bump the tts-cpp port to >= 2026-04-21 port-version.

Slower-than-expected RTF on darwin — set config: { useGPU: true } (the default is now CPU; see Constructor

Backends & GPU acceleration) and confirm the port was built with the metal feature. Also confirm your reference wav's mel was baked (Using C++ VoiceEncoder / C++ S3TokenizerV2 messages in the log) — if voice conditioning falls back to CPU, a chunk of the first-call overhead is visible in RTF.

Slow-but-otherwise-fine RTF on Android — set config: { useGPU: true } (the default is CPU; see Backends & GPU acceleration) and confirm your device's GPU is on tts-cpp's per-vendor allowlist. Chatterbox is declined to CPU on ARM Mali, so on a Mali device that engine stays on CPU regardless; Supertonic runs on the GPU there.

License

Apache-2.0. See LICENSE.

@qvac/tts-ggml

0.3.6

@GitHub Actions

@qvac/tts-ggml

Features

Install

Model files

Quick start

Streaming

Sentence streaming — `runStreaming(asyncIter)`

Chunk streaming — `streamChunkTokens`

Voice cloning

Backends & GPU acceleration

Android: dynamic backend loading

API overview

Constructor — `new TTSGgml(options)`

Methods

Response shape

Examples

Testing

Build from source

Troubleshooting

License

@qvac/tts-ggml

0.3.6

@GitHub Actions

@qvac/tts-ggml

Features

Install

Model files

Quick start

Streaming

Sentence streaming — runStreaming(asyncIter)

Chunk streaming — streamChunkTokens

Voice cloning

Backends & GPU acceleration

Android: dynamic backend loading

API overview

Constructor — new TTSGgml(options)

Methods

Response shape

Examples

Testing

Build from source

Troubleshooting

License

Sentence streaming — `runStreaming(asyncIter)`

Chunk streaming — `streamChunkTokens`

Constructor — `new TTSGgml(options)`