npm stats
  • Search
  • About
  • Repo
  • Sponsor
  • more
    • Search
    • About
    • Repo
    • Sponsor

Made by Antonio Ramirez

@qvac/embed-llamacpp

0.13.3

@GitHub Actions

npmHomeRepoSnykSocket
Downloads:4653
$ npm install @qvac/embed-llamacpp
DailyWeeklyMonthlyYearly

qvac-lib-infer-llamacpp-embed

This native C++ addon, built using the Bare Runtime, simplifies running text embedding models to enable efficient generation of high-quality contextual text embeddings. It provides an easy interface to load, execute, and manage embedding model instances.

Table of Contents

  • Supported platforms
  • Installation
  • Building from Source
  • Usage
    • 1. Import the Model Class
    • 2. Create a Data Loader
    • 3. Create the args obj
    • 4. Create config
    • 5. Instanstiate the model
    • 6. Load the model
    • 7. Generate embeddings for input sequence
    • 8. Unload the model
  • API behavior by state
  • Quickstart Example
  • Other Examples
  • Benchmarking
  • Tests
  • Glossary
  • License

Supported platforms

PlatformArchitectureMin VersionStatusGPU Support
macOSarm64, x6414.0+✅ Tier 1Metal
iOSarm6417.0+✅ Tier 1Metal
Linuxarm64, x64Ubuntu-22+✅ Tier 1Vulkan
Androidarm6412+✅ Tier 1Vulkan, OpenCL (Adreno 700+)
Windowsx6410+✅ Tier 1Vulkan

Dependencies:

  • qvac-lib-inference-addon-cpp (≥1.1.2): C++ addon framework
  • qvac-fabric-llm.cpp (≥7248.2.1): Inference engine
  • Bare Runtime (≥1.24.0): JavaScript runtime
  • Linux requires Clang/LLVM 19 with libc++

Installation

Prerequisites

Ensure that the Bare Runtime is installed globally on your system. If it's not already installed, you can install it using:

npm install -g bare@latest

Installing the Package

npm install @qvac/embed-llamacpp@latest

Building from Source

See build.md for detailed instructions on how to build the addon from source.

Usage

1. Import the Model Class

const GGMLBert = require('@qvac/embed-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

// Download model from HuggingFace (see examples/utils.js for downloadModel helper)
const [modelName, dirPath] = await downloadModel(
  'https://huggingface.co/ChristianAzinn/gte-large-gguf/resolve/main/gte-large_fp16.gguf',
  'gte-large_fp16.gguf'
)

const fsDL = new FilesystemDL({ dirPath })

3. Create the args obj

const args = {
  loader: fsDL,
  logger: console,
  opts: { stats: true },
  diskPath: dirPath,
  modelName
}

The args obj contains the following properties:

  • loader: The Data Loader instance from which the model file will be streamed.
  • logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
  • opts.stats: This flag determines whether to calculate inference stats.
  • diskPath: The local directory where the model file will be downloaded to.
  • modelName: The name of model file in the Data Loader.

4. Create config

The config is a dictionary (object) consisting of hyper-parameters which can be used to tweak the behaviour of the model.
All parameter values should be strings.

const config = {
  device: 'gpu',
  gpu_layers: '99',
  batch_size: '1024',
  ctx_size: '512'
}
ParameterRange / TypeDefaultDescription
-dev"gpu" or "cpu""gpu"Device to run inference on
-nglinteger0Number of model layers to offload to GPU
--batch-sizeinteger2048Tokens for processing multiple prompts together
--pooling{none,mean,cls,last,rank}model defaultPooling type for embeddings
--attention{causal,non-causal}model defaultAttention type for embeddings
--embd-normalizeinteger2Embedding normalization (-1=none, 0=max abs int16, 1=taxicab, 2=euclidean, >2=p-norm)
-fa"on", "off", or "auto""auto"Enable/disable flash attention
--main-gpuinteger, "integrated", or "dedicated"—GPU selection for multi-GPU systems
verbosity0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)0Logging verbosity level

IGPU/GPU selection logic:

Scenariomain-gpu not specifiedmain-gpu: "dedicated"main-gpu: "integrated"
Devices consideredAll GPUs (dedicated + integrated)Only dedicated GPUsOnly integrated GPUs
System with iGPU only✅ Uses iGPU❌ Falls back to CPU✅ Uses iGPU
System with dedicated GPU only✅ Uses dedicated GPU✅ Uses dedicated GPU❌ Falls back to CPU
System with both✅ Uses dedicated GPU (preferred)✅ Uses dedicated GPU✅ Uses integrated GPU

5. Instantiate the model

const model = new GGMLBert(args, config)

6. Load the model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

  • close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
  • reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

7. Generate embeddings for input sequence

The model outputs a vector for the input sequence.

const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()

When opts.stats is enabled, response.stats includes runtime metrics such as total_tokens, total_time_ms, tokens_per_second, and backendDevice ("cpu" or "gpu"). backendDevice reflects the resolved device used at runtime after backend selection/fallback logic, not only the requested config.

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

API behavior by state

The following table describes the expected behavior of run and cancel depending on the current state (idle vs a job running). cancel can be called on the model (model.cancel()) or on the response (response.cancel()); both target the same underlying job.

Current stateAction calledWhat happens
idlerunAllowed — starts inference, returns QvacResponse
idlecancelAllowed — no-op (no job to cancel); Promise resolves
runrunThrow — second run() throws "a job is already set or being processed" (can wait very briefly for previous job completion)
runcancelAllowed — cancels current job; Promise resolves when job has stopped

When run() is called while another job is active, the implementation first waits briefly for the previous job to settle. This preserves single-job behavior while still failing fast when the instance is busy. If the second run cannot be accepted (timeout or addon busy rejection), it throws:

  • "Cannot set new job: a job is already set or being processed"

Cancellation API: Prefer cancelling from the model: await model.cancel(). This cancels the current job and the Promise resolves when the job has actually stopped (future-based in C++). You can also call await response.cancel() on the value returned by run(); it is equivalent and targets the same job. Both are no-op when idle.

Quickstart Example

Clone the repository and navigate to it:

cd qvac-lib-infer-llamacpp-embed

Install dependencies:

npm install

Run the quickstart example (uses examples/quickstart.js):

npm run quickstart

Other Examples

  • Batch Inference – Demonstrates running multiple prompts at once using batch inference.
  • Native Logging – Demonstrates C++ addon logging integration.

Benchmarking

We conduct rigorous benchmarking of our embedding models to evaluate their retrieval effectiveness and computational efficiency across diverse tasks and datasets. Our evaluation framework incorporates standard information retrieval metrics and system performance indicators to provide a holistic view of model quality.

Running Benchmarks

For instructions on running benchmarks yourself, see the Benchmark Runner Documentation.

The benchmarking covers:

  • Retrieval Quality:

    • nDCG@k: Quality of ranked results based on relevance and position
    • MRR@k: Position of the first relevant result per query
    • Recall@k: Coverage of relevant results in the top k
    • Precision@k: Proportion of top k results that are relevant

Results are continuously updated with new releases to ensure up-to-date performance insights.

Tests

Integration tests are located in test/integration/ and cover core functionality including model loading, inference, tool calling, multimodal capabilities, and configuration parameters.
These tests help prevent regressions and ensure the library remains stable as contributions are made to the project.

Unit tests are located in test/unit/ and test the C++ addon components at a lower level, including backend selection, cache management, chat templates, context handling, and UTF8 token processing.
These tests validate the native implementation and help catch issues early in development.

Glossary

  • Bare Runtime - Small and modular JavaScript runtime for desktop and mobile. Learn more.

License

This project is licensed under the Apache-2.0 License – see the LICENSE file for details.

For questions or issues, please open an issue on the GitHub repository.