simhash-vocabulary

Vocabulary-based SimHash implementation for similarity detection.

Installation

npm install simhash-vocabulary

Usage

const { SimHash } = require('simhash-vocabulary')

// Define your vocabulary
const vocabulary = ['cat', 'dog', 'bird', 'fish', 'tree', 'house']

const simhash = new SimHash(vocabulary)

// Hash token arrays to 256-bit (32-byte) buffers
const hash1 = simhash.hash(['cat', 'dog', 'bird'])
const hash2 = simhash.hash(['cat', 'dog', 'fish'])
const hash3 = simhash.hash(['tree', 'house'])

// Compare similarity via Hamming distance
console.log(SimHash.hammingDistance(hash1, hash2)) // small distance (similar)
console.log(SimHash.hammingDistance(hash1, hash3)) // larger distance (different)

API

`new SimHash(vocabulary)`

Create a SimHash instance with a fixed vocabulary. Each token gets a deterministic 256-bit vector derived from its SHA-256 hash.

`simhash.hash(tokens)`

Compute a 32-byte SimHash buffer from an array of tokens. Tokens not in the vocabulary are ignored with a warning.

`SimHash.hammingDistance(buf1, buf2)`

Calculate the Hamming distance between two buffers (number of differing bits). Lower values indicate higher similarity.

How it works

SimHash converts a set of tokens into a fixed-size fingerprint where similar inputs produce similar outputs. The algorithm accumulates weighted bit vectors for each token, then thresholds the result to produce the final hash.

License

Apache-2.0

simhash-vocabulary

1.0.2

@d_cassidy

simhash-vocabulary

Installation

Usage

API

`new SimHash(vocabulary)`

`simhash.hash(tokens)`

`SimHash.hammingDistance(buf1, buf2)`

How it works

License