Vocabulary-based SimHash implementation for similarity detection.
npm install simhash-vocabulary
const { SimHash } = require('simhash-vocabulary')
// Define your vocabulary
const vocabulary = ['cat', 'dog', 'bird', 'fish', 'tree', 'house']
const simhash = new SimHash(vocabulary)
// Hash token arrays to 256-bit (32-byte) buffers
const hash1 = simhash.hash(['cat', 'dog', 'bird'])
const hash2 = simhash.hash(['cat', 'dog', 'fish'])
const hash3 = simhash.hash(['tree', 'house'])
// Compare similarity via Hamming distance
console.log(SimHash.hammingDistance(hash1, hash2)) // small distance (similar)
console.log(SimHash.hammingDistance(hash1, hash3)) // larger distance (different)
new SimHash(vocabulary)Create a SimHash instance with a fixed vocabulary. Each token gets a deterministic 256-bit vector derived from its SHA-256 hash.
simhash.hash(tokens)Compute a 32-byte SimHash buffer from an array of tokens. Tokens not in the vocabulary are ignored with a warning.
SimHash.hammingDistance(buf1, buf2)Calculate the Hamming distance between two buffers (number of differing bits). Lower values indicate higher similarity.
SimHash converts a set of tokens into a fixed-size fingerprint where similar inputs produce similar outputs. The algorithm accumulates weighted bit vectors for each token, then thresholds the result to produce the final hash.
Apache-2.0