npm stats
  • Search
  • About
  • Repo
  • Sponsor
  • more
    • Search
    • About
    • Repo
    • Sponsor

Made by Antonio Ramirez

simhash-vocabulary

1.0.2

@d_cassidy

npmHomeRepoSnykSocket
Downloads:354
$ npm install simhash-vocabulary
DailyWeeklyMonthlyYearly

simhash-vocabulary

Vocabulary-based SimHash implementation for similarity detection.

Installation

npm install simhash-vocabulary

Usage

const { SimHash } = require('simhash-vocabulary')

// Define your vocabulary
const vocabulary = ['cat', 'dog', 'bird', 'fish', 'tree', 'house']

const simhash = new SimHash(vocabulary)

// Hash token arrays to 256-bit (32-byte) buffers
const hash1 = simhash.hash(['cat', 'dog', 'bird'])
const hash2 = simhash.hash(['cat', 'dog', 'fish'])
const hash3 = simhash.hash(['tree', 'house'])

// Compare similarity via Hamming distance
console.log(SimHash.hammingDistance(hash1, hash2)) // small distance (similar)
console.log(SimHash.hammingDistance(hash1, hash3)) // larger distance (different)

API

new SimHash(vocabulary)

Create a SimHash instance with a fixed vocabulary. Each token gets a deterministic 256-bit vector derived from its SHA-256 hash.

simhash.hash(tokens)

Compute a 32-byte SimHash buffer from an array of tokens. Tokens not in the vocabulary are ignored with a warning.

SimHash.hammingDistance(buf1, buf2)

Calculate the Hamming distance between two buffers (number of differing bits). Lower values indicate higher similarity.

How it works

SimHash converts a set of tokens into a fixed-size fingerprint where similar inputs produce similar outputs. The algorithm accumulates weighted bit vectors for each token, then thresholds the result to produce the final hash.

License

Apache-2.0