About Milvus
Get Started
Concepts
User Guide
- Database
- Collections
- Schema & Data Fields
- Insert & Delete
- Indexes
- Search
- Function & Model Inference
  BM25 Function
  Model-based Embedding Functions
  MinHash Function
  Rerank Functions
- Storage Optimization
- Snapshots
Data Import
AI Tools
Administration Guide
Tools
Integrations
Tutorials
FAQs
API Reference

Home
Docs
User Guide
Function & Model Inference
MinHash Function

MinHash FunctionCompatible with Milvus 3.0.x

The MinHash function converts raw text into binary vectors that approximate Jaccard similarity between documents. It applies text shingling and multiple hash functions to produce fixed-length signature vectors, enabling fast near-duplicate detection and document deduplication at scale.

As a built-in function, MinHash runs within Milvus and does not require external model inference or preprocessing. You insert raw text, and Milvus generates the MinHash signature vectors automatically.

Limits

The output field must be a BINARY_VECTOR with a dimension that satisfies dim % 32 == 0, because each MinHash signature is a 32-bit hash value.
The dim of the binary vector field must equal 32 * num_hashes. A mismatch causes an error.
When using MINHASH_LSH index with MinHash function output, mh_element_bit_width must be set to 32.

How MinHash works

Expand to see how it works

MinHash is a locality-sensitive hashing technique that estimates Jaccard similarity between sets. In Milvus, the MinHash function follows this pipeline: you provide raw text as input, and Milvus produces a binary vector as output — handling all intermediate steps internally.

The overall workflow consists of a shared text processing pipeline used by both document ingestion and query processing, followed by phase-specific operations for storage and retrieval.

Iaqkbfeh8oqggsx6nsocfosondo

Shared text processing pipeline

Both document ingestion and query processing pass raw text through the same four-stage transformation:

Text analysis: The text is processed by an analyzer (when token_level is "word") or used directly (when token_level is "char"). Word-level tokenization applies the analyzer configured on the input field to segment text into terms — for example, "milvus is vector db" becomes ["milvus", "is", "vector", "db"].
Shingling: The tokens are split into overlapping n-grams (shingles) of size shingle_size. For example, with 3-grams at word level, the tokens ["information", "retrieval", "is", "a", "field"] become shingles like ["information retrieval is", "retrieval is a", "is a field"].
MinHash signature generation: Multiple hash functions (H1, H2, …, Hn, where n = num_hashes) are applied to the shingle set. For each hash function, the minimum hash value across all shingles is selected. The collection of these minimum values forms the MinHash signature — a fixed-length representation that approximates the Jaccard similarity of the original document.
Binary vector encoding: Each signature value is a 32-bit hash, and the full signature is packed into a BINARY_VECTOR of dimension 32 * num_hashes.

Document ingestion

During insertion, the binary vector produced by the shared pipeline is stored in the MINHASH_LSH index. The index maintains an LSH (Locality-Sensitive Hashing) table that groups similar signatures into the same buckets, enabling fast candidate retrieval at query time.

Query processing

During search, the query text goes through the same shared pipeline to produce a binary vector. This vector is used to perform an LSH lookup in the MINHASH_LSH index, which quickly identifies candidate pairs that are likely similar. Candidates are then ranked by estimated Jaccard similarity and the top-K results are returned.

Because both paths share the same transformation logic, two documents with highly overlapping content produce similar MinHash signatures. This makes the function effective for finding near-duplicates even when documents differ in word order, formatting, or minor phrasing.

Before you start

Before using the MinHash function, plan your collection schema to include the following:

A text field for raw content

Your collection must include a VARCHAR field to store raw text. This field serves as the input to the MinHash function.
An analyzer for the text field (when using word-level tokenization)

If token_level is set to "word" (default), the text field must have an analyzer enabled. The analyzer defines how text is tokenized before shingling. By default, Milvus uses the standard analyzer. To configure a different analyzer, refer to Choose the Right Analyzer for Your Use Case.
A binary vector field for MinHash output

Your collection must include a BINARY_VECTOR field to store the binary vectors generated by the MinHash function. The dimension must equal 32 * num_hashes.

Step 1: Create a collection with a MinHash function

To use the MinHash function, define it when creating the collection. The function becomes part of the collection schema and is applied automatically during data insertion and search.

Define schema fields

Your collection schema must include at least three fields:

Primary field: Uniquely identifies each entity in the collection.
Text field (VARCHAR): Stores raw text documents. Set enable_analyzer=True so Milvus can process the text for MinHash signature generation. By default, Milvus uses the standard analyzer for text analysis. To configure a different analyzer, refer to Choose the Right Analyzer for Your Use Case.
Binary vector field (BINARY_VECTOR): Stores binary vectors automatically generated by the MinHash function. The dimension must equal 32 * num_hashes.

Python Java NodeJS Go cURL

from pymilvus import MilvusClient, DataType, Function, FunctionType

client = MilvusClient(uri="http://localhost:19530", token="root:Milvus")

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="document_content", datatype=DataType.VARCHAR, max_length=9000, enable_analyzer=True)
schema.add_field(field_name="binary_vector", datatype=DataType.BINARY_VECTOR, dim=8192)

// java

// nodejs

// go

# restful

Define the MinHash function

The MinHash function converts analyzed text into binary vectors that approximate Jaccard similarity between documents.

Define the function and add it to your schema:

Python Java NodeJS Go cURL

minhash_function = Function(
    name="minhash_function",
    input_field_names=["document_content"], # Name of the VARCHAR field containing raw text
    output_field_names=["binary_vector"], # Name of the BINARY_VECTOR field for generated signatures
    function_type=FunctionType.MINHASH,
    params={
        "num_hashes": 256, # Number of hash functions; produces dim = 32 * 256 = 8192
        "shingle_size": 3, # N-gram size for shingling
    }
)

schema.add_function(minhash_function)

// java

// nodejs

// go

# restful

Configuration options

The params dictionary of the MinHash function accepts the following parameters. All parameter names are case-insensitive.

Parameter	Type	Default	Description
`num_hashes`	int	Derived from `dim / 32`	Number of hash functions for signature generation. The output binary vector dimension equals `32 * num_hashes`. Higher values reduce variance in similarity estimation but increase computation. Recommended: `256` (dim = 8192).
`shingle_size`	int	`3`	N-gram size for shingling. Word-level: 1-3 is typical. Character-level: 2-6 is typical.
`hash_function`	str	`"xxhash"`	Hash function to use. Options: `"xxhash"` (fast) `"sha1"` (slower, higher collision resistance).
`token_level`	str	`"word"`	Tokenization level. Options: `"word"`: uses the field's analyzer for tokenization, then applies n-gram shingling. `"char"` / `"character"`: applies n-gram shingling directly on raw characters (no analyzer). Word-level provides stronger semantics and higher efficiency but depends on language-specific tokenization. Character-level is language-agnostic but produces higher-dimensional shingles with weaker semantics.
`seed`	int	`1234`	Random seed for MinHash function initialization.

Configure the index

The recommended index type for MinHash binary vectors is MINHASH_LSH, with metric type MHJACCARD.

Python Java NodeJS Go cURL

index_params = client.prepare_index_params()

index_params.add_index(
    field_name="binary_vector",
    index_type="MINHASH_LSH",
    metric_type="MHJACCARD",
    params={
        "mh_lsh_band": 128,
        "mh_element_bit_width": 32,
        "with_raw_data": True,
    },
)

// java

// nodejs

// go

# restful

Create the collection

Create the collection using the schema and index parameters defined above:

Python Java NodeJS Go cURL

client.create_collection(
    collection_name="dedup_collection",
    schema=schema,
    index_params=index_params,
)

// java

// nodejs

// go

# restful

Step 2: Insert documents

After setting up your collection, insert text data. You only need to provide the raw text — the MinHash function automatically generates the binary vector for each document.

Python Java NodeJS Go cURL

client.insert(
    "dedup_collection",
    [
        {"document_content": "information retrieval is a field of study that helps users find relevant information in large datasets"},
        {"document_content": "information retrieval is a research field focused on helping users find relevant data in large collections"},
        {"document_content": "information retrieval is a field of research helping users search for relevant information in large datasets"},
    ],
)

// java

// nodejs

// go

# restful

Step 3: Search with MinHash

Once you have inserted data, search for near-duplicate documents by providing raw text queries. Milvus automatically converts your query text into a MinHash binary vector and retrieves the most similar documents using estimated Jaccard similarity.

Python Java NodeJS Go cURL

search_params = {
    "metric_type": "MHJACCARD",
    "params": {},
}

results = client.search(
    collection_name="dedup_collection",
    data=["information retrieval is a research field focused on helping users find relevant data in large collections"],
    anns_field="binary_vector",
    limit=3,
    output_fields=["document_content"],
    search_params=search_params,
)

for hits in results:
    for hit in hits:
        print(f"ID: {hit['id']}, Distance: {hit['distance']}")
        print(f"Document: {hit['entity']['document_content']}")

// java

// nodejs

// go

# restful

What’s next

Full Text Search: Use BM25 for lexical relevance ranking instead of near-duplicate detection.
Analyzer Overview: Configure custom analyzers for text tokenization.
MINHASH_LSH Index: Learn about tuning LSH parameters for recall and performance.

MinHash Function
Limits
How MinHash works
Shared text processing pipeline
Document ingestion
Query processing
Before you start
Step 1: Create a collection with a MinHash function
Define schema fields
Define the MinHash function
Configure the index
Create the collection
Step 2: Insert documents
Step 3: Search with MinHash
What's next

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Feedback

Was this page helpful?

MinHash FunctionCompatible with Milvus 3.0.x

Limits

How MinHash works

Shared text processing pipeline

Document ingestion

Query processing

Before you start

Step 1: Create a collection with a MinHash function

Define schema fields

Define the MinHash function

Configure the index

Create the collection

Step 2: Insert documents

Step 3: Search with MinHash

What’s next

Table of contents

Try Managed Milvus for Free

Feedback