Full Text Search
Full text search is a feature that retrieves documents containing specific terms or phrases in text datasets, then ranks the results based on relevance. This feature overcomes the limitations of semantic search, which might overlook precise terms, ensuring you receive the most accurate and contextually relevant results. Additionally, it simplifies vector searches by accepting raw text input, automatically converting your text data into sparse embeddings without the need to manually generate vector embeddings.
Using the BM25 algorithm for relevance scoring, this feature is particularly valuable in retrieval-augmented generation (RAG) scenarios, where it prioritizes documents that closely match specific search terms.
- By integrating full text search with semantic-based dense vector search, you can enhance the accuracy and relevance of search results. For more information, refer to Hybrid Search.
- Full text search is available in Milvus Standalone and Milvus Distributed but not Milvus Lite, although adding it to Milvus Lite is on the roadmap.
Overview
Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
Text input: You insert raw text documents or provide query text without needing to manually embed them.
Text analysis: Milvus uses an analyzer to tokenize the input text into individual, searchable terms. For more information on analyzers, refer to Analyzer Overview.
Function processing: The built-in function receives tokenized terms and converts them into sparse vector representations.
Collection store: Milvus stores these sparse embeddings in a collection for efficient retrieval.
BM25 scoring: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on their relevance to the query text.
Full text search
To use full text search, follow these main steps:
Create a collection: Set up a collection with necessary fields and define a function to convert raw text into sparse embeddings.
Insert data: Ingest your raw text documents to the collection.
Perform searches: Use query texts to search through your collection and retrieve relevant results.
Create a collection for full text search
To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:
The primary field that uniquely identifies each entity in a collection.
A
VARCHAR
field that stores raw text documents, with theenable_analyzer
attribute set toTrue
. This allows Milvus to tokenize text into specific terms for function processing.A
SPARSE_FLOAT_VECTOR
field reserved to store sparse embeddings that Milvus will automatically generate for theVARCHAR
field.
Define the collection schema
First, create the schema and add the necessary fields:
from pymilvus import MilvusClient, DataType, Function, FunctionType
client = MilvusClient(uri="http://localhost:19530")
schema = client.create_schema()
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
import io.milvus.v2.common.DataType;
import io.milvus.v2.service.collection.request.AddFieldReq;
import io.milvus.v2.service.collection.request.CreateCollectionReq;
CreateCollectionReq.CollectionSchema schema = CreateCollectionReq.CollectionSchema.builder()
.build();
schema.addField(AddFieldReq.builder()
.fieldName("id")
.dataType(DataType.Int64)
.isPrimaryKey(true)
.autoID(true)
.build());
schema.addField(AddFieldReq.builder()
.fieldName("text")
.dataType(DataType.VarChar)
.maxLength(1000)
.enableAnalyzer(true)
.build());
schema.addField(AddFieldReq.builder()
.fieldName("sparse")
.dataType(DataType.SparseFloatVector)
.build());
import { MilvusClient, DataType } from "@zilliz/milvus2-sdk-node";
const address = "http://localhost:19530";
const token = "root:Milvus";
const client = new MilvusClient({address, token});
const schema = [
{
name: "id",
data_type: DataType.Int64,
is_primary_key: true,
},
{
name: "text",
data_type: "VarChar",
enable_analyzer: true,
enable_match: true,
max_length: 1000,
},
{
name: "sparse",
data_type: DataType.SparseFloatVector,
},
];
console.log(res.results)
export schema='{
"autoId": true,
"enabledDynamicField": false,
"fields": [
{
"fieldName": "id",
"dataType": "Int64",
"isPrimary": true
},
{
"fieldName": "text",
"dataType": "VarChar",
"elementTypeParams": {
"max_length": 1000,
"enable_analyzer": true
}
},
{
"fieldName": "sparse",
"dataType": "SparseFloatVector"
}
]
}'
In this configuration,
id
: serves as the primary key and is automatically generated withauto_id=True
.text
: stores your raw text data for full text search operations. The data type must beVARCHAR
, asVARCHAR
is Milvus’ string data type for text storage. Setenable_analyzer=True
to allow Milvus to tokenize the text. By default, Milvus uses the standard analyzer for text analysis. To configure a different analyzer, refer to Overview.sparse
: a vector field reserved to store internally generated sparse embeddings for full text search operations. The data type must beSPARSE_FLOAT_VECTOR
.
Now, define a function that will convert your text into sparse vector representations and then add it to the schema:
bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)
import io.milvus.common.clientenum.FunctionType;
import io.milvus.v2.service.collection.request.CreateCollectionReq.Function;
import java.util.*;
schema.addFunction(Function.builder()
.functionType(FunctionType.BM25)
.name("text_bm25_emb")
.inputFieldNames(Collections.singletonList("text"))
.outputFieldNames(Collections.singletonList("vector"))
.build());
const functions = [
{
name: 'text_bm25_emb',
description: 'bm25 function',
type: FunctionType.BM25,
input_field_names: ['text'],
output_field_names: ['vector'],
params: {},
},
];
export schema='{
"autoId": true,
"enabledDynamicField": false,
"fields": [
{
"fieldName": "id",
"dataType": "Int64",
"isPrimary": true
},
{
"fieldName": "text",
"dataType": "VarChar",
"elementTypeParams": {
"max_length": 1000,
"enable_analyzer": true
}
},
{
"fieldName": "sparse",
"dataType": "SparseFloatVector"
}
],
"functions": [
{
"name": "text_bm25_emb",
"type": "BM25",
"inputFieldNames": ["text"],
"outputFieldNames": ["sparse"],
"params": {}
}
]
}'
Parameter | Description |
---|---|
| The name of the function. This function converts your raw text from the |
| The name of the |
| The name of the field where the internally generated sparse vectors will be stored. For |
| The type of the function to use. Set the value to |
For collections with multiple VARCHAR
fields requiring text-to-sparse-vector conversion, add separate functions to the collection schema, ensuring each function has a unique name and output_field_names
value.
Configure the index
After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use AUTOINDEX
as the index_type
, an option that allows Milvus to choose and configure the most suitable index type based on the structure of your data.
index_params = client.prepare_index_params()
index_params.add_index(
field_name="sparse",
index_type="AUTOINDEX",
metric_type="BM25"
)
import io.milvus.v2.common.IndexParam;
List<IndexParam> indexes = new ArrayList<>();
indexes.add(IndexParam.builder()
.fieldName("sparse")
.indexType(IndexParam.IndexType.SPARSE_INVERTED_INDEX)
.metricType(IndexParam.MetricType.BM25)
.build());
const index_params = [
{
field_name: "sparse",
metric_type: "BM25",
index_type: "AUTOINDEX",
},
];
export indexParams='[
{
"fieldName": "sparse",
"metricType": "BM25",
"indexType": "AUTOINDEX"
}
]'
Parameter | Description |
---|---|
| The name of the vector field to index. For full text search, this should be the field that stores the generated sparse vectors. In this example, set the value to |
| The type of the index to create. |
| The value for this parameter must be set to |
Create the collection
Now create the collection using the schema and index parameters defined.
client.create_collection(
collection_name='demo',
schema=schema,
index_params=index_params
)
import io.milvus.v2.service.collection.request.CreateCollectionReq;
CreateCollectionReq requestCreate = CreateCollectionReq.builder()
.collectionName("demo")
.collectionSchema(schema)
.indexParams(indexes)
.build();
client.createCollection(requestCreate);
await client.create_collection(
collection_name: 'demo',
schema: schema,
index_params: index_params,
functions: functions
);
export CLUSTER_ENDPOINT="http://localhost:19530"
export TOKEN="root:Milvus"
curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/collections/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d "{
\"collectionName\": \"demo\",
\"schema\": $schema,
\"indexParams\": $indexParams
}"
Insert text data
After setting up your collection and index, you’re ready to insert text data. In this process, you need only to provide the raw text. The built-in function we defined earlier automatically generates the corresponding sparse vector for each text entry.
client.insert('demo', [
{'text': 'information retrieval is a field of study.'},
{'text': 'information retrieval focuses on finding relevant information in large datasets.'},
{'text': 'data mining and information retrieval overlap in research.'},
])
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import io.milvus.v2.service.vector.request.InsertReq;
Gson gson = new Gson();
List<JsonObject> rows = Arrays.asList(
gson.fromJson("{\"text\": \"information retrieval is a field of study.\"}", JsonObject.class),
gson.fromJson("{\"text\": \"information retrieval focuses on finding relevant information in large datasets.\"}", JsonObject.class),
gson.fromJson("{\"text\": \"data mining and information retrieval overlap in research.\"}", JsonObject.class)
);
client.insert(InsertReq.builder()
.collectionName("demo")
.data(rows)
.build());
await client.insert({
collection_name: 'demo',
data: [
{'text': 'information retrieval is a field of study.'},
{'text': 'information retrieval focuses on finding relevant information in large datasets.'},
{'text': 'data mining and information retrieval overlap in research.'},
]);
curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/entities/insert" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"data": [
{"text": "information retrieval is a field of study."},
{"text": "information retrieval focuses on finding relevant information in large datasets."},
{"text": "data mining and information retrieval overlap in research."}
],
"collectionName": "demo"
}'
Perform full text search
Once you’ve inserted data into your collection, you can perform full text searches using raw text queries. Milvus automatically converts your query into a sparse vector and ranks the matched search results using the BM25 algorithm, and then returns the topK (limit
) results.
search_params = {
'params': {'drop_ratio_search': 0.2},
}
client.search(
collection_name='demo',
data=['whats the focus of information retrieval?'],
anns_field='sparse',
limit=3,
search_params=search_params
)
import io.milvus.v2.service.vector.request.SearchReq;
import io.milvus.v2.service.vector.request.data.EmbeddedText;
import io.milvus.v2.service.vector.response.SearchResp;
Map<String,Object> searchParams = new HashMap<>();
searchParams.put("drop_ratio_search", 0.2);
SearchResp searchResp = client.search(SearchReq.builder()
.collectionName("demo")
.data(Collections.singletonList(new EmbeddedText("whats the focus of information retrieval?")))
.annsField("sparse")
.topK(3)
.searchParams(searchParams)
.outputFields(Collections.singletonList("text"))
.build());
await client.search(
collection_name: 'demo',
data: ['whats the focus of information retrieval?'],
anns_field: 'sparse',
limit: 3,
params: {'drop_ratio_search': 0.2},
)
curl --request POST \
--url "${CLUSTER_ENDPOINT}/v2/vectordb/entities/search" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
--data-raw '{
"collectionName": "demo",
"data": [
"whats the focus of information retrieval?"
],
"annsField": "sparse",
"limit": 3,
"outputFields": [
"text"
],
"searchParams":{
"params":{
"drop_ratio_search":0.2
}
}
}'
Parameter | Description |
---|---|
| A dictionary containing search parameters. |
| Proportion of low-frequency terms to ignore during search. For details, refer to Sparse Vector. |
| The raw query text. |
| The name of the field that contains internally generated sparse vectors. |
| Maximum number of top matches to return. |