Synonym
The synonym filter rewrites tokens according to a synonym dictionary, so that related terms match during search. It supports two modes of operation and two ways of supplying the dictionary:
Operation modes —
expandmode preserves the original token and emits additional synonyms alongside it; normalization mode (expand: false) rewrites tokens to a canonical form.Dictionary sources — small dictionaries can be inlined into the filter configuration via the
synonymsarray; large dictionaries should be stored as a file resource and referenced viasynonyms_file.
Dictionary format
A synonym dictionary is a plain-text document (or inline array) in which each line defines one rule. Two rule forms are supported.
Mapping rule
fast, quick => speedy
The tokens on the left (fast, quick) rewrite to the tokens on the right (speedy). Multiple targets are allowed:
small, little => tiny, compact
With expand: true, the original tokens are kept alongside the targets:
Input
fastwithexpand: true→fast,speedyInput
fastwithexpand: false→speedy
Equivalence group
happy, joyful, cheerful
All listed tokens are considered equivalent:
With
expand: true, any occurrence of any token in the group emits every token in the group. Inputhappy→happy,joyful,cheerful.With
expand: false, every occurrence is rewritten to the first token in the group. Inputjoyful→happy; inputhappyis already the first token and is unchanged.
Configuration
The synonym filter is a custom filter. Specify "type": "synonym" along with at least one of synonyms (inline) or synonyms_file (external), plus an expand flag.
analyzer_params = {
"tokenizer": "standard",
"filter": [
{
"type": "synonym",
"synonyms": [ # inline rules (optional)
"fast, quick => speedy",
"happy, joyful, cheerful",
],
"synonyms_file": { # external rules (optional)
"type": "remote",
"resource_name": "en_synonyms",
"file_name": "synonyms.txt",
},
"expand": True,
}
],
}
The synonym filter accepts the following parameters.
Parameter |
Description |
Default |
|---|---|---|
|
An inline array of rule strings. Each string uses the dictionary format described above. Suitable for small dictionaries (up to a few dozen rules). |
— |
|
A reference to a file resource that stores synonym rules, one per line. Use for larger dictionaries. See External dictionary file below. |
— |
|
A boolean flag that controls how rules apply. true preserves the original token and emits synonyms alongside it; false rewrites tokens to their canonical form (the right-hand side of a mapping, or the first token of an equivalence group). |
false |
You can specify synonyms, synonyms_file, or both. When both are present, the filter merges the two sources. The filter operates on tokens produced by the tokenizer; it must therefore be combined with a tokenizer such as the standard tokenizer.
External dictionary file
For production-sized dictionaries, register the file as a remote file resource and reference it from synonyms_file.
from pymilvus import MilvusClient
client = MilvusClient(uri="http://localhost:19530")
# Register the file once, then reference it from any analyzer that needs it.
client.add_file_resource(
name="en_synonyms",
path="file/synonyms.txt", # full S3 object key, including rootPath
)
analyzer_params = {
"tokenizer": "standard",
"filter": [{
"type": "synonym",
"synonyms_file": {
"type": "remote",
"resource_name": "en_synonyms",
"file_name": "synonyms.txt",
},
"expand": True,
}],
}
See Manage File Resources for the full workflow (upload, register, list, remove) and for the alternative "type": "local" form.
Examples
Before applying the analyzer to a collection schema, verify its behavior with run_analyzer. The following examples use the inline synonyms array for brevity; replace with synonyms_file for larger dictionaries.
expand: true — keep the original, add synonyms
from pymilvus import MilvusClient
client = MilvusClient(uri="http://localhost:19530")
analyzer_params = {
"tokenizer": "standard",
"filter": [{
"type": "synonym",
"synonyms": [
"fast, quick => speedy",
"happy, joyful, cheerful",
],
"expand": True,
}],
}
print(client.run_analyzer(["a fast car"], analyzer_params))
# → [['a', 'fast', 'speedy', 'car']]
print(client.run_analyzer(["i am happy today"], analyzer_params))
# → [['i', 'am', 'happy', 'joyful', 'cheerful', 'today']]
Both fast and happy are preserved; their synonyms are emitted alongside.
expand: false — rewrite to canonical form
analyzer_params_norm = {
"tokenizer": "standard",
"filter": [{
"type": "synonym",
"synonyms": [
"fast, quick => speedy",
"happy, joyful, cheerful",
],
"expand": False,
}],
}
print(client.run_analyzer(["a fast car"], analyzer_params_norm))
# → [['a', 'speedy', 'car']]
print(client.run_analyzer(["i am happy today"], analyzer_params_norm))
# → [['i', 'am', 'happy', 'today']]
The mapping rule rewrites fast to speedy. The equivalence group leaves happy unchanged because it is the first token of the group; an input containing joyful or cheerful would have been rewritten to happy.