🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do I build a test set for semantic search evaluation?

To build a test set for evaluating a semantic search system, start by defining the use cases and gathering representative queries. Semantic search aims to understand user intent and context, so your test set must reflect real-world scenarios. First, identify common user intents your system should handle. For example, if you’re building search for an e-commerce site, include queries like “affordable winter jackets” or “waterproof hiking boots size 10.” Collect both straightforward queries (“blue jeans”) and ambiguous ones (“light jackets,” which could refer to weight, color, or material). Include synonyms (“sneakers” vs. “athletic shoes”) and phrasing variations (“best laptop for coding” vs. “good computers for software development”). Edge cases, like rare terms or complex phrasing, should also be included to test robustness.

Next, create a set of documents (e.g., product descriptions, articles, or support tickets) and map them to the queries. Each query should have a set of “correct” documents that address the intent, ideally annotated by humans. For instance, if a query is “How to reset a forgotten password,” the correct document might be a support article titled “Account Recovery Steps.” Use relevance scoring (e.g., on a scale of 0–3, where 3 is a perfect match) to quantify how well documents align with queries. Avoid relying solely on keyword overlap—focus on semantic relevance. If you’re retrofitting an existing system, use logs of past user queries and clicked results to infer relevance. For new systems, simulate user behavior with domain experts or crowdsourcing. Include negative examples (documents that should not match a query) to test precision. For example, a query for “wireless headphones” should exclude wired models, even if they share keywords like “noise-canceling.”

Finally, structure the test set for reproducibility. Split it into validation and test subsets: use validation to tune model parameters (like embedding dimensions or ranking thresholds) and reserve the test set for final evaluation. Ensure the test set covers diverse scenarios, such as short vs. long queries, single vs. multi-intent queries, and domain-specific jargon. Track metrics like recall@k (how many correct documents appear in the top k results), Mean Reciprocal Rank (MRR), or Normalized Discounted Cumulative Gain (NDCG) to measure performance. For example, if your system retrieves 3 relevant documents in the top 10 results for a query, recall@10 would be 3 divided by the total relevant documents for that query. Regularly update the test set as user needs evolve—for instance, adding queries about new product lines or emerging terminology (like “USB-C cables” replacing “USB 3.0 cables”). This ensures your evaluation remains aligned with real-world usage.

Like the article? Spread the word