Keep answers concise by combining hard limits (token caps) with clear output contracts (structure and length rules). The most reliable control is max_tokens: if you cap output to 400–800 tokens, the model physically can’t ramble. Then add a system instruction like: “Answer in 5 bullets max” or “Use two short paragraphs, no extra background.” This works best when the prompt is also focused. If you ask a broad question without constraints, you’ll get a broad answer. If you ask a specific question with a strict format, you’ll get a tight response.
A production pattern that works well is “short first, expand on demand.” For example:
Request 1: “Give a 6-bullet answer, each bullet ≤ 16 words.”
Request 2 (only if user clicks): “Expand bullet #3 with an example and edge cases.”
This reduces average cost and improves UX. For developer content, require a consistent structure so “concise” doesn’t mean “missing important details.” A good concise template is:
Direct answer (1–2 sentences)
How to implement (3–5 bullets)
Gotchas (2–4 bullets)
If you’re generating code, require a diff rather than a full file; if you’re generating JSON, require schema-only output. Also consider adding a “verbosity” parameter in your product and map it to token caps and templates.
RAG helps concision because it prevents the model from pulling in unrelated knowledge. Retrieve only what is needed from Milvus or Zilliz Cloud, then instruct Opus 4.6 to answer using only those chunks. This naturally limits scope and reduces the temptation to add general background. You can also require a “Sources” list that includes only retrieved chunk IDs, which discourages invented details. In short: concision comes from budgeting, formatting, and grounding—not from asking “please be concise” and hoping it complies.