🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • Blog
  • Parsing is Hard: Solving Semantic Understanding with Mistral OCR and Milvus

Parsing is Hard: Solving Semantic Understanding with Mistral OCR and Milvus

  • Engineering
April 03, 2025
Stephen Batifol

Let’s face it: parsing documents is hard—really hard. PDFs, images, reports, tables, messy handwriting; they’re packed with valuable information that your users want to search for, but extracting that information and express that accurately in your search index is like solving a puzzle where the pieces keep changing shape: you thought you’ve solved it with an extra line of code but tomorrow a new doc gets ingested and you find another corner case to deal with.

In this post, we’ll tackle this challenge head-on using the powerful combo of Mistral OCR and Milvus Vector DB, turning your doc parsing nightmares into a calm dream with searchable, semantically meaningful vector embeddings.

Why Rule-based Parsing Just Won’t Cut It

If you’ve ever struggled with standard OCR tools, you probably know that they have all sorts of issues:

  • Complex layouts: Tables, lists, multi-column formats – they can break or pose issues to most parsers.
  • Semantic ambiguity: Keywords alone don’t tell you if “apple” means fruit or company.
  • The chanllege of scale and cost: Processing thousands of documents becomes painfully slow.

We need a smarter, more systematic approach that doesn’t just extract text—it understands the content. And that’s exactly where Mistral OCR and Milvus come in.

Meet Your Dream Team

Mistral OCR: More than just text extraction

Mistral OCR isn’t your average OCR tool. It’s designed to tackle a wide range of documents.

  • Deep Understanding of Complex Documents: Whether it’s embedded images, mathematical equations, or tables, it can understand it all with a very high accuracy.
  • Keeps original layouts: Not only does it understand the different layouts in the documents, it also keeps the original layouts and structure intact. On top of that, it’s also capable of parsing multi-page documents.
  • Multilingual and Multimodal Mastery: From English to Hindi to Arabic, Mistral OCR can comprehend documents across thousands of languages and scripts, making it invaluable for applications targeting a global user base.

Milvus: Your Vector Database Built for Scale

  • Billion+ Scale: Milvus can scale to billions of vectors, making it perfect for storing large-scale documents.
  • Full-Text Search: In addition to supporting dense vector embeddings, Milvus also supports Full Text Search. Making it easy to run queries using text and get better results for your RAG system.

Examples:

Let’s take this handwritten note in English, for example. Using a regular OCR tool to extract this text would be a very hard task.

A handwritten note in English A handwritten note in English

We process it with Mistral OCR

api_key = os.getenv("MISTRAL_API_KEY")
client = Mistral(api_key=api_key)

url = "https://preview.redd.it/ocr-for-handwritten-documents-v0-os036yiv9xod1.png?width=640&format=png&auto=webp&s=29461b68383534a3c1bf76cc9e36a2ba4de13c86"
result = client.ocr.process(
                model=ocr_model, document={"type": "image_url", "image_url": url}
            )
print(f"Result: {result.pages[0].markdown}")

And we get the following output. It can recognize handwritten text well. We can see that it even keeps the capitalized format of the words "FORCED AND UNNATURAL"!

Today is Thursday, October 20th - But it definitely feels like a Friday. I'm already considering making a second cup of coffee - and I haven't even finished my first. Do I have a problem?
Sometimes I'll fly through older notes I've taken, and my handwriting is unrecamptable. Perhaps it depends on the type of pen I use. I've tried writing in all cups but it looks so FORCED AND UNNATURAL.
Often times, I'll just take notes on my lapten, but I still seem to ermittelt forward pen and paper. Any advice on what to
improve? I already feel stressed at looking back at what I've just written - it looks like I different people wrote this!

Now we can then insert the text into Milvus for semantic search.

from pymilvus import MilvusClient 

COLLECTION_NAME = "document_ocr"

milvus_client = MilvusClient(uri='http://localhost:19530')
"""
This is where you would define the index, create a collection etc. For the sake of this example. I am skipping it. 

schema = CollectionSchema(...)

milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    schema=schema,
    )

"""

milvus_client.insert(collection_name=COLLECTION_NAME, data=[result.pages[0].markdown])

But Mistral can also understand documents in different languages or in more complex format, for example let’s try this invoice in German that combines some item names in English.

An Invoice in German An Invoice in German

Mistral OCR is still capable of extracting all the information you have and it even creates the table structure in Markdown that represents the table from the scanned image.

Rechnungsadresse:

Jähn Jessel GmbH a. Co. KG Marianne Scheibe Karla-Löffler-Weg 2 66522 Wismar

Lieferadresse:

Jähn Jessel GmbH a. Co. KG Marianne Scheibe Karla-Löffler-Weg 2 66522 Wismar

Rechnungsinformationen:

Bestelldatum: 2004-10-20
Bezahit: Ja
Expressversand: Nein
Rechnungsnummer: 4652

Rechnungsübersicht

| Pos. | Produkt | Preis <br> (Netto) | Menge | Steuersatz | Summe <br> Brutto |
| :--: | :--: | :--: | :--: | :--: | :--: |
| 1 | Grundig CH 7280w Multi-Zerkleinerer (Gourmet, 400 Watt, 11 Glasbehälter), weiß | 183.49 C | 2 | $0 \%$ | 366.98 C |
| 2 | Planet K | 349.9 C | 2 | $19.0 \%$ | 832.76 C |
| 3 | The Cabin in the Woods (Blu-ray) | 159.1 C | 2 | $7.0 \%$ | 340.47 C |
| 4 | Schenkung auf Italienisch Taschenbuch - 30. | 274.33 C | 4 | $19.0 \%$ | 1305.81 C |
| 5 | Xbox 360 - Razer 0N2A Controller Tournament Edition | 227.6 C | 2 | $7.0 \%$ | 487.06 C |
| 6 | Philips LED-Lampe ersetzt 25Watt E27 2700 Kelvin - warm-weiß, 2.7 Watt, 250 Lumen IEnergieklasse A++I | 347.57 C | 3 | $7.0 \%$ | 1115.7 C |
| 7 | Spannende Abenteuer Die verschollene Grabkammer | 242.8 C | 6 | $0 \%$ | 1456.8 C |
| Zw. summe |  | 1784.79 C |  |  |  |
| Zzgl. Mwst. 7\% |  | 51.4 C |  |  |  |
| Zzgl. Mwst. 19\% |  | 118.6 C |  |  |  |
| Gesamtbetrag C inkl. MwSt. |  | 1954.79 C |  |  |  |

Real-World Usage: A Case Study

Now that we’ve seen that Mistral OCR can work on different documents, we could imagine how a legal firm that is drowning in case files and contracts leverage this tool. By implementing a RAG system with Mistral OCR and Milvus, what once took a paralegal countless hours, like manually scanning for specific clauses or comparing past cases, now is done by AI in only a couple of minutes.

Next Steps

Ready to extract all your content? Head over to the notebook on GitHub for the full example, join our Discord to chat with the community, and start building today! You can also check out Mistral documentation about their OCR model

Say goodbye to parsing chaos, and hello to intelligent, scalable document understanding.

Try Managed Milvus for Free

Zilliz Cloud is hassle-free, powered by Milvus and 10x faster.

Get Started

Like the article? Spread the word

Keep Reading