This module provides batch OCR processing for PDF documents using Mistral’s OCR API. It handles uploading PDFs, creating and submitting batch jobs, monitoring their completion, and saving the results as markdown files with extracted images. The main entry point is the ocr function which processes single PDFs or entire folders.
Setup
To use this module, you’ll need: - A Mistral API key set in your environment as MISTRAL_API_KEY or passed to functions - PDF files to process - The mistralai Python package installed
Load your API key from a .env file or set it directly:
The pipeline works in stages: first we upload PDFs to Mistral’s servers and get signed URLs, then we create batch entries for each PDF, submit them as a job, monitor completion, and finally download and save the results as markdown files with extracted images.
Each PDF needs a batch entry that tells Mistral where to find it and what options to use. The custom id cid helps you identify results later - by default it uses the PDF filename.
Once you have batch entries for all your PDFs, submit them as a single job. Mistral processes them asynchronously, so you’ll need to poll for completion.
Jobs can take from seconds to minutes depending on PDF size and queue length. The wait_for_job function polls every 10 seconds by default, but you can adjust this with the poll_interval parameter if needed.
Each result contains the custom ID you provided, the OCR response with pages and markdown content, and any errors. Images are embedded as base64 strings and need to be decoded before saving.
Results Processing
Results contain the OCR’d markdown for each page, plus any extracted images as base64-encoded data. We save each page as a separate markdown file in a folder named after the PDF, with images in an img subfolder.
The ocr function below handles the entire pipeline: uploading PDFs, creating batch entries, submitting the job, waiting for completion, and saving results. Making it perfect for processing single files or entire folders.
Helper functions for working with OCR results and preparing PDFs for processing.
Reading OCR Output
After running ocr_pdf, use read_pgs to load the markdown content. Set join=True (default) to get a single string suitable for LLM processing, or join=False to get a list of pages for individual page analysis.
If you need to OCR just a subset of pages (e.g., to test on a few pages before processing a large document), use subset_pdf to extract a page range first.