This module provides batch OCR processing for PDF documents using Mistral’s OCR API. It handles uploading PDFs, creating and submitting batch jobs, monitoring their completion, and saving the results as markdown files with extracted images. The main entry point is the ocr function which processes single PDFs or entire folders.
Setup
To use this module, you’ll need: - A Mistral API key set in your environment as MISTRAL_API_KEY or passed to functions - PDF files to process - The mistralai Python package installed
Load your API key from a .env file or set it directly:
get_api_key
def get_api_key( key:str=None, # Mistral API key):
Get Mistral API key from parameter or environment
The pipeline works in stages: first we upload PDFs to Mistral’s servers and get signed URLs, then we create batch entries for each PDF, submit them as a job, monitor completion, and finally download and save the results as markdown files with extracted images.
PDF Upload
upload_pdf
def upload_pdf( path:str, # Path to PDF file key:str=None, # Mistral API key)->tuple: # Mistral pdf signed url and client
Upload PDF to Mistral and return signed URL
fname ='files/test/attention-is-all-you-need.pdf'url, c = upload_pdf(fname)url, c
Each PDF needs a batch entry that tells Mistral where to find it and what options to use. The custom id cid helps you identify results later - by default it uses the PDF filename.
create_batch_entry
def create_batch_entry( path:str, # Path to PDF file, url:str, # Mistral signed URL cid:str=None, # Custom ID (by default using the file name without extension) inc_img:bool=True, # Include image in response extract_header:bool=True, # Extract headers from document extract_footer:bool=True, # Extract footers from document)->dict: # Batch entry dict
Create a batch entry dict for OCR
Note
When extract_header and extract_footer are True, headers and footers are extracted as separate page properties and excluded from the markdown output. Mistral defaults both to False, which includes them in the markdown.
def prep_pdf_batch( path:str, # Path to PDF file, cid:str=None, # Custom ID (by default using the file name without extention) inc_img:bool=True, # Include image in response key:NoneType=None, # API key)->dict: # Batch entry dict
Once you have batch entries for all your PDFs, submit them as a single job. Mistral processes them asynchronously, so you’ll need to poll for completion.
submit_batch
def submit_batch( entries:list, # List of batch entries, c:Mistral=None, # Mistral client, model:str='mistral-ocr-latest', # Model name, endpoint:str='/v1/ocr', # Endpoint name)->dict: # Job dict
Jobs can take from seconds to minutes depending on PDF size and queue length. The wait_for_job function polls every 10 seconds by default, but you can adjust this with the poll_interval parameter if needed.
wait_for_job
def wait_for_job( job:dict, # Batch job from submit_batch c:Mistral=None, # Mistral client poll_interval:int=1, # Seconds between status checks queued_timeout:int=300, # Max seconds in QUEUED before timeout)->dict: # Completed job dict
Poll job until completion and return final job status
Each result contains the custom ID you provided, the OCR response with pages and markdown content, and any errors. Images are embedded as base64 strings and need to be decoded before saving.
Results Processing
Results contain the OCR’d markdown for each page, plus any extracted images as base64-encoded data. We save each page as a separate markdown file in a folder named after the PDF, with images in an img subfolder.
save_images
def save_images( page:dict, # Page dict, img_dir:str='img', # Directory to save images)->None:
Save all images from a page to directory
save_page
def save_page( page:dict, # Page dict, dst:str, # Directory to save page img_dir:str='img', # Directory to save images)->None:
Save single page markdown and images
For a PDF named paper.pdf, the output structure will be:
The ocr function below handles the entire pipeline: uploading PDFs, creating batch entries, submitting the job, waiting for completion, and saving results. Making it perfect for processing single files or entire folders.
High-level Interface
ocr_pdf
def ocr_pdf( path:str, # Path to PDF file or folder, dst:str='md', # Directory to save markdown pages, inc_img:bool=True, # Include image in response, key:str=None, # API key, poll_interval:int=2, # Poll interval in seconds)->list: # List of output directories
Helper functions for working with OCR results and preparing PDFs for processing.
Reading OCR Output
After running ocr_pdf, use read_pgs to load the markdown content. Set join=True (default) to get a single string suitable for LLM processing, or join=False to get a list of pages for individual page analysis.
read_pgs
def read_pgs( path:str, # OCR output directory, join:bool=True, # Join pages into single string)->str|list[str]: # Joined string or list of page contents
Read specific page or all pages from OCR output directory
For instance to read all pages:
text = read_pgs('files/test/md_all/resnet')print(text[:500])
# Deep Residual Learning for Image Recognition
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
# Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference
Or a list of markdown pages (that you can further index):
(12,
'# Deep Residual Learning for Image Recognition\n\nKaiming He\n\nXiangyu Zhang\n\nShaoqing Ren\n\nJian Sun\n\nMicrosoft Research\n\n{kahe, v-xiangz, v-shren, jiansun}@microsoft.com\n\n# Abstract\n\nDeeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference')
PDF Preparation
If you need to OCR just a subset of pages (e.g., to test on a few pages before processing a large document), use subset_pdf to extract a page range first.
subset_pdf
def subset_pdf( path:str, # Path to PDF file start:int=1, # Start page (1-based) end:int=None, # End page (1-based, inclusive) dst:str='.', # Output directory)->Path: # Path to subset PDF
Extract page range from PDF and save with range suffix