Core

Batch OCR for PDFs using Mistral API

This module provides batch OCR processing for PDF documents using Mistral’s OCR API. It handles uploading PDFs, creating and submitting batch jobs, monitoring their completion, and saving the results as markdown files with extracted images. The main entry point is the ocr function which processes single PDFs or entire folders.

Setup

To use this module, you’ll need: - A Mistral API key set in your environment as MISTRAL_API_KEY or passed to functions - PDF files to process - The mistralai Python package installed

Load your API key from a .env file or set it directly:


get_api_key


def get_api_key(
    key:str=None, # Mistral API key
):

Get Mistral API key from parameter or environment

The pipeline works in stages: first we upload PDFs to Mistral’s servers and get signed URLs, then we create batch entries for each PDF, submit them as a job, monitor completion, and finally download and save the results as markdown files with extracted images.

PDF Upload


upload_pdf


def upload_pdf(
    path:str, # Path to PDF file
    key:str=None, # Mistral API key
)->tuple: # Mistral pdf signed url and client

Upload PDF to Mistral and return signed URL

fname = 'files/test/attention-is-all-you-need.pdf'
url, c = upload_pdf(fname)
url, c
('https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2026-02-04T11%3A54%3A16Z&sp=r&sv=2025-01-05&sr=b&sig=OZQUFCJ4wcHAU4VfiEpUZgIyZJfI8AYhxIIeYUGNn2k%3D',
 <mistralai.sdk.Mistral>)
test_url, test_c = upload_pdf('files/test/attention-is-all-you-need.pdf')
assert test_url.startswith('https://')

Batch Entry Creation

Each PDF needs a batch entry that tells Mistral where to find it and what options to use. The custom id cid helps you identify results later - by default it uses the PDF filename.


create_batch_entry


def create_batch_entry(
    path:str, # Path to PDF file,
    url:str, # Mistral signed URL
    cid:str=None, # Custom ID (by default using the file name without extension)
    inc_img:bool=True, # Include image in response
    extract_header:bool=True, # Extract headers from document
    extract_footer:bool=True, # Extract footers from document
)->dict: # Batch entry dict

Create a batch entry dict for OCR

Note

When extract_header and extract_footer are True, headers and footers are extracted as separate page properties and excluded from the markdown output. Mistral defaults both to False, which includes them in the markdown.

entry = create_batch_entry(fname, url)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2026-02-04T11%3A54%3A16Z&sp=r&sv=2025-01-05&sr=b&sig=OZQUFCJ4wcHAU4VfiEpUZgIyZJfI8AYhxIIeYUGNn2k%3D'},
  'include_image_base64': True,
  'extract_header': True,
  'extract_footer': True}}

prep_pdf_batch


def prep_pdf_batch(
    path:str, # Path to PDF file,
    cid:str=None, # Custom ID (by default using the file name without extention)
    inc_img:bool=True, # Include image in response
    key:NoneType=None, # API key
)->dict: # Batch entry dict

Upload PDF and create batch entry in one step

entry, c = prep_pdf_batch(fname)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2026-02-03T17%3A01%3A57Z&sp=r&sv=2025-01-05&sr=b&sig=YDpOZlSl9UKSLr7LGPQXBTOOuk1NCDqmO6suS4A/HIc%3D'},
  'include_image_base64': True,
  'extract_header': False,
  'extract_footer': False}}

Batch Submission

Once you have batch entries for all your PDFs, submit them as a single job. Mistral processes them asynchronously, so you’ll need to poll for completion.


submit_batch


def submit_batch(
    entries:list, # List of batch entries,
    c:Mistral=None, # Mistral client,
    model:str='mistral-ocr-latest', # Model name,
    endpoint:str='/v1/ocr', # Endpoint name
)->dict: # Job dict

Submit batch entries and return job

job = submit_batch([entry], c)
job
BatchJobOut(id='df55316d-b94b-4fa1-ba55-fc6bf376ce67', input_files=['a2e34e90-97a6-48c1-ab0e-627b95b7045a'], endpoint='/v1/ocr', errors=[], status='QUEUED', created_at=1770051720, total_requests=0, completed_requests=0, succeeded_requests=0, failed_requests=0, object='batch', metadata=None, model='mistral-ocr-latest', agent_id=None, output_file=None, error_file=None, outputs=None, started_at=None, completed_at=None)

Jobs can take from seconds to minutes depending on PDF size and queue length. The wait_for_job function polls every 10 seconds by default, but you can adjust this with the poll_interval parameter if needed.


wait_for_job


def wait_for_job(
    job:dict, # Batch job from submit_batch
    c:Mistral=None, # Mistral client
    poll_interval:int=1, # Seconds between status checks
    queued_timeout:int=300, # Max seconds in QUEUED before timeout
)->dict: # Completed job dict

Poll job until completion and return final job status

final_job = wait_for_job(job, c)
final_job.status, final_job.succeeded_requests, final_job.total_requests
__main__ - INFO - Waiting for batch job df55316d-b94b-4fa1-ba55-fc6bf376ce67 (initial status: QUEUED)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 0s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 1s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 2s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 3s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 4s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 5s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 6s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 7s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 8s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 9s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 10s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 11s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 12s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 13s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 14s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 15s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 16s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 17s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 18s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 19s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 20s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 21s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 22s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 23s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 24s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 25s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 26s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 27s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 28s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 29s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 30s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 31s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 32s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 33s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 34s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 35s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 36s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 37s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 38s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 39s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 40s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 41s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 42s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 43s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 44s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 45s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 46s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 47s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 48s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 49s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 50s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 51s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 52s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 53s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 54s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 55s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 56s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 57s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 58s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 59s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 60s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 61s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 62s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 63s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 64s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 65s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 66s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 67s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 68s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 69s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 70s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 71s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 72s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 73s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 74s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 75s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 76s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 77s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 78s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 79s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 80s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 81s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 82s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 83s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 84s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 85s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 86s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 87s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 88s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 89s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 90s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 91s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 92s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 93s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 94s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 95s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 96s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 97s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 98s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 99s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 100s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 101s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 102s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 103s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 104s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 105s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 106s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 107s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 108s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 109s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 110s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 111s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 112s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 113s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 114s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 115s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 116s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 117s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 118s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 119s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 120s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 121s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 122s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 123s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 124s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 125s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 126s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 127s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 128s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 129s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 130s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 131s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 132s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 133s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 134s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 135s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 136s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 137s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 138s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: QUEUED (elapsed: 139s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: RUNNING (elapsed: 140s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: RUNNING (elapsed: 140s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: RUNNING (elapsed: 140s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: RUNNING (elapsed: 140s)
__main__ - DEBUG - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 status: RUNNING (elapsed: 140s)
__main__ - INFO - Job df55316d-b94b-4fa1-ba55-fc6bf376ce67 completed with status: SUCCESS
('SUCCESS', 1, 1)

download_results


def download_results(
    job:dict, # Job dict,
    c:Mistral=None, # Mistral client
)->list: # List of results

Download and parse batch job results

results = download_results(final_job, c)
len(results), results[0].keys()
(1, dict_keys(['id', 'custom_id', 'response', 'error']))

Each result contains the custom ID you provided, the OCR response with pages and markdown content, and any errors. Images are embedded as base64 strings and need to be decoded before saving.

Results Processing

Results contain the OCR’d markdown for each page, plus any extracted images as base64-encoded data. We save each page as a separate markdown file in a folder named after the PDF, with images in an img subfolder.


save_images


def save_images(
    page:dict, # Page dict,
    img_dir:str='img', # Directory to save images
)->None:

Save all images from a page to directory


save_page


def save_page(
    page:dict, # Page dict,
    dst:str, # Directory to save page
    img_dir:str='img', # Directory to save images
)->None:

Save single page markdown and images

For a PDF named paper.pdf, the output structure will be:

md/
  paper/
    page_1.md
    page_2.md
    img/
      img-0.jpeg
      img-1.jpeg

Each page is saved as a separate markdown file, making it easy to process or view individual pages.


save_pages


def save_pages(
    ocr_resp:dict, # OCR response,
    dst:str, # Directory to save pages,
    cid:str, # Custom ID
)->Path: # Output directory

Save markdown pages and images from OCR response to output directory

ocr_resp = results[0]['response']['body']
save_pages(ocr_resp, 'files/test/md', results[0]['custom_id'])
Path('files/test/md/attention-is-all-you-need')
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
files/test/md:
attention-is-all-you-need/  resnet/

files/test/md/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md/attention-is-all-you-need/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md/resnet:
img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

files/test/md/resnet/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

The ocr function below handles the entire pipeline: uploading PDFs, creating batch entries, submitting the job, waiting for completion, and saving results. Making it perfect for processing single files or entire folders.

High-level Interface


ocr_pdf


def ocr_pdf(
    path:str, # Path to PDF file or folder,
    dst:str='md', # Directory to save markdown pages,
    inc_img:bool=True, # Include image in response,
    key:str=None, # API key,
    poll_interval:int=2, # Poll interval in seconds
)->list: # List of output directories

OCR a PDF file or folder of PDFs and save results

mds = ocr_pdf('files/test', 'files/test/md_all')
mds
files/test/md_all:
attention-is-all-you-need/  iom-hoa-short/  resnet/

files/test/md_all/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md_all/attention-is-all-you-need/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md_all/iom-hoa-short:
img/        page_14.md  page_2.md   page_25.md  page_30.md  page_8.md
page_1.md   page_15.md  page_20.md  page_26.md  page_31.md  page_9.md
page_10.md  page_16.md  page_21.md  page_27.md  page_4.md
page_11.md  page_17.md  page_22.md  page_28.md  page_5.md
page_12.md  page_18.md  page_23.md  page_29.md  page_6.md
page_13.md  page_19.md  page_24.md  page_3.md   page_7.md

files/test/md_all/iom-hoa-short/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md_all/resnet:
img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

files/test/md_all/resnet/img:
img-0.jpeg  img-10.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg  img-8.jpeg
img-1.jpeg  img-11.jpeg  img-3.jpeg  img-5.jpeg  img-7.jpeg  img-9.jpeg

Utilities

Helper functions for working with OCR results and preparing PDFs for processing.

Reading OCR Output

After running ocr_pdf, use read_pgs to load the markdown content. Set join=True (default) to get a single string suitable for LLM processing, or join=False to get a list of pages for individual page analysis.


read_pgs


def read_pgs(
    path:str, # OCR output directory,
    join:bool=True, # Join pages into single string
)->str | list[str]: # Joined string or list of page contents

Read specific page or all pages from OCR output directory

For instance to read all pages:

text = read_pgs('files/test/md_all/resnet')
print(text[:500])
# Deep Residual Learning for Image Recognition

Kaiming He

Xiangyu Zhang

Shaoqing Ren

Jian Sun

Microsoft Research

{kahe, v-xiangz, v-shren, jiansun}@microsoft.com

# Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference

Or a list of markdown pages (that you can further index):

md_pgs = read_pgs('files/test/md_all/resnet', join=False)
len(md_pgs), md_pgs[0][:500]
(12,
 '# Deep Residual Learning for Image Recognition\n\nKaiming He\n\nXiangyu Zhang\n\nShaoqing Ren\n\nJian Sun\n\nMicrosoft Research\n\n{kahe, v-xiangz, v-shren, jiansun}@microsoft.com\n\n# Abstract\n\nDeeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference')

PDF Preparation

If you need to OCR just a subset of pages (e.g., to test on a few pages before processing a large document), use subset_pdf to extract a page range first.


subset_pdf


def subset_pdf(
    path:str, # Path to PDF file
    start:int=1, # Start page (1-based)
    end:int=None, # End page (1-based, inclusive)
    dst:str='.', # Output directory
)->Path: # Path to subset PDF

Extract page range from PDF and save with range suffix

fname = 'files/test/attention-is-all-you-need.pdf'
subset_pdf(fname, 5, 10, 'files/test/xtra')
Path('files/test/xtra/attention-is-all-you-need_p5-10.pdf')