Core

OCR for PDFs using Mistral API

This module provides OCR processing for PDF documents using Mistral’s OCR API. For single PDFs, it uses direct (synchronous) OCR for immediate results. For multiple PDFs, it uses the batch API for cost savings. Results are saved as markdown files with extracted images. The main entry point is ocr_pdf which auto-detects the best mode based on input.

Setup

To use this module, you’ll need: - A Mistral API key set in your environment as MISTRAL_API_KEY or passed to functions - PDF files to process - The mistralai Python package installed

Load your API key from a .env file or set it directly:


get_api_key


def get_api_key(
    key:str=None, # Mistral API key
):

Get Mistral API key from parameter or environment

Single PDF OCR

For processing a single PDF with immediate results (no batch queue), we use direct OCR via base64 encoding.


run_single


def run_single(
    path, # Path to PDF
    inc_img:bool=True, # Include image in response
    extract_header:bool=True, # Extract header
    extract_footer:bool=True, # Extract footer
    key:NoneType=None, # Mistral API key
):

Run direct OCR on a single PDF (no batch queue)

result = run_single('files/test/attention-is-all-you-need.pdf')
len(result.pages), result.pages[0].markdown[:200]
(15,
 '# Attention Is All You Need\n\nAshish Vaswani*\n\nGoogle Brain\n\navaswani@google.com\n\nNoam Shazeer*\n\nGoogle Brain\n\nnoam@google.com\n\nNiki Parmar*\n\nGoogle Research\n\nnikip@google.com\n\nJakob Uszkoreit*\n\nGoogle')

Batch PDF OCR

The pipeline works in stages: first we upload PDFs to Mistral’s servers and get signed URLs, then we create batch entries for each PDF, submit them as a job, monitor completion, and finally download and save the results as markdown files with extracted images.

PDF Upload


upload_pdf


def upload_pdf(
    path:str, # Path to PDF file
    key:str=None, # Mistral API key
)->tuple: # Mistral pdf signed url and client

Upload PDF to Mistral and return signed URL

fname = 'files/test/attention-is-all-you-need.pdf'
url, c = upload_pdf(fname)
url, c
('https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/b115c933188b4371a6a11793c2a59ace.pdf?se=2026-02-07T12%3A23%3A37Z&sp=r&sv=2026-02-06&sr=b&sig=rtxkyBXX6O3eyLE0Z8Zw3LuS6AM%2BWqpoAnegHDeSVsM%3D',
 <mistralai.sdk.Mistral>)
test_url, test_c = upload_pdf('files/test/attention-is-all-you-need.pdf')
assert test_url.startswith('https://')

Batch Entry Creation

Each PDF needs a batch entry that tells Mistral where to find it and what options to use. The custom id cid helps you identify results later - by default it uses the PDF filename.


create_batch_entry


def create_batch_entry(
    path:str, # Path to PDF file,
    url:str, # Mistral signed URL
    cid:str=None, # Custom ID (by default using the file name without extension)
    inc_img:bool=True, # Include image in response
    extract_header:bool=True, # Extract headers from document
    extract_footer:bool=True, # Extract footers from document
)->dict: # Batch entry dict

Create a batch entry dict for OCR

Note

When extract_header and extract_footer are True, headers and footers are extracted as separate page properties and excluded from the markdown output. Mistral defaults both to False, which includes them in the markdown.

entry = create_batch_entry(fname, url)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/b115c933188b4371a6a11793c2a59ace.pdf?se=2026-02-07T12%3A23%3A37Z&sp=r&sv=2026-02-06&sr=b&sig=rtxkyBXX6O3eyLE0Z8Zw3LuS6AM%2BWqpoAnegHDeSVsM%3D'},
  'include_image_base64': True,
  'extract_header': True,
  'extract_footer': True}}

prep_pdf_batch


def prep_pdf_batch(
    path:str, # Path to PDF file,
    cid:str=None, # Custom ID (by default using the file name without extention)
    inc_img:bool=True, # Include image in response
    key:NoneType=None, # API key
)->dict: # Batch entry dict

Upload PDF and create batch entry in one step

entry, c = prep_pdf_batch(fname)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/b115c933188b4371a6a11793c2a59ace.pdf?se=2026-02-07T15%3A38%3A48Z&sp=r&sv=2026-02-06&sr=b&sig=U%2Bzo%2BntyDgW4BKNG2T7hP/OYDsSaUFtZMyydxugOXio%3D'},
  'include_image_base64': True,
  'extract_header': True,
  'extract_footer': True}}

Batch Submission

Once you have batch entries for all your PDFs, submit them as a single job. Mistral processes them asynchronously, so you’ll need to poll for completion.


submit_batch


def submit_batch(
    entries:list, # List of batch entries,
    c:Mistral=None, # Mistral client,
    model:str='mistral-ocr-latest', # Model name,
    endpoint:str='/v1/ocr', # Endpoint name
)->dict: # Job dict

Submit batch entries and return job

job = submit_batch([entry], c)
job
BatchJobOut(id='482ba9ee-559f-4b11-8f60-8aabe739ad05', input_files=['ce74af1e-a2b4-4d8e-9256-fab329c3d2de'], endpoint='/v1/ocr', errors=[], status='QUEUED', created_at=1770393138, total_requests=0, completed_requests=0, succeeded_requests=0, failed_requests=0, object='batch', metadata=None, model='mistral-ocr-latest', agent_id=None, output_file=None, error_file=None, outputs=None, started_at=None, completed_at=None)

wait_for_job


def wait_for_job(
    job, # Batch job from submit_batch
    c:NoneType=None, # Mistral client
    poll_interval:int=10, # Seconds between status checks
    timeout:NoneType=None, # Max total seconds before timeout
)->dict: # Job dict

Poll job until completion, printing only on status change

final_job = wait_for_job(job, c, timeout=1200)
final_job.status, final_job.succeeded_requests, final_job.total_requests
jobs = c.batch.jobs.list()
for j in jobs.data[:10]: print(j.id, j.status, j.created_at)
482ba9ee-559f-4b11-8f60-8aabe739ad05 CANCELLATION_REQUESTED 1770393138
40e30b98-dc12-412a-a5ce-52818f3a097a CANCELLATION_REQUESTED 1770392814
39e42cd1-b2e4-423a-8d63-ca17697efac8 CANCELLED 1770392339
23450f30-0d63-43be-83ce-206102e47c6d CANCELLATION_REQUESTED 1770390898
60e1ce21-6af8-4ad3-862d-6df52bd34cc7 CANCELLATION_REQUESTED 1770389326
ed556e64-884e-4349-9579-ed37af5e6920 CANCELLED 1770373917
2fa17c1d-959a-46ae-b0b2-7d966eb2c56a CANCELLED 1770371001
e7d64ad4-8394-4c9a-a3f4-d36a1681afcd CANCELLED 1770310312
9683cafb-a3a4-4346-934b-e6b06865657e CANCELLED 1770307850
944d4d93-271e-4994-8c72-dd1cf54d1182 CANCELLED 1770307352
stuck_job = c.batch.jobs.cancel(job_id='39e42cd1-b2e4-423a-8d63-ca17697efac8')
stuck_job.status, stuck_job.errors
('CANCELLED', [])

download_results


def download_results(
    job:dict, # Job dict,
    c:Mistral=None, # Mistral client
)->list: # List of results

Download and parse batch job results

results = download_results(final_job, c)
len(results), results[0].keys()
(1, dict_keys(['id', 'custom_id', 'response', 'error']))

Each result contains the custom ID you provided, the OCR response with pages and markdown content, and any errors. Images are embedded as base64 strings and need to be decoded before saving.

Batch Orchestration


prep_batch


def prep_batch(
    pdfs:list, # List of PDF paths
    inc_img:bool=True, # Include images in response
    key:str=None, # Mistral API key
)->tuple: # Batch entries and Mistral client

Upload PDFs and prepare batch entries


run_batch


def run_batch(
    entries:list, # Batch entries from prep_batch
    c:Mistral, # Mistral client
    poll_interval:int=2, # Seconds between status checks
)->list: # List of OCR results

Submit batch, wait for completion, and download results

Saving Results

OCR results contain markdown for each page, plus any extracted images as base64-encoded data. The functions below save each page as a separate markdown file in a folder named after the PDF, with images in an img subfolder.


single_to_batch_result


def single_to_batch_result(
    resp, # OCRResponse from run_single
    cid, # Custom ID (typically PDF filename stem)
)->list: # List containing single batch-format result

Adapt single OCR response to batch result format for uniform downstream processing


save_images


def save_images(
    page:dict, # Page dict
    img_dir:str='img', # Directory to save images
)->None:

Save all images from a page to directory


save_page


def save_page(
    page:dict, # Page dict,
    dst:str, # Directory to save page
    img_dir:str='img', # Directory to save images
)->None:

Save single page markdown and images

For a PDF named paper.pdf, the output structure will be:

md/
  paper/
    page_1.md
    page_2.md
    img/
      img-0.jpeg
      img-1.jpeg

Each page is saved as a separate markdown file, making it easy to process or view individual pages.


save_pages


def save_pages(
    ocr_resp:dict, # OCR response,
    dst:str, # Directory to save pages,
    cid:str, # Custom ID
)->Path: # Output directory

Save markdown pages and images from OCR response to output directory

ocr_resp = results[0]['response']['body']
save_pages(ocr_resp, 'files/test/md', results[0]['custom_id'])
Path('files/test/md/attention-is-all-you-need')
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
files/test/md:
attention-is-all-you-need/  resnet/

files/test/md/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md/attention-is-all-you-need/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md/resnet:
img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

files/test/md/resnet/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

High-level Interface

ocr_pdf is the main entry point. It auto-detects the best mode: single PDFs use direct OCR for immediate results; multiple PDFs use the batch API for cost savings.


get_paths


def get_paths(
    path:str, # Path to file or folder
)->list: # List of PDF paths

Get list of PDFs from file or folder

For instance:

get_paths('files/test/attention-is-all-you-need.pdf'), get_paths('files/test')
([Path('files/test/attention-is-all-you-need.pdf')],
 [Path('files/test/attention-is-all-you-need.pdf'), Path('files/test/resnet.pdf')])

ocr_pdf


def ocr_pdf(
    path:str, # Path to PDF file or folder,
    dst:str='md', # Directory to save markdown pages,
    inc_img:bool=True, # Include image in response,
    key:str=None, # API key,
    poll_interval:int=2, # Poll interval in seconds
)->list: # List of output directories

OCR PDF(s) and save results. Single PDF uses direct mode; multiple uses batch

md = ocr_pdf('files/test/attention-is-all-you-need.pdf', 'files/test/md_single')
md
Path('files/test/md_single/attention-is-all-you-need')
!ls -R files/test/md_single/attention-is-all-you-need
files/test/md_single/attention-is-all-you-need:
img     page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md_single/attention-is-all-you-need/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg
mds = ocr_pdf('files/test', 'files/test/md_all')
mds
files/test/md_all:
attention-is-all-you-need/  iom-hoa-short/  resnet/

files/test/md_all/attention-is-all-you-need:
img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md_all/attention-is-all-you-need/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md_all/iom-hoa-short:
img/        page_14.md  page_2.md   page_25.md  page_30.md  page_8.md
page_1.md   page_15.md  page_20.md  page_26.md  page_31.md  page_9.md
page_10.md  page_16.md  page_21.md  page_27.md  page_4.md
page_11.md  page_17.md  page_22.md  page_28.md  page_5.md
page_12.md  page_18.md  page_23.md  page_29.md  page_6.md
page_13.md  page_19.md  page_24.md  page_3.md   page_7.md

files/test/md_all/iom-hoa-short/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg

files/test/md_all/resnet:
img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

files/test/md_all/resnet/img:
img-0.jpeg  img-10.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg  img-8.jpeg
img-1.jpeg  img-11.jpeg  img-3.jpeg  img-5.jpeg  img-7.jpeg  img-9.jpeg

Utilities

Helper functions for working with OCR results and preparing PDFs for processing.

Reading OCR Output

After running ocr_pdf, use read_pgs to load the markdown content. Set join=True (default) to get a single string suitable for LLM processing, or join=False to get a list of pages for individual page analysis.


read_pgs


def read_pgs(
    path:str, # OCR output directory,
    join:bool=True, # Join pages into single string
)->str | list[str]: # Joined string or list of page contents

Read specific page or all pages from OCR output directory

For instance to read all pages:

text = read_pgs('files/test/md_all/resnet')
print(text[:500])
# Deep Residual Learning for Image Recognition

Kaiming He

Xiangyu Zhang

Shaoqing Ren

Jian Sun

Microsoft Research

{kahe, v-xiangz, v-shren, jiansun}@microsoft.com

# Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference

Or a list of markdown pages (that you can further index):

md_pgs = read_pgs('files/test/md_all/resnet', join=False)
len(md_pgs), md_pgs[0][:500]
(12,
 '# Deep Residual Learning for Image Recognition\n\nKaiming He\n\nXiangyu Zhang\n\nShaoqing Ren\n\nJian Sun\n\nMicrosoft Research\n\n{kahe, v-xiangz, v-shren, jiansun}@microsoft.com\n\n# Abstract\n\nDeeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreference')

PDF Preparation

If you need to OCR just a subset of pages (e.g., to test on a few pages before processing a large document), use subset_pdf to extract a page range first.


subset_pdf


def subset_pdf(
    path:str, # Path to PDF file
    start:int=1, # Start page (1-based)
    end:int=None, # End page (1-based, inclusive)
    dst:str='.', # Output directory
)->Path: # Path to subset PDF

Extract page range from PDF and save with range suffix

fname = 'files/test/attention-is-all-you-need.pdf'
subset_pdf(fname, 5, 10, 'files/test/xtra')
Path('files/test/xtra/attention-is-all-you-need_p5-10.pdf')