Core

Batch OCR for PDFs using Mistral API

This module provides batch OCR processing for PDF documents using Mistral’s OCR API. It handles uploading PDFs, creating and submitting batch jobs, monitoring their completion, and saving the results as markdown files with extracted images. The main entry point is the ocr function which processes single PDFs or entire folders.

Setup

To use this module, you’ll need: - A Mistral API key set in your environment as MISTRAL_API_KEY or passed to functions - PDF files to process - The mistralai Python package installed

Load your API key from a .env file or set it directly:


source

get_api_key

 get_api_key (key:str=None)

Get Mistral API key from parameter or environment

Type Default Details
key str None Mistral API key

The pipeline works in stages: first we upload PDFs to Mistral’s servers and get signed URLs, then we create batch entries for each PDF, submit them as a job, monitor completion, and finally download and save the results as markdown files with extracted images.

PDF Upload


source

upload_pdf

 upload_pdf (path:str, key:str=None)

Upload PDF to Mistral and return signed URL

Type Default Details
path str Path to PDF file
key str None Mistral API key
Returns tuple Mistral pdf signed url and client
fname = 'files/test/attention-is-all-you-need.pdf'
url, c = upload_pdf(fname)
url, c
('https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2025-12-17T10%3A14%3A10Z&sp=r&sv=2025-01-05&sr=b&sig=81bCJn02NmqFYAXruUCpu5pTnH0Wk0%2BJC4Izxw69ClY%3D',
 <mistralai.sdk.Mistral>)
test_url, test_c = upload_pdf('files/test/attention-is-all-you-need.pdf')
assert test_url.startswith('https://')

Batch Entry Creation

Each PDF needs a batch entry that tells Mistral where to find it and what options to use. The custom id cid helps you identify results later - by default it uses the PDF filename.


source

create_batch_entry

 create_batch_entry (path:str, url:str, cid:str=None, inc_img:bool=True)

Create a batch entry dict for OCR

Type Default Details
path str Path to PDF file,
url str Mistral signed URL
cid str None Custom ID (by default using the file name without extension)
inc_img bool True Include image in response
Returns dict Batch entry dict
entry = create_batch_entry(fname, url)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2025-12-17T10%3A14%3A10Z&sp=r&sv=2025-01-05&sr=b&sig=81bCJn02NmqFYAXruUCpu5pTnH0Wk0%2BJC4Izxw69ClY%3D'},
  'include_image_base64': True}}

source

prep_pdf_batch

 prep_pdf_batch (path:str, cid:str=None, inc_img:bool=True, key=None)

Upload PDF and create batch entry in one step

Type Default Details
path str Path to PDF file,
cid str None Custom ID (by default using the file name without extention)
inc_img bool True Include image in response
key NoneType None API key
Returns dict Batch entry dict
entry, c = prep_pdf_batch(fname)
entry
{'custom_id': 'attention-is-all-you-need',
 'body': {'document': {'type': 'document_url',
   'document_url': 'https://mistralaifilesapiprodswe.blob.core.windows.net/fine-tune/d3c29c82-04d2-4c77-98b6-d2820e375e1f/75695561-4561-4dc1-aeab-b0f34f40f753/3ff15e6900254f5b882c4d49d22dadb8.pdf?se=2025-12-17T10%3A14%3A11Z&sp=r&sv=2025-01-05&sr=b&sig=C18T8be97uzpFrCQcF0czGwDj8Sqi4kFEfj6s/uE8es%3D'},
  'include_image_base64': True}}

Batch Submission

Once you have batch entries for all your PDFs, submit them as a single job. Mistral processes them asynchronously, so you’ll need to poll for completion.


source

submit_batch

 submit_batch (entries:list[dict], c:mistralai.sdk.Mistral=None,
               model:str='mistral-ocr-latest', endpoint:str='/v1/ocr')

Submit batch entries and return job

Type Default Details
entries list List of batch entries,
c Mistral None Mistral client,
model str mistral-ocr-latest Model name,
endpoint str /v1/ocr Endpoint name
Returns dict Job dict
job = submit_batch([entry], c)
job
BatchJobOut(id='6a2a3cb8-7197-4246-86d0-b18dab00bf5d', input_files=['465e4701-80c8-4b50-9935-a99784e0445c'], endpoint='/v1/ocr', errors=[], status='QUEUED', created_at=1765880052, total_requests=0, completed_requests=0, succeeded_requests=0, failed_requests=0, object='batch', metadata=None, model='mistral-ocr-latest', agent_id=None, output_file=None, error_file=None, started_at=None, completed_at=None)

Jobs can take from seconds to minutes depending on PDF size and queue length. The wait_for_job function polls every 10 seconds by default, but you can adjust this with the poll_interval parameter if needed.


source

wait_for_job

 wait_for_job (job:dict, c:mistralai.sdk.Mistral=None,
               poll_interval:int=1, queued_timeout:int=300)

Poll job until completion and return final job status

Type Default Details
job dict Batch job from submit_batch
c Mistral None Mistral client
poll_interval int 1 Seconds between status checks
queued_timeout int 300 Max seconds in QUEUED before timeout
Returns dict Completed job dict
final_job = wait_for_job(job, c)
final_job.status, final_job.succeeded_requests, final_job.total_requests
__main__ - INFO - Waiting for batch job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d (initial status: QUEUED)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: QUEUED (elapsed: 0s)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: RUNNING (elapsed: 1s)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: RUNNING (elapsed: 1s)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: RUNNING (elapsed: 1s)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: RUNNING (elapsed: 1s)
__main__ - DEBUG - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d status: RUNNING (elapsed: 1s)
__main__ - INFO - Job 6a2a3cb8-7197-4246-86d0-b18dab00bf5d completed with status: SUCCESS
('SUCCESS', 1, 1)

source

download_results

 download_results (job:dict, c:mistralai.sdk.Mistral=None)

Download and parse batch job results

Type Default Details
job dict Job dict,
c Mistral None Mistral client
Returns list List of results
results = download_results(final_job, c)
len(results), results[0].keys()

Each result contains the custom ID you provided, the OCR response with pages and markdown content, and any errors. Images are embedded as base64 strings and need to be decoded before saving.

Results Processing

Results contain the OCR’d markdown for each page, plus any extracted images as base64-encoded data. We save each page as a separate markdown file in a folder named after the PDF, with images in an img subfolder.


source

save_images

 save_images (page:dict, img_dir:str='img')

Save all images from a page to directory

Type Default Details
page dict Page dict,
img_dir str img Directory to save images
Returns None

source

save_page

 save_page (page:dict, dst:str, img_dir:str='img')

Save single page markdown and images

Type Default Details
page dict Page dict,
dst str Directory to save page
img_dir str img Directory to save images
Returns None

For a PDF named paper.pdf, the output structure will be:

md/
  paper/
    page_1.md
    page_2.md
    img/
      img-0.jpeg
      img-1.jpeg

Each page is saved as a separate markdown file, making it easy to process or view individual pages.


source

save_pages

 save_pages (ocr_resp:dict, dst:str, cid:str)

Save markdown pages and images from OCR response to output directory

Type Details
ocr_resp dict OCR response,
dst str Directory to save pages,
cid str Custom ID
Returns Path Output directory
ocr_resp = results[0]['response']['body']
save_pages(ocr_resp, 'files/test/md', results[0]['custom_id'])

The ocr function below handles the entire pipeline: uploading PDFs, creating batch entries, submitting the job, waiting for completion, and saving results. Making it perfect for processing single files or entire folders.

High-level Interface


source

ocr_pdf

 ocr_pdf (path:str, dst:str='md', inc_img:bool=True, key:str=None,
          poll_interval:int=2)

OCR a PDF file or folder of PDFs and save results

Type Default Details
path str Path to PDF file or folder,
dst str md Directory to save markdown pages,
inc_img bool True Include image in response,
key str None API key,
poll_interval int 2 Poll interval in seconds
Returns list List of output directories
mds = ocr_pdf('files/test', 'files/test/md_all')
mds

Utilities

Helper functions for working with OCR results and preparing PDFs for processing.

Reading OCR Output

After running ocr_pdf, use read_pgs to load the markdown content. Set join=True (default) to get a single string suitable for LLM processing, or join=False to get a list of pages for individual page analysis.


source

read_pgs

 read_pgs (path:str, join:bool=True)

Read specific page or all pages from OCR output directory

Type Default Details
path str OCR output directory,
join bool True Join pages into single string
Returns str | list[str] Joined string or list of page contents

For instance to read all pages:

text = read_pgs('files/test/md_all/resnet')
print(text[:500])

Or a list of markdown pages (that you can further index):

md_pgs = read_pgs('files/test/md_all/resnet', join=False)
len(md_pgs), md_pgs[0][:500]

PDF Preparation

If you need to OCR just a subset of pages (e.g., to test on a few pages before processing a large document), use subset_pdf to extract a page range first.


source

subset_pdf

 subset_pdf (path:str, start:int=1, end:int=None, dst:str='.')

Extract page range from PDF and save with range suffix

Type Default Details
path str Path to PDF file
start int 1 Start page (1-based)
end int None End page (1-based, inclusive)
dst str . Output directory
Returns Path Path to subset PDF
fname = 'files/test/attention-is-all-you-need.pdf'
subset_pdf(fname, 5, 10, 'files/test/xtra')
Path('files/test/xtra/attention-is-all-you-need_p5-10.pdf')