Pipeline

End-to-End Pipeline: PDF OCR, Markdown Heading Correction, and AI Image Descriptions

PDF to markdown pipeline with OCR, heading fixes, and AI-generated image descriptions.


source

pdf_to_md

 pdf_to_md (pdf_path:str, dst:str, ocr_dst:str=None, model:str='claude-
            sonnet-4-5', add_img_desc:bool=True, progress:bool=True,
            img_folder:str='img', semaphore:int=2, delay:float=1,
            force:bool=False)

Convert PDF to markdown with OCR, fixed heading hierarchy, and optional image descriptions

Type Default Details
pdf_path str Path to input PDF file
dst str Destination directory for output markdown
ocr_dst str None Optional OCR output directory
model str claude-sonnet-4-5 Model to use for heading fixes and image descriptions
add_img_desc bool True Whether to add image descriptions
progress bool True Whether to show progress messages
img_folder str img Name of folder containing images
semaphore int 2 Max concurrent API requests
delay float 1 Delay in seconds between API calls
force bool False Force regeneration even if cache exists
await pdf_to_md('files/test/attention-is-all-you-need.pdf', 'files/test/md_test')
__main__ - INFO - Step 1/3: Running OCR on files/test/attention-is-all-you-need.pdf...
mistocr.core - INFO - Waiting for batch job 9dc9cc8c-f84d-430f-9435-45d167431f93 (initial status: QUEUED)
mistocr.core - DEBUG - Job 9dc9cc8c-f84d-430f-9435-45d167431f93 status: QUEUED (elapsed: 0s)
mistocr.core - DEBUG - Job 9dc9cc8c-f84d-430f-9435-45d167431f93 status: RUNNING (elapsed: 2s)
mistocr.core - DEBUG - Job 9dc9cc8c-f84d-430f-9435-45d167431f93 status: RUNNING (elapsed: 2s)
mistocr.core - DEBUG - Job 9dc9cc8c-f84d-430f-9435-45d167431f93 status: RUNNING (elapsed: 2s)
mistocr.core - INFO - Job 9dc9cc8c-f84d-430f-9435-45d167431f93 completed with status: SUCCESS
__main__ - INFO - Step 2/3: Fixing heading hierarchy...
__main__ - INFO - Step 3/3: Adding image descriptions...
Describing 7 images...
__main__ - INFO - Done!
Saved descriptions to /tmp/tmp1z_5vpmb/attention-is-all-you-need/img_descriptions.json
Adding descriptions to 15 pages...
Done! Enriched pages saved to files/test/md_test
!ls -R files/test/md_test
files/test/md_test:
img     page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md_test/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg
md = read_pgs('files/test/md_test', join=True)
print(md[5000:8000])
d of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks *[34]*.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as *[17, 18]* and *[9]*.

## 3 Model Architecture ... page 2

Most competitive neural sequence transduction models have an encoder-decoder structure *[5, 2, 35]*. Here, the encoder maps an input sequence of symbol representations $(x_{1},...,x_{n})$ to a sequence of continuous representations $\mathbf{z}=(z_{1},...,z_{n})$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_{1},...,y_{m})$ of symbols one element at a time. At each step the model is auto-regressive *[10]*, consuming the previously generated symbols as additional input when generating the next.

![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This is an architectural diagram of a Transformer model, showing the encoder-decoder structure. The left side shows the encoder with N× stacked layers, each containing Multi-Head Attention and Feed Forward sublayers with Add & Norm operations. The right side shows the decoder with N× stacked layers, featuring Masked Multi-Head Attention, Multi-Head Attention (for encoder-decoder attention), and Feed Forward sublayers, also with Add & Norm operations. Both sides include Positional Encoding added to Input Embedding (encoder) and Output Embedding (decoder). The decoder processes outputs shifted right. The architecture flows upward through Linear and Softmax layers to produce Output Probabilities. Residual connections are indicated by arrows wrapping around each sublayer block. This diagram represents the standard Transformer architecture used in natural language processing and sequence-to-sequence tasks.
___
Figure 1: The Transformer - model architecture.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

### 3.1 Encoder and Decoder Stacks ... page 3

Encoder: The encoder is composed of a stack of  $N = 6$  identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm  $(x + \text{Sublayer}(x))$ , where Sublayer  $(x)$  is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs