Pipeline

End-to-End Pipeline: PDF OCR, Markdown Heading Correction, and AI Image Descriptions

PDF to markdown pipeline with OCR, heading fixes, and AI-generated image descriptions.

from cachy import enable_cachy
enable_cachy()

pdf_to_md


def pdf_to_md(
    pdf_path:str, # Path to input PDF file
    dst:str, # Destination directory for output markdown
    ocr_dst:str=None, # Optional OCR output directory
    model:str='claude-sonnet-4-5', # Model to use for heading fixes and image descriptions
    add_img_desc:bool=True, # Whether to add image descriptions
    progress:bool=True, # Whether to show progress messages
    fix_kwargs:dict=None, # Extra kwargs for fix_hdgs (e.g. prompt, max_tokens)
    desc_kwargs:dict=None, # Extra kwargs for add_img_descs (e.g. prompt, batch_sz, max_conc)
):

Convert a single PDF to markdown with OCR, fixed heading hierarchy, and optional image descriptions. Batch version planned. See fix_hdgs and add_img_descs for available kwargs.

await pdf_to_md('files/test/attention-is-all-you-need.pdf', 'files/test/md_test')
__main__ - INFO - Step 1/3: Running OCR on files/test/attention-is-all-you-need.pdf...
__main__ - INFO - Step 2/3: Fixing heading hierarchy...
__main__ - INFO - Step 3/3: Adding image descriptions...
mistocr.refine - INFO - Describing 7 images...
mistocr.refine - INFO - Saved descriptions to /tmp/tmp3owej3gz/attention-is-all-you-need/img_descriptions.json
mistocr.refine - INFO - Adding descriptions to 15 pages...
mistocr.refine - INFO - Done! Enriched pages saved to files/test/md_test
__main__ - INFO - Done!
!ls -R files/test/md_test
files/test/md_test:
img     page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

files/test/md_test/img:
img-0.jpeg  img-10.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg  img-8.jpeg
img-1.jpeg  img-11.jpeg  img-3.jpeg  img-5.jpeg  img-7.jpeg  img-9.jpeg
md = read_pgs('files/test/md_test', join=True)
print(md[5000:8000])
estion answering and language modeling tasks *[34]*.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as *[17, 18]* and *[9]*.

## 3 Model Architecture ... page 2

Most competitive neural sequence transduction models have an encoder-decoder structure *[5, 2, 35]*. Here, the encoder maps an input sequence of symbol representations $(x_{1},...,x_{n})$ to a sequence of continuous representations $\mathbf{z}=(z_{1},...,z_{n})$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_{1},...,y_{m})$ of symbols one element at a time. At each step the model is auto-regressive *[10]*, consuming the previously generated symbols as additional input when generating the next.

![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This is an architectural diagram of a Transformer model, showing the encoder-decoder structure. The left side shows the encoder stack with N× repeated blocks, each containing Multi-Head Attention, Add & Norm layers, and Feed Forward components with residual connections. The right side shows the decoder stack, also with N× blocks, featuring Masked Multi-Head Attention and Multi-Head Attention layers, each followed by Add & Norm and Feed Forward components. Input flows from the bottom through Input Embedding with Positional Encoding on the encoder side, and Output Embedding (shifted right) with Positional Encoding on the decoder side. The architecture culminates in a Linear layer followed by Softmax to produce Output Probabilities. Arrows indicate the flow of information through the network, with residual connections shown as curved arrows bypassing certain components. This diagram represents the foundational architecture for transformer-based models used in natural language processing and other sequence-to-sequence tasks.
___
Figure 1: The Transformer - model architecture.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

### 3.1 Encoder and Decoder Stacks ... page 3

Encoder: The encoder is composed of a stack of  $N = 6$  identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm  $(x + \text{Sublayer}(x))$ , where Sublayer  $(x)$  is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the em