from cachy import enable_cachy
enable_cachy()Pipeline
End-to-End Pipeline: PDF OCR, Markdown Heading Correction, and AI Image Descriptions
PDF to markdown pipeline with OCR, heading fixes, and AI-generated image descriptions.
pdf_to_md
def pdf_to_md(
pdf_path:str, # Path to input PDF file
dst:str, # Destination directory for output markdown
ocr_dst:str=None, # Optional OCR output directory
model:str='claude-sonnet-4-5', # Model to use for heading fixes and image descriptions
add_img_desc:bool=True, # Whether to add image descriptions
progress:bool=True, # Whether to show progress messages
fix_kwargs:dict=None, # Extra kwargs for fix_hdgs (e.g. prompt, max_tokens)
desc_kwargs:dict=None, # Extra kwargs for add_img_descs (e.g. prompt, batch_sz, max_conc)
):
Convert a single PDF to markdown with OCR, fixed heading hierarchy, and optional image descriptions. Batch version planned. See fix_hdgs and add_img_descs for available kwargs.
await pdf_to_md('files/test/attention-is-all-you-need.pdf', 'files/test/md_test')__main__ - INFO - Step 1/3: Running OCR on files/test/attention-is-all-you-need.pdf...
__main__ - INFO - Step 2/3: Fixing heading hierarchy...
__main__ - INFO - Step 3/3: Adding image descriptions...
mistocr.refine - INFO - Describing 7 images...
mistocr.refine - INFO - Saved descriptions to /tmp/tmp3owej3gz/attention-is-all-you-need/img_descriptions.json
mistocr.refine - INFO - Adding descriptions to 15 pages...
mistocr.refine - INFO - Done! Enriched pages saved to files/test/md_test
__main__ - INFO - Done!
!ls -R files/test/md_testfiles/test/md_test:
img page_11.md page_14.md page_3.md page_6.md page_9.md
page_1.md page_12.md page_15.md page_4.md page_7.md
page_10.md page_13.md page_2.md page_5.md page_8.md
files/test/md_test/img:
img-0.jpeg img-10.jpeg img-2.jpeg img-4.jpeg img-6.jpeg img-8.jpeg
img-1.jpeg img-11.jpeg img-3.jpeg img-5.jpeg img-7.jpeg img-9.jpeg
md = read_pgs('files/test/md_test', join=True)
print(md[5000:8000])estion answering and language modeling tasks *[34]*.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as *[17, 18]* and *[9]*.
## 3 Model Architecture ... page 2
Most competitive neural sequence transduction models have an encoder-decoder structure *[5, 2, 35]*. Here, the encoder maps an input sequence of symbol representations $(x_{1},...,x_{n})$ to a sequence of continuous representations $\mathbf{z}=(z_{1},...,z_{n})$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_{1},...,y_{m})$ of symbols one element at a time. At each step the model is auto-regressive *[10]*, consuming the previously generated symbols as additional input when generating the next.

AI-generated image description:
___
This is an architectural diagram of a Transformer model, showing the encoder-decoder structure. The left side shows the encoder stack with N× repeated blocks, each containing Multi-Head Attention, Add & Norm layers, and Feed Forward components with residual connections. The right side shows the decoder stack, also with N× blocks, featuring Masked Multi-Head Attention and Multi-Head Attention layers, each followed by Add & Norm and Feed Forward components. Input flows from the bottom through Input Embedding with Positional Encoding on the encoder side, and Output Embedding (shifted right) with Positional Encoding on the decoder side. The architecture culminates in a Linear layer followed by Softmax to produce Output Probabilities. Arrows indicate the flow of information through the network, with residual connections shown as curved arrows bypassing certain components. This diagram represents the foundational architecture for transformer-based models used in natural language processing and other sequence-to-sequence tasks.
___
Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
### 3.1 Encoder and Decoder Stacks ... page 3
Encoder: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm $(x + \text{Sublayer}(x))$ , where Sublayer $(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the em