Fix heading hierarchy and describe images in OCR’d markdown documents
This module refines markdown documents extracted from OCR’d PDFs through two main processes:
Heading Hierarchy Correction: OCR often corrupts document structure, creating inconsistent heading levels (e.g., jumping from H1 to H4, or using H2 for both sections and subsections). We use LLMs to analyze the full heading structure and automatically fix these issues, ensuring proper hierarchical relationships. Page numbers can optionally be added to headings for easier navigation.
Image Description: We classify images as informative (charts, diagrams, tables) or decorative (logos, backgrounds), then generate detailed descriptions for informative images using vision LLMs. These descriptions are inserted directly into the markdown, making visual content searchable and accessible for RAG systems and accessibility tools.
Both processes work incrementally on page-by-page markdown files, caching results to avoid redundant API calls.
Heading Hierarchy
Functions for detecting and fixing markdown heading levels
OCR’d PDF files often have corrupted heading hierarchies - headings may jump levels incorrectly (e.g., from H1 to H4) or use inconsistent levels for sections at the same depth. This section provides tools to automatically detect and fix these issues using LLMs, while also optionally adding page numbers for easier navigation.
The first step is extracting all headings from a markdown document so we can analyze their structure.
get_hdgs
def get_hdgs( md:str, # Markdown file string)->L: # L of strings
The add_pg_hdgs function serves two important purposes:
1. Creating unique heading identifiers
When fixing heading hierarchies across an entire document, we need a way to distinguish between headings that have the same text but appear in different locations. For example, a document might have multiple “Introduction” or “Conclusion” headings in different chapters. By appending the page number to each heading, we create unique identifiers that allow us to build a lookup table mapping each specific heading instance to its corrected version. This assumes (reasonably) that the same heading text won’t appear twice on a single page.
2. Providing spatial context for LLMs
Adding page numbers gives LLMs valuable positional information when analyzing the document structure. The page number helps the model understand: - Where a heading sits in the overall document flow - The relative distance between sections - Whether headings that seem related are actually close together or far apart
This spatial awareness can significantly improve the LLM’s ability to infer the correct hierarchical relationships between headings, especially in long documents where similar section names might appear at different structural levels.
Before:
--------------------------------------------------------------------------------
# Attention Is All You Need
Ashish Vaswani*
Google Brain
avaswani@google.com
Noam Shazeer*
Google Brain
noam@google.com
Niki Parmar*
Google Research
nikip@google.com
Jakob Uszkoreit*
Google Research
usz@google.com
Llion Jones*
Google Research
llion@google.com
Aidan N. Gomez*†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser*
Google Brain
lukaszkaiser@google.com
Illia Polosukhin*‡
illia.polosukhin@gmail.com
# Abstract
The dominant sequence transduction models are
After:
--------------------------------------------------------------------------------
# Attention Is All You Need ... page 1
Ashish Vaswani*
Google Brain
avaswani@google.com
Noam Shazeer*
Google Brain
noam@google.com
Niki Parmar*
Google Research
nikip@google.com
Jakob Uszkoreit*
Google Research
usz@google.com
Llion Jones*
Google Research
llion@google.com
Aidan N. Gomez*†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser*
Google Brain
lukaszkaiser@google.com
Illia Polosukhin*‡
illia.polosukhin@gmail.com
# Abstract ... page 1
The dominant sequence tr
read_pgs_pg
def read_pgs_pg( path:str, # Path to the markdown file)->L: # List of markdown pages
Read all pages of a markdown file and add page numbers to all headings
pgs = read_pgs_pg('files/test/md_all/attention-is-all-you-need')hdgs = L([get_hdgs(pg) for pg in pgs]).concat()hdgs
(#25) ['# Attention Is All You Need ... page 1','# Abstract ... page 1','## 2 Background ... page 2','## 3 Model Architecture ... page 2','# 3.1 Encoder and Decoder Stacks ... page 3','# 3.2 Attention ... page 3','# 3.2.1 Scaled Dot-Product Attention ... page 4','# 3.2.2 Multi-Head Attention ... page 4','#### 3.2.3 Applications of Attention in our Model ... page 5','### 3.3 Position-wise Feed-Forward Networks ... page 5','### 3.4 Embeddings and Softmax ... page 5','# 3.5 Positional Encoding ... page 6','# 4 Why Self-Attention ... page 6','## 5 Training ... page 7','### 5.1 Training Data and Batching ... page 7','### 5.2 Hardware and Schedule ... page 7','### 5.3 Optimizer ... page 7','### 5.4 Regularization ... page 7','# 6 Results ... page 8','# 6.1 Machine Translation ... page 8'...]
To make it easier for an LLM to reference specific headings when suggesting fixes, we format them with index numbers.
fmt_hdgs_idx
def fmt_hdgs_idx( hdgs:list, # List of markdown headings)->str: # Formatted string with index
We use a Pydantic model to ensure the LLM returns corrections in a structured format - a list of dictionary mapping heading indices to their corrected versions.
HeadingCorrection
def HeadingCorrection( data:Any)->None:
A single heading correction mapping an index to its corrected markdown heading
HeadingCorrections
def HeadingCorrections( data:Any)->None:
Collection of heading corrections returned by the LLM
This prompt instructs the LLM on what types of heading hierarchy errors to fix while preserving the document’s intended structure. It focuses on three main issues:
level jumps that skip intermediate levels,
numbering inconsistencies where subsection depth doesn’t match heading level, and
ensures decreasing levels (moving back up the hierarchy) are preserved.
Now we can use an LLM to automatically detect and fix these heading hierarchy issues. The function uses litellm (wrapped by the Lisette package to send the formatted headings to a language model along with our correction rules. The LLM analyzes the structure and returns only the headings that need fixing, mapped by their index numbers.
fix_hdg_hierarchy
def fix_hdg_hierarchy( hdgs:list, # List of markdown headings prompt:str=None, # Prompt to use model:str='claude-sonnet-4-5', # Model to use api_key:str=None, # API key)->dict: # Dictionary of index → corrected heading
The corrections come back as string indices, but we need to map the actual heading text to its corrected version for easy replacement in the document.
mk_fixes_lut
def mk_fixes_lut( hdgs:list, # List of markdown headings model:str='claude-sonnet-4-5', # Model to use api_key:str=None, # API key prompt:str=None, # Prompt to use)->dict: # Dictionary of old → new heading
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framewor
Finally, we tie everything together in a single function that processes an entire document directory, fixing all heading hierarchy issues and optionally adding page numbers.
fix_hdgs
def fix_hdgs( src:str, model:str='claude-sonnet-4-5', # Model to use dst:str=None, img_folder:str='img', api_key:str=None, # API key prompt:str=None, # Prompt to use): # Dictionary of old → new heading
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
Image Description
Tools for classifying and describing images in markdown documents
Image classification and description for OCR’d documents
The two-field Pydantic model filters decorative images via is_informative while providing rich description context for downstream RAG systems.
The prompt uses a two-step approach: first classify the image as informative or decorative, then provide an appropriate level of detail in the description based on that classification.
describe_img
def describe_img( img_path:Path, # Path to the image file model:str='claude-sonnet-4-5', # Model to use prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt for description)->ImgDescription:
Describe a single image using AsyncChat
We process images asynchronously using AsyncChat to handle multiple images efficiently while respecting API rate limits.
{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
def parse_r( result, # ModelResponse object from API call):
Extract and parse JSON content from model response
The describe_imgs function orchestrates parallel image processing: it creates async tasks for each image, limits concurrency with a semaphore, adds delays between requests to avoid rate limits, and returns a dictionary mapping filenames to their descriptions for easy lookup during markdown enrichment.
describe_imgs
def describe_imgs( imgs:list, # List of image file paths to describe model:str='claude-sonnet-4-5', # Model to use for image description prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt template for description semaphore:int=10, # Max concurrent API requests delay:float=0.1, # Delay in seconds between requests)->dict: # Dict mapping filename to parsed description
Describe multiple images in parallel with rate limiting
{'img-5.jpeg': {'is_informative': True,
'description': 'Three line graphs showing training error (%) versus iterations (1e4) for different neural network architectures. Left panel compares plain networks (plain-20, plain-32, plain-44, plain-56) showing error rates between 0-20%, with deeper networks performing worse. Middle panel shows ResNet architectures (ResNet-20, ResNet-32, ResNet-44, ResNet-56, ResNet-110) with error rates 0-20%, demonstrating that deeper ResNets achieve lower error rates, with 56-layer and 20-layer performance labeled. Right panel compares residual-110 and residual-1202 models showing error rates 0-20%, with both converging to similar performance around 5-7% error. The graphs illustrate the effectiveness of residual connections in enabling training of deeper networks compared to plain architectures.'},
'img-0.jpeg': {'is_informative': True,
'description': 'Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.'}}
save_img_descs
def save_img_descs( descs:dict, # Dictionary of image descriptions dst_fname:Path, # Path to save the JSON file)->None:
Save image descriptions to JSON file
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
'{\n "img-5.jpeg": {\n "is_informative": true,\n "description": "This figure contains three line graphs showing training error rates over iterations for different neural network architectures. \\n\\nLeft panel: Shows error rates for plain networks (plain-20, plain-32, plain-44, plain-56) over approximately 6\\u00d710^4 iterations. The curves show varying convergence patterns, with the 56-layer and 20-layer networks achieving lower error rates around 10-13%, while deeper networks show less stable'
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
Once we have image descriptions, we insert them into the markdown by finding image references and adding formatted description blocks.
add_descs_to_pg
def add_descs_to_pg( pg:str, # Page markdown content descs:dict, # Dictionary mapping image filenames to their descriptions)->str: # Page markdown with descriptions added
Add AI-generated descriptions to images in page
Image descriptions are inserted directly after the markdown image reference, wrapped in horizontal rules to visually separate them from the document flow. This preserves the original structure while making descriptions easily identifiable.
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradien
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training error. The 56-layer network (red line) maintains higher test error around 14-15%, while the 20-layer network (yellow/olive line) achieves lower test error around 10-11%.
Key observation: The deeper 56-layer network exhibits worse performance (higher error) than the shallower 20-layer network on both training and test sets, suggesting a degradation problem in very deep networks. This visualization likely illustrates the motivation for residual learning architectures that address the degradation problem in deep neural networks.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanish
We process all pages in batch to efficiently add descriptions throughout the document.
add_descs_to_pgs
def add_descs_to_pgs( pgs:list, # List of page markdown strings descs:dict, # Dictionary mapping image filenames to their descriptions)->list: # List of pages with descriptions added
Add AI-generated descriptions to images in all pages
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training err
The force parameter controls whether to regenerate image descriptions: - force=False (default): Loads existing descriptions from img_descriptions.json if it exists, saving time and API costs - force=True: Regenerates all descriptions by calling the vision LLM, even if cached descriptions exist
Important: If dst is the same as src, descriptions will be added to files that may already contain them from previous runs. To avoid duplicate descriptions, either: - Use a different destination directory each time, or - Ensure your source markdown files are clean before processing
add_img_descs
def add_img_descs( src:str, # Path to source markdown directory dst:str=None, # Destination directory (defaults to src if None) model:str='claude-sonnet-4-5', # Vision model for image description img_folder:str='img', # Name of folder containing images semaphore:int=2, # Max concurrent API requests delay:float=1, # Delay in seconds between API calls force:bool=False, # Force regeneration even if cache exists progress:bool=True, # Log progress messages):
Describe all images in markdown document and insert descriptions inline
Here’s the complete workflow to process a document:
__main__ - INFO - Describing 7 images...
__main__ - INFO - Saved descriptions to files/test/md_fixed/resnet/img_descriptions.json
__main__ - INFO - Adding descriptions to 12 pages...
__main__ - INFO - Done! Enriched pages saved to files/test/md_enriched/resnet
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks