Fix heading hierarchy and describe images in OCR’d markdown documents
This module refines markdown documents extracted from OCR’d PDFs through two main processes:
Heading Hierarchy Correction: OCR often corrupts document structure, creating inconsistent heading levels (e.g., jumping from H1 to H4, or using H2 for both sections and subsections). We use LLMs to analyze the full heading structure and automatically fix these issues, ensuring proper hierarchical relationships. Page numbers can optionally be added to headings for easier navigation.
Image Description: We classify images as informative (charts, diagrams, tables) or decorative (logos, backgrounds), then generate detailed descriptions for informative images using vision LLMs. These descriptions are inserted directly into the markdown, making visual content searchable and accessible for RAG systems and accessibility tools.
Both processes work incrementally on page-by-page markdown files, caching results to avoid redundant API calls.
Heading Hierarchy
Functions for detecting and fixing markdown heading levels
OCR’d PDF files often have corrupted heading hierarchies - headings may jump levels incorrectly (e.g., from H1 to H4) or use inconsistent levels for sections at the same depth. This section provides tools to automatically detect and fix these issues using LLMs, while also optionally adding page numbers for easier navigation.
The first step is extracting all headings from a markdown document so we can analyze their structure.
The add_pg_hdgs function serves two important purposes:
1. Creating unique heading identifiers
When fixing heading hierarchies across an entire document, we need a way to distinguish between headings that have the same text but appear in different locations. For example, a document might have multiple “Introduction” or “Conclusion” headings in different chapters. By appending the page number to each heading, we create unique identifiers that allow us to build a lookup table mapping each specific heading instance to its corrected version. This assumes (reasonably) that the same heading text won’t appear twice on a single page.
2. Providing spatial context for LLMs
Adding page numbers gives LLMs valuable positional information when analyzing the document structure. The page number helps the model understand: - Where a heading sits in the overall document flow - The relative distance between sections - Whether headings that seem related are actually close together or far apart
This spatial awareness can significantly improve the LLM’s ability to infer the correct hierarchical relationships between headings, especially in long documents where similar section names might appear at different structural levels.
Before:
--------------------------------------------------------------------------------
# Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
#### Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unr
After:
--------------------------------------------------------------------------------
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
#### Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, i
We use a Pydantic model to ensure the LLM returns corrections in a structured format - a list of dictionary mapping heading indices to their corrected versions.
Collection of heading corrections returned by the LLM
This prompt instructs the LLM on what types of heading hierarchy errors to fix while preserving the document’s intended structure. It focuses on three main issues:
level jumps that skip intermediate levels,
numbering inconsistencies where subsection depth doesn’t match heading level, and
ensures decreasing levels (moving back up the hierarchy) are preserved.
Now we can use an LLM to automatically detect and fix these heading hierarchy issues. The function uses litellm (wrapped by the Lisette package to send the formatted headings to a language model along with our correction rules. The LLM analyzes the structure and returns only the headings that need fixing, mapped by their index numbers.
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framewor
Finally, we tie everything together in a single function that processes an entire document directory, fixing all heading hierarchy issues and optionally adding page numbers.
# Deep Residual Learning for Image Recognition ... page 1
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
Image Description
Tools for classifying and describing images in markdown documents
Image classification and description for OCR’d documents
The two-field Pydantic model filters decorative images via is_informative while providing rich description context for downstream RAG systems.
The prompt uses a two-step approach: first classify the image as informative or decorative, then provide an appropriate level of detail in the description based on that classification.
describe_img (img_path:pathlib.Path, model:str='claude-sonnet-4-5',
prompt:str='Analyze this image from an academic/technical
document.\n\nStep 1: Determine if this image is informative
for understanding the document content.\n- Informative:
charts, diagrams, tables, technical illustrations,
experimental results, architectural diagrams\n- Non-
informative: logos, decorative images, generic photos, page
backgrounds\n\nStep 2: \n- If informative: Provide a
detailed description including the type of visualization,
key elements and their relationships, important data or
patterns, and relevant technical details.\n- If non-
informative: Provide a brief label (e.g., "Company logo",
"Decorative header image")\n\nReturn your response as JSON
with \'is_informative\' (boolean) and \'description\'
(string) fields.')
Describe a single image using AsyncChat
Type
Default
Details
img_path
Path
Path to the image file
model
str
claude-sonnet-4-5
Model to use
prompt
str
Analyze this image from an academic/technical document.
Step 1: Determine if this image is informative for understanding the document content. - Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams - Non-informative: logos, decorative images, generic photos, page backgrounds
Step 2: - If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details. - If non-informative: Provide a brief label (e.g., “Company logo”, “Decorative header image”)
Return your response as JSON with ‘is_informative’ (boolean) and ‘description’ (string) fields.
Prompt for description
Returns
ImgDescription
We process images asynchronously using AsyncChat to handle multiple images efficiently while respecting API rate limits.
{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
Extract and parse JSON content from model response
Details
result
ModelResponse object from API call
The describe_imgs function orchestrates parallel image processing: it creates async tasks for each image, limits concurrency with a semaphore, adds delays between requests to avoid rate limits, and returns a dictionary mapping filenames to their descriptions for easy lookup during markdown enrichment.
describe_imgs (imgs:list[pathlib.Path], model:str='claude-sonnet-4-5',
prompt:str='Analyze this image from an academic/technical
document.\n\nStep 1: Determine if this image is
informative for understanding the document content.\n-
Informative: charts, diagrams, tables, technical
illustrations, experimental results, architectural
diagrams\n- Non-informative: logos, decorative images,
generic photos, page backgrounds\n\nStep 2: \n- If
informative: Provide a detailed description including the
type of visualization, key elements and their
relationships, important data or patterns, and relevant
technical details.\n- If non-informative: Provide a brief
label (e.g., "Company logo", "Decorative header
image")\n\nReturn your response as JSON with
\'is_informative\' (boolean) and \'description\' (string)
fields.', semaphore:int=10, delay:float=0.1)
Describe multiple images in parallel with rate limiting
Type
Default
Details
imgs
list
List of image file paths to describe
model
str
claude-sonnet-4-5
Model to use for image description
prompt
str
Analyze this image from an academic/technical document.
Step 1: Determine if this image is informative for understanding the document content. - Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams - Non-informative: logos, decorative images, generic photos, page backgrounds
Step 2: - If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details. - If non-informative: Provide a brief label (e.g., “Company logo”, “Decorative header image”)
Return your response as JSON with ‘is_informative’ (boolean) and ‘description’ (string) fields.
{'img-5.jpeg': {'is_informative': True,
'description': 'Three line graphs showing training error (%) versus iterations (1e4) for different neural network architectures. Left panel compares plain networks (plain-20, plain-32, plain-44, plain-56) showing error rates between 0-20%, with deeper networks performing worse. Middle panel shows ResNet architectures (ResNet-20, ResNet-32, ResNet-44, ResNet-56, ResNet-110) with error rates 0-20%, demonstrating that deeper ResNets achieve lower error rates, with 56-layer and 20-layer performance labeled. Right panel compares residual-110 and residual-1202 models showing error rates 0-20%, with both converging to similar performance around 5-7% error. The graphs illustrate the effectiveness of residual connections in enabling training of deeper networks compared to plain architectures.'},
'img-0.jpeg': {'is_informative': True,
'description': 'Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.'}}
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
'{\n "img-5.jpeg": {\n "is_informative": true,\n "description": "This figure contains three line graphs showing training error rates over iterations for different neural network architectures. \\n\\nLeft panel: Shows error rates for plain networks (plain-20, plain-32, plain-44, plain-56) over approximately 6\\u00d710^4 iterations. The curves show varying convergence patterns, with the 56-layer and 20-layer networks achieving lower error rates around 10-13%, while deeper networks show less stable'
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
Once we have image descriptions, we insert them into the markdown by finding image references and adding formatted description blocks.
Dictionary mapping image filenames to their descriptions
Returns
str
Page markdown with descriptions added
Image descriptions are inserted directly after the markdown image reference, wrapped in horizontal rules to visually separate them from the document flow. This preserves the original structure while making descriptions easily identifiable.
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradien
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training error. The 56-layer network (red line) maintains higher test error around 14-15%, while the 20-layer network (yellow/olive line) achieves lower test error around 10-11%.
Key observation: The deeper 56-layer network exhibits worse performance (higher error) than the shallower 20-layer network on both training and test sets, suggesting a degradation problem in very deep networks. This visualization likely illustrates the motivation for residual learning architectures that address the degradation problem in deep neural networks.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanish
We process all pages in batch to efficiently add descriptions throughout the document.
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training err
The force parameter controls whether to regenerate image descriptions: - force=False (default): Loads existing descriptions from img_descriptions.json if it exists, saving time and API costs - force=True: Regenerates all descriptions by calling the vision LLM, even if cached descriptions exist
Important: If dst is the same as src, descriptions will be added to files that may already contain them from previous runs. To avoid duplicate descriptions, either: - Use a different destination directory each time, or - Ensure your source markdown files are clean before processing
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks