Refine

Fix heading hierarchy and describe images in OCR’d markdown documents

This module refines markdown documents extracted from OCR’d PDFs through two main processes:

Both processes work incrementally on page-by-page markdown files, caching results to avoid redundant API calls.

Heading Hierarchy

Functions for detecting and fixing markdown heading levels

OCR’d PDF files often have corrupted heading hierarchies - headings may jump levels incorrectly (e.g., from H1 to H4) or use inconsistent levels for sections at the same depth. This section provides tools to automatically detect and fix these issues using LLMs, while also optionally adding page numbers for easier navigation.

The first step is extracting all headings from a markdown document so we can analyze their structure.


get_hdgs


def get_hdgs(
    md:str, # Markdown file string
)->L: # L of strings

Return the markdown headings


add_pg_hdgs


def add_pg_hdgs(
    md:str, # Markdown file string,
    n:int, # Page number
)->str: # Markdown file string

Add page number to all headings in page markdown

The add_pg_hdgs function serves two important purposes:

1. Creating unique heading identifiers

When fixing heading hierarchies across an entire document, we need a way to distinguish between headings that have the same text but appear in different locations. For example, a document might have multiple “Introduction” or “Conclusion” headings in different chapters. By appending the page number to each heading, we create unique identifiers that allow us to build a lookup table mapping each specific heading instance to its corrected version. This assumes (reasonably) that the same heading text won’t appear twice on a single page.

2. Providing spatial context for LLMs

Adding page numbers gives LLMs valuable positional information when analyzing the document structure. The page number helps the model understand: - Where a heading sits in the overall document flow - The relative distance between sections - Whether headings that seem related are actually close together or far apart

This spatial awareness can significantly improve the LLM’s ability to infer the correct hierarchical relationships between headings, especially in long documents where similar section names might appear at different structural levels.

For instance:

pgs = read_pgs('files/test/md_all/attention-is-all-you-need', join=False)
pg0,pg0_with = pgs[0][:500],add_pg_hdgs(pgs[0], n=1)[:500]
print('Before:\n' + 80*'-' + f'\n{pg0}\n\nAfter:\n' + 80*'-' + f'\n{pg0_with}')
Before:
--------------------------------------------------------------------------------
# Attention Is All You Need

Ashish Vaswani*

Google Brain

avaswani@google.com

Noam Shazeer*

Google Brain

noam@google.com

Niki Parmar*

Google Research

nikip@google.com

Jakob Uszkoreit*

Google Research

usz@google.com

Llion Jones*

Google Research

llion@google.com

Aidan N. Gomez*†

University of Toronto

aidan@cs.toronto.edu

Łukasz Kaiser*

Google Brain

lukaszkaiser@google.com

Illia Polosukhin*‡

illia.polosukhin@gmail.com

# Abstract

The dominant sequence transduction models are 

After:
--------------------------------------------------------------------------------
# Attention Is All You Need ... page 1

Ashish Vaswani*

Google Brain

avaswani@google.com

Noam Shazeer*

Google Brain

noam@google.com

Niki Parmar*

Google Research

nikip@google.com

Jakob Uszkoreit*

Google Research

usz@google.com

Llion Jones*

Google Research

llion@google.com

Aidan N. Gomez*†

University of Toronto

aidan@cs.toronto.edu

Łukasz Kaiser*

Google Brain

lukaszkaiser@google.com

Illia Polosukhin*‡

illia.polosukhin@gmail.com

# Abstract ... page 1

The dominant sequence tr

read_pgs_pg


def read_pgs_pg(
    path:str, # Path to the markdown file
)->L: # List of markdown pages

Read all pages of a markdown file and add page numbers to all headings

pgs = read_pgs_pg('files/test/md_all/attention-is-all-you-need')
hdgs = L([get_hdgs(pg) for pg in pgs]).concat()
hdgs
(#25) ['# Attention Is All You Need ... page 1','# Abstract ... page 1','## 2 Background ... page 2','## 3 Model Architecture ... page 2','# 3.1 Encoder and Decoder Stacks ... page 3','# 3.2 Attention ... page 3','# 3.2.1 Scaled Dot-Product Attention ... page 4','# 3.2.2 Multi-Head Attention ... page 4','#### 3.2.3 Applications of Attention in our Model ... page 5','### 3.3 Position-wise Feed-Forward Networks ... page 5','### 3.4 Embeddings and Softmax ... page 5','# 3.5 Positional Encoding ... page 6','# 4 Why Self-Attention ... page 6','## 5 Training ... page 7','### 5.1 Training Data and Batching ... page 7','### 5.2 Hardware and Schedule ... page 7','### 5.3 Optimizer ... page 7','### 5.4 Regularization ... page 7','# 6 Results ... page 8','# 6.1 Machine Translation ... page 8'...]

To make it easier for an LLM to reference specific headings when suggesting fixes, we format them with index numbers.


fmt_hdgs_idx


def fmt_hdgs_idx(
    hdgs:list, # List of markdown headings
)->str: # Formatted string with index

Format the headings with index

hdgs_fmt = fmt_hdgs_idx(hdgs)
print(hdgs_fmt)
0. # Attention Is All You Need ... page 1
1. # Abstract ... page 1
2. ## 2 Background ... page 2
3. ## 3 Model Architecture ... page 2
4. # 3.1 Encoder and Decoder Stacks ... page 3
5. # 3.2 Attention ... page 3
6. # 3.2.1 Scaled Dot-Product Attention ... page 4
7. # 3.2.2 Multi-Head Attention ... page 4
8. #### 3.2.3 Applications of Attention in our Model ... page 5
9. ### 3.3 Position-wise Feed-Forward Networks ... page 5
10. ### 3.4 Embeddings and Softmax ... page 5
11. # 3.5 Positional Encoding ... page 6
12. # 4 Why Self-Attention ... page 6
13. ## 5 Training ... page 7
14. ### 5.1 Training Data and Batching ... page 7
15. ### 5.2 Hardware and Schedule ... page 7
16. ### 5.3 Optimizer ... page 7
17. ### 5.4 Regularization ... page 7
18. # 6 Results ... page 8
19. # 6.1 Machine Translation ... page 8
20. # 6.2 Model Variations ... page 8
21. # 6.3 English Constituency Parsing ... page 9
22. # 7 Conclusion ... page 10
23. # References ... page 10
24. # Attention Visualizations ... page 13

We use a Pydantic model to ensure the LLM returns corrections in a structured format - a list of dictionary mapping heading indices to their corrected versions.


HeadingCorrection


def HeadingCorrection(
    data:Any
)->None:

A single heading correction mapping an index to its corrected markdown heading


HeadingCorrections


def HeadingCorrections(
    data:Any
)->None:

Collection of heading corrections returned by the LLM

This prompt instructs the LLM on what types of heading hierarchy errors to fix while preserving the document’s intended structure. It focuses on three main issues:

  • level jumps that skip intermediate levels,
  • numbering inconsistencies where subsection depth doesn’t match heading level, and
  • ensures decreasing levels (moving back up the hierarchy) are preserved.

Now we can use an LLM to automatically detect and fix these heading hierarchy issues. The function uses litellm (wrapped by the Lisette package to send the formatted headings to a language model along with our correction rules. The LLM analyzes the structure and returns only the headings that need fixing, mapped by their index numbers.


fix_hdg_hierarchy


def fix_hdg_hierarchy(
    hdgs:list, # List of markdown headings
    prompt:str=None, # Prompt to use
    model:str='claude-sonnet-4-5', # Model to use
    api_key:str=None, # API key
)->dict: # Dictionary of index → corrected heading

Fix the heading hierarchy

fixes = fix_hdg_hierarchy(hdgs)
fixes
{1: '## Abstract ... page 1',
 14: '### A. Object Detection Baselines ... page 10',
 15: '#### PASCAL VOC ... page 10',
 16: '#### MS COCO ... page 10',
 17: '### B. Object Detection Improvements ... page 10',
 18: '#### MS COCO ... page 10',
 19: '#### PASCAL VOC ... page 11',
 20: '#### ImageNet Detection ... page 11',
 21: '### C. ImageNet Localization ... page 12'}

The corrections come back as string indices, but we need to map the actual heading text to its corrected version for easy replacement in the document.


mk_fixes_lut


def mk_fixes_lut(
    hdgs:list, # List of markdown headings
    model:str='claude-sonnet-4-5', # Model to use
    api_key:str=None, # API key
    prompt:str=None, # Prompt to use
)->dict: # Dictionary of old → new heading

Make a lookup table of fixes

lut_fixes = mk_fixes_lut(hdgs)
lut_fixes
{'#### Abstract ... page 1': '## Abstract ... page 1',
 '## A. Object Detection Baselines ... page 10': '### A. Object Detection Baselines ... page 10',
 '## PASCAL VOC ... page 10': '#### PASCAL VOC ... page 10',
 '## MS COCO ... page 10': '#### MS COCO ... page 10',
 '## B. Object Detection Improvements ... page 10': '### B. Object Detection Improvements ... page 10',
 '## PASCAL VOC ... page 11': '#### PASCAL VOC ... page 11',
 '## ImageNet Detection ... page 11': '#### ImageNet Detection ... page 11',
 '## C. ImageNet Localization ... page 12': '### C. ImageNet Localization ... page 12'}

Now we can apply the fixes to individual pages. We optionally add page numbers to headings for easier navigation in the final document.


apply_hdg_fixes


def apply_hdg_fixes(
    p:str, # Page to fix
    lut_fixes:dict, # Lookup table of fixes
)->str: # Page with fixes applied

Apply the fixes to the page

pg_nb = 1
p = read_pgs_pg('files/test/md_all/resnet')[0]
print(apply_hdg_fixes(p, lut_fixes)[:300])
# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


## Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framewor

Finally, we tie everything together in a single function that processes an entire document directory, fixing all heading hierarchy issues and optionally adding page numbers.


fix_hdgs


def fix_hdgs(
    src:str, model:str='claude-sonnet-4-5', # Model to use
    dst:str=None, img_folder:str='img', api_key:str=None, # API key
    prompt:str=None, # Prompt to use
): # Dictionary of old → new heading

Fix heading hierarchy in markdown document

fix_hdgs('files/test/md_all/resnet', dst='files/test/md_fixed/resnet')
!ls -R 'files/test/md_fixed/resnet'
files/test/md_fixed/resnet:
img            page_10.md  page_2.md  page_5.md  page_8.md
img_descriptions.json  page_11.md  page_3.md  page_6.md  page_9.md
page_1.md          page_12.md  page_4.md  page_7.md

files/test/md_fixed/resnet/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg
md = read_pgs('files/test/md_fixed/resnet')
print(md[:500])
# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


## Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins

Image Description

Tools for classifying and describing images in markdown documents

imgs = Path('files/test/md_fixed/resnet/img').ls(file_exts='.jpeg')
imgs
(#7) [Path('files/test/md_fixed/resnet/img/img-5.jpeg'),Path('files/test/md_fixed/resnet/img/img-0.jpeg'),Path('files/test/md_fixed/resnet/img/img-1.jpeg'),Path('files/test/md_fixed/resnet/img/img-2.jpeg'),Path('files/test/md_fixed/resnet/img/img-3.jpeg'),Path('files/test/md_fixed/resnet/img/img-6.jpeg'),Path('files/test/md_fixed/resnet/img/img-4.jpeg')]

ImgDescription


def ImgDescription(
    data:Any
)->None:

Image classification and description for OCR’d documents

The two-field Pydantic model filters decorative images via is_informative while providing rich description context for downstream RAG systems.

The prompt uses a two-step approach: first classify the image as informative or decorative, then provide an appropriate level of detail in the description based on that classification.


describe_img


def describe_img(
    img_path:Path, # Path to the image file
    model:str='claude-sonnet-4-5', # Model to use
    prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt for description
)->ImgDescription:

Describe a single image using AsyncChat

We process images asynchronously using AsyncChat to handle multiple images efficiently while respecting API rate limits.

img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await describe_img(img)
r

{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}

  • id: chatcmpl-444fbd64-e128-4f47-b72b-ee254372fa78
  • model: claude-sonnet-4-5-20250929
  • finish_reason: stop
  • usage: Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

To avoid hitting API rate limits when processing multiple images, we use semaphore-based concurrency control with delays between requests.


limit


def limit(
    semaphore, # Semaphore for concurrency control
    coro, # Coroutine to execute
    delay:float=None, # Optional delay in seconds after execution
):

Execute coroutine with semaphore-based rate limiting and optional delay

img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await limit(Semaphore(2), describe_img(img), delay=1)
r

{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}

  • id: chatcmpl-74b8a349-87e1-4ceb-bc33-39cd648404e2
  • model: claude-sonnet-4-5-20250929
  • finish_reason: stop
  • usage: Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

parse_r


def parse_r(
    result, # ModelResponse object from API call
):

Extract and parse JSON content from model response

The describe_imgs function orchestrates parallel image processing: it creates async tasks for each image, limits concurrency with a semaphore, adds delays between requests to avoid rate limits, and returns a dictionary mapping filenames to their descriptions for easy lookup during markdown enrichment.


describe_imgs


def describe_imgs(
    imgs:list, # List of image file paths to describe
    model:str='claude-sonnet-4-5', # Model to use for image description
    prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt template for description
    semaphore:int=10, # Max concurrent API requests
    delay:float=0.1, # Delay in seconds between requests
)->dict: # Dict mapping filename to parsed description

Describe multiple images in parallel with rate limiting

descs = await describe_imgs(imgs[:2], semaphore=10, delay=0.1)
descs
{'img-5.jpeg': {'is_informative': True,
  'description': 'Three line graphs showing training error (%) versus iterations (1e4) for different neural network architectures. Left panel compares plain networks (plain-20, plain-32, plain-44, plain-56) showing error rates between 0-20%, with deeper networks performing worse. Middle panel shows ResNet architectures (ResNet-20, ResNet-32, ResNet-44, ResNet-56, ResNet-110) with error rates 0-20%, demonstrating that deeper ResNets achieve lower error rates, with 56-layer and 20-layer performance labeled. Right panel compares residual-110 and residual-1202 models showing error rates 0-20%, with both converging to similar performance around 5-7% error. The graphs illustrate the effectiveness of residual connections in enabling training of deeper networks compared to plain architectures.'},
 'img-0.jpeg': {'is_informative': True,
  'description': 'Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.'}}

save_img_descs


def save_img_descs(
    descs:dict, # Dictionary of image descriptions
    dst_fname:Path, # Path to save the JSON file
)->None:

Save image descriptions to JSON file

We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.

save_img_descs(descs, 'files/test/md_fixed/resnet/img_descriptions.json')
Path('files/test/md_fixed/resnet/img_descriptions.json').read_text()[:500]
'{\n  "img-5.jpeg": {\n    "is_informative": true,\n    "description": "This figure contains three line graphs showing training error rates over iterations for different neural network architectures. \\n\\nLeft panel: Shows error rates for plain networks (plain-20, plain-32, plain-44, plain-56) over approximately 6\\u00d710^4 iterations. The curves show varying convergence patterns, with the 56-layer and 20-layer networks achieving lower error rates around 10-13%, while deeper networks show less stable'

We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.

Once we have image descriptions, we insert them into the markdown by finding image references and adding formatted description blocks.


add_descs_to_pg


def add_descs_to_pg(
    pg:str, # Page markdown content
    descs:dict, # Dictionary mapping image filenames to their descriptions
)->str: # Page markdown with descriptions added

Add AI-generated descriptions to images in page

Image descriptions are inserted directly after the markdown image reference, wrapped in horizontal rules to visually separate them from the document flow. This preserves the original structure while making descriptions easily identifiable.

pgs = read_pgs('files/test/md_fixed/resnet', join=False)
print(pgs[0][2000:3000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradien
descs = json.loads(Path('files/test/md_fixed/resnet/img_descriptions.json').read_text())
new_pg = add_descs_to_pg(pgs[0], descs)
print(new_pg[2000:4000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.

Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.

Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training error. The 56-layer network (red line) maintains higher test error around 14-15%, while the 20-layer network (yellow/olive line) achieves lower test error around 10-11%.

Key observation: The deeper 56-layer network exhibits worse performance (higher error) than the shallower 20-layer network on both training and test sets, suggesting a degradation problem in very deep networks. This visualization likely illustrates the motivation for residual learning architectures that address the degradation problem in deep neural networks.
___

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanish

We process all pages in batch to efficiently add descriptions throughout the document.


add_descs_to_pgs


def add_descs_to_pgs(
    pgs:list, # List of page markdown strings
    descs:dict, # Dictionary mapping image filenames to their descriptions
)->list: # List of pages with descriptions added

Add AI-generated descriptions to images in all pages

enriched_pgs = add_descs_to_pgs(pgs, descs)
print(enriched_pgs[0][2000:3000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.

Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.

Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training err

The force parameter controls whether to regenerate image descriptions: - force=False (default): Loads existing descriptions from img_descriptions.json if it exists, saving time and API costs - force=True: Regenerates all descriptions by calling the vision LLM, even if cached descriptions exist

Important: If dst is the same as src, descriptions will be added to files that may already contain them from previous runs. To avoid duplicate descriptions, either: - Use a different destination directory each time, or - Ensure your source markdown files are clean before processing


add_img_descs


def add_img_descs(
    src:str, # Path to source markdown directory
    dst:str=None, # Destination directory (defaults to src if None)
    model:str='claude-sonnet-4-5', # Vision model for image description
    img_folder:str='img', # Name of folder containing images
    semaphore:int=2, # Max concurrent API requests
    delay:float=1, # Delay in seconds between API calls
    force:bool=False, # Force regeneration even if cache exists
    progress:bool=True, # Log progress messages
):

Describe all images in markdown document and insert descriptions inline

Here’s the complete workflow to process a document:

await add_img_descs('files/test/md_fixed/resnet', dst='files/test/md_enriched/resnet', force=True, progress=True)
__main__ - INFO - Describing 7 images...
__main__ - INFO - Saved descriptions to files/test/md_fixed/resnet/img_descriptions.json
__main__ - INFO - Adding descriptions to 12 pages...
__main__ - INFO - Done! Enriched pages saved to files/test/md_enriched/resnet
pgs = read_pgs('files/test/md_enriched/resnet', join=False)
print(pgs[0][2000:4000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.
___

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks