Refine

Fix heading hierarchy and describe images in OCR’d markdown documents

This module refines markdown documents extracted from OCR’d PDFs through two main processes:

Both processes work incrementally on page-by-page markdown files, caching results to avoid redundant API calls.

Heading Hierarchy

Functions for detecting and fixing markdown heading levels

OCR’d PDF files often have corrupted heading hierarchies - headings may jump levels incorrectly (e.g., from H1 to H4) or use inconsistent levels for sections at the same depth. This section provides tools to automatically detect and fix these issues using LLMs, while also optionally adding page numbers for easier navigation.

The first step is extracting all headings from a markdown document so we can analyze their structure.


source

get_hdgs

 get_hdgs (md:str)

Return the markdown headings

Type Details
md str Markdown file string
Returns L L of strings

source

add_pg_hdgs

 add_pg_hdgs (md:str, n:int)

Add page number to all headings in page markdown

Type Details
md str Markdown file string,
n int Page number
Returns str Markdown file string

The add_pg_hdgs function serves two important purposes:

1. Creating unique heading identifiers

When fixing heading hierarchies across an entire document, we need a way to distinguish between headings that have the same text but appear in different locations. For example, a document might have multiple “Introduction” or “Conclusion” headings in different chapters. By appending the page number to each heading, we create unique identifiers that allow us to build a lookup table mapping each specific heading instance to its corrected version. This assumes (reasonably) that the same heading text won’t appear twice on a single page.

2. Providing spatial context for LLMs

Adding page numbers gives LLMs valuable positional information when analyzing the document structure. The page number helps the model understand: - Where a heading sits in the overall document flow - The relative distance between sections - Whether headings that seem related are actually close together or far apart

This spatial awareness can significantly improve the LLM’s ability to infer the correct hierarchical relationships between headings, especially in long documents where similar section names might appear at different structural levels.

For instance:

pgs = read_pgs('files/test/md_all/resnet', join=False)
pg0,pg0_with = pgs[0][:500],add_pg_hdgs(pgs[0], n=1)[:500]
print('Before:\n' + 80*'-' + f'\n{pg0}\n\nAfter:\n' + 80*'-' + f'\n{pg0_with}')
Before:
--------------------------------------------------------------------------------
# Deep Residual Learning for Image Recognition 

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


#### Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unr

After:
--------------------------------------------------------------------------------
# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


#### Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, i

source

read_pgs_pg

 read_pgs_pg (path:str)

Read all pages of a markdown file and add page numbers to all headings

Type Details
path str Path to the markdown file
Returns L List of markdown pages
pgs = read_pgs_pg('files/test/md_all/resnet')
hdgs = L([get_hdgs(pg) for pg in pgs]).concat()
hdgs
(#22) ['# Deep Residual Learning for Image Recognition  ... page 1','#### Abstract ... page 1','## 1. Introduction ... page 1','## 2. Related Work ... page 2','## 3. Deep Residual Learning ... page 3','### 3.1. Residual Learning ... page 3','### 3.2. Identity Mapping by Shortcuts ... page 3','### 3.3. Network Architectures ... page 3','### 3.4. Implementation ... page 4','## 4. Experiments ... page 4','### 4.1. ImageNet Classification ... page 4','### 4.2. CIFAR-10 and Analysis ... page 7','### 4.3. Object Detection on PASCAL and MS COCO ... page 8','## References ... page 9','## A. Object Detection Baselines ... page 10','## PASCAL VOC ... page 10','## MS COCO ... page 10','## B. Object Detection Improvements ... page 10','## MS COCO ... page 10','## PASCAL VOC ... page 11'...]

To make it easier for an LLM to reference specific headings when suggesting fixes, we format them with index numbers.


source

fmt_hdgs_idx

 fmt_hdgs_idx (hdgs:list[str])

Format the headings with index

Type Details
hdgs list List of markdown headings
Returns str Formatted string with index
hdgs_fmt = fmt_hdgs_idx(hdgs)
print(hdgs_fmt)
0. # Deep Residual Learning for Image Recognition  ... page 1
1. #### Abstract ... page 1
2. ## 1. Introduction ... page 1
3. ## 2. Related Work ... page 2
4. ## 3. Deep Residual Learning ... page 3
5. ### 3.1. Residual Learning ... page 3
6. ### 3.2. Identity Mapping by Shortcuts ... page 3
7. ### 3.3. Network Architectures ... page 3
8. ### 3.4. Implementation ... page 4
9. ## 4. Experiments ... page 4
10. ### 4.1. ImageNet Classification ... page 4
11. ### 4.2. CIFAR-10 and Analysis ... page 7
12. ### 4.3. Object Detection on PASCAL and MS COCO ... page 8
13. ## References ... page 9
14. ## A. Object Detection Baselines ... page 10
15. ## PASCAL VOC ... page 10
16. ## MS COCO ... page 10
17. ## B. Object Detection Improvements ... page 10
18. ## MS COCO ... page 10
19. ## PASCAL VOC ... page 11
20. ## ImageNet Detection ... page 11
21. ## C. ImageNet Localization ... page 12

We use a Pydantic model to ensure the LLM returns corrections in a structured format - a list of dictionary mapping heading indices to their corrected versions.


source

HeadingCorrection

 HeadingCorrection (index:int, corrected:str)

A single heading correction mapping an index to its corrected markdown heading


source

HeadingCorrections

 HeadingCorrections (corrections:list[__main__.HeadingCorrection])

Collection of heading corrections returned by the LLM

This prompt instructs the LLM on what types of heading hierarchy errors to fix while preserving the document’s intended structure. It focuses on three main issues:

  • level jumps that skip intermediate levels,
  • numbering inconsistencies where subsection depth doesn’t match heading level, and
  • ensures decreasing levels (moving back up the hierarchy) are preserved.

Now we can use an LLM to automatically detect and fix these heading hierarchy issues. The function uses litellm (wrapped by the Lisette package to send the formatted headings to a language model along with our correction rules. The LLM analyzes the structure and returns only the headings that need fixing, mapped by their index numbers.


source

fix_hdg_hierarchy

 fix_hdg_hierarchy (hdgs:list[str], prompt:str=None, model:str='claude-
                    sonnet-4-5', api_key:str=None)

Fix the heading hierarchy

Type Default Details
hdgs list List of markdown headings
prompt str None Prompt to use
model str claude-sonnet-4-5 Model to use
api_key str None API key
Returns dict Dictionary of index → corrected heading
fixes = fix_hdg_hierarchy(hdgs)
fixes
{1: '## Abstract ... page 1',
 14: '### A. Object Detection Baselines ... page 10',
 15: '#### PASCAL VOC ... page 10',
 16: '#### MS COCO ... page 10',
 17: '### B. Object Detection Improvements ... page 10',
 18: '#### MS COCO ... page 10',
 19: '#### PASCAL VOC ... page 11',
 20: '#### ImageNet Detection ... page 11',
 21: '### C. ImageNet Localization ... page 12'}

The corrections come back as string indices, but we need to map the actual heading text to its corrected version for easy replacement in the document.


source

mk_fixes_lut

 mk_fixes_lut (hdgs:list[str], model:str='claude-sonnet-4-5',
               api_key:str=None, prompt:str=None)

Make a lookup table of fixes

Type Default Details
hdgs list List of markdown headings
model str claude-sonnet-4-5 Model to use
api_key str None API key
prompt str None Prompt to use
Returns dict Dictionary of old → new heading
lut_fixes = mk_fixes_lut(hdgs)
lut_fixes
{'#### Abstract ... page 1': '## Abstract ... page 1',
 '## A. Object Detection Baselines ... page 10': '### A. Object Detection Baselines ... page 10',
 '## PASCAL VOC ... page 10': '#### PASCAL VOC ... page 10',
 '## MS COCO ... page 10': '#### MS COCO ... page 10',
 '## B. Object Detection Improvements ... page 10': '### B. Object Detection Improvements ... page 10',
 '## PASCAL VOC ... page 11': '#### PASCAL VOC ... page 11',
 '## ImageNet Detection ... page 11': '#### ImageNet Detection ... page 11',
 '## C. ImageNet Localization ... page 12': '### C. ImageNet Localization ... page 12'}

Now we can apply the fixes to individual pages. We optionally add page numbers to headings for easier navigation in the final document.


source

apply_hdg_fixes

 apply_hdg_fixes (p:str, lut_fixes:dict[str,str])

Apply the fixes to the page

Type Details
p str Page to fix
lut_fixes dict Lookup table of fixes
Returns str Page with fixes applied
pg_nb = 1
p = read_pgs_pg('files/test/md_all/resnet')[0]
print(apply_hdg_fixes(p, lut_fixes)[:300])
# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


## Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framewor

Finally, we tie everything together in a single function that processes an entire document directory, fixing all heading hierarchy issues and optionally adding page numbers.


source

fix_hdgs

 fix_hdgs (src:str, model:str='claude-sonnet-4-5', dst:str=None,
           img_folder:str='img', api_key:str=None, prompt:str=None)

Fix heading hierarchy in markdown document

Type Default Details
src str
model str claude-sonnet-4-5 Model to use
dst str None
img_folder str img
api_key str None API key
prompt str None Prompt to use
Returns dict Dictionary of old → new heading
fix_hdgs('files/test/md_all/resnet', dst='files/test/md_fixed/resnet')
!ls -R 'files/test/md_fixed/resnet'
files/test/md_fixed/resnet:
img            page_10.md  page_2.md  page_5.md  page_8.md
img_descriptions.json  page_11.md  page_3.md  page_6.md  page_9.md
page_1.md          page_12.md  page_4.md  page_7.md

files/test/md_fixed/resnet/img:
img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
img-1.jpeg  img-3.jpeg  img-5.jpeg
md = read_pgs('files/test/md_fixed/resnet')
print(md[:500])
# Deep Residual Learning for Image Recognition  ... page 1

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


## Abstract ... page 1

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins

Image Description

Tools for classifying and describing images in markdown documents

imgs = Path('files/test/md_fixed/resnet/img').ls(file_exts='.jpeg')
imgs
(#7) [Path('files/test/md_fixed/resnet/img/img-5.jpeg'),Path('files/test/md_fixed/resnet/img/img-0.jpeg'),Path('files/test/md_fixed/resnet/img/img-1.jpeg'),Path('files/test/md_fixed/resnet/img/img-2.jpeg'),Path('files/test/md_fixed/resnet/img/img-3.jpeg'),Path('files/test/md_fixed/resnet/img/img-6.jpeg'),Path('files/test/md_fixed/resnet/img/img-4.jpeg')]

source

ImgDescription

 ImgDescription (is_informative:bool, description:str)

Image classification and description for OCR’d documents

The two-field Pydantic model filters decorative images via is_informative while providing rich description context for downstream RAG systems.

The prompt uses a two-step approach: first classify the image as informative or decorative, then provide an appropriate level of detail in the description based on that classification.


source

describe_img

 describe_img (img_path:pathlib.Path, model:str='claude-sonnet-4-5',
               prompt:str='Analyze this image from an academic/technical
               document.\n\nStep 1: Determine if this image is informative
               for understanding the document content.\n- Informative:
               charts, diagrams, tables, technical illustrations,
               experimental results, architectural diagrams\n- Non-
               informative: logos, decorative images, generic photos, page
               backgrounds\n\nStep 2: \n- If informative: Provide a
               detailed description including the type of visualization,
               key elements and their relationships, important data or
               patterns, and relevant technical details.\n- If non-
               informative: Provide a brief label (e.g., "Company logo",
               "Decorative header image")\n\nReturn your response as JSON
               with \'is_informative\' (boolean) and \'description\'
               (string) fields.')

Describe a single image using AsyncChat

Type Default Details
img_path Path Path to the image file
model str claude-sonnet-4-5 Model to use
prompt str Analyze this image from an academic/technical document.

Step 1: Determine if this image is informative for understanding the document content.
- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams
- Non-informative: logos, decorative images, generic photos, page backgrounds

Step 2:
- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.
- If non-informative: Provide a brief label (e.g., “Company logo”, “Decorative header image”)

Return your response as JSON with ‘is_informative’ (boolean) and ‘description’ (string) fields.
Prompt for description
Returns ImgDescription

We process images asynchronously using AsyncChat to handle multiple images efficiently while respecting API rate limits.

img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await describe_img(img)
r

{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}

  • id: chatcmpl-444fbd64-e128-4f47-b72b-ee254372fa78
  • model: claude-sonnet-4-5-20250929
  • finish_reason: stop
  • usage: Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

To avoid hitting API rate limits when processing multiple images, we use semaphore-based concurrency control with delays between requests.


source

limit

 limit (semaphore, coro, delay:float=None)

Execute coroutine with semaphore-based rate limiting and optional delay

Type Default Details
semaphore Semaphore for concurrency control
coro Coroutine to execute
delay float None Optional delay in seconds after execution
img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await limit(Semaphore(2), describe_img(img), delay=1)
r

{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}

  • id: chatcmpl-74b8a349-87e1-4ceb-bc33-39cd648404e2
  • model: claude-sonnet-4-5-20250929
  • finish_reason: stop
  • usage: Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

source

parse_r

 parse_r (result)

Extract and parse JSON content from model response

Details
result ModelResponse object from API call

The describe_imgs function orchestrates parallel image processing: it creates async tasks for each image, limits concurrency with a semaphore, adds delays between requests to avoid rate limits, and returns a dictionary mapping filenames to their descriptions for easy lookup during markdown enrichment.


source

describe_imgs

 describe_imgs (imgs:list[pathlib.Path], model:str='claude-sonnet-4-5',
                prompt:str='Analyze this image from an academic/technical
                document.\n\nStep 1: Determine if this image is
                informative for understanding the document content.\n-
                Informative: charts, diagrams, tables, technical
                illustrations, experimental results, architectural
                diagrams\n- Non-informative: logos, decorative images,
                generic photos, page backgrounds\n\nStep 2: \n- If
                informative: Provide a detailed description including the
                type of visualization, key elements and their
                relationships, important data or patterns, and relevant
                technical details.\n- If non-informative: Provide a brief
                label (e.g., "Company logo", "Decorative header
                image")\n\nReturn your response as JSON with
                \'is_informative\' (boolean) and \'description\' (string)
                fields.', semaphore:int=10, delay:float=0.1)

Describe multiple images in parallel with rate limiting

Type Default Details
imgs list List of image file paths to describe
model str claude-sonnet-4-5 Model to use for image description
prompt str Analyze this image from an academic/technical document.

Step 1: Determine if this image is informative for understanding the document content.
- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams
- Non-informative: logos, decorative images, generic photos, page backgrounds

Step 2:
- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.
- If non-informative: Provide a brief label (e.g., “Company logo”, “Decorative header image”)

Return your response as JSON with ‘is_informative’ (boolean) and ‘description’ (string) fields.
Prompt template for description
semaphore int 10 Max concurrent API requests
delay float 0.1 Delay in seconds between requests
Returns dict Dict mapping filename to parsed description
descs = await describe_imgs(imgs[:2], semaphore=10, delay=0.1)
descs
{'img-5.jpeg': {'is_informative': True,
  'description': 'Three line graphs showing training error (%) versus iterations (1e4) for different neural network architectures. Left panel compares plain networks (plain-20, plain-32, plain-44, plain-56) showing error rates between 0-20%, with deeper networks performing worse. Middle panel shows ResNet architectures (ResNet-20, ResNet-32, ResNet-44, ResNet-56, ResNet-110) with error rates 0-20%, demonstrating that deeper ResNets achieve lower error rates, with 56-layer and 20-layer performance labeled. Right panel compares residual-110 and residual-1202 models showing error rates 0-20%, with both converging to similar performance around 5-7% error. The graphs illustrate the effectiveness of residual connections in enabling training of deeper networks compared to plain architectures.'},
 'img-0.jpeg': {'is_informative': True,
  'description': 'Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.'}}

source

save_img_descs

 save_img_descs (descs:dict, dst_fname:pathlib.Path)

Save image descriptions to JSON file

Type Details
descs dict Dictionary of image descriptions
dst_fname Path Path to save the JSON file
Returns None

We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.

save_img_descs(descs, 'files/test/md_fixed/resnet/img_descriptions.json')
Path('files/test/md_fixed/resnet/img_descriptions.json').read_text()[:500]
'{\n  "img-5.jpeg": {\n    "is_informative": true,\n    "description": "This figure contains three line graphs showing training error rates over iterations for different neural network architectures. \\n\\nLeft panel: Shows error rates for plain networks (plain-20, plain-32, plain-44, plain-56) over approximately 6\\u00d710^4 iterations. The curves show varying convergence patterns, with the 56-layer and 20-layer networks achieving lower error rates around 10-13%, while deeper networks show less stable'

We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.

Once we have image descriptions, we insert them into the markdown by finding image references and adding formatted description blocks.


source

add_descs_to_pg

 add_descs_to_pg (pg:str, descs:dict)

Add AI-generated descriptions to images in page

Type Details
pg str Page markdown content
descs dict Dictionary mapping image filenames to their descriptions
Returns str Page markdown with descriptions added

Image descriptions are inserted directly after the markdown image reference, wrapped in horizontal rules to visually separate them from the document flow. This preserves the original structure while making descriptions easily identifiable.

pgs = read_pgs('files/test/md_fixed/resnet', join=False)
print(pgs[0][2000:3000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradien
descs = json.loads(Path('files/test/md_fixed/resnet/img_descriptions.json').read_text())
new_pg = add_descs_to_pg(pgs[0], descs)
print(new_pg[2000:4000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.

Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.

Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training error. The 56-layer network (red line) maintains higher test error around 14-15%, while the 20-layer network (yellow/olive line) achieves lower test error around 10-11%.

Key observation: The deeper 56-layer network exhibits worse performance (higher error) than the shallower 20-layer network on both training and test sets, suggesting a degradation problem in very deep networks. This visualization likely illustrates the motivation for residual learning architectures that address the degradation problem in deep neural networks.
___

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanish

We process all pages in batch to efficiently add descriptions throughout the document.


source

add_descs_to_pgs

 add_descs_to_pgs (pgs:list, descs:dict)

Add AI-generated descriptions to images in all pages

Type Details
pgs list List of page markdown strings
descs dict Dictionary mapping image filenames to their descriptions
Returns list List of pages with descriptions added
enriched_pgs = add_descs_to_pgs(pgs, descs)
print(enriched_pgs[0][2000:3000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.

Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.

Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training err

The force parameter controls whether to regenerate image descriptions: - force=False (default): Loads existing descriptions from img_descriptions.json if it exists, saving time and API costs - force=True: Regenerates all descriptions by calling the vision LLM, even if cached descriptions exist

Important: If dst is the same as src, descriptions will be added to files that may already contain them from previous runs. To avoid duplicate descriptions, either: - Use a different destination directory each time, or - Ensure your source markdown files are clean before processing


source

add_img_descs

 add_img_descs (src:str, dst:str=None, model:str='claude-sonnet-4-5',
                img_folder:str='img', semaphore:int=2, delay:float=1,
                force:bool=False, progress:bool=True)

Describe all images in markdown document and insert descriptions inline

Type Default Details
src str Path to source markdown directory
dst str None Destination directory (defaults to src if None)
model str claude-sonnet-4-5 Vision model for image description
img_folder str img Name of folder containing images
semaphore int 2 Max concurrent API requests
delay float 1 Delay in seconds between API calls
force bool False Force regeneration even if cache exists
progress bool True Log progress messages

Here’s the complete workflow to process a document:

await add_img_descs('files/test/md_fixed/resnet', dst='files/test/md_enriched/resnet', force=True, progress=True)
pgs = read_pgs('files/test/md_enriched/resnet', join=False)
print(pgs[0][2000:4000])
3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also

[^0]![img-0.jpeg](img-0.jpeg)
AI-generated image description:
___
Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.
___

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks