from cachy import enable_cachy
enable_cachy()Refine
This module refines markdown documents extracted from OCR’d PDFs through two main processes:
Heading Hierarchy Correction: OCR often corrupts document structure, creating inconsistent heading levels (e.g., jumping from H1 to H4, or using H2 for both sections and subsections). We use LLMs to analyze the full heading structure and automatically fix these issues, ensuring proper hierarchical relationships. Page numbers can optionally be added to headings for easier navigation.
Image Description: We classify images as informative (charts, diagrams, tables) or decorative (logos, backgrounds), then generate detailed descriptions for informative images using vision LLMs. These descriptions are inserted directly into the markdown, making visual content searchable and accessible for RAG systems and accessibility tools.
Both processes work incrementally on page-by-page markdown files, caching results to avoid redundant API calls.
Heading Hierarchy
Functions for detecting and fixing markdown heading levels
OCR’d PDF files often have corrupted heading hierarchies - headings may jump levels incorrectly (e.g., from H1 to H4) or use inconsistent levels for sections at the same depth. This section provides tools to automatically detect and fix these issues using LLMs, while also optionally adding page numbers for easier navigation.
The first step is extracting all headings from a markdown document so we can analyze their structure.
get_hdgs
def get_hdgs(
md:str, # Markdown file string
)->L: # L of strings
Return the markdown headings
add_pg_hdgs
def add_pg_hdgs(
md:str, # Markdown file string,
n:int, # Page number
)->str: # Markdown file string
Add page number to all headings in page markdown
The add_pg_hdgs function serves two important purposes:
1. Creating unique heading identifiers
When fixing heading hierarchies across an entire document, we need a way to distinguish between headings that have the same text but appear in different locations. For example, a document might have multiple “Introduction” or “Conclusion” headings in different chapters. By appending the page number to each heading, we create unique identifiers that allow us to build a lookup table mapping each specific heading instance to its corrected version. This assumes (reasonably) that the same heading text won’t appear twice on a single page.
2. Providing spatial context for LLMs
Adding page numbers gives LLMs valuable positional information when analyzing the document structure. The page number helps the model understand: - Where a heading sits in the overall document flow - The relative distance between sections - Whether headings that seem related are actually close together or far apart
This spatial awareness can significantly improve the LLM’s ability to infer the correct hierarchical relationships between headings, especially in long documents where similar section names might appear at different structural levels.
For instance:
pgs = read_pgs('files/test/md_all/attention-is-all-you-need', join=False)
pg0,pg0_with = pgs[0][:100],add_pg_hdgs(pgs[0], n=1)[:100]
print('Before:\n' + 80*'-' + f'\n{pg0}\n\nAfter:\n' + 80*'-' + f'\n{pg0_with}')Before:
--------------------------------------------------------------------------------
# Attention Is All You Need
Ashish Vaswani*
Google Brain
avaswani@google.com
Noam Shazeer*
Goog
After:
--------------------------------------------------------------------------------
# Attention Is All You Need ... page 1
Ashish Vaswani*
Google Brain
avaswani@google.com
Noam Sha
read_pgs_pg
def read_pgs_pg(
path:str, # Path to the markdown file
)->L: # List of markdown pages
Read all pages of a markdown file and add page numbers to all headings
pgs = read_pgs_pg('files/test/md_all/attention-is-all-you-need')
hdgs = L([get_hdgs(pg) for pg in pgs]).concat()
hdgs['# Attention Is All You Need ... page 1', '# Abstract ... page 1', '## 2 Background ... page 2', '## 3 Model Architecture ... page 2', '# 3.1 Encoder and Decoder Stacks ... page 3', '# 3.2 Attention ... page 3', '# 3.2.1 Scaled Dot-Product Attention ... page 4', '# 3.2.2 Multi-Head Attention ... page 4', '#### 3.2.3 Applications of Attention in our Model ... page 5', '### 3.3 Position-wise Feed-Forward Networks ... page 5', '### 3.4 Embeddings and Softmax ... page 5', '# 3.5 Positional Encoding ... page 6', '# 4 Why Self-Attention ... page 6', '## 5 Training ... page 7', '### 5.1 Training Data and Batching ... page 7', '### 5.2 Hardware and Schedule ... page 7', '### 5.3 Optimizer ... page 7', '### 5.4 Regularization ... page 7', '# 6 Results ... page 8', '# 6.1 Machine Translation ... page 8', '# 6.2 Model Variations ... page 8', '# 6.3 English Constituency Parsing ... page 9', '# 7 Conclusion ... page 10', '# References ... page 10', '# Attention Visualizations ... page 13']
To make it easier for an LLM to reference specific headings when suggesting fixes, we format them with index numbers.
fmt_hdgs_idx
def fmt_hdgs_idx(
hdgs:list, # List of markdown headings
)->str: # Formatted string with index
Format the headings with index
hdgs_fmt = fmt_hdgs_idx(hdgs)
print(hdgs_fmt)0. # Attention Is All You Need ... page 1
1. # Abstract ... page 1
2. ## 2 Background ... page 2
3. ## 3 Model Architecture ... page 2
4. # 3.1 Encoder and Decoder Stacks ... page 3
5. # 3.2 Attention ... page 3
6. # 3.2.1 Scaled Dot-Product Attention ... page 4
7. # 3.2.2 Multi-Head Attention ... page 4
8. #### 3.2.3 Applications of Attention in our Model ... page 5
9. ### 3.3 Position-wise Feed-Forward Networks ... page 5
10. ### 3.4 Embeddings and Softmax ... page 5
11. # 3.5 Positional Encoding ... page 6
12. # 4 Why Self-Attention ... page 6
13. ## 5 Training ... page 7
14. ### 5.1 Training Data and Batching ... page 7
15. ### 5.2 Hardware and Schedule ... page 7
16. ### 5.3 Optimizer ... page 7
17. ### 5.4 Regularization ... page 7
18. # 6 Results ... page 8
19. # 6.1 Machine Translation ... page 8
20. # 6.2 Model Variations ... page 8
21. # 6.3 English Constituency Parsing ... page 9
22. # 7 Conclusion ... page 10
23. # References ... page 10
24. # Attention Visualizations ... page 13
We use a Pydantic model to ensure the LLM returns corrections in a structured format - a list of dictionary mapping heading indices to their corrected versions.
HeadingCorrection
def HeadingCorrection(
data:Any
)->None:
A single heading correction mapping an index to its corrected markdown heading
HeadingCorrections
def HeadingCorrections(
data:Any
)->None:
Collection of heading corrections returned by the LLM
This prompt instructs the LLM on what types of heading hierarchy errors to fix while preserving the document’s intended structure. It focuses on three main issues:
- level jumps that skip intermediate levels,
- numbering inconsistencies where subsection depth doesn’t match heading level, and
- ensures decreasing levels (moving back up the hierarchy) are preserved.
Now we can use an LLM to automatically detect and fix these heading hierarchy issues. The function uses litellm (wrapped by the Lisette package to send the formatted headings to a language model along with our correction rules. The LLM analyzes the structure and returns only the headings that need fixing, mapped by their index numbers.
fix_hdg_hierarchy
def fix_hdg_hierarchy(
hdgs:list, # List of markdown headings
prompt:str=None, # Prompt to use (instructions part only)
model:str='claude-sonnet-4-5', # Model to use
api_key:str=None, # API key
reasoning_effort:str=None, # Reasoning effort: 'low', 'medium', or 'high'
)->dict: # Dictionary of index → corrected heading
Fix the heading hierarchy
Exported source
prompt_fix_hdgs = """Fix markdown heading hierarchy errors while preserving the document's intended structure.
INPUT FORMAT: Each heading is prefixed with its index number (e.g., "0. # Title ... page 1")
ANALYSIS STEPS (think through these before outputting corrections):
1. For each numbered heading (e.g., "4.1", "2.a", "A.1"), identify its parent heading (e.g., "4", "2", "A")
2. Verify the child heading is exactly one # deeper than its parent
3. If not, mark it for correction
RULES - Apply these fixes in order:
1. **Single H1 rule**: Documents must have exactly ONE # heading (typically the document title at the top)
- If index 0 is already #, then all subsequent headings (index 1+) must be ## or deeper
- If no H1 exists, the first major heading should be #, and all others ## or deeper
- NO exceptions: appendices, references, and all sections are ## or deeper after the title
2. **Infer depth from numbering patterns**: If headings contain section numbers, deeper nesting means deeper heading level
- Parent section (e.g., "1", "2", "A") MUST be shallower than child (e.g., "1.1", "2.a", "A.1")
- Child section MUST be exactly one # deeper than parent
- Works with any numbering: "1/1.1/1.1.1", "A/A.1/A.1.a", "I/I.A/I.A.1", etc.
3. **Level jumps**: Headings can only increase by one # at a time when moving deeper
- Wrong: ## Section → ##### Subsection
- Fixed: ## Section → ### Subsection
4. **Decreasing levels is OK**: Moving back up the hierarchy (### to ##) is valid for new sections
5. **Unnumbered headings in numbered documents**: If the document uses numbered headings consistently, any unnumbered heading appearing within that structure is likely misclassified bold text and should be converted to regular text (output the heading text without any # symbols in the corrected field)
"""Exported source
prompt_fix_hdgs_suffix = """
OUTPUT: Return a list of corrections, where each correction has:
- index: the heading's index number
- corrected: the fixed heading text (without the index prefix), or empty string "" to remove the heading entirely
IMPORTANT: Preserve the " ... page N" suffix in all corrected headings.
Only include headings that need changes.
Headings to analyze:
{headings_list}
"""fixes = fix_hdg_hierarchy(hdgs)
fixes{1: '## Abstract ... page 1',
4: '### 3.1 Encoder and Decoder Stacks ... page 3',
5: '### 3.2 Attention ... page 3',
6: '#### 3.2.1 Scaled Dot-Product Attention ... page 4',
7: '#### 3.2.2 Multi-Head Attention ... page 4',
8: '##### 3.2.3 Applications of Attention in our Model ... page 5',
9: '### 3.3 Position-wise Feed-Forward Networks ... page 5',
10: '### 3.4 Embeddings and Softmax ... page 5',
11: '### 3.5 Positional Encoding ... page 6',
12: '## 4 Why Self-Attention ... page 6',
13: '## 5 Training ... page 7',
14: '### 5.1 Training Data and Batching ... page 7',
15: '### 5.2 Hardware and Schedule ... page 7',
16: '### 5.3 Optimizer ... page 7',
17: '### 5.4 Regularization ... page 7',
18: '## 6 Results ... page 8',
19: '### 6.1 Machine Translation ... page 8',
20: '### 6.2 Model Variations ... page 8',
21: '### 6.3 English Constituency Parsing ... page 9',
22: '## 7 Conclusion ... page 10',
23: '## References ... page 10',
24: '## Attention Visualizations ... page 13'}
The corrections come back as string indices, but we need to map the actual heading text to its corrected version for easy replacement in the document.
mk_fixes_lut
def mk_fixes_lut(
hdgs:list, # List of markdown headings
model:str='claude-sonnet-4-5', # Model to use
api_key:str=None, # API key
prompt:str=None, # Prompt to use (instructions part only)
reasoning_effort:str=None, # Reasoning effort: 'low', 'medium', or 'high'
)->dict: # Dictionary of old → new heading
Make a lookup table of fixes
lut_fixes = mk_fixes_lut(hdgs)
lut_fixes{'# Abstract ... page 1': '## Abstract ... page 1',
'# 3.1 Encoder and Decoder Stacks ... page 3': '### 3.1 Encoder and Decoder Stacks ... page 3',
'# 3.2 Attention ... page 3': '### 3.2 Attention ... page 3',
'# 3.2.1 Scaled Dot-Product Attention ... page 4': '#### 3.2.1 Scaled Dot-Product Attention ... page 4',
'# 3.2.2 Multi-Head Attention ... page 4': '#### 3.2.2 Multi-Head Attention ... page 4',
'#### 3.2.3 Applications of Attention in our Model ... page 5': '##### 3.2.3 Applications of Attention in our Model ... page 5',
'### 3.3 Position-wise Feed-Forward Networks ... page 5': '### 3.3 Position-wise Feed-Forward Networks ... page 5',
'### 3.4 Embeddings and Softmax ... page 5': '### 3.4 Embeddings and Softmax ... page 5',
'# 3.5 Positional Encoding ... page 6': '### 3.5 Positional Encoding ... page 6',
'# 4 Why Self-Attention ... page 6': '## 4 Why Self-Attention ... page 6',
'## 5 Training ... page 7': '## 5 Training ... page 7',
'### 5.1 Training Data and Batching ... page 7': '### 5.1 Training Data and Batching ... page 7',
'### 5.2 Hardware and Schedule ... page 7': '### 5.2 Hardware and Schedule ... page 7',
'### 5.3 Optimizer ... page 7': '### 5.3 Optimizer ... page 7',
'### 5.4 Regularization ... page 7': '### 5.4 Regularization ... page 7',
'# 6 Results ... page 8': '## 6 Results ... page 8',
'# 6.1 Machine Translation ... page 8': '### 6.1 Machine Translation ... page 8',
'# 6.2 Model Variations ... page 8': '### 6.2 Model Variations ... page 8',
'# 6.3 English Constituency Parsing ... page 9': '### 6.3 English Constituency Parsing ... page 9',
'# 7 Conclusion ... page 10': '## 7 Conclusion ... page 10',
'# References ... page 10': '## References ... page 10',
'# Attention Visualizations ... page 13': '## Attention Visualizations ... page 13'}
Now we can apply the fixes to individual pages. We optionally add page numbers to headings for easier navigation in the final document.
apply_hdg_fixes
def apply_hdg_fixes(
p:str, # Page to fix
lut_fixes:dict, # Lookup table of fixes
)->str: # Page with fixes applied
Apply the fixes to the page
pg_nb = 1
p = read_pgs_pg('files/test/md_all/resnet')[0]
print(apply_hdg_fixes(p, lut_fixes)[:300])# Deep Residual Learning for Image Recognition ... page 1
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to
Finally, we tie everything together in a single function that processes an entire document directory, fixing all heading hierarchy issues and optionally adding page numbers.
fix_hdgs
def fix_hdgs(
src:str, # Path to source markdown directory
model:str='claude-sonnet-4-5', # LLM model to use for heading analysis
dst:str=None, # Path to destination directory (defaults to src, modifying in place)
img_folder:str='img', # Name of image subfolder to copy
api_key:str=None, # API key
prompt:str=None, # Prompt to use (instructions part only)
reasoning_effort:str=None, # Reasoning effort: 'low', 'medium', or 'high'
)->None:
Fix heading hierarchy in markdown document
fix_hdgs('files/test/md_all/resnet', dst='files/test/md_fixed/resnet')!ls -R 'files/test/md_fixed/resnet'files/test/md_fixed/resnet:
img page_10.md page_2.md page_5.md page_8.md
img_descriptions.json page_11.md page_3.md page_6.md page_9.md
page_1.md page_12.md page_4.md page_7.md
files/test/md_fixed/resnet/img:
img-0.jpeg img-10.jpeg img-2.jpeg img-4.jpeg img-6.jpeg img-8.jpeg
img-1.jpeg img-11.jpeg img-3.jpeg img-5.jpeg img-7.jpeg img-9.jpeg
md = read_pgs('files/test/md_fixed/resnet')
print(md[:500])# Deep Residual Learning for Image Recognition ... page 1
Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
## Abstract ... page 1
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead
Image Description
Tools for classifying and describing images in markdown documents
imgs = Path('files/test/md_fixed/resnet/img').ls(file_exts='.jpeg')
imgs(#7) [Path('files/test/md_fixed/resnet/img/img-5.jpeg'),Path('files/test/md_fixed/resnet/img/img-0.jpeg'),Path('files/test/md_fixed/resnet/img/img-1.jpeg'),Path('files/test/md_fixed/resnet/img/img-2.jpeg'),Path('files/test/md_fixed/resnet/img/img-3.jpeg'),Path('files/test/md_fixed/resnet/img/img-6.jpeg'),Path('files/test/md_fixed/resnet/img/img-4.jpeg')]
ImgDescription
def ImgDescription(
data:Any
)->None:
Image classification and description for OCR’d documents
The two-field Pydantic model filters decorative images via is_informative while providing rich description context for downstream RAG systems.
The prompt uses a two-step approach: first classify the image as informative or decorative, then provide an appropriate level of detail in the description based on that classification.
describe_img
def describe_img(
img_path:Path, # Path to the image file
model:str='claude-sonnet-4-5', # Model to use
prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt for description
)->ImgDescription:
Describe a single image using AsyncChat
We process images asynchronously using AsyncChat to handle multiple images efficiently while respecting API rate limits.
img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await describe_img(img)
r{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
- id:
chatcmpl-444fbd64-e128-4f47-b72b-ee254372fa78 - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)
To avoid hitting API rate limits when processing multiple images, we use semaphore-based concurrency control with delays between requests.
limit
def limit(
semaphore, # Semaphore for concurrency control
coro, # Coroutine to execute
delay:float=None, # Optional delay in seconds after execution
):
Execute coroutine with semaphore-based rate limiting and optional delay
img = Path('files/test/md_fixed/resnet/img/img-0.jpeg')
r = await limit(Semaphore(2), describe_img(img), delay=1)
r{“is_informative”: true, “description”: “Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.”}
- id:
chatcmpl-74b8a349-87e1-4ceb-bc33-39cd648404e2 - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=245, prompt_tokens=605, total_tokens=850, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)
parse_r
def parse_r(
result, # ModelResponse object from API call
):
Extract and parse JSON content from model response
The describe_imgs function orchestrates parallel image processing: it creates async tasks for each image, limits concurrency with a semaphore, adds delays between requests to avoid rate limits, and returns a dictionary mapping filenames to their descriptions for easy lookup during markdown enrichment.
describe_imgs
def describe_imgs(
imgs:list, # List of image file paths to describe
model:str='claude-sonnet-4-5', # Model to use for image description
prompt:str='Analyze this image from an academic/technical document.\n\nStep 1: Determine if this image is informative for understanding the document content.\n- Informative: charts, diagrams, tables, technical illustrations, experimental results, architectural diagrams\n- Non-informative: logos, decorative images, generic photos, page backgrounds\n\nStep 2: \n- If informative: Provide a detailed description including the type of visualization, key elements and their relationships, important data or patterns, and relevant technical details.\n- If non-informative: Provide a brief label (e.g., "Company logo", "Decorative header image")\n\nReturn your response as JSON with \'is_informative\' (boolean) and \'description\' (string) fields.', # Prompt template for description
semaphore:int=10, # Max concurrent API requests
delay:float=0.1, # Delay in seconds between requests
)->dict: # Dict mapping filename to parsed description
Describe multiple images in parallel with rate limiting
descs = await describe_imgs(imgs[:2], semaphore=10, delay=0.1)
descs{'img-5.jpeg': {'is_informative': True,
'description': 'Three line graphs showing training error (%) versus iterations (1e4) for different neural network architectures. Left panel compares plain networks (plain-20, plain-32, plain-44, plain-56) showing error rates between 0-20%, with deeper networks performing worse. Middle panel shows ResNet architectures (ResNet-20, ResNet-32, ResNet-44, ResNet-56, ResNet-110) with error rates 0-20%, demonstrating that deeper ResNets achieve lower error rates, with 56-layer and 20-layer performance labeled. Right panel compares residual-110 and residual-1202 models showing error rates 0-20%, with both converging to similar performance around 5-7% error. The graphs illustrate the effectiveness of residual connections in enabling training of deeper networks compared to plain architectures.'},
'img-0.jpeg': {'is_informative': True,
'description': 'Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.'}}
save_img_descs
def save_img_descs(
descs:dict, # Dictionary of image descriptions
dst_fname:Path, # Path to save the JSON file
)->None:
Save image descriptions to JSON file
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
save_img_descs(descs, 'files/test/md_fixed/resnet/img_descriptions.json')
Path('files/test/md_fixed/resnet/img_descriptions.json').read_text()[:500]'{\n "img-5.jpeg": {\n "is_informative": true,\n "description": "This figure contains three line graphs showing training error rates over iterations for different neural network architectures. \\n\\nLeft panel: Shows error rates for plain networks (plain-20, plain-32, plain-44, plain-56) over approximately 6\\u00d710^4 iterations. The curves show varying convergence patterns, with the 56-layer and 20-layer networks achieving lower error rates around 10-13%, while deeper networks show less stable'
We save descriptions to a JSON file so they can be reused without re-processing images, which saves API costs and time. The add_img_descs function checks for this cache file first.
Once we have image descriptions, we insert them into the markdown by finding image references and adding formatted description blocks.
add_descs_to_pg
def add_descs_to_pg(
pg:str, # Page markdown content
descs:dict, # Dictionary mapping image filenames to their descriptions
)->str: # Page markdown with descriptions added
Add AI-generated descriptions to images in page
Image descriptions are inserted directly after the markdown image reference, wrapped in horizontal rules to visually separate them from the document flow. This preserves the original structure while making descriptions easily identifiable.
pgs = read_pgs('files/test/md_fixed/resnet', join=False)
print(pgs[0][2000:3000])3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradien
descs = json.loads(Path('files/test/md_fixed/resnet/img_descriptions.json').read_text())
new_pg = add_descs_to_pg(pgs[0], descs)
print(new_pg[2000:4000])3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training error. The 56-layer network (red line) maintains higher test error around 14-15%, while the 20-layer network (yellow/olive line) achieves lower test error around 10-11%.
Key observation: The deeper 56-layer network exhibits worse performance (higher error) than the shallower 20-layer network on both training and test sets, suggesting a degradation problem in very deep networks. This visualization likely illustrates the motivation for residual learning architectures that address the degradation problem in deep neural networks.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanish
We process all pages in batch to efficiently add descriptions throughout the document.
add_descs_to_pgs
def add_descs_to_pgs(
pgs:list, # List of page markdown strings
descs:dict, # Dictionary mapping image filenames to their descriptions
)->list: # List of pages with descriptions added
Add AI-generated descriptions to images in all pages
enriched_pgs = add_descs_to_pgs(pgs, descs)
print(enriched_pgs[0][2000:3000])3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
This figure contains two side-by-side line graphs comparing training and test error rates for neural networks with different layer depths (56-layer and 20-layer) over training iterations.
Left panel (Training Error): Shows training error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). The 56-layer network (red line) starts around 19% error and maintains higher error throughout training, ending around 6-7%. The 20-layer network (yellow/olive line) starts around 20% error and decreases more smoothly to approximately 3-4% by the end of training.
Right panel (Test Error): Shows test error percentage (y-axis, 0-20%) versus iterations (x-axis, 0-6 ×10^4). Both networks show similar patterns to training err
The force parameter controls whether to regenerate image descriptions: - force=False (default): Loads existing descriptions from img_descriptions.json if it exists, saving time and API costs - force=True: Regenerates all descriptions by calling the vision LLM, even if cached descriptions exist
Important: If dst is the same as src, descriptions will be added to files that may already contain them from previous runs. To avoid duplicate descriptions, either: - Use a different destination directory each time, or - Ensure your source markdown files are clean before processing
add_img_descs
def add_img_descs(
src:str, # Path to source markdown directory
dst:str=None, # Destination directory (defaults to src if None)
model:str='claude-sonnet-4-5', # Vision model for image description
img_folder:str='img', # Name of folder containing images
semaphore:int=2, # Max concurrent API requests
delay:float=1, # Delay in seconds between API calls
force:bool=False, # Force regeneration even if cache exists
progress:bool=True, # Log progress messages
):
Describe all images in markdown document and insert descriptions inline
Here’s the complete workflow to process a document:
await add_img_descs('files/test/md_fixed/resnet', dst='files/test/md_enriched/resnet', force=True, progress=True)__main__ - INFO - Describing 7 images...
__main__ - INFO - Saved descriptions to files/test/md_fixed/resnet/img_descriptions.json
__main__ - INFO - Adding descriptions to 12 pages...
__main__ - INFO - Done! Enriched pages saved to files/test/md_enriched/resnet
pgs = read_pgs('files/test/md_enriched/resnet', join=False)
print(pgs[0][2000:4000])3,16]$ on the challenging ImageNet dataset [36] all exploit "very deep" [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks $[8,12,7,32,27]$ have also
[^0]
AI-generated image description:
___
Two side-by-side line graphs comparing training error (left) and test error (right) versus iteration number for neural networks with different layer depths. Both graphs show performance over approximately 60,000 iterations (6×10^4). The y-axis represents error percentage (0-20%), and the x-axis shows iterations in scientific notation. Two lines are plotted in each graph: a red line representing a 56-layer network and a yellow/green line representing a 20-layer network. In the training error plot, both networks show decreasing error over time, with the 20-layer network achieving lower training error (~2-3%) compared to the 56-layer network (~5-6%). In the test error plot, both networks show similar patterns with the 20-layer network achieving slightly better test error (~10-11%) compared to the 56-layer network (~13-14%). This visualization demonstrates the degradation problem in deep neural networks where deeper networks (56-layer) perform worse than shallower ones (20-layer) on both training and test sets.
___
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization $[23,9,37,13]$ and intermediate normalization layers [16], which enable networks