Extract

Report Sections Extraction

This module provides tools for automatically identifying and extracting core sections from evaluation reports. When working with large reports (50-200+ pages), we need to focus on key sections—such as executive summaries, introductions, conclusions, and recommendations—to support tagging and mapping exercises against evaluation frameworks (e.g., SRF, GCM).

Focusing on these sections helps:

Reduce noise from tangential content
Prevent confirmation bias by avoiding the tendency to flag any passing mention as a theme

The approach uses an LLM to parse a report’s table of contents, identify which sections contain substantive thematic content, and extract just those sections for further processing.

source

CoreSectionsOutput


def CoreSectionsOutput(
    data:Any
)->None:

Identify the core sections of the report

Exported source

class CoreSectionsOutput(BaseModel):
    "Identify the core sections of the report"
    section_paths: list[list[str]]
    reasoning: str

For instance, given a markdown:

from mistocr.core import read_pgs

!ls ../../pipeline/data/md/4341695461234eee3deb51ac68871109

img     page_16.md  page_23.md  page_30.md  page_38.md  page_45.md
page_1.md   page_17.md  page_24.md  page_31.md  page_39.md  page_46.md
page_10.md  page_18.md  page_25.md  page_32.md  page_4.md   page_47.md
page_11.md  page_19.md  page_26.md  page_33.md  page_40.md  page_5.md
page_12.md  page_2.md   page_27.md  page_34.md  page_41.md  page_6.md
page_13.md  page_20.md  page_28.md  page_35.md  page_42.md  page_7.md
page_14.md  page_21.md  page_29.md  page_36.md  page_43.md  page_8.md
page_15.md  page_22.md  page_3.md   page_37.md  page_44.md  page_9.md

fname = "../../pipeline/data/md/4341695461234eee3deb51ac68871109"
sample_md = read_pgs(fname, join=True)

sample_md = """# Report Title ... page 1

## Executive Summary ... page 1

This is a summary of key findings.

## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...

## 2. Findings ... page 3

Detailed findings.

## 3. Conclusions ... page 5

Main conclusions.

## 4. Recommendations ... page 6

Key recommendations."""

hdgs = create_heading_dict(sample_md)
hdgs

{'Report Title ... page 1': {'Executive Summary ... page 1': {},
  '1. Introduction ... page 2': {'1.1 Objectives ... page 2': {}},
  '2. Findings ... page 3': {},
  '3. Conclusions ... page 5': {},
  '4. Recommendations ... page 6': {}}}

from mistocr.refine import fmt_hdgs_idx

print(fmt_hdgs_idx(hdgs))

0. Report Title ... page 1

re.findall(r'^#+\s+.*$', sample_md, re.MULTILINE)

['# Report Title ... page 1',
 '## Executive Summary ... page 1',
 '## 1. Introduction ... page 2',
 '### 1.1 Objectives ... page 2',
 '## 2. Findings ... page 3',
 '## 3. Conclusions ... page 5',
 '## 4. Recommendations ... page 6']

From Raw Headings to Dictionary Paths

The create_heading_dict function (from toolslm.md_hier) parses a markdown document and returns a nested dictionary representing its heading structure. Each key is a heading title, and nested dictionaries represent subsections.

When users or an LLM identify sections by their raw markdown headings (e.g., "## 1. Introduction ... page 4"), we need to convert these into paths that can navigate this nested structure. This involves:

Stripping the markdown prefix (##) to get the dictionary key
Finding the full path to that key in the nested hierarchy

source

heading_to_key


def heading_to_key(
    hdg:str, # Raw markdown heading like "## 1. Intro ..."
)->str: # Heading text without markdown prefix

Strip markdown heading prefix: ‘## 1. Intro …’ → ‘1. Intro …’

assert heading_to_key("## 1. Introduction ... page 4") == "1. Introduction ... page 4"
assert heading_to_key("### 2.1. Context ... page 5") == "2.1. Context ... page 5"
assert heading_to_key("# Title") == "Title"

source

find_path


def find_path(
    hdgs:dict, # Nested dictionary of headings
    key:str, # Heading key to find
)->list: # Full path to key, or empty list if not found

Find full path to a heading key in nested dict

sample_hdgs = create_heading_dict(sample_md)
assert find_path(sample_hdgs, "1. Introduction ... page 2") == [
    "Report Title ... page 1",
    "1. Introduction ... page 2"
]
assert find_path(sample_hdgs, "1.1 Objectives ... page 2") == [
    "Report Title ... page 1",
    "1. Introduction ... page 2",
    "1.1 Objectives ... page 2"
]
assert find_path(sample_hdgs, "nonexistent") == []

When the LLM or an end user identifies core sections, they might select both a parent section and its children (e.g., “Introduction” and “Introduction > Objectives”). To avoid duplicate content, we filter out any paths that are children of other selected paths.

source

rm_nested


def rm_nested(
    paths:list, # List of section paths, where each path is a list of keys
)->list: # Filtered list with nested paths removed

Remove paths that are children of other paths in the list

nested_paths = [
    ['Report Title ... page 1', '1. Introduction ... page 2'],
    ['Report Title ... page 1', '1. Introduction ... page 2', '1.1 Objectives ... page 2'],
    ['Report Title ... page 1', '3. Conclusions ... page 5']
]

rm_nested(nested_paths)

[['Report Title ... page 1', '1. Introduction ... page 2'],
 ['Report Title ... page 1', '3. Conclusions ... page 5']]

source

headings_to_paths


def headings_to_paths(
    hdgs:dict, # Nested dictionary of headings
    selected_headings:list, # Raw markdown headings like ["## 1. Intro ...", ...]
)->list: # Deduplicated paths through heading hierarchy

Convert raw headings to deduplicated paths

# From a nested heading, we get the full path for text extraction
selected = ["### 1.1 Objectives ... page 2"]
paths = headings_to_paths(sample_hdgs, selected)
assert paths == [[
    "Report Title ... page 1",
    "1. Introduction ... page 2", 
    "1.1 Objectives ... page 2"
]]

Navigating Nested Headings

Reports have hierarchical structure (sections, subsections, etc.). We represent this as a nested dictionary using create_heading_dict from toolslm.md_hier. To extract text from a specific section, we need to navigate through this hierarchy using a path of keys.

source

get_text


def get_text(
    ks:list, # List of exact key strings forming path through nested dict
    hdgs:dict, # Nested dictionary of headings created by `create_heading_dict`
)->str: # Extracted markdown text for the section

Navigate through nested heading levels and return the text content

path = ['Report Title ... page 1', '3. Conclusions ... page 5']
print(get_text(path, hdgs))

## 3. Conclusions ... page 5

Main conclusions.

LLM-Based Section Identification

Rather than using rigid pattern matching, we use an LLM to intelligently identify core sections. This handles multilingual reports, varied naming conventions, and unusual structures. The LLM receives the table of contents as a nested dictionary and returns paths to the most relevant sections.

source

identify_core_sections


def identify_core_sections(
    hdgs:dict, # Nested dictionary of report headings from `create_heading_dict`
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->dict: # Dictionary with 'section_paths' and 'reasoning' keys

Use LLM to identify core sections (exec summary, intro, conclusions, recommendations) from ToC

sections = identify_core_sections(hdgs)
sections

{'section_paths': [['Report Title ... page 1', 'Executive Summary ... page 1'],
  ['Report Title ... page 1', '1. Introduction ... page 2'],
  ['Report Title ... page 1',
   '1. Introduction ... page 2',
   '1.1 Objectives ... page 2'],
  ['Report Title ... page 1', '3. Conclusions ... page 5'],
  ['Report Title ... page 1', '4. Recommendations ... page 6']],
 'reasoning': "Selected all five key sections that reveal core themes: Executive Summary (page 1) provides high-level overview of what authors consider important; Introduction and Objectives (pages 2) establish the evaluation's purpose and scope; Conclusions (page 5) synthesize key findings and judgments; Recommendations (page 6) translate conclusions into actionable guidance. These sections total approximately 6 pages and represent the most strategic content for determining core themes. Excluded Findings section as it likely contains detailed evidence that supports rather than defines the core themes, though the selected sections should provide sufficient context about what findings matter most."}

Putting It All Together

The main entry point extract_sections combines all the pieces: it parses the report’s heading structure, identifies core sections (either via LLM or user-provided headings), removes nested duplicates to avoid repetition, and extracts the concatenated text ready for downstream processing.

source

get_h1_title


def get_h1_title(
    hdgs:dict
)->str:

Get the main title (first h1 heading) from heading dict

source

extract_sections


def extract_sections(
    md:str, # Markdown text of full report
    selected_headings:list=None, # Raw headings like ["## 1. Intro ...", ...]; if None, uses LLM
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->str: # Concatenated text of all core sections

Extract core sections from report. Uses LLM auto-detection if selected_headings is None, otherwise uses provided headings.

text = extract_sections(sample_md, selected_headings=["## 1. Introduction ... page 2"])
assert "# Report Title ... page 1" in text
assert "Background information here" in text
assert "1.1 Objectives" in text
print(text)

# Report Title ... page 1
## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...

text = extract_sections(sample_md, selected_headings=[])
assert text == ""
text

''