Extract

Report Sections Extraction

This module provides tools for automatically identifying and extracting core sections from evaluation reports. When working with large reports (50-200+ pages), we need to focus on key sections—such as executive summaries, introductions, conclusions, and recommendations—to support tagging and mapping exercises against evaluation frameworks (e.g., SRF, GCM).

Focusing on these sections helps:

The approach uses an LLM to parse a report’s table of contents, identify which sections contain substantive thematic content, and extract just those sections for further processing.


source

CoreSectionsOutput


def CoreSectionsOutput(
    data:Any
)->None:

Identify the core sections of the report

Exported source
class CoreSectionsOutput(BaseModel):
    "Identify the core sections of the report"
    section_paths: list[list[str]]
    reasoning: str

For instance, given a markdown:

from mistocr.core import read_pgs
!ls ../../pipeline/data/md/4341695461234eee3deb51ac68871109
img     page_16.md  page_23.md  page_30.md  page_38.md  page_45.md
page_1.md   page_17.md  page_24.md  page_31.md  page_39.md  page_46.md
page_10.md  page_18.md  page_25.md  page_32.md  page_4.md   page_47.md
page_11.md  page_19.md  page_26.md  page_33.md  page_40.md  page_5.md
page_12.md  page_2.md   page_27.md  page_34.md  page_41.md  page_6.md
page_13.md  page_20.md  page_28.md  page_35.md  page_42.md  page_7.md
page_14.md  page_21.md  page_29.md  page_36.md  page_43.md  page_8.md
page_15.md  page_22.md  page_3.md   page_37.md  page_44.md  page_9.md
fname = "../../pipeline/data/md/4341695461234eee3deb51ac68871109"
sample_md = read_pgs(fname, join=True)
sample_md = """# Report Title ... page 1

## Executive Summary ... page 1

This is a summary of key findings.

## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...

## 2. Findings ... page 3

Detailed findings.

## 3. Conclusions ... page 5

Main conclusions.

## 4. Recommendations ... page 6

Key recommendations."""
hdgs = create_heading_dict(sample_md)
hdgs
{'Report Title ... page 1': {'Executive Summary ... page 1': {},
  '1. Introduction ... page 2': {'1.1 Objectives ... page 2': {}},
  '2. Findings ... page 3': {},
  '3. Conclusions ... page 5': {},
  '4. Recommendations ... page 6': {}}}
from mistocr.refine import fmt_hdgs_idx
print(fmt_hdgs_idx(hdgs))
0. Report Title ... page 1
re.findall(r'^#+\s+.*$', sample_md, re.MULTILINE)
['# Report Title ... page 1',
 '## Executive Summary ... page 1',
 '## 1. Introduction ... page 2',
 '### 1.1 Objectives ... page 2',
 '## 2. Findings ... page 3',
 '## 3. Conclusions ... page 5',
 '## 4. Recommendations ... page 6']

From Raw Headings to Dictionary Paths

The create_heading_dict function (from toolslm.md_hier) parses a markdown document and returns a nested dictionary representing its heading structure. Each key is a heading title, and nested dictionaries represent subsections.

When users or an LLM identify sections by their raw markdown headings (e.g., "## 1. Introduction ... page 4"), we need to convert these into paths that can navigate this nested structure. This involves:

  1. Stripping the markdown prefix (##) to get the dictionary key
  2. Finding the full path to that key in the nested hierarchy

source

heading_to_key


def heading_to_key(
    hdg:str, # Raw markdown heading like "## 1. Intro ..."
)->str: # Heading text without markdown prefix

Strip markdown heading prefix: ‘## 1. Intro …’ → ‘1. Intro …’

assert heading_to_key("## 1. Introduction ... page 4") == "1. Introduction ... page 4"
assert heading_to_key("### 2.1. Context ... page 5") == "2.1. Context ... page 5"
assert heading_to_key("# Title") == "Title"

source

find_path


def find_path(
    hdgs:dict, # Nested dictionary of headings
    key:str, # Heading key to find
)->list: # Full path to key, or empty list if not found

Find full path to a heading key in nested dict

sample_hdgs = create_heading_dict(sample_md)
assert find_path(sample_hdgs, "1. Introduction ... page 2") == [
    "Report Title ... page 1",
    "1. Introduction ... page 2"
]
assert find_path(sample_hdgs, "1.1 Objectives ... page 2") == [
    "Report Title ... page 1",
    "1. Introduction ... page 2",
    "1.1 Objectives ... page 2"
]
assert find_path(sample_hdgs, "nonexistent") == []

When the LLM or an end user identifies core sections, they might select both a parent section and its children (e.g., “Introduction” and “Introduction > Objectives”). To avoid duplicate content, we filter out any paths that are children of other selected paths.


source

rm_nested


def rm_nested(
    paths:list, # List of section paths, where each path is a list of keys
)->list: # Filtered list with nested paths removed

Remove paths that are children of other paths in the list

nested_paths = [
    ['Report Title ... page 1', '1. Introduction ... page 2'],
    ['Report Title ... page 1', '1. Introduction ... page 2', '1.1 Objectives ... page 2'],
    ['Report Title ... page 1', '3. Conclusions ... page 5']
]

rm_nested(nested_paths)
[['Report Title ... page 1', '1. Introduction ... page 2'],
 ['Report Title ... page 1', '3. Conclusions ... page 5']]

source

headings_to_paths


def headings_to_paths(
    hdgs:dict, # Nested dictionary of headings
    selected_headings:list, # Raw markdown headings like ["## 1. Intro ...", ...]
)->list: # Deduplicated paths through heading hierarchy

Convert raw headings to deduplicated paths

# From a nested heading, we get the full path for text extraction
selected = ["### 1.1 Objectives ... page 2"]
paths = headings_to_paths(sample_hdgs, selected)
assert paths == [[
    "Report Title ... page 1",
    "1. Introduction ... page 2", 
    "1.1 Objectives ... page 2"
]]

LLM-Based Section Identification

Rather than using rigid pattern matching, we use an LLM to intelligently identify core sections. This handles multilingual reports, varied naming conventions, and unusual structures. The LLM receives the table of contents as a nested dictionary and returns paths to the most relevant sections.


source

identify_core_sections


def identify_core_sections(
    hdgs:dict, # Nested dictionary of report headings from `create_heading_dict`
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->dict: # Dictionary with 'section_paths' and 'reasoning' keys

Use LLM to identify core sections (exec summary, intro, conclusions, recommendations) from ToC

sections = identify_core_sections(hdgs)
sections
{'section_paths': [['Report Title ... page 1', 'Executive Summary ... page 1'],
  ['Report Title ... page 1', '1. Introduction ... page 2'],
  ['Report Title ... page 1',
   '1. Introduction ... page 2',
   '1.1 Objectives ... page 2'],
  ['Report Title ... page 1', '3. Conclusions ... page 5'],
  ['Report Title ... page 1', '4. Recommendations ... page 6']],
 'reasoning': "Selected all five key sections that reveal core themes: Executive Summary (page 1) provides high-level overview of what authors consider important; Introduction and Objectives (pages 2) establish the evaluation's purpose and scope; Conclusions (page 5) synthesize key findings and judgments; Recommendations (page 6) translate conclusions into actionable guidance. These sections total approximately 6 pages and represent the most strategic content for determining core themes. Excluded Findings section as it likely contains detailed evidence that supports rather than defines the core themes, though the selected sections should provide sufficient context about what findings matter most."}

Putting It All Together

The main entry point extract_sections combines all the pieces: it parses the report’s heading structure, identifies core sections (either via LLM or user-provided headings), removes nested duplicates to avoid repetition, and extracts the concatenated text ready for downstream processing.


source

get_h1_title


def get_h1_title(
    hdgs:dict
)->str:

Get the main title (first h1 heading) from heading dict


source

extract_sections


def extract_sections(
    md:str, # Markdown text of full report
    selected_headings:list=None, # Raw headings like ["## 1. Intro ...", ...]; if None, uses LLM
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->str: # Concatenated text of all core sections

Extract core sections from report. Uses LLM auto-detection if selected_headings is None, otherwise uses provided headings.

text = extract_sections(sample_md, selected_headings=["## 1. Introduction ... page 2"])
assert "# Report Title ... page 1" in text
assert "Background information here" in text
assert "1.1 Objectives" in text
print(text)
# Report Title ... page 1
## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...
text = extract_sections(sample_md, selected_headings=[])
assert text == ""
text
''