Exported source
class CoreSectionsOutput(BaseModel):
"Identify the core sections of the report"
section_paths: list[list[str]]
reasoning: strThis module provides tools for automatically identifying and extracting core sections from evaluation reports. When working with large reports (50-200+ pages), we need to focus on key sections—such as executive summaries, introductions, conclusions, and recommendations—to support tagging and mapping exercises against evaluation frameworks (e.g., SRF, GCM).
Focusing on these sections helps:
The approach uses an LLM to parse a report’s table of contents, identify which sections contain substantive thematic content, and extract just those sections for further processing.
Identify the core sections of the report
For instance, given a markdown:
img page_16.md page_23.md page_30.md page_38.md page_45.md
page_1.md page_17.md page_24.md page_31.md page_39.md page_46.md
page_10.md page_18.md page_25.md page_32.md page_4.md page_47.md
page_11.md page_19.md page_26.md page_33.md page_40.md page_5.md
page_12.md page_2.md page_27.md page_34.md page_41.md page_6.md
page_13.md page_20.md page_28.md page_35.md page_42.md page_7.md
page_14.md page_21.md page_29.md page_36.md page_43.md page_8.md
page_15.md page_22.md page_3.md page_37.md page_44.md page_9.md
sample_md = """# Report Title ... page 1
## Executive Summary ... page 1
This is a summary of key findings.
## 1. Introduction ... page 2
Background information here.
### 1.1 Objectives ... page 2
The objectives are...
## 2. Findings ... page 3
Detailed findings.
## 3. Conclusions ... page 5
Main conclusions.
## 4. Recommendations ... page 6
Key recommendations."""{'Report Title ... page 1': {'Executive Summary ... page 1': {},
'1. Introduction ... page 2': {'1.1 Objectives ... page 2': {}},
'2. Findings ... page 3': {},
'3. Conclusions ... page 5': {},
'4. Recommendations ... page 6': {}}}
The create_heading_dict function (from toolslm.md_hier) parses a markdown document and returns a nested dictionary representing its heading structure. Each key is a heading title, and nested dictionaries represent subsections.
When users or an LLM identify sections by their raw markdown headings (e.g., "## 1. Introduction ... page 4"), we need to convert these into paths that can navigate this nested structure. This involves:
##) to get the dictionary keyStrip markdown heading prefix: ‘## 1. Intro …’ → ‘1. Intro …’
Find full path to a heading key in nested dict
sample_hdgs = create_heading_dict(sample_md)
assert find_path(sample_hdgs, "1. Introduction ... page 2") == [
"Report Title ... page 1",
"1. Introduction ... page 2"
]
assert find_path(sample_hdgs, "1.1 Objectives ... page 2") == [
"Report Title ... page 1",
"1. Introduction ... page 2",
"1.1 Objectives ... page 2"
]
assert find_path(sample_hdgs, "nonexistent") == []When the LLM or an end user identifies core sections, they might select both a parent section and its children (e.g., “Introduction” and “Introduction > Objectives”). To avoid duplicate content, we filter out any paths that are children of other selected paths.
Remove paths that are children of other paths in the list
[['Report Title ... page 1', '1. Introduction ... page 2'],
['Report Title ... page 1', '3. Conclusions ... page 5']]
Convert raw headings to deduplicated paths
Reports have hierarchical structure (sections, subsections, etc.). We represent this as a nested dictionary using create_heading_dict from toolslm.md_hier. To extract text from a specific section, we need to navigate through this hierarchy using a path of keys.
Navigate through nested heading levels and return the text content
Rather than using rigid pattern matching, we use an LLM to intelligently identify core sections. This handles multilingual reports, varied naming conventions, and unusual structures. The LLM receives the table of contents as a nested dictionary and returns paths to the most relevant sections.
def identify_core_sections(
hdgs:dict, # Nested dictionary of report headings from `create_heading_dict`
sp:str=None, # System prompt for section identification
response_format:type=CoreSectionsOutput, # Pydantic model for structured output
model:str='claude-sonnet-4-5', # LLM model to use for identification
)->dict: # Dictionary with 'section_paths' and 'reasoning' keys
Use LLM to identify core sections (exec summary, intro, conclusions, recommendations) from ToC
{'section_paths': [['Report Title ... page 1', 'Executive Summary ... page 1'],
['Report Title ... page 1', '1. Introduction ... page 2'],
['Report Title ... page 1',
'1. Introduction ... page 2',
'1.1 Objectives ... page 2'],
['Report Title ... page 1', '3. Conclusions ... page 5'],
['Report Title ... page 1', '4. Recommendations ... page 6']],
'reasoning': "Selected all five key sections that reveal core themes: Executive Summary (page 1) provides high-level overview of what authors consider important; Introduction and Objectives (pages 2) establish the evaluation's purpose and scope; Conclusions (page 5) synthesize key findings and judgments; Recommendations (page 6) translate conclusions into actionable guidance. These sections total approximately 6 pages and represent the most strategic content for determining core themes. Excluded Findings section as it likely contains detailed evidence that supports rather than defines the core themes, though the selected sections should provide sufficient context about what findings matter most."}
The main entry point extract_sections combines all the pieces: it parses the report’s heading structure, identifies core sections (either via LLM or user-provided headings), removes nested duplicates to avoid repetition, and extracts the concatenated text ready for downstream processing.
Get the main title (first h1 heading) from heading dict
def extract_sections(
md:str, # Markdown text of full report
selected_headings:list=None, # Raw headings like ["## 1. Intro ...", ...]; if None, uses LLM
sp:str=None, # System prompt for section identification
response_format:type=CoreSectionsOutput, # Pydantic model for structured output
model:str='claude-sonnet-4-5', # LLM model to use for identification
)->str: # Concatenated text of all core sections
Extract core sections from report. Uses LLM auto-detection if selected_headings is None, otherwise uses provided headings.
# Report Title ... page 1
## 1. Introduction ... page 2
Background information here.
### 1.1 Objectives ... page 2
The objectives are...