Extract

Report Sections Extraction

This module provides tools for automatically identifying and extracting core sections from evaluation reports. When working with large reports (50-200+ pages), we need to focus on key sections—such as executive summaries, introductions, conclusions, and recommendations—to support tagging and mapping exercises against evaluation frameworks (e.g., SRF, GCM).

Focusing on these sections helps:

The approach uses an LLM to parse a report’s table of contents, identify which sections contain substantive thematic content, and extract just those sections for further processing.


CoreSectionsOutput


def CoreSectionsOutput(
    data:Any
)->None:

Identify the core sections of the report

Exported source
class CoreSectionsOutput(BaseModel):
    "Identify the core sections of the report"
    section_paths: list[list[str]]
    reasoning: str

For instance, given a markdown:

from mistocr.core import read_pgs
!ls ../../pipeline/data/md/4341695461234eee3deb51ac68871109
img     page_16.md  page_23.md  page_30.md  page_38.md  page_45.md
page_1.md   page_17.md  page_24.md  page_31.md  page_39.md  page_46.md
page_10.md  page_18.md  page_25.md  page_32.md  page_4.md   page_47.md
page_11.md  page_19.md  page_26.md  page_33.md  page_40.md  page_5.md
page_12.md  page_2.md   page_27.md  page_34.md  page_41.md  page_6.md
page_13.md  page_20.md  page_28.md  page_35.md  page_42.md  page_7.md
page_14.md  page_21.md  page_29.md  page_36.md  page_43.md  page_8.md
page_15.md  page_22.md  page_3.md   page_37.md  page_44.md  page_9.md
fname = "../../pipeline/data/md/4341695461234eee3deb51ac68871109"
sample_md = read_pgs(fname, join=True)
sample_md = """# Report Title ... page 1

## Executive Summary ... page 1

This is a summary of key findings.

## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...

## 2. Findings ... page 3

Detailed findings.

## 3. Conclusions ... page 5

Main conclusions.

## 4. Recommendations ... page 6

Key recommendations."""
hdgs = create_heading_dict(sample_md)
hdgs
{'Final External Evaluation ... page 1': {'Evaluation scope ... page 3': {},
  'Evaluation criteria ... page 3': {},
  'Evaluation questions ... page 3': {},
  'Evaluation methodology ... page 5': {},
  'Ethics, norms and standards for evaluation ... page 5': {},
  'Hired Evaluator must abide with the following. ... page 6': {},
  'Evaluation deliverables ... page 6': {},
  'Specifications of roles ... page 6': {},
  'Time schedule ... page 7': {},
  'Qualifications of the Evaluator ... page 8': {},
  'Submission of application ... page 8': {},
  'REGIONAL INTERVIEWS ... page 14': {},
  'INTRODUCTION ... page 14': {},
  'Coherence ... page 14': {},
  'Effectiveness ... page 14': {},
  'Impact ... page 16': {},
  'Sustainability ... page 16': {},
  'Closure ... page 16': {},
  'NATIONAL INTERVIEWS ... page 16': {},
  'INTRODUCTION ... page 16': {},
  'Coherence ... page 17': {},
  'Effectiveness ... page 17': {},
  'Efficiency ... page 17': {'3.2.2. External alignment with efforts and organizations outside IOM  ... page 20': {},
   '3.3. Effectiveness  ... page 21': {},
   '3.3.1. Specific Objective 1: National and regional authorities in the field of migration governance are aware of, and act in accordance with, international and regional frameworks for migration governance and human rights standards. ... page 21': {'3.3.2. Specific Objective 2: The quality of national and cross-border cooperation on trafficking and smuggling cases between law enforcement, judicial and other state and nonstate actors, in coordination with existing regional initiatives and in accordance with international obligations and standards, is increased  ... page 26': {},
    '3.3.3. Specific Objective 3: Protection services for Victims of Trafficking and of vulnerable migrants are improved at local, national, and regional levels  ... page 28': {},
    "3.3.4. Factors influencing the programme's effectiveness  ... page 32": {'Security challenges ... page 32': {},
     'Political context and changing government priorities ... page 32': {},
     'Procurement challenges ... page 33': {}},
    '3.3.5. Integration of cross-cutting themes  ... page 35': {}},
   '3.4. Efficiency ... page 35': {},
   "3.4.1. Financial efficiency of IOM's BMM programme ... page 35": {'3.4.2. Efficiency of coordination and reporting  ... page 37': {}},
   '3.5. IMPACT  ... page 39': {},
   '3.6. Sustainability  ... page 40': {}},
  '4. CONCLUSIONS  ... page 42': {},
  '5. RECOMMENDATIONS  ... page 44': {}}}
from mistocr.refine import fmt_hdgs_idx
print(fmt_hdgs_idx(hdgs))
0. Final External Evaluation ... page 1
re.findall(r'^#+\s+.*$', sample_md, re.MULTILINE)
['# Final External Evaluation ... page 1',
 '## Evaluation scope ... page 3',
 '## Evaluation criteria ... page 3',
 '## Evaluation questions ... page 3',
 '## Evaluation methodology ... page 5',
 '## Ethics, norms and standards for evaluation ... page 5',
 '## Hired Evaluator must abide with the following. ... page 6',
 '## Evaluation deliverables ... page 6',
 '## Specifications of roles ... page 6',
 '## Time schedule ... page 7',
 '## Qualifications of the Evaluator ... page 8',
 '## Submission of application ... page 8',
 '## REGIONAL INTERVIEWS ... page 14',
 '## INTRODUCTION ... page 14',
 '## Coherence ... page 14',
 '## Effectiveness ... page 14',
 '## Impact ... page 16',
 '## Sustainability ... page 16',
 '## Closure ... page 16',
 '## NATIONAL INTERVIEWS ... page 16',
 '## INTRODUCTION ... page 16',
 '## Coherence ... page 17',
 '## Effectiveness ... page 17',
 '## Efficiency ... page 17',
 '#### 3.2.2. External alignment with efforts and organizations outside IOM  ... page 20',
 '### 3.3. Effectiveness  ... page 21',
 '### 3.3.1. Specific Objective 1: National and regional authorities in the field of migration governance are aware of, and act in accordance with, international and regional frameworks for migration governance and human rights standards. ... page 21',
 '#### 3.3.2. Specific Objective 2: The quality of national and cross-border cooperation on trafficking and smuggling cases between law enforcement, judicial and other state and nonstate actors, in coordination with existing regional initiatives and in accordance with international obligations and standards, is increased  ... page 26',
 '#### 3.3.3. Specific Objective 3: Protection services for Victims of Trafficking and of vulnerable migrants are improved at local, national, and regional levels  ... page 28',
 "#### 3.3.4. Factors influencing the programme's effectiveness  ... page 32",
 '##### Security challenges ... page 32',
 '##### Political context and changing government priorities ... page 32',
 '##### Procurement challenges ... page 33',
 '#### 3.3.5. Integration of cross-cutting themes  ... page 35',
 '### 3.4. Efficiency ... page 35',
 "### 3.4.1. Financial efficiency of IOM's BMM programme ... page 35",
 '#### 3.4.2. Efficiency of coordination and reporting  ... page 37',
 '### 3.5. IMPACT  ... page 39',
 '### 3.6. Sustainability  ... page 40',
 '## 4. CONCLUSIONS  ... page 42',
 '## 5. RECOMMENDATIONS  ... page 44']

Handling Nested Selections

When the LLM identifies core sections, it might select both a parent section and its children (e.g., “Introduction” and “Introduction > Objectives”). To avoid duplicate content, we filter out any paths that are children of other selected paths.


rm_nested


def rm_nested(
    paths:list, # List of section paths, where each path is a list of keys
)->list: # Filtered list with nested paths removed

Remove paths that are children of other paths in the list

nested_paths = [
    ['Report Title ... page 1', '1. Introduction ... page 2'],
    ['Report Title ... page 1', '1. Introduction ... page 2', '1.1 Objectives ... page 2'],
    ['Report Title ... page 1', '3. Conclusions ... page 5']
]

rm_nested(nested_paths)
[['Report Title ... page 1', '1. Introduction ... page 2'],
 ['Report Title ... page 1', '3. Conclusions ... page 5']]

LLM-Based Section Identification

Rather than using rigid pattern matching, we use an LLM to intelligently identify core sections. This handles multilingual reports, varied naming conventions, and unusual structures. The LLM receives the table of contents as a nested dictionary and returns paths to the most relevant sections.


identify_core_sections


def identify_core_sections(
    hdgs:dict, # Nested dictionary of report headings from `create_heading_dict`
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->dict: # Dictionary with 'section_paths' and 'reasoning' keys

Use LLM to identify core sections (exec summary, intro, conclusions, recommendations) from ToC

sections = identify_core_sections(hdgs)
sections
{'section_paths': [['Report Title ... page 1', 'Executive Summary ... page 1'],
  ['Report Title ... page 1',
   '1. Introduction ... page 2',
   '1.1 Objectives ... page 2'],
  ['Report Title ... page 1', '3. Conclusions ... page 5'],
  ['Report Title ... page 1', '4. Recommendations ... page 6']],
 'reasoning': "Selected four core sections totaling approximately 6 pages that capture the report's essential themes: Executive Summary (page 1) provides the overview and key findings; Introduction/Objectives (pages 2-3) establishes the evaluation purpose and questions; Conclusions (pages 5-6) synthesizes findings; and Recommendations (page 6+) presents actionable insights. These sections represent where authors explicitly articulate what is important and core to the evaluation, avoiding the Findings section which likely contains supporting detail rather than thematic synthesis."}

Putting It All Together

The main entry point combines all the pieces: parse the report structure, identify core sections, remove nested duplicates, and extract the text.


extract_sections


def extract_sections(
    md:str, # Markdown text of full report
    sp:str=None, # System prompt for section identification
    response_format:type=CoreSectionsOutput, # Pydantic model for structured output
    model:str='claude-sonnet-4-5', # LLM model to use for identification
)->str: # Concatenated text of all core sections

Extract and concatenate core sections (exec summary, intro, conclusions, recommendations) from report markdown

text = extract_sections(sample_md, model='claude-sonnet-4-5')
print(text[:200])
## Executive Summary ... page 1

This is a summary of key findings.
## 1. Introduction ... page 2

Background information here.

### 1.1 Objectives ... page 2

The objectives are...
## 3. Conclusions 

TO DO

The LLM gets sometimes confused when passing the nested dict of headings. Flattening it might make it more robust…

def flatten_paths(hdgs:dict, # Nested dictionary of headings
                  prefix:list=None # Path prefix for recursion
                 ) -> list[list[str]]: # List of all paths through the heading tree
    "Flatten nested heading dict into list of paths"
    paths = []
    for k, v in hdgs.items():
        current_path = prefix + [k]
        paths.append(current_path)
        if v:  # If there are children
            paths.extend(flatten_paths(v, current_path))
    return paths
hdgs
{'Report Title ... page 1': {'Executive Summary ... page 1': {},
  '1. Introduction ... page 2': {'1.1 Objectives ... page 2': {}},
  '2. Findings ... page 3': {},
  '3. Conclusions ... page 5': {},
  '4. Recommendations ... page 6': {}}}
for i,o in enumerate(flatten_paths(hdgs)): 
    print((i,o))
(0, ['Report Title ... page 1'])
(1, ['Report Title ... page 1', 'Executive Summary ... page 1'])
(2, ['Report Title ... page 1', '1. Introduction ... page 2'])
(3, ['Report Title ... page 1', '1. Introduction ... page 2', '1.1 Objectives ... page 2'])
(4, ['Report Title ... page 1', '2. Findings ... page 3'])
(5, ['Report Title ... page 1', '3. Conclusions ... page 5'])
(6, ['Report Title ... page 1', '4. Recommendations ... page 6'])