History

kempersc 3c80de87e0 add isil entries		2025-11-19 23:25:22 +01:00
..
demo_nlp_extractor.py	add isil entries	2025-11-19 23:25:22 +01:00
extract_identifiers.py	add isil entries	2025-11-19 23:25:22 +01:00
heritage_custodian_instances.yaml	add isil entries	2025-11-19 23:25:22 +01:00
README.md	add isil entries	2025-11-19 23:25:22 +01:00

README.md

Examples

This directory contains usage examples for the GLAM data extraction pipeline.

Available Examples

extract_identifiers.py

Demonstrates how to extract identifiers (ISIL, Wikidata, VIAF, KvK, URLs) from conversation JSON files.

Usage:

cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=./src:$PYTHONPATH python examples/extract_identifiers.py

What it does:

Loads a sample conversation JSON file
Parses the conversation structure
Extracts text from assistant messages
Runs identifier extraction using regex patterns
Displays results grouped by identifier type

Expected output:

=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4

Identifiers by scheme:
  ISIL: NL-ASDRM, NL-HANA
  URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl

Running Examples

All examples should be run from the project root with PYTHONPATH set:

# From project root
cd /Users/kempersc/Documents/claude/glam

# Set PYTHONPATH and run
PYTHONPATH=./src:$PYTHONPATH python examples/<example_name>.py

Future Examples

extract_from_csv.py - Parse Dutch ISIL registry and organizations CSV
extract_with_ner.py - Use subagent-based NER to extract institution names
geocode_locations.py - Geocode addresses to lat/lon coordinates
export_to_jsonld.py - Export extracted data to JSON-LD format
validate_schema.py - Validate data against LinkML schema
deduplicate.py - Find and merge duplicate institution records