| .. | ||
| demo_nlp_extractor.py | ||
| extract_identifiers.py | ||
| heritage_custodian_instances.yaml | ||
| README.md | ||
Examples
This directory contains usage examples for the GLAM data extraction pipeline.
Available Examples
extract_identifiers.py
Demonstrates how to extract identifiers (ISIL, Wikidata, VIAF, KvK, URLs) from conversation JSON files.
Usage:
cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=./src:$PYTHONPATH python examples/extract_identifiers.py
What it does:
- Loads a sample conversation JSON file
- Parses the conversation structure
- Extracts text from assistant messages
- Runs identifier extraction using regex patterns
- Displays results grouped by identifier type
Expected output:
=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4
Identifiers by scheme:
ISIL: NL-ASDRM, NL-HANA
URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl
Running Examples
All examples should be run from the project root with PYTHONPATH set:
# From project root
cd /Users/kempersc/Documents/claude/glam
# Set PYTHONPATH and run
PYTHONPATH=./src:$PYTHONPATH python examples/<example_name>.py
Future Examples
- extract_from_csv.py - Parse Dutch ISIL registry and organizations CSV
- extract_with_ner.py - Use subagent-based NER to extract institution names
- geocode_locations.py - Geocode addresses to lat/lon coordinates
- export_to_jsonld.py - Export extracted data to JSON-LD format
- validate_schema.py - Validate data against LinkML schema
- deduplicate.py - Find and merge duplicate institution records