54 lines
1.5 KiB
Markdown
54 lines
1.5 KiB
Markdown
# Examples
|
|
|
|
This directory contains usage examples for the GLAM data extraction pipeline.
|
|
|
|
## Available Examples
|
|
|
|
### extract_identifiers.py
|
|
|
|
Demonstrates how to extract identifiers (ISIL, Wikidata, VIAF, KvK, URLs) from conversation JSON files.
|
|
|
|
**Usage**:
|
|
```bash
|
|
cd /Users/kempersc/Documents/claude/glam
|
|
PYTHONPATH=./src:$PYTHONPATH python examples/extract_identifiers.py
|
|
```
|
|
|
|
**What it does**:
|
|
1. Loads a sample conversation JSON file
|
|
2. Parses the conversation structure
|
|
3. Extracts text from assistant messages
|
|
4. Runs identifier extraction using regex patterns
|
|
5. Displays results grouped by identifier type
|
|
|
|
**Expected output**:
|
|
```
|
|
=== Conversation: Test Dutch GLAM Institutions ===
|
|
Messages: 4
|
|
Total identifiers found: 4
|
|
|
|
Identifiers by scheme:
|
|
ISIL: NL-ASDRM, NL-HANA
|
|
URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl
|
|
```
|
|
|
|
## Running Examples
|
|
|
|
All examples should be run from the project root with PYTHONPATH set:
|
|
|
|
```bash
|
|
# From project root
|
|
cd /Users/kempersc/Documents/claude/glam
|
|
|
|
# Set PYTHONPATH and run
|
|
PYTHONPATH=./src:$PYTHONPATH python examples/<example_name>.py
|
|
```
|
|
|
|
## Future Examples
|
|
|
|
- **extract_from_csv.py** - Parse Dutch ISIL registry and organizations CSV
|
|
- **extract_with_ner.py** - Use subagent-based NER to extract institution names
|
|
- **geocode_locations.py** - Geocode addresses to lat/lon coordinates
|
|
- **export_to_jsonld.py** - Export extracted data to JSON-LD format
|
|
- **validate_schema.py** - Validate data against LinkML schema
|
|
- **deduplicate.py** - Find and merge duplicate institution records
|