glam/examples/README.md

# Examples

This directory contains usage examples for the GLAM data extraction pipeline.

## Available Examples

### extract_identifiers.py

Demonstrates how to extract identifiers (ISIL, Wikidata, VIAF, KvK, URLs) from conversation JSON files.

**Usage**:
```bash
cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=./src:$PYTHONPATH python examples/extract_identifiers.py
```

**What it does**:
1. Loads a sample conversation JSON file
2. Parses the conversation structure
3. Extracts text from assistant messages
4. Runs identifier extraction using regex patterns
5. Displays results grouped by identifier type

**Expected output**:
```
=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4

Identifiers by scheme:
  ISIL: NL-ASDRM, NL-HANA
  URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl
```

## Running Examples

All examples should be run from the project root with PYTHONPATH set:

```bash
# From project root
cd /Users/kempersc/Documents/claude/glam

# Set PYTHONPATH and run
PYTHONPATH=./src:$PYTHONPATH python examples/<example_name>.py
```

## Future Examples

- **extract_from_csv.py** - Parse Dutch ISIL registry and organizations CSV
- **extract_with_ner.py** - Use subagent-based NER to extract institution names
- **geocode_locations.py** - Geocode addresses to lat/lon coordinates
- **export_to_jsonld.py** - Export extracted data to JSON-LD format
- **validate_schema.py** - Validate data against LinkML schema
- **deduplicate.py** - Find and merge duplicate institution records