glam/NEXT_STEPS.md

# Next Steps: Conversation JSON Extraction

**Status**: ✅ Phase 2A COMPLETE - Pattern-Based Extraction Working
**Priority**: HIGH
**Complexity**: Medium (Pattern-based completed, ML-based enhancement optional)
**Last Updated**: 2025-11-05

---

## ✅ COMPLETED: Pattern-Based NLP Extractor

**Implementation**: `src/glam_extractor/extractors/nlp_extractor.py` (630 lines, 90% test coverage)

**Capabilities**:
- ✅ Institution name extraction (pattern-based with keyword detection)
- ✅ Multilingual support (English, Dutch, Spanish, Portuguese, French, German, Greek)
- ✅ Institution type classification (13 types: MUSEUM, LIBRARY, ARCHIVE, etc.)
- ✅ Identifier extraction (ISIL, Wikidata, VIAF, KvK)
- ✅ Location extraction (city, country from patterns)
- ✅ Confidence scoring (0.0-1.0)
- ✅ Full provenance tracking (TIER_4_INFERRED)
- ✅ Deduplication

**Batch Processing Pipeline**: `scripts/batch_extract_institutions.py` (500+ lines)

**Test Results**:
- 21 tests, 20 passing (95% pass rate)
- Successfully tested on 3 conversation files
- Extracted 18 unique institutions from test run
- Exports to JSON and CSV formats

**Known Limitations** (Pattern-Based Approach):
1. Name variants ("Vietnamese Museum" vs "Vietnamese Museu")
2. Many institutions have UNKNOWN country (location patterns limited)
3. Complex names fail ("Museum of Modern Art" not matched by simple patterns)
4. No syntactic parsing, relies on keyword proximity

---

## Overview

Parse 139 conversation JSON files to extract GLAM institution data using pattern-based NLP.

**Goal**: Extract ~2,000-5,000 TIER_4_INFERRED heritage custodian records from global GLAM research conversations.

**Status**: Basic extraction working, ready for full batch processing or ML enhancement.

---

## Quick Start

### 1. List Available Conversations

```bash
# Count conversation files
find /Users/kempersc/Documents/claude/glam -name "*.json" -type f | wc -l

# Sample conversation names
ls -1 /Users/kempersc/Documents/claude/glam/*.json | head -20
```

### 2. Start Small - Test Extraction Pipeline

Pick **1-2 conversations** to develop and test extraction logic:

**Recommended Test Files**:
1. A Brazilian GLAM conversation (Portuguese, museums/libraries)
2. A Dutch province conversation (already know Dutch institutions from CSV)

### 3. ✅ Extraction Pipeline (WORKING)

```python
from pathlib import Path
from glam_extractor.parsers.conversation import ConversationParser
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor

# Parse conversation
parser = ConversationParser()
conversation = parser.parse_file("2025-09-22T14-40-15-...-Brazilian_GLAM.json")

# Extract institutions
extractor = InstitutionExtractor()
result = extractor.extract_from_text(
    conversation.extract_all_text(),
    conversation_id=conversation.uuid
)

if result.success:
    for institution in result.value:
        print(f"{institution.name} ({institution.institution_type})")
        print(f"  Confidence: {institution.provenance.confidence_score}")
```

### 4. ✅ Batch Processing (WORKING)

```bash
# Process first 10 conversations
python scripts/batch_extract_institutions.py --limit 10

# Process all 139 conversations
python scripts/batch_extract_institutions.py

# Filter by country
python scripts/batch_extract_institutions.py --country Brazil

# Disable geocoding (faster)
python scripts/batch_extract_institutions.py --no-geocoding

# Custom output directory
python scripts/batch_extract_institutions.py --output-dir results/
```

---

## Extraction Tasks (from AGENTS.md)

### ✅ Phase 2A: Basic Entity Extraction (COMPLETE)

1. **✅ Institution Names** (Pattern-Based)
   - Capitalization patterns + keyword context
   - Multilingual keyword detection (7 languages)
   - Confidence scoring based on evidence

2. **✅ Locations** (Pattern Matching + GeoNames)
   - Extract cities, countries from "in [City]" patterns
   - GeoNames lookup for lat/lon enrichment
   - ISIL prefix → country code mapping

3. **✅ Identifiers** (Regex Pattern Matching)
   - ISIL codes: `[A-Z]{2}-[A-Za-z0-9]+` ✅
   - Wikidata IDs: `Q[0-9]+` ✅
   - VIAF IDs: `viaf.org/viaf/[0-9]+` ✅
   - KvK: `[0-9]{8}` ✅
   - URLs ✅

4. **✅ Institution Types** (Keyword Classification)
   - 13-type taxonomy (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.)
   - Multilingual keyword matching
   - Defaults to MIXED when uncertain

### ⏳ Phase 2B: Advanced Extraction (NEXT PRIORITIES)

5. **Relationships** (Pattern Matching) - NOT STARTED
   - Parent organizations
   - Partnerships
   - Network memberships
   - **Next Step**: Create `src/glam_extractor/extractors/relationship_extractor.py`

6. **Collection Metadata** (Pattern Matching) - NOT STARTED
   - Collection names, types
   - Item counts, time periods
   - Subject areas
   - **Next Step**: Create `src/glam_extractor/extractors/collection_extractor.py`

7. **Digital Platforms** (Pattern Matching) - NOT STARTED
   - CMS systems mentioned
   - SPARQL endpoints
   - Collection portals
   - APIs and discovery services

8. **Metadata Standards** (Pattern Matching) - NOT STARTED
   - Dublin Core, MARC21, EAD, etc.
   - Schema.org, CIDOC-CRM

9. **Organizational Change Events** (Pattern Matching) - NOT STARTED
   - Mergers, closures, relocations
   - Name changes, reorganizations
   - See AGENTS.md Task 8 for details

---

## Implementation Strategy

### Option 1: Subagent-Based (Recommended)

**Pros**:
- Clean separation of concerns
- Flexible (subagent chooses best NER approach)
- No heavy dependencies in main code
- Easy to experiment

**Workflow**:
```python
# 1. Parse conversation
conversation = parser.parse_file(conv_path)

# 2. Launch subagent for NER
result = task_tool.invoke(
    subagent_type="general",
    description="Extract GLAM institutions",
    prompt=f"""
    Extract museum, library, and archive names from this text.

    Text: {conversation.extract_all_text()}

    Return JSON array with:
    - name: institution name
    - type: museum/library/archive/mixed
    - city: location (if mentioned)
    - confidence: 0.0-1.0
    """
)

# 3. Validate and convert to HeritageCustodian
institutions = json.loads(result)
custodians = [convert_to_custodian(inst, conversation) for inst in institutions]
```

### Option 2: Direct NER (Alternative)

**Pros**:
- Full control over NER pipeline
- Better for debugging

**Cons**:
- Adds spaCy dependency to main code
- More complex error handling

---

## Test-Driven Development Plan

### Step 1: Parse Single Conversation
```bash
# Create test
touch tests/parsers/test_conversation_extraction.py

# Test: Load conversation, extract institutions (manual fixtures)
pytest tests/parsers/test_conversation_extraction.py -v
```

### Step 2: Identifier Extraction (Regex-Based)
```python
# Easy win: Extract ISIL codes, Wikidata IDs
# High precision, no ML needed

def extract_isil_codes(text: str) -> List[str]:
    pattern = r'\b([A-Z]{2}-[A-Za-z0-9]+)\b'
    return re.findall(pattern, text)
```

### Step 3: NER via Subagent
```python
# Launch subagent to extract institution names
# Validate results with known institutions (e.g., Rijksmuseum)
```

### Step 4: Batch Processing
```python
# Process all 139 conversations
# Collect statistics (institutions per country, types, etc.)
```

---

## Expected Outputs

### Extraction Statistics (Estimate)

Based on 139 conversations covering 60+ countries:

- **Institutions extracted**: 2,000-5,000 (rough estimate)
- **Countries covered**: 60+
- **ISIL codes found**: 100-300
- **Wikidata links**: 500-1,000
- **Confidence distribution**:
  - High (0.8-1.0): 40%
  - Medium (0.6-0.8): 35%
  - Low (0.3-0.6): 25%

### Provenance Metadata

All conversation-extracted records:
```yaml
provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-05T..."
  extraction_method: "Subagent NER via Task tool"
  confidence_score: 0.75
  conversation_id: "conversation-uuid"
  source_url: null
  verified_date: null
  verified_by: null
```

---

## Cross-Linking Opportunities

### Dutch Conversations + CSV Data

5 Dutch province conversations exist:
- Limburg (NL)
- Gelderland (NL)
- Drenthe (NL)
- Groningen (NL)
- (+ general Dutch conversations)

**Validation Approach**:
1. Extract institutions from Dutch conversations
2. Match against ISIL registry (364 records)
3. Match against Dutch orgs CSV (1,351 records)
4. **Measure extraction accuracy** using known ground truth

**Expected Results**:
- Precision check: % of extracted names that match CSV data
- Recall check: % of CSV institutions mentioned in conversations
- Name variation analysis: Different spellings, abbreviations

---

## Files to Create

### Source Code
- `src/glam_extractor/extractors/ner.py` - NER via subagents
- `src/glam_extractor/extractors/institutions.py` - Institution extraction logic
- `src/glam_extractor/extractors/locations.py` - Location extraction + geocoding

### Tests
- `tests/extractors/test_ner.py` - Subagent NER tests
- `tests/extractors/test_institutions.py` - Institution extraction tests
- `tests/integration/test_conversation_pipeline.py` - End-to-end tests

### Scripts
- `extract_single_conversation.py` - Test single conversation extraction
- `extract_all_conversations.py` - Batch process all 139 files
- `validate_dutch_conversations.py` - Cross-validate with CSV data

---

## Success Criteria

### Phase 2A Complete When:
- ✅ Single conversation extraction works (1 test file)
- ✅ Identifier extraction (ISIL, Wikidata) via regex
- ✅ Institution name extraction via subagent
- ✅ Location extraction via subagent
- ✅ Provenance tracking (TIER_4, conversation_id)
- ✅ Validation against known institutions (Dutch CSV)

### Phase 2B Complete When:
- ✅ All 139 conversations processed
- ✅ 2,000+ heritage custodian records extracted
- ✅ Statistics report generated (institutions per country, types)
- ✅ Cross-linked with TIER_1 data (where applicable)
- ✅ Exported to JSON-LD/RDF

---

## Risks and Mitigations

### Risk 1: Low Extraction Quality
- **Mitigation**: Start with Dutch conversations (ground truth available)
- **Mitigation**: Use confidence scoring, flag low-confidence for review

### Risk 2: Multilingual NER Challenges
- **Mitigation**: Let subagents choose language-specific models
- **Mitigation**: Focus on English + Dutch first, expand later

### Risk 3: Duplicate Detection
- **Mitigation**: Implement fuzzy name matching
- **Mitigation**: Cross-reference with ISIL codes

### Risk 4: Performance (139 files × NER cost)
- **Mitigation**: Batch processing with progress tracking
- **Mitigation**: Cache subagent results

---

## References

- **Conversation parser**: `src/glam_extractor/parsers/conversation.py` (✅ complete, 90% coverage)
- **Conversation tests**: `tests/parsers/test_conversation.py` (25 tests passing)
- **Agent instructions**: `AGENTS.md` (NLP extraction tasks section)
- **Schema**: `schemas/heritage_custodian.yaml`
- **Progress**: `PROGRESS.md` (Phase 1 complete)

---

## Quick Commands

```bash
# Run existing conversation parser tests
pytest tests/parsers/test_conversation.py -v

# Count conversations by country (filename pattern)
ls -1 *.json | grep -o '\w\+\.json' | sort | uniq -c

# Test with a single conversation
python extract_single_conversation.py "2025-09-22T14-40-15-...-Brazilian_GLAM.json"

# Process all conversations
python extract_all_conversations.py

# Validate extraction quality (Dutch conversations)
python validate_dutch_conversations.py
```

---

## Immediate Next Actions

### Option A: Process All Conversations (Quick Win)
```bash
# Run batch extractor on all 139 files
python scripts/batch_extract_institutions.py

# Expected output: 2,000-5,000 institutions across 60+ countries
# Output: output/institutions.json, output/institutions.csv
```

### Option B: Improve Extraction Quality (Before Batch Run)

**Priority Tasks**:
1. **Fix location extraction** - Improve country detection (most are UNKNOWN)
2. **Improve name extraction** - Reduce variants ("Museum" vs "Museu")
3. **Add validation** - Cross-check with Dutch CSV data
4. **Add Nominatim geocoding** - For institutions without GeoNames match

**Implementation**:
- Option 1: Enhance pattern matching in `nlp_extractor.py`
- Option 2: Use subagent-based NER (spaCy/transformers) as originally planned

### Option C: Build Advanced Extractors

Create extractors for:
1. `relationship_extractor.py` - Organizational relationships
2. `collection_extractor.py` - Collection metadata
3. `event_extractor.py` - Organizational change events

---

**Recommended Next Action**:

**Run Option A first** to get baseline statistics, then assess quality and decide whether Option B enhancements are needed.

---

# NEW: Australian Heritage Institution Extraction (Trove API)

**Status**: ✅ Ready to Extract
**Priority**: HIGH
**Complexity**: LOW (Authoritative API, no NLP required)
**Date Added**: 2025-11-18

---

## Overview

Extract Australian heritage custodian organizations from the **Trove API** (National Library of Australia).

**What is Trove?**: Australia's national discovery service aggregating collections from libraries, archives, museums, galleries across Australia.

**What is NUC?**: National Union Catalogue symbols - Australia's unique identifiers for heritage institutions, equivalent to ISIL codes (format: `AU-{NUC}`).

---

## Quick Start: Run Trove Extraction

### 1. Get Trove API Key (5 minutes)

**Required**: Free API key from National Library of Australia

**Steps**:
1. Visit: https://trove.nla.gov.au/about/create-something/using-api
2. Click "Sign up for an API key"
3. Fill registration form (name, email, intended use: "Heritage institution research")
4. Check email for API key (arrives immediately)
5. Save the key securely

### 2. Run Extraction Script

```bash
cd /Users/kempersc/apps/glam

python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY
```

**What happens**:
- Fetches all Trove contributors (estimated 200-500 institutions)
- Retrieves full details for each (respects 200 req/min rate limit)
- Classifies institutions by GLAMORCUBESFIXPHDNT type
- Generates GHCID persistent identifiers (UUID v5, numeric)
- Exports to YAML, JSON, CSV formats
- Takes ~2-5 minutes

**Output**:
```
data/instances/
├── trove_contributors_YYYYMMDD_HHMMSS.yaml
├── trove_contributors_YYYYMMDD_HHMMSS.json
└── trove_contributors_YYYYMMDD_HHMMSS.csv
```

### 3. Validate Results

```bash
# Count institutions
wc -l data/instances/trove_contributors_*.csv

# View sample record
head -n 50 data/instances/trove_contributors_*.yaml

# Check type distribution
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c
```

---

## What We Built

### ✅ Completed Implementation

1. **`scripts/extract_trove_contributors.py`** (697 lines)
   - ✅ Trove API v3 client with rate limiting
   - ✅ GHCID generator (UUID v5, numeric, base string)
   - ✅ Institution type classifier (GLAMORCUBESFIXPHDNT)
   - ✅ LinkML schema mapper (v0.2.1 compliant)
   - ✅ Multi-format exporter (YAML, JSON, CSV)
   - ✅ Provenance tracking (TIER_1_AUTHORITATIVE)
   - ✅ Type hints fixed (Optional parameters)

2. **`docs/AUSTRALIA_TROVE_EXTRACTION.md`** (comprehensive guide)
   - API documentation and usage
   - Data quality information
   - Troubleshooting guide
   - Integration strategies

### Data Quality

**TIER_1_AUTHORITATIVE** classification:
- ✅ Official source (National Library of Australia)
- ✅ Maintained registry (curated by NLA staff)
- ✅ Quality controlled (verified organizations)
- ✅ Standards compliant (NUC codes map to ISIL)
- ✅ Current data (regularly updated)

**Confidence Score**: 0.95 (very high)

---

## Expected Results

### Coverage

**Trove API** (what we're extracting now):
- **200-500 institutions** (organizations contributing to Trove)
- Major libraries (national, state, university)
- Government archives (state, municipal)
- Museums with digitized collections
- Galleries contributing to Trove

**Full ISIL Registry** (future enhancement):
- **800-1,200 institutions** (estimated)
- Includes non-contributing organizations
- Requires web scraping ILRS Directory

### Sample Output

```yaml
- id: https://w3id.org/heritage/custodian/au/nla
  ghcid_uuid: "550e8400-e29b-41d4-a716-446655440000"
  ghcid_numeric: 213324328442227739
  ghcid_current: AU-ACT-CAN-L-NLA
  name: National Library of Australia
  institution_type: L  # Library
  identifiers:
    - identifier_scheme: NUC
      identifier_value: NLA
    - identifier_scheme: ISIL
      identifier_value: AU-NLA
  homepage: https://www.nla.gov.au
  locations:
    - city: Canberra
      region: ACT
      country: AU
  provenance:
    data_source: TROVE_API
    data_tier: TIER_1_AUTHORITATIVE
    extraction_date: "2025-11-18T14:30:00Z"
    confidence_score: 0.95
```

---

## Advanced Options

### Custom Output Directory

```bash
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --output-dir data/instances/australia
```

### Adjust Rate Limiting

```bash
# Slower (safer if hitting rate limits)
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --delay 0.5  # 120 req/min instead of 200
```

### Export Specific Formats

```bash
# YAML and JSON only
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --formats yaml json
```

---

## Next Priorities (After Extraction)

### Priority 1: Data Enrichment

After extracting Trove data:

1. **Geocoding**: Convert cities to lat/lon
   ```bash
   python scripts/geocode_australian_institutions.py \
     --input data/instances/trove_contributors_*.yaml
   ```

2. **Wikidata Cross-referencing**: Find Q-numbers
   ```bash
   python scripts/enrich_australian_with_wikidata.py \
     --input data/instances/trove_contributors_*.yaml
   ```

### Priority 2: Full ISIL Coverage

**Current**: Trove API = subset (contributing organizations only)

**To Get Full Coverage**:
1. Build ILRS Directory scraper
2. Extract all ISIL codes (https://www.nla.gov.au/apps/ilrs/)
3. Merge with Trove data

```bash
# Future script (not yet implemented)
python scripts/scrape_ilrs_directory.py \
  --output data/raw/ilrs_full_registry.csv
```

### Priority 3: Integration

Merge Australian data with:
- Dutch ISIL registry (comparison study)
- Conversation extractions (find Australian institutions in JSON files)
- Global GHCID registry (unified RDF export)

---

## Documentation

### New Files

- **`scripts/extract_trove_contributors.py`** - Extraction script (ready to run)
- **`docs/AUSTRALIA_TROVE_EXTRACTION.md`** - Comprehensive guide
- **`NEXT_STEPS.md`** (this file) - Updated with Australian extraction

### Related Documentation

- **Agent Instructions**: `AGENTS.md` - Institution type taxonomy
- **Schema**: `schemas/heritage_custodian.yaml` - LinkML v0.2.1
- **GHCIDs**: `docs/PERSISTENT_IDENTIFIERS.md` - Identifier specification
- **Progress**: `PROGRESS.md` - Overall project status

---

## Troubleshooting

### "API key required" Error

**Solution**: Register at https://trove.nla.gov.au/about/create-something/using-api

### Rate Limit Errors (HTTP 429)

**Solution**: Increase delay between requests:
```bash
python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5
```

### No Contributors Found

**Solution**: Check API key validity, internet connection, Trove status (https://status.nla.gov.au)

---

## 🎯 Your Immediate Next Action

```bash
# Step 1: Get API key (5 minutes)
# Visit: https://trove.nla.gov.au/about/create-something/using-api

# Step 2: Run extraction (2-5 minutes)
python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY
```

**Expected Output**: 200-500 Australian heritage institutions
**Data Quality**: TIER_1_AUTHORITATIVE
**Confidence**: 0.95 (very high)

---

**Recommendation**: Run **Australian Trove extraction** before batch processing conversations. Trove provides clean, authoritative data that can serve as a quality benchmark for the conversation NLP extractions.