glam/NEXT_STEPS.md
2025-11-19 23:25:22 +01:00

20 KiB
Raw Permalink Blame History

Next Steps: Conversation JSON Extraction

Status: Phase 2A COMPLETE - Pattern-Based Extraction Working
Priority: HIGH
Complexity: Medium (Pattern-based completed, ML-based enhancement optional)
Last Updated: 2025-11-05


COMPLETED: Pattern-Based NLP Extractor

Implementation: src/glam_extractor/extractors/nlp_extractor.py (630 lines, 90% test coverage)

Capabilities:

  • Institution name extraction (pattern-based with keyword detection)
  • Multilingual support (English, Dutch, Spanish, Portuguese, French, German, Greek)
  • Institution type classification (13 types: MUSEUM, LIBRARY, ARCHIVE, etc.)
  • Identifier extraction (ISIL, Wikidata, VIAF, KvK)
  • Location extraction (city, country from patterns)
  • Confidence scoring (0.0-1.0)
  • Full provenance tracking (TIER_4_INFERRED)
  • Deduplication

Batch Processing Pipeline: scripts/batch_extract_institutions.py (500+ lines)

Test Results:

  • 21 tests, 20 passing (95% pass rate)
  • Successfully tested on 3 conversation files
  • Extracted 18 unique institutions from test run
  • Exports to JSON and CSV formats

Known Limitations (Pattern-Based Approach):

  1. Name variants ("Vietnamese Museum" vs "Vietnamese Museu")
  2. Many institutions have UNKNOWN country (location patterns limited)
  3. Complex names fail ("Museum of Modern Art" not matched by simple patterns)
  4. No syntactic parsing, relies on keyword proximity

Overview

Parse 139 conversation JSON files to extract GLAM institution data using pattern-based NLP.

Goal: Extract ~2,000-5,000 TIER_4_INFERRED heritage custodian records from global GLAM research conversations.

Status: Basic extraction working, ready for full batch processing or ML enhancement.


Quick Start

1. List Available Conversations

# Count conversation files
find /Users/kempersc/Documents/claude/glam -name "*.json" -type f | wc -l

# Sample conversation names
ls -1 /Users/kempersc/Documents/claude/glam/*.json | head -20

2. Start Small - Test Extraction Pipeline

Pick 1-2 conversations to develop and test extraction logic:

Recommended Test Files:

  1. A Brazilian GLAM conversation (Portuguese, museums/libraries)
  2. A Dutch province conversation (already know Dutch institutions from CSV)

3. Extraction Pipeline (WORKING)

from pathlib import Path
from glam_extractor.parsers.conversation import ConversationParser
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor

# Parse conversation
parser = ConversationParser()
conversation = parser.parse_file("2025-09-22T14-40-15-...-Brazilian_GLAM.json")

# Extract institutions
extractor = InstitutionExtractor()
result = extractor.extract_from_text(
    conversation.extract_all_text(),
    conversation_id=conversation.uuid
)

if result.success:
    for institution in result.value:
        print(f"{institution.name} ({institution.institution_type})")
        print(f"  Confidence: {institution.provenance.confidence_score}")

4. Batch Processing (WORKING)

# Process first 10 conversations
python scripts/batch_extract_institutions.py --limit 10

# Process all 139 conversations
python scripts/batch_extract_institutions.py

# Filter by country
python scripts/batch_extract_institutions.py --country Brazil

# Disable geocoding (faster)
python scripts/batch_extract_institutions.py --no-geocoding

# Custom output directory
python scripts/batch_extract_institutions.py --output-dir results/

Extraction Tasks (from AGENTS.md)

Phase 2A: Basic Entity Extraction (COMPLETE)

  1. Institution Names (Pattern-Based)

    • Capitalization patterns + keyword context
    • Multilingual keyword detection (7 languages)
    • Confidence scoring based on evidence
  2. Locations (Pattern Matching + GeoNames)

    • Extract cities, countries from "in [City]" patterns
    • GeoNames lookup for lat/lon enrichment
    • ISIL prefix → country code mapping
  3. Identifiers (Regex Pattern Matching)

    • ISIL codes: [A-Z]{2}-[A-Za-z0-9]+
    • Wikidata IDs: Q[0-9]+
    • VIAF IDs: viaf.org/viaf/[0-9]+
    • KvK: [0-9]{8}
    • URLs
  4. Institution Types (Keyword Classification)

    • 13-type taxonomy (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.)
    • Multilingual keyword matching
    • Defaults to MIXED when uncertain

Phase 2B: Advanced Extraction (NEXT PRIORITIES)

  1. Relationships (Pattern Matching) - NOT STARTED

    • Parent organizations
    • Partnerships
    • Network memberships
    • Next Step: Create src/glam_extractor/extractors/relationship_extractor.py
  2. Collection Metadata (Pattern Matching) - NOT STARTED

    • Collection names, types
    • Item counts, time periods
    • Subject areas
    • Next Step: Create src/glam_extractor/extractors/collection_extractor.py
  3. Digital Platforms (Pattern Matching) - NOT STARTED

    • CMS systems mentioned
    • SPARQL endpoints
    • Collection portals
    • APIs and discovery services
  4. Metadata Standards (Pattern Matching) - NOT STARTED

    • Dublin Core, MARC21, EAD, etc.
    • Schema.org, CIDOC-CRM
  5. Organizational Change Events (Pattern Matching) - NOT STARTED

    • Mergers, closures, relocations
    • Name changes, reorganizations
    • See AGENTS.md Task 8 for details

Implementation Strategy

Pros:

  • Clean separation of concerns
  • Flexible (subagent chooses best NER approach)
  • No heavy dependencies in main code
  • Easy to experiment

Workflow:

# 1. Parse conversation
conversation = parser.parse_file(conv_path)

# 2. Launch subagent for NER
result = task_tool.invoke(
    subagent_type="general",
    description="Extract GLAM institutions",
    prompt=f"""
    Extract museum, library, and archive names from this text.
    
    Text: {conversation.extract_all_text()}
    
    Return JSON array with:
    - name: institution name
    - type: museum/library/archive/mixed
    - city: location (if mentioned)
    - confidence: 0.0-1.0
    """
)

# 3. Validate and convert to HeritageCustodian
institutions = json.loads(result)
custodians = [convert_to_custodian(inst, conversation) for inst in institutions]

Option 2: Direct NER (Alternative)

Pros:

  • Full control over NER pipeline
  • Better for debugging

Cons:

  • Adds spaCy dependency to main code
  • More complex error handling

Test-Driven Development Plan

Step 1: Parse Single Conversation

# Create test
touch tests/parsers/test_conversation_extraction.py

# Test: Load conversation, extract institutions (manual fixtures)
pytest tests/parsers/test_conversation_extraction.py -v

Step 2: Identifier Extraction (Regex-Based)

# Easy win: Extract ISIL codes, Wikidata IDs
# High precision, no ML needed

def extract_isil_codes(text: str) -> List[str]:
    pattern = r'\b([A-Z]{2}-[A-Za-z0-9]+)\b'
    return re.findall(pattern, text)

Step 3: NER via Subagent

# Launch subagent to extract institution names
# Validate results with known institutions (e.g., Rijksmuseum)

Step 4: Batch Processing

# Process all 139 conversations
# Collect statistics (institutions per country, types, etc.)

Expected Outputs

Extraction Statistics (Estimate)

Based on 139 conversations covering 60+ countries:

  • Institutions extracted: 2,000-5,000 (rough estimate)
  • Countries covered: 60+
  • ISIL codes found: 100-300
  • Wikidata links: 500-1,000
  • Confidence distribution:
    • High (0.8-1.0): 40%
    • Medium (0.6-0.8): 35%
    • Low (0.3-0.6): 25%

Provenance Metadata

All conversation-extracted records:

provenance:
  data_source: CONVERSATION_NLP
  data_tier: TIER_4_INFERRED
  extraction_date: "2025-11-05T..."
  extraction_method: "Subagent NER via Task tool"
  confidence_score: 0.75
  conversation_id: "conversation-uuid"
  source_url: null
  verified_date: null
  verified_by: null

Cross-Linking Opportunities

Dutch Conversations + CSV Data

5 Dutch province conversations exist:

  • Limburg (NL)
  • Gelderland (NL)
  • Drenthe (NL)
  • Groningen (NL)
  • (+ general Dutch conversations)

Validation Approach:

  1. Extract institutions from Dutch conversations
  2. Match against ISIL registry (364 records)
  3. Match against Dutch orgs CSV (1,351 records)
  4. Measure extraction accuracy using known ground truth

Expected Results:

  • Precision check: % of extracted names that match CSV data
  • Recall check: % of CSV institutions mentioned in conversations
  • Name variation analysis: Different spellings, abbreviations

Files to Create

Source Code

  • src/glam_extractor/extractors/ner.py - NER via subagents
  • src/glam_extractor/extractors/institutions.py - Institution extraction logic
  • src/glam_extractor/extractors/locations.py - Location extraction + geocoding

Tests

  • tests/extractors/test_ner.py - Subagent NER tests
  • tests/extractors/test_institutions.py - Institution extraction tests
  • tests/integration/test_conversation_pipeline.py - End-to-end tests

Scripts

  • extract_single_conversation.py - Test single conversation extraction
  • extract_all_conversations.py - Batch process all 139 files
  • validate_dutch_conversations.py - Cross-validate with CSV data

Success Criteria

Phase 2A Complete When:

  • Single conversation extraction works (1 test file)
  • Identifier extraction (ISIL, Wikidata) via regex
  • Institution name extraction via subagent
  • Location extraction via subagent
  • Provenance tracking (TIER_4, conversation_id)
  • Validation against known institutions (Dutch CSV)

Phase 2B Complete When:

  • All 139 conversations processed
  • 2,000+ heritage custodian records extracted
  • Statistics report generated (institutions per country, types)
  • Cross-linked with TIER_1 data (where applicable)
  • Exported to JSON-LD/RDF

Risks and Mitigations

Risk 1: Low Extraction Quality

  • Mitigation: Start with Dutch conversations (ground truth available)
  • Mitigation: Use confidence scoring, flag low-confidence for review

Risk 2: Multilingual NER Challenges

  • Mitigation: Let subagents choose language-specific models
  • Mitigation: Focus on English + Dutch first, expand later

Risk 3: Duplicate Detection

  • Mitigation: Implement fuzzy name matching
  • Mitigation: Cross-reference with ISIL codes

Risk 4: Performance (139 files × NER cost)

  • Mitigation: Batch processing with progress tracking
  • Mitigation: Cache subagent results

References

  • Conversation parser: src/glam_extractor/parsers/conversation.py ( complete, 90% coverage)
  • Conversation tests: tests/parsers/test_conversation.py (25 tests passing)
  • Agent instructions: AGENTS.md (NLP extraction tasks section)
  • Schema: schemas/heritage_custodian.yaml
  • Progress: PROGRESS.md (Phase 1 complete)

Quick Commands

# Run existing conversation parser tests
pytest tests/parsers/test_conversation.py -v

# Count conversations by country (filename pattern)
ls -1 *.json | grep -o '\w\+\.json' | sort | uniq -c

# Test with a single conversation
python extract_single_conversation.py "2025-09-22T14-40-15-...-Brazilian_GLAM.json"

# Process all conversations
python extract_all_conversations.py

# Validate extraction quality (Dutch conversations)
python validate_dutch_conversations.py

Immediate Next Actions

Option A: Process All Conversations (Quick Win)

# Run batch extractor on all 139 files
python scripts/batch_extract_institutions.py

# Expected output: 2,000-5,000 institutions across 60+ countries
# Output: output/institutions.json, output/institutions.csv

Option B: Improve Extraction Quality (Before Batch Run)

Priority Tasks:

  1. Fix location extraction - Improve country detection (most are UNKNOWN)
  2. Improve name extraction - Reduce variants ("Museum" vs "Museu")
  3. Add validation - Cross-check with Dutch CSV data
  4. Add Nominatim geocoding - For institutions without GeoNames match

Implementation:

  • Option 1: Enhance pattern matching in nlp_extractor.py
  • Option 2: Use subagent-based NER (spaCy/transformers) as originally planned

Option C: Build Advanced Extractors

Create extractors for:

  1. relationship_extractor.py - Organizational relationships
  2. collection_extractor.py - Collection metadata
  3. event_extractor.py - Organizational change events

Recommended Next Action:

Run Option A first to get baseline statistics, then assess quality and decide whether Option B enhancements are needed.


NEW: Australian Heritage Institution Extraction (Trove API)

Status: Ready to Extract
Priority: HIGH
Complexity: LOW (Authoritative API, no NLP required)
Date Added: 2025-11-18


Overview

Extract Australian heritage custodian organizations from the Trove API (National Library of Australia).

What is Trove?: Australia's national discovery service aggregating collections from libraries, archives, museums, galleries across Australia.

What is NUC?: National Union Catalogue symbols - Australia's unique identifiers for heritage institutions, equivalent to ISIL codes (format: AU-{NUC}).


Quick Start: Run Trove Extraction

1. Get Trove API Key (5 minutes)

Required: Free API key from National Library of Australia

Steps:

  1. Visit: https://trove.nla.gov.au/about/create-something/using-api
  2. Click "Sign up for an API key"
  3. Fill registration form (name, email, intended use: "Heritage institution research")
  4. Check email for API key (arrives immediately)
  5. Save the key securely

2. Run Extraction Script

cd /Users/kempersc/apps/glam

python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY

What happens:

  • Fetches all Trove contributors (estimated 200-500 institutions)
  • Retrieves full details for each (respects 200 req/min rate limit)
  • Classifies institutions by GLAMORCUBESFIXPHDNT type
  • Generates GHCID persistent identifiers (UUID v5, numeric)
  • Exports to YAML, JSON, CSV formats
  • Takes ~2-5 minutes

Output:

data/instances/
├── trove_contributors_YYYYMMDD_HHMMSS.yaml
├── trove_contributors_YYYYMMDD_HHMMSS.json
└── trove_contributors_YYYYMMDD_HHMMSS.csv

3. Validate Results

# Count institutions
wc -l data/instances/trove_contributors_*.csv

# View sample record
head -n 50 data/instances/trove_contributors_*.yaml

# Check type distribution
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c

What We Built

Completed Implementation

  1. scripts/extract_trove_contributors.py (697 lines)

    • Trove API v3 client with rate limiting
    • GHCID generator (UUID v5, numeric, base string)
    • Institution type classifier (GLAMORCUBESFIXPHDNT)
    • LinkML schema mapper (v0.2.1 compliant)
    • Multi-format exporter (YAML, JSON, CSV)
    • Provenance tracking (TIER_1_AUTHORITATIVE)
    • Type hints fixed (Optional parameters)
  2. docs/AUSTRALIA_TROVE_EXTRACTION.md (comprehensive guide)

    • API documentation and usage
    • Data quality information
    • Troubleshooting guide
    • Integration strategies

Data Quality

TIER_1_AUTHORITATIVE classification:

  • Official source (National Library of Australia)
  • Maintained registry (curated by NLA staff)
  • Quality controlled (verified organizations)
  • Standards compliant (NUC codes map to ISIL)
  • Current data (regularly updated)

Confidence Score: 0.95 (very high)


Expected Results

Coverage

Trove API (what we're extracting now):

  • 200-500 institutions (organizations contributing to Trove)
  • Major libraries (national, state, university)
  • Government archives (state, municipal)
  • Museums with digitized collections
  • Galleries contributing to Trove

Full ISIL Registry (future enhancement):

  • 800-1,200 institutions (estimated)
  • Includes non-contributing organizations
  • Requires web scraping ILRS Directory

Sample Output

- id: https://w3id.org/heritage/custodian/au/nla
  ghcid_uuid: "550e8400-e29b-41d4-a716-446655440000"
  ghcid_numeric: 213324328442227739
  ghcid_current: AU-ACT-CAN-L-NLA
  name: National Library of Australia
  institution_type: L  # Library
  identifiers:
    - identifier_scheme: NUC
      identifier_value: NLA
    - identifier_scheme: ISIL
      identifier_value: AU-NLA
  homepage: https://www.nla.gov.au
  locations:
    - city: Canberra
      region: ACT
      country: AU
  provenance:
    data_source: TROVE_API
    data_tier: TIER_1_AUTHORITATIVE
    extraction_date: "2025-11-18T14:30:00Z"
    confidence_score: 0.95

Advanced Options

Custom Output Directory

python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --output-dir data/instances/australia

Adjust Rate Limiting

# Slower (safer if hitting rate limits)
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --delay 0.5  # 120 req/min instead of 200

Export Specific Formats

# YAML and JSON only
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --formats yaml json

Next Priorities (After Extraction)

Priority 1: Data Enrichment

After extracting Trove data:

  1. Geocoding: Convert cities to lat/lon

    python scripts/geocode_australian_institutions.py \
      --input data/instances/trove_contributors_*.yaml
    
  2. Wikidata Cross-referencing: Find Q-numbers

    python scripts/enrich_australian_with_wikidata.py \
      --input data/instances/trove_contributors_*.yaml
    

Priority 2: Full ISIL Coverage

Current: Trove API = subset (contributing organizations only)

To Get Full Coverage:

  1. Build ILRS Directory scraper
  2. Extract all ISIL codes (https://www.nla.gov.au/apps/ilrs/)
  3. Merge with Trove data
# Future script (not yet implemented)
python scripts/scrape_ilrs_directory.py \
  --output data/raw/ilrs_full_registry.csv

Priority 3: Integration

Merge Australian data with:

  • Dutch ISIL registry (comparison study)
  • Conversation extractions (find Australian institutions in JSON files)
  • Global GHCID registry (unified RDF export)

Documentation

New Files

  • scripts/extract_trove_contributors.py - Extraction script (ready to run)
  • docs/AUSTRALIA_TROVE_EXTRACTION.md - Comprehensive guide
  • NEXT_STEPS.md (this file) - Updated with Australian extraction
  • Agent Instructions: AGENTS.md - Institution type taxonomy
  • Schema: schemas/heritage_custodian.yaml - LinkML v0.2.1
  • GHCIDs: docs/PERSISTENT_IDENTIFIERS.md - Identifier specification
  • Progress: PROGRESS.md - Overall project status

Troubleshooting

"API key required" Error

Solution: Register at https://trove.nla.gov.au/about/create-something/using-api

Rate Limit Errors (HTTP 429)

Solution: Increase delay between requests:

python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5

No Contributors Found

Solution: Check API key validity, internet connection, Trove status (https://status.nla.gov.au)


🎯 Your Immediate Next Action

# Step 1: Get API key (5 minutes)
# Visit: https://trove.nla.gov.au/about/create-something/using-api

# Step 2: Run extraction (2-5 minutes)
python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY

Expected Output: 200-500 Australian heritage institutions
Data Quality: TIER_1_AUTHORITATIVE
Confidence: 0.95 (very high)


Recommendation: Run Australian Trove extraction before batch processing conversations. Trove provides clean, authoritative data that can serve as a quality benchmark for the conversation NLP extractions.