glam/docs/plan/global_glam/01-implementation-phases.md
2025-11-19 23:25:22 +01:00

9.3 KiB

Global GLAM Dataset: Implementation Phases

Overview

This document outlines the phased approach to building a comprehensive Global GLAM (Galleries, Libraries, Archives, Museums) dataset from 139 Claude conversation JSON files containing heritage institution research worldwide.

Phase 1: Foundation & Schema Design (Weeks 1-2)

1.1 LinkML Schema Development

Objective: Create a comprehensive heritage custodian ontology combining multiple standards

Tasks:

  • Design core LinkML schema integrating:
    • TOOI (Dutch government organizations ontology)
    • Schema.org (general web semantics)
    • CPOC (Core Public Organization Vocabulary)
    • ISIL (International Standard Identifier for Libraries)
    • Domain-specific vocabularies:
      • RiC-O (Records in Contexts - Archives)
      • BIBFRAME (Bibliographic Framework - Libraries)
      • LIDO/CIDOC-CRM (Museums)
      • EBUCore (Media archives)

Deliverables:

  • schemas/heritage_custodian.yaml - Main LinkML schema
  • schemas/mixins/ - Modular schema components
  • docs/schema-documentation.md - Schema documentation
  • Example instances in YAML/JSON-LD

1.2 Poetry Application Scaffolding

Objective: Set up Python development environment

Tasks:

  • Initialize Poetry project with proper dependencies
  • Configure development tools (ruff, mypy, pytest)
  • Set up project structure:
    glam-extractor/
    ├── pyproject.toml
    ├── src/glam_extractor/
    │   ├── parsers/      # JSON conversation parsers
    │   ├── extractors/   # NLP-based data extraction
    │   ├── crawlers/     # crawl4ai integration
    │   ├── validators/   # LinkML validation
    │   └── exporters/    # Output formats
    ├── tests/
    └── data/
        ├── input/        # Conversation JSONs
        └── output/       # Extracted datasets
    

Deliverables:

  • Working Poetry project
  • CI/CD configuration
  • Development documentation

Phase 2: Data Parsing & Initial Extraction (Weeks 3-4)

2.1 Conversation Parser Development

Objective: Parse Claude conversation JSON files and extract text content

Tasks:

  • Implement JSON parser for conversation structure
  • Extract markdown content from messages
  • Handle citations and URL references
  • Create conversation metadata index (country, date, topics)

Deliverables:

  • ConversationParser class
  • Conversation index database (SQLite/DuckDB)
  • Parsed text corpus

2.2 Sample Data Analysis

Objective: Analyze representative conversations to refine extraction patterns

Tasks:

  • Deep dive into 10-15 representative conversations covering:
    • Different countries/regions
    • Different GLAM types (galleries, libraries, archives, museums)
    • Various institution sizes (national, regional, local)
  • Document common patterns and structures
  • Identify edge cases

Deliverables:

  • Pattern analysis document
  • Extraction templates
  • Edge case catalog

Phase 3: NLP Extraction Pipeline (Weeks 5-7)

3.1 Entity Extraction

Objective: Extract heritage institutions and their attributes from conversation text

Tasks:

  • Implement Named Entity Recognition (NER) for:
    • Institution names
    • Addresses/locations
    • URLs and digital platforms
    • Contact information
    • Collection types
  • Create entity linking to disambiguate institutions
  • Extract relationships (parent organizations, networks, partnerships)

Technologies:

  • spaCy or similar NLP framework
  • Custom entity recognition models
  • Regex patterns for structured data (URLs, codes)

Deliverables:

  • EntityExtractor class
  • Custom NER models
  • Entity disambiguation logic

3.2 Attribute Extraction

Objective: Extract structured attributes for each institution

Tasks:

  • Extract:
    • Institution types (museum, archive, library, etc.)
    • Geographic coverage
    • Collection subjects/themes
    • Digital platforms and repositories
    • Standards used (metadata, preservation, access)
    • Identifiers (ISIL, Wikidata, national IDs)
    • Funding sources
    • Consortium memberships
  • Map extracted attributes to LinkML schema

Deliverables:

  • AttributeExtractor class
  • Attribute mapping configuration
  • Quality metrics

Phase 4: Web Crawling & Enrichment (Weeks 8-10)

4.1 crawl4ai Integration

Objective: Fetch detailed information from URLs cited in conversations

Tasks:

  • Implement crawl4ai wrapper for:
    • Institution homepages
    • Digital repository platforms
    • Catalog systems
    • API documentation
  • Extract structured data from crawled pages:
    • OpenGraph metadata
    • Schema.org JSON-LD
    • Dublin Core metadata
    • Custom extraction rules per platform type
  • Implement rate limiting and ethical crawling

Deliverables:

  • WebCrawler class
  • Platform-specific extractors
  • Crawl results database

4.2 Data Validation & Enrichment

Objective: Validate URLs and enrich extracted data

Tasks:

  • Verify URL availability (HTTP status checks)
  • Cross-reference with external authorities:
    • Wikidata
    • VIAF (Virtual International Authority File)
    • National library registries
    • UNESCO heritage databases
  • Resolve geographic coordinates
  • Normalize institution names

Deliverables:

  • DataEnricher class
  • External API integrations
  • Enrichment metrics

Phase 5: LinkML Instance Generation (Weeks 11-12)

5.1 Schema Mapping

Objective: Map extracted data to LinkML schema instances

Tasks:

  • Transform extracted entities to LinkML classes
  • Validate instances against schema
  • Handle missing/incomplete data
  • Generate unique identifiers (IRIs)
  • Maintain provenance (source conversations, extraction confidence)

Deliverables:

  • LinkMLMapper class
  • Instance generator
  • Validation reports

5.2 Data Quality Assurance

Objective: Ensure high-quality dataset output

Tasks:

  • Implement quality checks:
    • Completeness scoring
    • Duplicate detection
    • Inconsistency identification
  • Manual review process for flagged items
  • Create quality dashboard

Deliverables:

  • QualityChecker class
  • Review interface
  • Quality metrics dashboard

Phase 6: Export & Documentation (Weeks 13-14)

6.1 Multi-Format Export

Objective: Export dataset in multiple formats

Tasks:

  • Generate outputs:
    • RDF/Turtle - Linked data format
    • JSON-LD - JSON with linked data context
    • CSV/TSV - Tabular formats for analysis
    • YAML - Human-readable instances
    • SQLite/DuckDB - Queryable database
    • Parquet - Efficient columnar format
  • Create dataset statistics report

Deliverables:

  • DataExporter class with format plugins
  • Export configurations
  • Dataset documentation

6.2 OpenCODE Agent Development

Objective: Create AI agents for extraction assistance

Tasks:

  • Write AGENTS.md with:
    • NLP extraction subagent instructions
    • Pattern recognition guidelines
    • Quality review agent specifications
  • Implement agent integration in pipeline
  • Test agent-assisted extraction

Deliverables:

  • AGENTS.md documentation
  • Agent integration code
  • Agent performance metrics

6.3 Documentation & Publication

Objective: Comprehensive project documentation

Tasks:

  • Write user guides:
    • Installation and setup
    • Running extraction pipeline
    • Querying the dataset
    • Contributing new sources
  • Create schema documentation site
  • Prepare dataset for publication:
    • License selection (CC0, CC-BY, ODbL)
    • Metadata record
    • DOI registration
    • GitHub repository preparation

Deliverables:

  • Complete documentation
  • Published dataset
  • Public repository

Phase 7: Maintenance & Extension (Ongoing)

7.1 Continuous Improvement

Tasks:

  • Monitor data quality metrics
  • Refine extraction algorithms based on feedback
  • Add support for new conversation sources
  • Update schema as standards evolve
  • Integrate community contributions

7.2 Dataset Updates

Tasks:

  • Re-crawl URLs periodically for updates
  • Process new conversation files
  • Maintain dataset versioning
  • Publish regular releases

Success Metrics

Quantitative

  • Coverage: Extract ≥80% of institutions mentioned in conversations
  • Accuracy: ≥90% precision on manual validation sample
  • Completeness: ≥70% of extracted institutions have URLs and types
  • Enrichment: ≥50% institutions linked to external authorities

Qualitative

  • Schema can represent diverse GLAM institutions globally
  • Dataset is reusable by heritage sector practitioners
  • Documentation enables community contributions
  • Pipeline is maintainable and extensible

Risk Mitigation

Technical Risks

  • NLP extraction accuracy: Mitigate with human-in-loop review and iterative refinement
  • Web crawling reliability: Implement robust error handling and retry logic
  • Schema complexity: Start simple, iterate based on real data needs

Resource Risks

  • Processing time: Parallelize where possible, use incremental processing
  • Storage requirements: Use efficient formats (Parquet), compress archives

Timeline Summary

  • Weeks 1-2: Foundation (schemas, scaffolding)
  • Weeks 3-4: Parsing & analysis
  • Weeks 5-7: NLP extraction
  • Weeks 8-10: Web crawling & enrichment
  • Weeks 11-12: LinkML generation & QA
  • Weeks 13-14: Export & documentation
  • Ongoing: Maintenance & updates

Total estimated duration: 14 weeks for initial release, with ongoing maintenance