# Global GLAM Dataset: Implementation Phases ## Overview This document outlines the phased approach to building a comprehensive Global GLAM (Galleries, Libraries, Archives, Museums) dataset from 139 Claude conversation JSON files containing heritage institution research worldwide. ## Phase 1: Foundation & Schema Design (Weeks 1-2) ### 1.1 LinkML Schema Development **Objective**: Create a comprehensive heritage custodian ontology combining multiple standards **Tasks**: - Design core LinkML schema integrating: - **TOOI** (Dutch government organizations ontology) - **Schema.org** (general web semantics) - **CPOC** (Core Public Organization Vocabulary) - **ISIL** (International Standard Identifier for Libraries) - Domain-specific vocabularies: - **RiC-O** (Records in Contexts - Archives) - **BIBFRAME** (Bibliographic Framework - Libraries) - **LIDO/CIDOC-CRM** (Museums) - **EBUCore** (Media archives) **Deliverables**: - `schemas/heritage_custodian.yaml` - Main LinkML schema - `schemas/mixins/` - Modular schema components - `docs/schema-documentation.md` - Schema documentation - Example instances in YAML/JSON-LD ### 1.2 Poetry Application Scaffolding **Objective**: Set up Python development environment **Tasks**: - Initialize Poetry project with proper dependencies - Configure development tools (ruff, mypy, pytest) - Set up project structure: ``` glam-extractor/ ├── pyproject.toml ├── src/glam_extractor/ │ ├── parsers/ # JSON conversation parsers │ ├── extractors/ # NLP-based data extraction │ ├── crawlers/ # crawl4ai integration │ ├── validators/ # LinkML validation │ └── exporters/ # Output formats ├── tests/ └── data/ ├── input/ # Conversation JSONs └── output/ # Extracted datasets ``` **Deliverables**: - Working Poetry project - CI/CD configuration - Development documentation ## Phase 2: Data Parsing & Initial Extraction (Weeks 3-4) ### 2.1 Conversation Parser Development **Objective**: Parse Claude conversation JSON files and extract text content **Tasks**: - Implement JSON parser for conversation structure - Extract markdown content from messages - Handle citations and URL references - Create conversation metadata index (country, date, topics) **Deliverables**: - `ConversationParser` class - Conversation index database (SQLite/DuckDB) - Parsed text corpus ### 2.2 Sample Data Analysis **Objective**: Analyze representative conversations to refine extraction patterns **Tasks**: - Deep dive into 10-15 representative conversations covering: - Different countries/regions - Different GLAM types (galleries, libraries, archives, museums) - Various institution sizes (national, regional, local) - Document common patterns and structures - Identify edge cases **Deliverables**: - Pattern analysis document - Extraction templates - Edge case catalog ## Phase 3: NLP Extraction Pipeline (Weeks 5-7) ### 3.1 Entity Extraction **Objective**: Extract heritage institutions and their attributes from conversation text **Tasks**: - Implement Named Entity Recognition (NER) for: - Institution names - Addresses/locations - URLs and digital platforms - Contact information - Collection types - Create entity linking to disambiguate institutions - Extract relationships (parent organizations, networks, partnerships) **Technologies**: - spaCy or similar NLP framework - Custom entity recognition models - Regex patterns for structured data (URLs, codes) **Deliverables**: - `EntityExtractor` class - Custom NER models - Entity disambiguation logic ### 3.2 Attribute Extraction **Objective**: Extract structured attributes for each institution **Tasks**: - Extract: - Institution types (museum, archive, library, etc.) - Geographic coverage - Collection subjects/themes - Digital platforms and repositories - Standards used (metadata, preservation, access) - Identifiers (ISIL, Wikidata, national IDs) - Funding sources - Consortium memberships - Map extracted attributes to LinkML schema **Deliverables**: - `AttributeExtractor` class - Attribute mapping configuration - Quality metrics ## Phase 4: Web Crawling & Enrichment (Weeks 8-10) ### 4.1 crawl4ai Integration **Objective**: Fetch detailed information from URLs cited in conversations **Tasks**: - Implement crawl4ai wrapper for: - Institution homepages - Digital repository platforms - Catalog systems - API documentation - Extract structured data from crawled pages: - OpenGraph metadata - Schema.org JSON-LD - Dublin Core metadata - Custom extraction rules per platform type - Implement rate limiting and ethical crawling **Deliverables**: - `WebCrawler` class - Platform-specific extractors - Crawl results database ### 4.2 Data Validation & Enrichment **Objective**: Validate URLs and enrich extracted data **Tasks**: - Verify URL availability (HTTP status checks) - Cross-reference with external authorities: - Wikidata - VIAF (Virtual International Authority File) - National library registries - UNESCO heritage databases - Resolve geographic coordinates - Normalize institution names **Deliverables**: - `DataEnricher` class - External API integrations - Enrichment metrics ## Phase 5: LinkML Instance Generation (Weeks 11-12) ### 5.1 Schema Mapping **Objective**: Map extracted data to LinkML schema instances **Tasks**: - Transform extracted entities to LinkML classes - Validate instances against schema - Handle missing/incomplete data - Generate unique identifiers (IRIs) - Maintain provenance (source conversations, extraction confidence) **Deliverables**: - `LinkMLMapper` class - Instance generator - Validation reports ### 5.2 Data Quality Assurance **Objective**: Ensure high-quality dataset output **Tasks**: - Implement quality checks: - Completeness scoring - Duplicate detection - Inconsistency identification - Manual review process for flagged items - Create quality dashboard **Deliverables**: - `QualityChecker` class - Review interface - Quality metrics dashboard ## Phase 6: Export & Documentation (Weeks 13-14) ### 6.1 Multi-Format Export **Objective**: Export dataset in multiple formats **Tasks**: - Generate outputs: - **RDF/Turtle** - Linked data format - **JSON-LD** - JSON with linked data context - **CSV/TSV** - Tabular formats for analysis - **YAML** - Human-readable instances - **SQLite/DuckDB** - Queryable database - **Parquet** - Efficient columnar format - Create dataset statistics report **Deliverables**: - `DataExporter` class with format plugins - Export configurations - Dataset documentation ### 6.2 OpenCODE Agent Development **Objective**: Create AI agents for extraction assistance **Tasks**: - Write `AGENTS.md` with: - NLP extraction subagent instructions - Pattern recognition guidelines - Quality review agent specifications - Implement agent integration in pipeline - Test agent-assisted extraction **Deliverables**: - `AGENTS.md` documentation - Agent integration code - Agent performance metrics ### 6.3 Documentation & Publication **Objective**: Comprehensive project documentation **Tasks**: - Write user guides: - Installation and setup - Running extraction pipeline - Querying the dataset - Contributing new sources - Create schema documentation site - Prepare dataset for publication: - License selection (CC0, CC-BY, ODbL) - Metadata record - DOI registration - GitHub repository preparation **Deliverables**: - Complete documentation - Published dataset - Public repository ## Phase 7: Maintenance & Extension (Ongoing) ### 7.1 Continuous Improvement **Tasks**: - Monitor data quality metrics - Refine extraction algorithms based on feedback - Add support for new conversation sources - Update schema as standards evolve - Integrate community contributions ### 7.2 Dataset Updates **Tasks**: - Re-crawl URLs periodically for updates - Process new conversation files - Maintain dataset versioning - Publish regular releases ## Success Metrics ### Quantitative - **Coverage**: Extract ≥80% of institutions mentioned in conversations - **Accuracy**: ≥90% precision on manual validation sample - **Completeness**: ≥70% of extracted institutions have URLs and types - **Enrichment**: ≥50% institutions linked to external authorities ### Qualitative - Schema can represent diverse GLAM institutions globally - Dataset is reusable by heritage sector practitioners - Documentation enables community contributions - Pipeline is maintainable and extensible ## Risk Mitigation ### Technical Risks - **NLP extraction accuracy**: Mitigate with human-in-loop review and iterative refinement - **Web crawling reliability**: Implement robust error handling and retry logic - **Schema complexity**: Start simple, iterate based on real data needs ### Resource Risks - **Processing time**: Parallelize where possible, use incremental processing - **Storage requirements**: Use efficient formats (Parquet), compress archives ## Timeline Summary - **Weeks 1-2**: Foundation (schemas, scaffolding) - **Weeks 3-4**: Parsing & analysis - **Weeks 5-7**: NLP extraction - **Weeks 8-10**: Web crawling & enrichment - **Weeks 11-12**: LinkML generation & QA - **Weeks 13-14**: Export & documentation - **Ongoing**: Maintenance & updates **Total estimated duration**: 14 weeks for initial release, with ongoing maintenance