glam/docs/plan/global_glam/01-implementation-phases.md
2025-11-19 23:25:22 +01:00

315 lines
9.3 KiB
Markdown

# Global GLAM Dataset: Implementation Phases
## Overview
This document outlines the phased approach to building a comprehensive Global GLAM (Galleries, Libraries, Archives, Museums) dataset from 139 Claude conversation JSON files containing heritage institution research worldwide.
## Phase 1: Foundation & Schema Design (Weeks 1-2)
### 1.1 LinkML Schema Development
**Objective**: Create a comprehensive heritage custodian ontology combining multiple standards
**Tasks**:
- Design core LinkML schema integrating:
- **TOOI** (Dutch government organizations ontology)
- **Schema.org** (general web semantics)
- **CPOC** (Core Public Organization Vocabulary)
- **ISIL** (International Standard Identifier for Libraries)
- Domain-specific vocabularies:
- **RiC-O** (Records in Contexts - Archives)
- **BIBFRAME** (Bibliographic Framework - Libraries)
- **LIDO/CIDOC-CRM** (Museums)
- **EBUCore** (Media archives)
**Deliverables**:
- `schemas/heritage_custodian.yaml` - Main LinkML schema
- `schemas/mixins/` - Modular schema components
- `docs/schema-documentation.md` - Schema documentation
- Example instances in YAML/JSON-LD
### 1.2 Poetry Application Scaffolding
**Objective**: Set up Python development environment
**Tasks**:
- Initialize Poetry project with proper dependencies
- Configure development tools (ruff, mypy, pytest)
- Set up project structure:
```
glam-extractor/
├── pyproject.toml
├── src/glam_extractor/
│ ├── parsers/ # JSON conversation parsers
│ ├── extractors/ # NLP-based data extraction
│ ├── crawlers/ # crawl4ai integration
│ ├── validators/ # LinkML validation
│ └── exporters/ # Output formats
├── tests/
└── data/
├── input/ # Conversation JSONs
└── output/ # Extracted datasets
```
**Deliverables**:
- Working Poetry project
- CI/CD configuration
- Development documentation
## Phase 2: Data Parsing & Initial Extraction (Weeks 3-4)
### 2.1 Conversation Parser Development
**Objective**: Parse Claude conversation JSON files and extract text content
**Tasks**:
- Implement JSON parser for conversation structure
- Extract markdown content from messages
- Handle citations and URL references
- Create conversation metadata index (country, date, topics)
**Deliverables**:
- `ConversationParser` class
- Conversation index database (SQLite/DuckDB)
- Parsed text corpus
### 2.2 Sample Data Analysis
**Objective**: Analyze representative conversations to refine extraction patterns
**Tasks**:
- Deep dive into 10-15 representative conversations covering:
- Different countries/regions
- Different GLAM types (galleries, libraries, archives, museums)
- Various institution sizes (national, regional, local)
- Document common patterns and structures
- Identify edge cases
**Deliverables**:
- Pattern analysis document
- Extraction templates
- Edge case catalog
## Phase 3: NLP Extraction Pipeline (Weeks 5-7)
### 3.1 Entity Extraction
**Objective**: Extract heritage institutions and their attributes from conversation text
**Tasks**:
- Implement Named Entity Recognition (NER) for:
- Institution names
- Addresses/locations
- URLs and digital platforms
- Contact information
- Collection types
- Create entity linking to disambiguate institutions
- Extract relationships (parent organizations, networks, partnerships)
**Technologies**:
- spaCy or similar NLP framework
- Custom entity recognition models
- Regex patterns for structured data (URLs, codes)
**Deliverables**:
- `EntityExtractor` class
- Custom NER models
- Entity disambiguation logic
### 3.2 Attribute Extraction
**Objective**: Extract structured attributes for each institution
**Tasks**:
- Extract:
- Institution types (museum, archive, library, etc.)
- Geographic coverage
- Collection subjects/themes
- Digital platforms and repositories
- Standards used (metadata, preservation, access)
- Identifiers (ISIL, Wikidata, national IDs)
- Funding sources
- Consortium memberships
- Map extracted attributes to LinkML schema
**Deliverables**:
- `AttributeExtractor` class
- Attribute mapping configuration
- Quality metrics
## Phase 4: Web Crawling & Enrichment (Weeks 8-10)
### 4.1 crawl4ai Integration
**Objective**: Fetch detailed information from URLs cited in conversations
**Tasks**:
- Implement crawl4ai wrapper for:
- Institution homepages
- Digital repository platforms
- Catalog systems
- API documentation
- Extract structured data from crawled pages:
- OpenGraph metadata
- Schema.org JSON-LD
- Dublin Core metadata
- Custom extraction rules per platform type
- Implement rate limiting and ethical crawling
**Deliverables**:
- `WebCrawler` class
- Platform-specific extractors
- Crawl results database
### 4.2 Data Validation & Enrichment
**Objective**: Validate URLs and enrich extracted data
**Tasks**:
- Verify URL availability (HTTP status checks)
- Cross-reference with external authorities:
- Wikidata
- VIAF (Virtual International Authority File)
- National library registries
- UNESCO heritage databases
- Resolve geographic coordinates
- Normalize institution names
**Deliverables**:
- `DataEnricher` class
- External API integrations
- Enrichment metrics
## Phase 5: LinkML Instance Generation (Weeks 11-12)
### 5.1 Schema Mapping
**Objective**: Map extracted data to LinkML schema instances
**Tasks**:
- Transform extracted entities to LinkML classes
- Validate instances against schema
- Handle missing/incomplete data
- Generate unique identifiers (IRIs)
- Maintain provenance (source conversations, extraction confidence)
**Deliverables**:
- `LinkMLMapper` class
- Instance generator
- Validation reports
### 5.2 Data Quality Assurance
**Objective**: Ensure high-quality dataset output
**Tasks**:
- Implement quality checks:
- Completeness scoring
- Duplicate detection
- Inconsistency identification
- Manual review process for flagged items
- Create quality dashboard
**Deliverables**:
- `QualityChecker` class
- Review interface
- Quality metrics dashboard
## Phase 6: Export & Documentation (Weeks 13-14)
### 6.1 Multi-Format Export
**Objective**: Export dataset in multiple formats
**Tasks**:
- Generate outputs:
- **RDF/Turtle** - Linked data format
- **JSON-LD** - JSON with linked data context
- **CSV/TSV** - Tabular formats for analysis
- **YAML** - Human-readable instances
- **SQLite/DuckDB** - Queryable database
- **Parquet** - Efficient columnar format
- Create dataset statistics report
**Deliverables**:
- `DataExporter` class with format plugins
- Export configurations
- Dataset documentation
### 6.2 OpenCODE Agent Development
**Objective**: Create AI agents for extraction assistance
**Tasks**:
- Write `AGENTS.md` with:
- NLP extraction subagent instructions
- Pattern recognition guidelines
- Quality review agent specifications
- Implement agent integration in pipeline
- Test agent-assisted extraction
**Deliverables**:
- `AGENTS.md` documentation
- Agent integration code
- Agent performance metrics
### 6.3 Documentation & Publication
**Objective**: Comprehensive project documentation
**Tasks**:
- Write user guides:
- Installation and setup
- Running extraction pipeline
- Querying the dataset
- Contributing new sources
- Create schema documentation site
- Prepare dataset for publication:
- License selection (CC0, CC-BY, ODbL)
- Metadata record
- DOI registration
- GitHub repository preparation
**Deliverables**:
- Complete documentation
- Published dataset
- Public repository
## Phase 7: Maintenance & Extension (Ongoing)
### 7.1 Continuous Improvement
**Tasks**:
- Monitor data quality metrics
- Refine extraction algorithms based on feedback
- Add support for new conversation sources
- Update schema as standards evolve
- Integrate community contributions
### 7.2 Dataset Updates
**Tasks**:
- Re-crawl URLs periodically for updates
- Process new conversation files
- Maintain dataset versioning
- Publish regular releases
## Success Metrics
### Quantitative
- **Coverage**: Extract ≥80% of institutions mentioned in conversations
- **Accuracy**: ≥90% precision on manual validation sample
- **Completeness**: ≥70% of extracted institutions have URLs and types
- **Enrichment**: ≥50% institutions linked to external authorities
### Qualitative
- Schema can represent diverse GLAM institutions globally
- Dataset is reusable by heritage sector practitioners
- Documentation enables community contributions
- Pipeline is maintainable and extensible
## Risk Mitigation
### Technical Risks
- **NLP extraction accuracy**: Mitigate with human-in-loop review and iterative refinement
- **Web crawling reliability**: Implement robust error handling and retry logic
- **Schema complexity**: Start simple, iterate based on real data needs
### Resource Risks
- **Processing time**: Parallelize where possible, use incremental processing
- **Storage requirements**: Use efficient formats (Parquet), compress archives
## Timeline Summary
- **Weeks 1-2**: Foundation (schemas, scaffolding)
- **Weeks 3-4**: Parsing & analysis
- **Weeks 5-7**: NLP extraction
- **Weeks 8-10**: Web crawling & enrichment
- **Weeks 11-12**: LinkML generation & QA
- **Weeks 13-14**: Export & documentation
- **Ongoing**: Maintenance & updates
**Total estimated duration**: 14 weeks for initial release, with ongoing maintenance