315 lines
9.3 KiB
Markdown
315 lines
9.3 KiB
Markdown
# Global GLAM Dataset: Implementation Phases
|
|
|
|
## Overview
|
|
This document outlines the phased approach to building a comprehensive Global GLAM (Galleries, Libraries, Archives, Museums) dataset from 139 Claude conversation JSON files containing heritage institution research worldwide.
|
|
|
|
## Phase 1: Foundation & Schema Design (Weeks 1-2)
|
|
|
|
### 1.1 LinkML Schema Development
|
|
**Objective**: Create a comprehensive heritage custodian ontology combining multiple standards
|
|
|
|
**Tasks**:
|
|
- Design core LinkML schema integrating:
|
|
- **TOOI** (Dutch government organizations ontology)
|
|
- **Schema.org** (general web semantics)
|
|
- **CPOC** (Core Public Organization Vocabulary)
|
|
- **ISIL** (International Standard Identifier for Libraries)
|
|
- Domain-specific vocabularies:
|
|
- **RiC-O** (Records in Contexts - Archives)
|
|
- **BIBFRAME** (Bibliographic Framework - Libraries)
|
|
- **LIDO/CIDOC-CRM** (Museums)
|
|
- **EBUCore** (Media archives)
|
|
|
|
**Deliverables**:
|
|
- `schemas/heritage_custodian.yaml` - Main LinkML schema
|
|
- `schemas/mixins/` - Modular schema components
|
|
- `docs/schema-documentation.md` - Schema documentation
|
|
- Example instances in YAML/JSON-LD
|
|
|
|
### 1.2 Poetry Application Scaffolding
|
|
**Objective**: Set up Python development environment
|
|
|
|
**Tasks**:
|
|
- Initialize Poetry project with proper dependencies
|
|
- Configure development tools (ruff, mypy, pytest)
|
|
- Set up project structure:
|
|
```
|
|
glam-extractor/
|
|
├── pyproject.toml
|
|
├── src/glam_extractor/
|
|
│ ├── parsers/ # JSON conversation parsers
|
|
│ ├── extractors/ # NLP-based data extraction
|
|
│ ├── crawlers/ # crawl4ai integration
|
|
│ ├── validators/ # LinkML validation
|
|
│ └── exporters/ # Output formats
|
|
├── tests/
|
|
└── data/
|
|
├── input/ # Conversation JSONs
|
|
└── output/ # Extracted datasets
|
|
```
|
|
|
|
**Deliverables**:
|
|
- Working Poetry project
|
|
- CI/CD configuration
|
|
- Development documentation
|
|
|
|
## Phase 2: Data Parsing & Initial Extraction (Weeks 3-4)
|
|
|
|
### 2.1 Conversation Parser Development
|
|
**Objective**: Parse Claude conversation JSON files and extract text content
|
|
|
|
**Tasks**:
|
|
- Implement JSON parser for conversation structure
|
|
- Extract markdown content from messages
|
|
- Handle citations and URL references
|
|
- Create conversation metadata index (country, date, topics)
|
|
|
|
**Deliverables**:
|
|
- `ConversationParser` class
|
|
- Conversation index database (SQLite/DuckDB)
|
|
- Parsed text corpus
|
|
|
|
### 2.2 Sample Data Analysis
|
|
**Objective**: Analyze representative conversations to refine extraction patterns
|
|
|
|
**Tasks**:
|
|
- Deep dive into 10-15 representative conversations covering:
|
|
- Different countries/regions
|
|
- Different GLAM types (galleries, libraries, archives, museums)
|
|
- Various institution sizes (national, regional, local)
|
|
- Document common patterns and structures
|
|
- Identify edge cases
|
|
|
|
**Deliverables**:
|
|
- Pattern analysis document
|
|
- Extraction templates
|
|
- Edge case catalog
|
|
|
|
## Phase 3: NLP Extraction Pipeline (Weeks 5-7)
|
|
|
|
### 3.1 Entity Extraction
|
|
**Objective**: Extract heritage institutions and their attributes from conversation text
|
|
|
|
**Tasks**:
|
|
- Implement Named Entity Recognition (NER) for:
|
|
- Institution names
|
|
- Addresses/locations
|
|
- URLs and digital platforms
|
|
- Contact information
|
|
- Collection types
|
|
- Create entity linking to disambiguate institutions
|
|
- Extract relationships (parent organizations, networks, partnerships)
|
|
|
|
**Technologies**:
|
|
- spaCy or similar NLP framework
|
|
- Custom entity recognition models
|
|
- Regex patterns for structured data (URLs, codes)
|
|
|
|
**Deliverables**:
|
|
- `EntityExtractor` class
|
|
- Custom NER models
|
|
- Entity disambiguation logic
|
|
|
|
### 3.2 Attribute Extraction
|
|
**Objective**: Extract structured attributes for each institution
|
|
|
|
**Tasks**:
|
|
- Extract:
|
|
- Institution types (museum, archive, library, etc.)
|
|
- Geographic coverage
|
|
- Collection subjects/themes
|
|
- Digital platforms and repositories
|
|
- Standards used (metadata, preservation, access)
|
|
- Identifiers (ISIL, Wikidata, national IDs)
|
|
- Funding sources
|
|
- Consortium memberships
|
|
- Map extracted attributes to LinkML schema
|
|
|
|
**Deliverables**:
|
|
- `AttributeExtractor` class
|
|
- Attribute mapping configuration
|
|
- Quality metrics
|
|
|
|
## Phase 4: Web Crawling & Enrichment (Weeks 8-10)
|
|
|
|
### 4.1 crawl4ai Integration
|
|
**Objective**: Fetch detailed information from URLs cited in conversations
|
|
|
|
**Tasks**:
|
|
- Implement crawl4ai wrapper for:
|
|
- Institution homepages
|
|
- Digital repository platforms
|
|
- Catalog systems
|
|
- API documentation
|
|
- Extract structured data from crawled pages:
|
|
- OpenGraph metadata
|
|
- Schema.org JSON-LD
|
|
- Dublin Core metadata
|
|
- Custom extraction rules per platform type
|
|
- Implement rate limiting and ethical crawling
|
|
|
|
**Deliverables**:
|
|
- `WebCrawler` class
|
|
- Platform-specific extractors
|
|
- Crawl results database
|
|
|
|
### 4.2 Data Validation & Enrichment
|
|
**Objective**: Validate URLs and enrich extracted data
|
|
|
|
**Tasks**:
|
|
- Verify URL availability (HTTP status checks)
|
|
- Cross-reference with external authorities:
|
|
- Wikidata
|
|
- VIAF (Virtual International Authority File)
|
|
- National library registries
|
|
- UNESCO heritage databases
|
|
- Resolve geographic coordinates
|
|
- Normalize institution names
|
|
|
|
**Deliverables**:
|
|
- `DataEnricher` class
|
|
- External API integrations
|
|
- Enrichment metrics
|
|
|
|
## Phase 5: LinkML Instance Generation (Weeks 11-12)
|
|
|
|
### 5.1 Schema Mapping
|
|
**Objective**: Map extracted data to LinkML schema instances
|
|
|
|
**Tasks**:
|
|
- Transform extracted entities to LinkML classes
|
|
- Validate instances against schema
|
|
- Handle missing/incomplete data
|
|
- Generate unique identifiers (IRIs)
|
|
- Maintain provenance (source conversations, extraction confidence)
|
|
|
|
**Deliverables**:
|
|
- `LinkMLMapper` class
|
|
- Instance generator
|
|
- Validation reports
|
|
|
|
### 5.2 Data Quality Assurance
|
|
**Objective**: Ensure high-quality dataset output
|
|
|
|
**Tasks**:
|
|
- Implement quality checks:
|
|
- Completeness scoring
|
|
- Duplicate detection
|
|
- Inconsistency identification
|
|
- Manual review process for flagged items
|
|
- Create quality dashboard
|
|
|
|
**Deliverables**:
|
|
- `QualityChecker` class
|
|
- Review interface
|
|
- Quality metrics dashboard
|
|
|
|
## Phase 6: Export & Documentation (Weeks 13-14)
|
|
|
|
### 6.1 Multi-Format Export
|
|
**Objective**: Export dataset in multiple formats
|
|
|
|
**Tasks**:
|
|
- Generate outputs:
|
|
- **RDF/Turtle** - Linked data format
|
|
- **JSON-LD** - JSON with linked data context
|
|
- **CSV/TSV** - Tabular formats for analysis
|
|
- **YAML** - Human-readable instances
|
|
- **SQLite/DuckDB** - Queryable database
|
|
- **Parquet** - Efficient columnar format
|
|
- Create dataset statistics report
|
|
|
|
**Deliverables**:
|
|
- `DataExporter` class with format plugins
|
|
- Export configurations
|
|
- Dataset documentation
|
|
|
|
### 6.2 OpenCODE Agent Development
|
|
**Objective**: Create AI agents for extraction assistance
|
|
|
|
**Tasks**:
|
|
- Write `AGENTS.md` with:
|
|
- NLP extraction subagent instructions
|
|
- Pattern recognition guidelines
|
|
- Quality review agent specifications
|
|
- Implement agent integration in pipeline
|
|
- Test agent-assisted extraction
|
|
|
|
**Deliverables**:
|
|
- `AGENTS.md` documentation
|
|
- Agent integration code
|
|
- Agent performance metrics
|
|
|
|
### 6.3 Documentation & Publication
|
|
**Objective**: Comprehensive project documentation
|
|
|
|
**Tasks**:
|
|
- Write user guides:
|
|
- Installation and setup
|
|
- Running extraction pipeline
|
|
- Querying the dataset
|
|
- Contributing new sources
|
|
- Create schema documentation site
|
|
- Prepare dataset for publication:
|
|
- License selection (CC0, CC-BY, ODbL)
|
|
- Metadata record
|
|
- DOI registration
|
|
- GitHub repository preparation
|
|
|
|
**Deliverables**:
|
|
- Complete documentation
|
|
- Published dataset
|
|
- Public repository
|
|
|
|
## Phase 7: Maintenance & Extension (Ongoing)
|
|
|
|
### 7.1 Continuous Improvement
|
|
**Tasks**:
|
|
- Monitor data quality metrics
|
|
- Refine extraction algorithms based on feedback
|
|
- Add support for new conversation sources
|
|
- Update schema as standards evolve
|
|
- Integrate community contributions
|
|
|
|
### 7.2 Dataset Updates
|
|
**Tasks**:
|
|
- Re-crawl URLs periodically for updates
|
|
- Process new conversation files
|
|
- Maintain dataset versioning
|
|
- Publish regular releases
|
|
|
|
## Success Metrics
|
|
|
|
### Quantitative
|
|
- **Coverage**: Extract ≥80% of institutions mentioned in conversations
|
|
- **Accuracy**: ≥90% precision on manual validation sample
|
|
- **Completeness**: ≥70% of extracted institutions have URLs and types
|
|
- **Enrichment**: ≥50% institutions linked to external authorities
|
|
|
|
### Qualitative
|
|
- Schema can represent diverse GLAM institutions globally
|
|
- Dataset is reusable by heritage sector practitioners
|
|
- Documentation enables community contributions
|
|
- Pipeline is maintainable and extensible
|
|
|
|
## Risk Mitigation
|
|
|
|
### Technical Risks
|
|
- **NLP extraction accuracy**: Mitigate with human-in-loop review and iterative refinement
|
|
- **Web crawling reliability**: Implement robust error handling and retry logic
|
|
- **Schema complexity**: Start simple, iterate based on real data needs
|
|
|
|
### Resource Risks
|
|
- **Processing time**: Parallelize where possible, use incremental processing
|
|
- **Storage requirements**: Use efficient formats (Parquet), compress archives
|
|
|
|
## Timeline Summary
|
|
- **Weeks 1-2**: Foundation (schemas, scaffolding)
|
|
- **Weeks 3-4**: Parsing & analysis
|
|
- **Weeks 5-7**: NLP extraction
|
|
- **Weeks 8-10**: Web crawling & enrichment
|
|
- **Weeks 11-12**: LinkML generation & QA
|
|
- **Weeks 13-14**: Export & documentation
|
|
- **Ongoing**: Maintenance & updates
|
|
|
|
**Total estimated duration**: 14 weeks for initial release, with ongoing maintenance
|