9.3 KiB
Global GLAM Dataset: Implementation Phases
Overview
This document outlines the phased approach to building a comprehensive Global GLAM (Galleries, Libraries, Archives, Museums) dataset from 139 Claude conversation JSON files containing heritage institution research worldwide.
Phase 1: Foundation & Schema Design (Weeks 1-2)
1.1 LinkML Schema Development
Objective: Create a comprehensive heritage custodian ontology combining multiple standards
Tasks:
- Design core LinkML schema integrating:
- TOOI (Dutch government organizations ontology)
- Schema.org (general web semantics)
- CPOC (Core Public Organization Vocabulary)
- ISIL (International Standard Identifier for Libraries)
- Domain-specific vocabularies:
- RiC-O (Records in Contexts - Archives)
- BIBFRAME (Bibliographic Framework - Libraries)
- LIDO/CIDOC-CRM (Museums)
- EBUCore (Media archives)
Deliverables:
schemas/heritage_custodian.yaml- Main LinkML schemaschemas/mixins/- Modular schema componentsdocs/schema-documentation.md- Schema documentation- Example instances in YAML/JSON-LD
1.2 Poetry Application Scaffolding
Objective: Set up Python development environment
Tasks:
- Initialize Poetry project with proper dependencies
- Configure development tools (ruff, mypy, pytest)
- Set up project structure:
glam-extractor/ ├── pyproject.toml ├── src/glam_extractor/ │ ├── parsers/ # JSON conversation parsers │ ├── extractors/ # NLP-based data extraction │ ├── crawlers/ # crawl4ai integration │ ├── validators/ # LinkML validation │ └── exporters/ # Output formats ├── tests/ └── data/ ├── input/ # Conversation JSONs └── output/ # Extracted datasets
Deliverables:
- Working Poetry project
- CI/CD configuration
- Development documentation
Phase 2: Data Parsing & Initial Extraction (Weeks 3-4)
2.1 Conversation Parser Development
Objective: Parse Claude conversation JSON files and extract text content
Tasks:
- Implement JSON parser for conversation structure
- Extract markdown content from messages
- Handle citations and URL references
- Create conversation metadata index (country, date, topics)
Deliverables:
ConversationParserclass- Conversation index database (SQLite/DuckDB)
- Parsed text corpus
2.2 Sample Data Analysis
Objective: Analyze representative conversations to refine extraction patterns
Tasks:
- Deep dive into 10-15 representative conversations covering:
- Different countries/regions
- Different GLAM types (galleries, libraries, archives, museums)
- Various institution sizes (national, regional, local)
- Document common patterns and structures
- Identify edge cases
Deliverables:
- Pattern analysis document
- Extraction templates
- Edge case catalog
Phase 3: NLP Extraction Pipeline (Weeks 5-7)
3.1 Entity Extraction
Objective: Extract heritage institutions and their attributes from conversation text
Tasks:
- Implement Named Entity Recognition (NER) for:
- Institution names
- Addresses/locations
- URLs and digital platforms
- Contact information
- Collection types
- Create entity linking to disambiguate institutions
- Extract relationships (parent organizations, networks, partnerships)
Technologies:
- spaCy or similar NLP framework
- Custom entity recognition models
- Regex patterns for structured data (URLs, codes)
Deliverables:
EntityExtractorclass- Custom NER models
- Entity disambiguation logic
3.2 Attribute Extraction
Objective: Extract structured attributes for each institution
Tasks:
- Extract:
- Institution types (museum, archive, library, etc.)
- Geographic coverage
- Collection subjects/themes
- Digital platforms and repositories
- Standards used (metadata, preservation, access)
- Identifiers (ISIL, Wikidata, national IDs)
- Funding sources
- Consortium memberships
- Map extracted attributes to LinkML schema
Deliverables:
AttributeExtractorclass- Attribute mapping configuration
- Quality metrics
Phase 4: Web Crawling & Enrichment (Weeks 8-10)
4.1 crawl4ai Integration
Objective: Fetch detailed information from URLs cited in conversations
Tasks:
- Implement crawl4ai wrapper for:
- Institution homepages
- Digital repository platforms
- Catalog systems
- API documentation
- Extract structured data from crawled pages:
- OpenGraph metadata
- Schema.org JSON-LD
- Dublin Core metadata
- Custom extraction rules per platform type
- Implement rate limiting and ethical crawling
Deliverables:
WebCrawlerclass- Platform-specific extractors
- Crawl results database
4.2 Data Validation & Enrichment
Objective: Validate URLs and enrich extracted data
Tasks:
- Verify URL availability (HTTP status checks)
- Cross-reference with external authorities:
- Wikidata
- VIAF (Virtual International Authority File)
- National library registries
- UNESCO heritage databases
- Resolve geographic coordinates
- Normalize institution names
Deliverables:
DataEnricherclass- External API integrations
- Enrichment metrics
Phase 5: LinkML Instance Generation (Weeks 11-12)
5.1 Schema Mapping
Objective: Map extracted data to LinkML schema instances
Tasks:
- Transform extracted entities to LinkML classes
- Validate instances against schema
- Handle missing/incomplete data
- Generate unique identifiers (IRIs)
- Maintain provenance (source conversations, extraction confidence)
Deliverables:
LinkMLMapperclass- Instance generator
- Validation reports
5.2 Data Quality Assurance
Objective: Ensure high-quality dataset output
Tasks:
- Implement quality checks:
- Completeness scoring
- Duplicate detection
- Inconsistency identification
- Manual review process for flagged items
- Create quality dashboard
Deliverables:
QualityCheckerclass- Review interface
- Quality metrics dashboard
Phase 6: Export & Documentation (Weeks 13-14)
6.1 Multi-Format Export
Objective: Export dataset in multiple formats
Tasks:
- Generate outputs:
- RDF/Turtle - Linked data format
- JSON-LD - JSON with linked data context
- CSV/TSV - Tabular formats for analysis
- YAML - Human-readable instances
- SQLite/DuckDB - Queryable database
- Parquet - Efficient columnar format
- Create dataset statistics report
Deliverables:
DataExporterclass with format plugins- Export configurations
- Dataset documentation
6.2 OpenCODE Agent Development
Objective: Create AI agents for extraction assistance
Tasks:
- Write
AGENTS.mdwith:- NLP extraction subagent instructions
- Pattern recognition guidelines
- Quality review agent specifications
- Implement agent integration in pipeline
- Test agent-assisted extraction
Deliverables:
AGENTS.mddocumentation- Agent integration code
- Agent performance metrics
6.3 Documentation & Publication
Objective: Comprehensive project documentation
Tasks:
- Write user guides:
- Installation and setup
- Running extraction pipeline
- Querying the dataset
- Contributing new sources
- Create schema documentation site
- Prepare dataset for publication:
- License selection (CC0, CC-BY, ODbL)
- Metadata record
- DOI registration
- GitHub repository preparation
Deliverables:
- Complete documentation
- Published dataset
- Public repository
Phase 7: Maintenance & Extension (Ongoing)
7.1 Continuous Improvement
Tasks:
- Monitor data quality metrics
- Refine extraction algorithms based on feedback
- Add support for new conversation sources
- Update schema as standards evolve
- Integrate community contributions
7.2 Dataset Updates
Tasks:
- Re-crawl URLs periodically for updates
- Process new conversation files
- Maintain dataset versioning
- Publish regular releases
Success Metrics
Quantitative
- Coverage: Extract ≥80% of institutions mentioned in conversations
- Accuracy: ≥90% precision on manual validation sample
- Completeness: ≥70% of extracted institutions have URLs and types
- Enrichment: ≥50% institutions linked to external authorities
Qualitative
- Schema can represent diverse GLAM institutions globally
- Dataset is reusable by heritage sector practitioners
- Documentation enables community contributions
- Pipeline is maintainable and extensible
Risk Mitigation
Technical Risks
- NLP extraction accuracy: Mitigate with human-in-loop review and iterative refinement
- Web crawling reliability: Implement robust error handling and retry logic
- Schema complexity: Start simple, iterate based on real data needs
Resource Risks
- Processing time: Parallelize where possible, use incremental processing
- Storage requirements: Use efficient formats (Parquet), compress archives
Timeline Summary
- Weeks 1-2: Foundation (schemas, scaffolding)
- Weeks 3-4: Parsing & analysis
- Weeks 5-7: NLP extraction
- Weeks 8-10: Web crawling & enrichment
- Weeks 11-12: LinkML generation & QA
- Weeks 13-14: Export & documentation
- Ongoing: Maintenance & updates
Total estimated duration: 14 weeks for initial release, with ongoing maintenance