# Global GLAM Dataset: Data Standardization Strategy ## Overview This document describes how existing structured datasets in the `data/` directory will be standardized, integrated, and fit into the Global GLAM dataset extraction pipeline alongside data mined from conversation files. ## Existing Data Assets ### 1. ISIL Codes Registry **File**: `data/ISIL-codes_2025-08-01.csv` #### Current Structure ```csv Volgnr.,Plaats,Instelling,ISIL code,Toegekend op,Opmerking 1,Aalten,Nationaal Onderduikmuseum,NL-AtNOM,2021-03-17, 2,Alkmaar,Regionaal Archief Alkmaar,NL-AmrRAA,2009-08-18, ``` #### Data Characteristics - **Scope**: Dutch heritage institutions with assigned ISIL codes - **Coverage**: ~300+ institutions (estimated from file) - **Authority**: Official ISIL registry for Netherlands - **Fields**: - Sequential number (Volgnr.) - Place/City (Plaats) - Institution name (Instelling) - ISIL code (standardized identifier) - Assignment date (Toegekend op) - Remarks/Notes (Opmerking) #### Data Quality - **Strengths**: - Authoritative source for Dutch ISIL codes - Standardized identifier format (NL-{code}) - Official assignment dates - Clean, structured data - **Limitations**: - Netherlands only (not global) - Only institutions with ISIL codes (subset of all institutions) - Limited metadata (no URLs, collections, types) - CSV encoding issues (extra quote marks) ### 2. Dutch Organizations Registry **File**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv` #### Current Structure ```csv Plaatsnaam bezoekadres,Straat en huisnummer bezoekadres,Organisatie,Koepelorganisatie, Webadres organisatie,Type organisatie,Opmerkingen Inez,ISIL-code (NA), Samenwerkingsverband / Platform,Systeem,Versnellen,Collectie Nederland,... ``` #### Data Characteristics - **Scope**: Comprehensive list of Dutch heritage organizations and services - **Coverage**: Province-by-province (Drenthe, Flevoland visible in sample) - **Richness**: Very detailed metadata about each institution - **Fields** (40+ columns): - **Location**: Province, city, street address - **Organization**: Name, parent organization, website - **Classification**: Type (museum, archive, library, historical society) - **Identifiers**: ISIL code, Chamber of Commerce (KvK) number - **Platforms**: Participation in various heritage platforms/networks - **Systems**: Collection management systems used (Atlantis, MAIS Flexis, ZCBS, etc.) - **Projects**: Participation in digitization projects (Versnellen, Collectie Nederland, etc.) - **Registers**: Museum register, Rijkscollectie, Archives Portal Europe, etc. - **Specialized networks**: WO2Net, Modemuze, Maritiem Digitaal, etc. - **Linked Data**: Participation in linked data initiatives - **Notes**: Extensive comments on organizational structure, data issues #### Data Quality - **Strengths**: - Extremely rich metadata - Real-world operational data (systems, platforms, projects) - Parent/child organization relationships - Network/consortium memberships - Curator notes on data quality issues - Physical addresses for geocoding - Web URLs for crawling - **Limitations**: - Netherlands-specific (single country) - Mixed data quality (notes indicate issues) - Some institutions lack KvK registration - Complex organizational relationships - Many empty cells (not all institutions participate in all platforms) ## Integration Strategy ### Phase 1: Standardization & Normalization #### 1.1 ISIL Codes CSV Processing **Parsing Steps**: 1. Fix CSV encoding issues (extra quotes) 2. Parse dates to ISO 8601 format 3. Validate ISIL code format (NL-{prefix}{suffix}) 4. Normalize place names (standardize Dutch municipality names) 5. Extract institution type hints from names **Output Schema**: ```python @dataclass class ISILRecord: sequence_number: int city: str city_normalized: str # Standardized municipality institution_name: str institution_name_normalized: str # Cleaned name isil_code: str assigned_date: date remarks: Optional[str] # Derived fields institution_type_hint: Optional[str] # Extracted from name province: Optional[str] # Looked up from city coordinates: Optional[Tuple[float, float]] # Geocoded ``` **Enrichment**: - Geocode cities to coordinates - Map cities to provinces - Infer institution types from names (e.g., "Archief" → archive, "Museum" → museum) - Link to Wikidata entities where possible #### 1.2 Dutch Organizations CSV Processing **Parsing Steps**: 1. Parse complex CSV with 40+ columns 2. Handle empty cells gracefully 3. Normalize organization types 4. Parse boolean flags (ja/nee → true/false) 5. Validate URLs 6. Parse ISIL codes 7. Extract system/platform names 8. Process curator notes **Output Schema**: ```python @dataclass class DutchOrganizationRecord: # Core identity organization_name: str organization_type: str # Normalized: 'museum', 'archive', 'library', etc. parent_organization: Optional[str] # Location province: str city: str street_address: str coordinates: Optional[Tuple[float, float]] # Digital presence website_url: Optional[str] # Identifiers isil_code: Optional[str] kvk_number: Optional[str] # Dutch business registry # Platforms & networks consortium_memberships: List[str] platform_participations: List[str] # Technical systems collection_management_system: Optional[str] # Project participation versnellen_project: bool collectie_nederland: bool museum_register: bool rijkscollectie: bool bibliotheek_collectie: bool dc4eu: bool # Digital Culture for Europe archives_portal_europe: bool wo2net: bool # WWII network specialized_networks: List[str] # Modemuze, Maritiem Digitaal, etc. # Linked data linked_data_participation: bool dataset_register: bool # Data quality curator_notes: Optional[str] data_issues: List[str] # Extracted from notes ``` **Normalization**: - Standardize organization types to controlled vocabulary - Parse platform/system names to canonical forms - Convert "ja"/"nieuw"/"" to boolean/enum values - Extract structured issues from free-text notes #### 1.3 Cross-Dataset Linking **ISIL Code Matching**: ```python def link_datasets(isil_records: List[ISILRecord], org_records: List[DutchOrganizationRecord]) -> List[LinkedRecord]: """ Link ISIL registry with Dutch organizations registry """ # Match on ISIL code (primary key) # Match on normalized organization name + city (fallback) # Flag conflicts and duplicates ``` **Linking Strategy**: 1. **Primary match**: ISIL code (if present in both datasets) 2. **Secondary match**: Fuzzy name matching + city 3. **Conflict resolution**: Manual review queue for ambiguous matches **Output**: Merged records with provenance tracking ```python @dataclass class LinkedDutchInstitution: # Merged data from both sources institution_name: str isil_code: str # Provenance in_isil_registry: bool in_org_registry: bool matched_on: str # 'isil_code', 'name_city', 'manual' # All fields from both records... ``` ### Phase 2: LinkML Schema Mapping #### 2.1 Schema Design for Existing Data **Heritage Institution Base Class**: ```yaml # schemas/heritage_custodian.yaml classes: HeritageCustodian: slots: - name - institution_types - isil_code - geographic_coverage - digital_platforms - parent_organization - consortium_memberships DutchHeritageCustodian: is_a: HeritageCustodian mixins: - TOOIOrganization # Dutch government org ontology slots: - kvk_number # Dutch-specific - province - municipality - versnellen_participation - collectie_nederland_participation - collection_management_system ``` #### 2.2 Mapping Rules **ISIL Registry → LinkML**: ```yaml mappings: ISILRecord_to_HeritageCustodian: institution_name: name isil_code: isil_code city: geographic_coverage.municipality assigned_date: identifiers[isil].assigned_date institution_type_hint: institution_types (with inference) ``` **Dutch Organizations Registry → LinkML**: ```yaml mappings: DutchOrganizationRecord_to_DutchHeritageCustodian: organization_name: name organization_type: institution_types parent_organization: parent_organization isil_code: isil_code kvk_number: identifiers[kvk].value website_url: digital_platforms[0].url collection_management_system: technical_infrastructure.cms consortium_memberships: organizational_relationships.consortia # ... (40+ field mappings) ``` #### 2.3 Instance Generation **Process**: 1. Parse CSV → Python dataclass 2. Normalize/enrich data 3. Map to LinkML schema 4. Validate against schema 5. Generate instances (YAML/JSON-LD) 6. Store in RDF graph **Output Structure**: ``` output/ ├── dutch_institutions/ │ ├── instances/ │ │ ├── NL-AsdSAA.yaml # Stadsarchief Amsterdam │ │ ├── NL-AsdRM.jsonld # Rijksmuseum │ │ └── ... │ ├── merged/ │ │ └── dutch_institutions.ttl # All as RDF │ └── stats/ │ └── coverage_report.json ``` ### Phase 3: Integration with Conversation-Mined Data #### 3.1 Dual Data Sources **Data Flow**: ``` ┌─────────────────────┐ ┌──────────────────────┐ │ Existing CSVs │ │ Conversation JSONs │ │ - ISIL registry │ │ - Global research │ │ - Dutch orgs │ │ - 139 countries │ └──────────┬──────────┘ └──────────┬───────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌─────────────────┐ │ CSV Parser │ │ NLP Extraction │ │ & Normalizer │ │ Pipeline │ └──────┬───────┘ └────────┬────────┘ │ │ ├───────────────────────────────┤ │ │ ▼ ▼ ┌────────────────────────────────────────┐ │ LinkML Instance Generator │ │ - Validate against schema │ │ - Merge overlapping records │ │ - Track provenance │ └──────────────┬─────────────────────────┘ │ ▼ ┌─────────────────────┐ │ Global GLAM │ │ Dataset │ │ (RDF + exports) │ └─────────────────────┘ ``` #### 3.2 Data Fusion Rules **Scenario 1: Institution in both CSV and conversation data** ```python # Example: Rijksmuseum appears in both ISIL registry and # Dutch conversations about museum digitization def merge_institution(csv_record: LinkedDutchInstitution, extracted_record: ExtractedInstitution) -> HeritageCustodian: """ Merge data from authoritative CSV and mined conversation """ merged = HeritageCustodian() # Prefer CSV for authoritative fields merged.isil_code = csv_record.isil_code # Authoritative merged.name = csv_record.organization_name # Official name merged.kvk_number = csv_record.kvk_number # Official # Prefer conversation data for URLs (may be more current) merged.website_url = extracted_record.urls[0] if extracted_record.urls else csv_record.website_url # Merge collections (union of both sources) merged.collections = list(set( csv_record.specialized_networks + # From CSV extracted_record.collection_subjects # From NLP )) # Track provenance merged.provenance = [ ProvenanceRecord(source="isil_registry", confidence=1.0), ProvenanceRecord(source="conversation_extraction", confidence=0.85) ] return merged ``` **Scenario 2: Institution only in CSV** ```python # Example: Small local archive in ISIL registry but not mentioned # in conversations (yet) def csv_only_institution(csv_record: LinkedDutchInstitution) -> HeritageCustodian: """ Convert CSV-only record to LinkML instance """ # All data from CSV # Mark as authoritative but limited metadata # May be enriched later by web crawling ``` **Scenario 3: Institution only in conversation data** ```python # Example: Brazilian museum mentioned in conversations but no # authoritative registry entry def conversation_only_institution(extracted: ExtractedInstitution) -> HeritageCustodian: """ Use conversation-mined data with appropriate confidence scores """ # All data from NLP extraction # Lower confidence scores # Requires validation via web crawling ``` #### 3.3 Data Completeness Scoring ```python @dataclass class CompletenessScore: """Track data completeness per institution""" has_name: bool = True # Always required has_type: bool = False has_location: bool = False has_isil: bool = False has_website: bool = False has_collections: bool = False has_systems: bool = False has_coordinates: bool = False @property def score(self) -> float: """Completeness as percentage""" fields = [f for f in self.__dataclass_fields__ if f != 'score'] return sum(getattr(self, f) for f in fields) / len(fields) ``` **Target Completeness**: - CSV-sourced institutions: >80% completeness expected - Conversation-mined institutions: >60% completeness target - Combined sources: >90% completeness ideal ### Phase 4: Data Enhancement Pipeline #### 4.1 Geocoding Addresses **For CSV data**: ```python from geopy.geocoders import Nominatim def geocode_dutch_addresses(records: List[DutchOrganizationRecord]): """ Geocode full street addresses from Dutch org registry """ geocoder = Nominatim(user_agent="glam-dataset-builder") for record in records: if record.street_address: full_address = f"{record.street_address}, {record.city}, Netherlands" try: location = geocoder.geocode(full_address) record.coordinates = (location.latitude, location.longitude) except: # Fall back to city-level geocoding pass ``` **Advantage**: CSV data has precise street addresses → high-quality geocoding #### 4.2 URL Validation & Crawling **For CSV URLs**: ```python async def validate_and_crawl_csv_urls(records: List[DutchOrganizationRecord]): """ Validate URLs from CSV and crawl for additional metadata """ for record in records: if record.website_url: # Validate URL is accessible status = await check_url(record.website_url) record.url_status = status if status == 200: # Crawl for Schema.org metadata, OpenGraph, etc. metadata = await crawl4ai.extract_metadata(record.website_url) # Enrich record with crawled data if metadata.get('description'): record.description = metadata['description'] if metadata.get('collections'): record.collections.extend(metadata['collections']) ``` #### 4.3 Wikidata Linking **Strategy**: 1. Search Wikidata for institution name + city 2. Filter by instance type (Q33506, Q166118, Q7075, etc.) 3. Validate match quality 4. Extract additional identifiers (VIAF, GND, etc.) ```python def link_to_wikidata(institution: HeritageCustodian) -> Optional[str]: """ Find Wikidata entity for institution """ query = f""" SELECT ?item WHERE {{ ?item rdfs:label "{institution.name}"@nl . ?item wdt:P31/wdt:P279* wd:Q33506 . # instance of archive (or subclass) ?item wdt:P131 wd:{city_qid} . # located in administrative territory }} """ # Returns Wikidata QID ``` ### Phase 5: Quality Assurance #### 5.1 Data Validation **CSV Data Checks**: ```python def validate_csv_data(record: DutchOrganizationRecord) -> List[ValidationIssue]: issues = [] # Required fields if not record.organization_name: issues.append(ValidationIssue("Missing organization name", "error")) # ISIL format if record.isil_code and not re.match(r'NL-[A-Za-z0-9]+', record.isil_code): issues.append(ValidationIssue("Invalid ISIL format", "warning")) # URL validity if record.website_url and not validators.url(record.website_url): issues.append(ValidationIssue("Invalid URL format", "warning")) # Curator notes review if record.curator_notes and "geen KvK" in record.curator_notes.lower(): issues.append(ValidationIssue("No KvK registration", "info")) return issues ``` #### 5.2 Duplicate Detection **Within CSV data**: - Check for duplicate ISIL codes - Check for duplicate names + addresses - Flag for manual review **Across datasets**: - Match CSV institutions to conversation-mined data - Detect near-duplicates with fuzzy matching - Merge or flag for review #### 5.3 Manual Review Interface **Flagged Cases** (from curator notes in CSV): - Institutions without KvK registration - Complex organizational hierarchies (museum part of municipality) - Name discrepancies (official name vs. popular name) - Missing parent organization links **Review Queue**: ```python @dataclass class ReviewItem: institution_id: str issue_type: str # 'duplicate', 'missing_data', 'conflict', 'ambiguous' description: str suggested_resolution: Optional[str] manual_decision: Optional[str] ``` ### Phase 6: Export & Integration #### 6.1 Standalone CSV Dataset Exports **Dutch Heritage Institutions Export**: ``` output/dutch_institutions/ ├── dutch_heritage_institutions.csv # Flattened CSV ├── dutch_heritage_institutions.ttl # RDF Turtle ├── dutch_heritage_institutions.jsonld # JSON-LD ├── metadata.yaml # Dataset metadata └── README.md # Documentation ``` **Statistics**: - Total institutions: ~400 (estimated) - With ISIL codes: ~300 - With websites: ~350 - With geocoded addresses: ~380 - Participating in Versnellen: ~100 - In Archives Portal Europe: ~150 #### 6.2 Integration into Global Dataset **Global GLAM Dataset Structure**: ``` output/global_glam_dataset/ ├── by_country/ │ ├── netherlands/ │ │ ├── authoritative/ # From CSV registries │ │ │ └── nl_institutions.ttl │ │ ├── conversation_mined/ # From Dutch conversations │ │ │ └── nl_mined.ttl │ │ └── merged/ # Fused data │ │ └── nl_complete.ttl │ ├── brazil/ │ │ └── conversation_mined/ # No authoritative CSV available │ │ └── br_institutions.ttl │ └── ... ├── global_merged.ttl # All countries combined └── statistics/ └── coverage_by_source.json ``` **Provenance Tracking**: ```turtle a hc:HeritageCustodian ; schema:name "Rijksmuseum"@nl ; hc:isil_code "NL-AsdRM" ; prov:wasDerivedFrom [ a prov:Entity ; prov:hadPrimarySource ; prov:generatedAtTime "2025-11-05T12:00:00Z"^^xsd:dateTime ] , [ a prov:Entity ; prov:hadPrimarySource ; prov:generatedAtTime "2025-11-05T12:30:00Z"^^xsd:dateTime ] . ``` ## Data Governance ### Data Sources Hierarchy (by authority) **Tier 1: Authoritative Registries** (highest trust) - ISIL code registry (official) - National library/archive registries - Government heritage databases - Confidence score: 1.0 **Tier 2: Curated Institutional Data** (high trust) - Dutch organizations CSV (curated by heritage professionals) - Institutional websites (verified) - Confidence score: 0.9 **Tier 3: Conversation-Mined Data** (medium trust) - NLP-extracted from conversations - Validated by web crawling - Confidence score: 0.7-0.85 **Tier 4: Inferred Data** (lower trust) - Geocoded from addresses - Type inference from names - Linked data from fuzzy matching - Confidence score: 0.5-0.7 ### Data Update Strategy **CSV Data**: - **Frequency**: When new versions released (quarterly/annually) - **Process**: 1. Detect changes (diff against previous version) 2. Re-process only changed records 3. Update merged dataset 4. Publish new version with changelog **Conversation Data**: - **Frequency**: As new conversations added - **Process**: Incremental processing of new files **Web-Crawled Data**: - **Frequency**: Quarterly refresh of URLs - **Process**: Re-crawl all URLs, update metadata ### License Considerations **CSV Data**: - ISIL registry: Public data (check specific license) - Dutch orgs CSV: Clarify license with data provider - Recommended: Request CC0 or CC-BY 4.0 **Derived Dataset**: - Combine data under compatible license - Provide attribution to original sources - Document provenance clearly ## Implementation Priority ### Phase 1 (Week 1): CSV Processing Infrastructure - CSV parser with encoding fixes - Data validation rules - Output to Python dataclasses - Basic statistics ### Phase 2 (Week 2): LinkML Schema Integration - Design schema extensions for CSV fields - Implement mapping rules - Generate LinkML instances - Validate against schema ### Phase 3 (Week 3-4): Enrichment - Geocoding pipeline - URL validation - Wikidata linking - Web crawling integration ### Phase 4 (Week 5): Merging & Quality - Implement data fusion logic - Duplicate detection - Conflict resolution - Manual review queue ### Phase 5 (Week 6): Export & Documentation - Multi-format export - Statistics generation - Documentation - Publication ## Success Metrics ### Quantitative - **Parse success rate**: >99% of CSV records successfully parsed - **Geocoding success**: >95% of addresses geocoded - **URL validity**: >90% of URLs accessible - **Wikidata linking**: >60% linked to Wikidata - **Merge rate**: >85% of ISIL registry entries matched to Dutch org registry ### Qualitative - CSV data provides "ground truth" for Dutch institutions - Conversation data adds international coverage - Merged dataset superior to either source alone - Clear provenance enables trust and verification ## Open Questions 1. **License for Dutch organizations CSV**: Need to confirm acceptable use and redistribution 2. **Update frequency**: How often are CSV files updated by maintainers? 3. **KvK integration**: Should we fetch additional data from Dutch Chamber of Commerce API? 4. **Parent organization resolution**: How to handle complex hierarchies (museum owned by municipality)? 5. **Platform data usage**: How to represent participation in platforms like "Versnellen" in LinkML schema?