glam/docs/plan/global_glam/04-data-standardization.md
2025-11-19 23:25:22 +01:00

23 KiB

Global GLAM Dataset: Data Standardization Strategy

Overview

This document describes how existing structured datasets in the data/ directory will be standardized, integrated, and fit into the Global GLAM dataset extraction pipeline alongside data mined from conversation files.

Existing Data Assets

1. ISIL Codes Registry

File: data/ISIL-codes_2025-08-01.csv

Current Structure

Volgnr.,Plaats,Instelling,ISIL code,Toegekend op,Opmerking
1,Aalten,Nationaal Onderduikmuseum,NL-AtNOM,2021-03-17,
2,Alkmaar,Regionaal Archief Alkmaar,NL-AmrRAA,2009-08-18,

Data Characteristics

  • Scope: Dutch heritage institutions with assigned ISIL codes
  • Coverage: ~300+ institutions (estimated from file)
  • Authority: Official ISIL registry for Netherlands
  • Fields:
    • Sequential number (Volgnr.)
    • Place/City (Plaats)
    • Institution name (Instelling)
    • ISIL code (standardized identifier)
    • Assignment date (Toegekend op)
    • Remarks/Notes (Opmerking)

Data Quality

  • Strengths:
    • Authoritative source for Dutch ISIL codes
    • Standardized identifier format (NL-{code})
    • Official assignment dates
    • Clean, structured data
  • Limitations:
    • Netherlands only (not global)
    • Only institutions with ISIL codes (subset of all institutions)
    • Limited metadata (no URLs, collections, types)
    • CSV encoding issues (extra quote marks)

2. Dutch Organizations Registry

File: data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv

Current Structure

Plaatsnaam bezoekadres,Straat en huisnummer bezoekadres,Organisatie,Koepelorganisatie,
Webadres organisatie,Type organisatie,Opmerkingen Inez,ISIL-code (NA),
Samenwerkingsverband / Platform,Systeem,Versnellen,Collectie Nederland,...

Data Characteristics

  • Scope: Comprehensive list of Dutch heritage organizations and services
  • Coverage: Province-by-province (Drenthe, Flevoland visible in sample)
  • Richness: Very detailed metadata about each institution
  • Fields (40+ columns):
    • Location: Province, city, street address
    • Organization: Name, parent organization, website
    • Classification: Type (museum, archive, library, historical society)
    • Identifiers: ISIL code, Chamber of Commerce (KvK) number
    • Platforms: Participation in various heritage platforms/networks
    • Systems: Collection management systems used (Atlantis, MAIS Flexis, ZCBS, etc.)
    • Projects: Participation in digitization projects (Versnellen, Collectie Nederland, etc.)
    • Registers: Museum register, Rijkscollectie, Archives Portal Europe, etc.
    • Specialized networks: WO2Net, Modemuze, Maritiem Digitaal, etc.
    • Linked Data: Participation in linked data initiatives
    • Notes: Extensive comments on organizational structure, data issues

Data Quality

  • Strengths:
    • Extremely rich metadata
    • Real-world operational data (systems, platforms, projects)
    • Parent/child organization relationships
    • Network/consortium memberships
    • Curator notes on data quality issues
    • Physical addresses for geocoding
    • Web URLs for crawling
  • Limitations:
    • Netherlands-specific (single country)
    • Mixed data quality (notes indicate issues)
    • Some institutions lack KvK registration
    • Complex organizational relationships
    • Many empty cells (not all institutions participate in all platforms)

Integration Strategy

Phase 1: Standardization & Normalization

1.1 ISIL Codes CSV Processing

Parsing Steps:

  1. Fix CSV encoding issues (extra quotes)
  2. Parse dates to ISO 8601 format
  3. Validate ISIL code format (NL-{prefix}{suffix})
  4. Normalize place names (standardize Dutch municipality names)
  5. Extract institution type hints from names

Output Schema:

@dataclass
class ISILRecord:
    sequence_number: int
    city: str
    city_normalized: str  # Standardized municipality
    institution_name: str
    institution_name_normalized: str  # Cleaned name
    isil_code: str
    assigned_date: date
    remarks: Optional[str]
    
    # Derived fields
    institution_type_hint: Optional[str]  # Extracted from name
    province: Optional[str]  # Looked up from city
    coordinates: Optional[Tuple[float, float]]  # Geocoded

Enrichment:

  • Geocode cities to coordinates
  • Map cities to provinces
  • Infer institution types from names (e.g., "Archief" → archive, "Museum" → museum)
  • Link to Wikidata entities where possible

1.2 Dutch Organizations CSV Processing

Parsing Steps:

  1. Parse complex CSV with 40+ columns
  2. Handle empty cells gracefully
  3. Normalize organization types
  4. Parse boolean flags (ja/nee → true/false)
  5. Validate URLs
  6. Parse ISIL codes
  7. Extract system/platform names
  8. Process curator notes

Output Schema:

@dataclass
class DutchOrganizationRecord:
    # Core identity
    organization_name: str
    organization_type: str  # Normalized: 'museum', 'archive', 'library', etc.
    parent_organization: Optional[str]
    
    # Location
    province: str
    city: str
    street_address: str
    coordinates: Optional[Tuple[float, float]]
    
    # Digital presence
    website_url: Optional[str]
    
    # Identifiers
    isil_code: Optional[str]
    kvk_number: Optional[str]  # Dutch business registry
    
    # Platforms & networks
    consortium_memberships: List[str]
    platform_participations: List[str]
    
    # Technical systems
    collection_management_system: Optional[str]
    
    # Project participation
    versnellen_project: bool
    collectie_nederland: bool
    museum_register: bool
    rijkscollectie: bool
    bibliotheek_collectie: bool
    dc4eu: bool  # Digital Culture for Europe
    archives_portal_europe: bool
    wo2net: bool  # WWII network
    specialized_networks: List[str]  # Modemuze, Maritiem Digitaal, etc.
    
    # Linked data
    linked_data_participation: bool
    dataset_register: bool
    
    # Data quality
    curator_notes: Optional[str]
    data_issues: List[str]  # Extracted from notes

Normalization:

  • Standardize organization types to controlled vocabulary
  • Parse platform/system names to canonical forms
  • Convert "ja"/"nieuw"/"" to boolean/enum values
  • Extract structured issues from free-text notes

1.3 Cross-Dataset Linking

ISIL Code Matching:

def link_datasets(isil_records: List[ISILRecord], 
                  org_records: List[DutchOrganizationRecord]) -> List[LinkedRecord]:
    """
    Link ISIL registry with Dutch organizations registry
    """
    # Match on ISIL code (primary key)
    # Match on normalized organization name + city (fallback)
    # Flag conflicts and duplicates

Linking Strategy:

  1. Primary match: ISIL code (if present in both datasets)
  2. Secondary match: Fuzzy name matching + city
  3. Conflict resolution: Manual review queue for ambiguous matches

Output: Merged records with provenance tracking

@dataclass
class LinkedDutchInstitution:
    # Merged data from both sources
    institution_name: str
    isil_code: str
    
    # Provenance
    in_isil_registry: bool
    in_org_registry: bool
    matched_on: str  # 'isil_code', 'name_city', 'manual'
    
    # All fields from both records...

Phase 2: LinkML Schema Mapping

2.1 Schema Design for Existing Data

Heritage Institution Base Class:

# schemas/heritage_custodian.yaml
classes:
  HeritageCustodian:
    slots:
      - name
      - institution_types
      - isil_code
      - geographic_coverage
      - digital_platforms
      - parent_organization
      - consortium_memberships
      
  DutchHeritageCustodian:
    is_a: HeritageCustodian
    mixins:
      - TOOIOrganization  # Dutch government org ontology
    slots:
      - kvk_number  # Dutch-specific
      - province
      - municipality
      - versnellen_participation
      - collectie_nederland_participation
      - collection_management_system

2.2 Mapping Rules

ISIL Registry → LinkML:

mappings:
  ISILRecord_to_HeritageCustodian:
    institution_name: name
    isil_code: isil_code
    city: geographic_coverage.municipality
    assigned_date: identifiers[isil].assigned_date
    institution_type_hint: institution_types (with inference)

Dutch Organizations Registry → LinkML:

mappings:
  DutchOrganizationRecord_to_DutchHeritageCustodian:
    organization_name: name
    organization_type: institution_types
    parent_organization: parent_organization
    isil_code: isil_code
    kvk_number: identifiers[kvk].value
    website_url: digital_platforms[0].url
    collection_management_system: technical_infrastructure.cms
    consortium_memberships: organizational_relationships.consortia
    # ... (40+ field mappings)

2.3 Instance Generation

Process:

  1. Parse CSV → Python dataclass
  2. Normalize/enrich data
  3. Map to LinkML schema
  4. Validate against schema
  5. Generate instances (YAML/JSON-LD)
  6. Store in RDF graph

Output Structure:

output/
├── dutch_institutions/
│   ├── instances/
│   │   ├── NL-AsdSAA.yaml           # Stadsarchief Amsterdam
│   │   ├── NL-AsdRM.jsonld          # Rijksmuseum
│   │   └── ...
│   ├── merged/
│   │   └── dutch_institutions.ttl   # All as RDF
│   └── stats/
│       └── coverage_report.json

Phase 3: Integration with Conversation-Mined Data

3.1 Dual Data Sources

Data Flow:

┌─────────────────────┐         ┌──────────────────────┐
│  Existing CSVs      │         │  Conversation JSONs  │
│  - ISIL registry    │         │  - Global research   │
│  - Dutch orgs       │         │  - 139 countries     │
└──────────┬──────────┘         └──────────┬───────────┘
           │                               │
           ▼                               ▼
    ┌──────────────┐              ┌─────────────────┐
    │ CSV Parser   │              │  NLP Extraction │
    │ & Normalizer │              │    Pipeline     │
    └──────┬───────┘              └────────┬────────┘
           │                               │
           ├───────────────────────────────┤
           │                               │
           ▼                               ▼
    ┌────────────────────────────────────────┐
    │    LinkML Instance Generator           │
    │  - Validate against schema             │
    │  - Merge overlapping records           │
    │  - Track provenance                    │
    └──────────────┬─────────────────────────┘
                   │
                   ▼
          ┌─────────────────────┐
          │  Global GLAM        │
          │  Dataset            │
          │  (RDF + exports)    │
          └─────────────────────┘

3.2 Data Fusion Rules

Scenario 1: Institution in both CSV and conversation data

# Example: Rijksmuseum appears in both ISIL registry and 
# Dutch conversations about museum digitization

def merge_institution(csv_record: LinkedDutchInstitution,
                     extracted_record: ExtractedInstitution) -> HeritageCustodian:
    """
    Merge data from authoritative CSV and mined conversation
    """
    merged = HeritageCustodian()
    
    # Prefer CSV for authoritative fields
    merged.isil_code = csv_record.isil_code  # Authoritative
    merged.name = csv_record.organization_name  # Official name
    merged.kvk_number = csv_record.kvk_number  # Official
    
    # Prefer conversation data for URLs (may be more current)
    merged.website_url = extracted_record.urls[0] if extracted_record.urls else csv_record.website_url
    
    # Merge collections (union of both sources)
    merged.collections = list(set(
        csv_record.specialized_networks +  # From CSV
        extracted_record.collection_subjects  # From NLP
    ))
    
    # Track provenance
    merged.provenance = [
        ProvenanceRecord(source="isil_registry", confidence=1.0),
        ProvenanceRecord(source="conversation_extraction", confidence=0.85)
    ]
    
    return merged

Scenario 2: Institution only in CSV

# Example: Small local archive in ISIL registry but not mentioned 
# in conversations (yet)

def csv_only_institution(csv_record: LinkedDutchInstitution) -> HeritageCustodian:
    """
    Convert CSV-only record to LinkML instance
    """
    # All data from CSV
    # Mark as authoritative but limited metadata
    # May be enriched later by web crawling

Scenario 3: Institution only in conversation data

# Example: Brazilian museum mentioned in conversations but no 
# authoritative registry entry

def conversation_only_institution(extracted: ExtractedInstitution) -> HeritageCustodian:
    """
    Use conversation-mined data with appropriate confidence scores
    """
    # All data from NLP extraction
    # Lower confidence scores
    # Requires validation via web crawling

3.3 Data Completeness Scoring

@dataclass
class CompletenessScore:
    """Track data completeness per institution"""
    has_name: bool = True  # Always required
    has_type: bool = False
    has_location: bool = False
    has_isil: bool = False
    has_website: bool = False
    has_collections: bool = False
    has_systems: bool = False
    has_coordinates: bool = False
    
    @property
    def score(self) -> float:
        """Completeness as percentage"""
        fields = [f for f in self.__dataclass_fields__ if f != 'score']
        return sum(getattr(self, f) for f in fields) / len(fields)

Target Completeness:

  • CSV-sourced institutions: >80% completeness expected
  • Conversation-mined institutions: >60% completeness target
  • Combined sources: >90% completeness ideal

Phase 4: Data Enhancement Pipeline

4.1 Geocoding Addresses

For CSV data:

from geopy.geocoders import Nominatim

def geocode_dutch_addresses(records: List[DutchOrganizationRecord]):
    """
    Geocode full street addresses from Dutch org registry
    """
    geocoder = Nominatim(user_agent="glam-dataset-builder")
    
    for record in records:
        if record.street_address:
            full_address = f"{record.street_address}, {record.city}, Netherlands"
            try:
                location = geocoder.geocode(full_address)
                record.coordinates = (location.latitude, location.longitude)
            except:
                # Fall back to city-level geocoding
                pass

Advantage: CSV data has precise street addresses → high-quality geocoding

4.2 URL Validation & Crawling

For CSV URLs:

async def validate_and_crawl_csv_urls(records: List[DutchOrganizationRecord]):
    """
    Validate URLs from CSV and crawl for additional metadata
    """
    for record in records:
        if record.website_url:
            # Validate URL is accessible
            status = await check_url(record.website_url)
            record.url_status = status
            
            if status == 200:
                # Crawl for Schema.org metadata, OpenGraph, etc.
                metadata = await crawl4ai.extract_metadata(record.website_url)
                
                # Enrich record with crawled data
                if metadata.get('description'):
                    record.description = metadata['description']
                if metadata.get('collections'):
                    record.collections.extend(metadata['collections'])

4.3 Wikidata Linking

Strategy:

  1. Search Wikidata for institution name + city
  2. Filter by instance type (Q33506, Q166118, Q7075, etc.)
  3. Validate match quality
  4. Extract additional identifiers (VIAF, GND, etc.)
def link_to_wikidata(institution: HeritageCustodian) -> Optional[str]:
    """
    Find Wikidata entity for institution
    """
    query = f"""
    SELECT ?item WHERE {{
      ?item rdfs:label "{institution.name}"@nl .
      ?item wdt:P31/wdt:P279* wd:Q33506 .  # instance of archive (or subclass)
      ?item wdt:P131 wd:{city_qid} .  # located in administrative territory
    }}
    """
    # Returns Wikidata QID

Phase 5: Quality Assurance

5.1 Data Validation

CSV Data Checks:

def validate_csv_data(record: DutchOrganizationRecord) -> List[ValidationIssue]:
    issues = []
    
    # Required fields
    if not record.organization_name:
        issues.append(ValidationIssue("Missing organization name", "error"))
    
    # ISIL format
    if record.isil_code and not re.match(r'NL-[A-Za-z0-9]+', record.isil_code):
        issues.append(ValidationIssue("Invalid ISIL format", "warning"))
    
    # URL validity
    if record.website_url and not validators.url(record.website_url):
        issues.append(ValidationIssue("Invalid URL format", "warning"))
    
    # Curator notes review
    if record.curator_notes and "geen KvK" in record.curator_notes.lower():
        issues.append(ValidationIssue("No KvK registration", "info"))
    
    return issues

5.2 Duplicate Detection

Within CSV data:

  • Check for duplicate ISIL codes
  • Check for duplicate names + addresses
  • Flag for manual review

Across datasets:

  • Match CSV institutions to conversation-mined data
  • Detect near-duplicates with fuzzy matching
  • Merge or flag for review

5.3 Manual Review Interface

Flagged Cases (from curator notes in CSV):

  • Institutions without KvK registration
  • Complex organizational hierarchies (museum part of municipality)
  • Name discrepancies (official name vs. popular name)
  • Missing parent organization links

Review Queue:

@dataclass
class ReviewItem:
    institution_id: str
    issue_type: str  # 'duplicate', 'missing_data', 'conflict', 'ambiguous'
    description: str
    suggested_resolution: Optional[str]
    manual_decision: Optional[str]

Phase 6: Export & Integration

6.1 Standalone CSV Dataset Exports

Dutch Heritage Institutions Export:

output/dutch_institutions/
├── dutch_heritage_institutions.csv        # Flattened CSV
├── dutch_heritage_institutions.ttl        # RDF Turtle
├── dutch_heritage_institutions.jsonld     # JSON-LD
├── metadata.yaml                          # Dataset metadata
└── README.md                              # Documentation

Statistics:

  • Total institutions: ~400 (estimated)
  • With ISIL codes: ~300
  • With websites: ~350
  • With geocoded addresses: ~380
  • Participating in Versnellen: ~100
  • In Archives Portal Europe: ~150

6.2 Integration into Global Dataset

Global GLAM Dataset Structure:

output/global_glam_dataset/
├── by_country/
│   ├── netherlands/
│   │   ├── authoritative/           # From CSV registries
│   │   │   └── nl_institutions.ttl
│   │   ├── conversation_mined/      # From Dutch conversations
│   │   │   └── nl_mined.ttl
│   │   └── merged/                  # Fused data
│   │       └── nl_complete.ttl
│   ├── brazil/
│   │   └── conversation_mined/      # No authoritative CSV available
│   │       └── br_institutions.ttl
│   └── ...
├── global_merged.ttl                # All countries combined
└── statistics/
    └── coverage_by_source.json

Provenance Tracking:

<https://glam-dataset.org/institution/NL-AsdRM> a hc:HeritageCustodian ;
    schema:name "Rijksmuseum"@nl ;
    hc:isil_code "NL-AsdRM" ;
    prov:wasDerivedFrom [
        a prov:Entity ;
        prov:hadPrimarySource <data/ISIL-codes_2025-08-01.csv> ;
        prov:generatedAtTime "2025-11-05T12:00:00Z"^^xsd:dateTime
    ] ,
    [
        a prov:Entity ;
        prov:hadPrimarySource <conversations/Dutch_GLAM_research.json> ;
        prov:generatedAtTime "2025-11-05T12:30:00Z"^^xsd:dateTime
    ] .

Data Governance

Data Sources Hierarchy (by authority)

Tier 1: Authoritative Registries (highest trust)

  • ISIL code registry (official)
  • National library/archive registries
  • Government heritage databases
  • Confidence score: 1.0

Tier 2: Curated Institutional Data (high trust)

  • Dutch organizations CSV (curated by heritage professionals)
  • Institutional websites (verified)
  • Confidence score: 0.9

Tier 3: Conversation-Mined Data (medium trust)

  • NLP-extracted from conversations
  • Validated by web crawling
  • Confidence score: 0.7-0.85

Tier 4: Inferred Data (lower trust)

  • Geocoded from addresses
  • Type inference from names
  • Linked data from fuzzy matching
  • Confidence score: 0.5-0.7

Data Update Strategy

CSV Data:

  • Frequency: When new versions released (quarterly/annually)
  • Process:
    1. Detect changes (diff against previous version)
    2. Re-process only changed records
    3. Update merged dataset
    4. Publish new version with changelog

Conversation Data:

  • Frequency: As new conversations added
  • Process: Incremental processing of new files

Web-Crawled Data:

  • Frequency: Quarterly refresh of URLs
  • Process: Re-crawl all URLs, update metadata

License Considerations

CSV Data:

  • ISIL registry: Public data (check specific license)
  • Dutch orgs CSV: Clarify license with data provider
  • Recommended: Request CC0 or CC-BY 4.0

Derived Dataset:

  • Combine data under compatible license
  • Provide attribution to original sources
  • Document provenance clearly

Implementation Priority

Phase 1 (Week 1): CSV Processing Infrastructure

  • CSV parser with encoding fixes
  • Data validation rules
  • Output to Python dataclasses
  • Basic statistics

Phase 2 (Week 2): LinkML Schema Integration

  • Design schema extensions for CSV fields
  • Implement mapping rules
  • Generate LinkML instances
  • Validate against schema

Phase 3 (Week 3-4): Enrichment

  • Geocoding pipeline
  • URL validation
  • Wikidata linking
  • Web crawling integration

Phase 4 (Week 5): Merging & Quality

  • Implement data fusion logic
  • Duplicate detection
  • Conflict resolution
  • Manual review queue

Phase 5 (Week 6): Export & Documentation

  • Multi-format export
  • Statistics generation
  • Documentation
  • Publication

Success Metrics

Quantitative

  • Parse success rate: >99% of CSV records successfully parsed
  • Geocoding success: >95% of addresses geocoded
  • URL validity: >90% of URLs accessible
  • Wikidata linking: >60% linked to Wikidata
  • Merge rate: >85% of ISIL registry entries matched to Dutch org registry

Qualitative

  • CSV data provides "ground truth" for Dutch institutions
  • Conversation data adds international coverage
  • Merged dataset superior to either source alone
  • Clear provenance enables trust and verification

Open Questions

  1. License for Dutch organizations CSV: Need to confirm acceptable use and redistribution
  2. Update frequency: How often are CSV files updated by maintainers?
  3. KvK integration: Should we fetch additional data from Dutch Chamber of Commerce API?
  4. Parent organization resolution: How to handle complex hierarchies (museum owned by municipality)?
  5. Platform data usage: How to represent participation in platforms like "Versnellen" in LinkML schema?