glam/docs/plan/global_glam/04-data-standardization.md

# Global GLAM Dataset: Data Standardization Strategy

## Overview
This document describes how existing structured datasets in the `data/` directory will be standardized, integrated, and fit into the Global GLAM dataset extraction pipeline alongside data mined from conversation files.

## Existing Data Assets

### 1. ISIL Codes Registry
**File**: `data/ISIL-codes_2025-08-01.csv`

#### Current Structure
```csv
Volgnr.,Plaats,Instelling,ISIL code,Toegekend op,Opmerking
1,Aalten,Nationaal Onderduikmuseum,NL-AtNOM,2021-03-17,
2,Alkmaar,Regionaal Archief Alkmaar,NL-AmrRAA,2009-08-18,
```

#### Data Characteristics
- **Scope**: Dutch heritage institutions with assigned ISIL codes
- **Coverage**: ~300+ institutions (estimated from file)
- **Authority**: Official ISIL registry for Netherlands
- **Fields**:
  - Sequential number (Volgnr.)
  - Place/City (Plaats)
  - Institution name (Instelling)
  - ISIL code (standardized identifier)
  - Assignment date (Toegekend op)
  - Remarks/Notes (Opmerking)

#### Data Quality
- **Strengths**:
  - Authoritative source for Dutch ISIL codes
  - Standardized identifier format (NL-{code})
  - Official assignment dates
  - Clean, structured data
- **Limitations**:
  - Netherlands only (not global)
  - Only institutions with ISIL codes (subset of all institutions)
  - Limited metadata (no URLs, collections, types)
  - CSV encoding issues (extra quote marks)

### 2. Dutch Organizations Registry
**File**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`

#### Current Structure
```csv
Plaatsnaam bezoekadres,Straat en huisnummer bezoekadres,Organisatie,Koepelorganisatie,
Webadres organisatie,Type organisatie,Opmerkingen Inez,ISIL-code (NA),
Samenwerkingsverband / Platform,Systeem,Versnellen,Collectie Nederland,...
```

#### Data Characteristics
- **Scope**: Comprehensive list of Dutch heritage organizations and services
- **Coverage**: Province-by-province (Drenthe, Flevoland visible in sample)
- **Richness**: Very detailed metadata about each institution
- **Fields** (40+ columns):
  - **Location**: Province, city, street address
  - **Organization**: Name, parent organization, website
  - **Classification**: Type (museum, archive, library, historical society)
  - **Identifiers**: ISIL code, Chamber of Commerce (KvK) number
  - **Platforms**: Participation in various heritage platforms/networks
  - **Systems**: Collection management systems used (Atlantis, MAIS Flexis, ZCBS, etc.)
  - **Projects**: Participation in digitization projects (Versnellen, Collectie Nederland, etc.)
  - **Registers**: Museum register, Rijkscollectie, Archives Portal Europe, etc.
  - **Specialized networks**: WO2Net, Modemuze, Maritiem Digitaal, etc.
  - **Linked Data**: Participation in linked data initiatives
  - **Notes**: Extensive comments on organizational structure, data issues

#### Data Quality
- **Strengths**:
  - Extremely rich metadata
  - Real-world operational data (systems, platforms, projects)
  - Parent/child organization relationships
  - Network/consortium memberships
  - Curator notes on data quality issues
  - Physical addresses for geocoding
  - Web URLs for crawling
- **Limitations**:
  - Netherlands-specific (single country)
  - Mixed data quality (notes indicate issues)
  - Some institutions lack KvK registration
  - Complex organizational relationships
  - Many empty cells (not all institutions participate in all platforms)

## Integration Strategy

### Phase 1: Standardization & Normalization

#### 1.1 ISIL Codes CSV Processing

**Parsing Steps**:
1. Fix CSV encoding issues (extra quotes)
2. Parse dates to ISO 8601 format
3. Validate ISIL code format (NL-{prefix}{suffix})
4. Normalize place names (standardize Dutch municipality names)
5. Extract institution type hints from names

**Output Schema**:
```python
@dataclass
class ISILRecord:
    sequence_number: int
    city: str
    city_normalized: str  # Standardized municipality
    institution_name: str
    institution_name_normalized: str  # Cleaned name
    isil_code: str
    assigned_date: date
    remarks: Optional[str]

    # Derived fields
    institution_type_hint: Optional[str]  # Extracted from name
    province: Optional[str]  # Looked up from city
    coordinates: Optional[Tuple[float, float]]  # Geocoded
```

**Enrichment**:
- Geocode cities to coordinates
- Map cities to provinces
- Infer institution types from names (e.g., "Archief" → archive, "Museum" → museum)
- Link to Wikidata entities where possible

#### 1.2 Dutch Organizations CSV Processing

**Parsing Steps**:
1. Parse complex CSV with 40+ columns
2. Handle empty cells gracefully
3. Normalize organization types
4. Parse boolean flags (ja/nee → true/false)
5. Validate URLs
6. Parse ISIL codes
7. Extract system/platform names
8. Process curator notes

**Output Schema**:
```python
@dataclass
class DutchOrganizationRecord:
    # Core identity
    organization_name: str
    organization_type: str  # Normalized: 'museum', 'archive', 'library', etc.
    parent_organization: Optional[str]

    # Location
    province: str
    city: str
    street_address: str
    coordinates: Optional[Tuple[float, float]]

    # Digital presence
    website_url: Optional[str]

    # Identifiers
    isil_code: Optional[str]
    kvk_number: Optional[str]  # Dutch business registry

    # Platforms & networks
    consortium_memberships: List[str]
    platform_participations: List[str]

    # Technical systems
    collection_management_system: Optional[str]

    # Project participation
    versnellen_project: bool
    collectie_nederland: bool
    museum_register: bool
    rijkscollectie: bool
    bibliotheek_collectie: bool
    dc4eu: bool  # Digital Culture for Europe
    archives_portal_europe: bool
    wo2net: bool  # WWII network
    specialized_networks: List[str]  # Modemuze, Maritiem Digitaal, etc.

    # Linked data
    linked_data_participation: bool
    dataset_register: bool

    # Data quality
    curator_notes: Optional[str]
    data_issues: List[str]  # Extracted from notes
```

**Normalization**:
- Standardize organization types to controlled vocabulary
- Parse platform/system names to canonical forms
- Convert "ja"/"nieuw"/"" to boolean/enum values
- Extract structured issues from free-text notes

#### 1.3 Cross-Dataset Linking

**ISIL Code Matching**:
```python
def link_datasets(isil_records: List[ISILRecord],
                  org_records: List[DutchOrganizationRecord]) -> List[LinkedRecord]:
    """
    Link ISIL registry with Dutch organizations registry
    """
    # Match on ISIL code (primary key)
    # Match on normalized organization name + city (fallback)
    # Flag conflicts and duplicates
```

**Linking Strategy**:
1. **Primary match**: ISIL code (if present in both datasets)
2. **Secondary match**: Fuzzy name matching + city
3. **Conflict resolution**: Manual review queue for ambiguous matches

**Output**: Merged records with provenance tracking
```python
@dataclass
class LinkedDutchInstitution:
    # Merged data from both sources
    institution_name: str
    isil_code: str

    # Provenance
    in_isil_registry: bool
    in_org_registry: bool
    matched_on: str  # 'isil_code', 'name_city', 'manual'

    # All fields from both records...
```

### Phase 2: LinkML Schema Mapping

#### 2.1 Schema Design for Existing Data

**Heritage Institution Base Class**:
```yaml
# schemas/heritage_custodian.yaml
classes:
  HeritageCustodian:
    slots:
      - name
      - institution_types
      - isil_code
      - geographic_coverage
      - digital_platforms
      - parent_organization
      - consortium_memberships

  DutchHeritageCustodian:
    is_a: HeritageCustodian
    mixins:
      - TOOIOrganization  # Dutch government org ontology
    slots:
      - kvk_number  # Dutch-specific
      - province
      - municipality
      - versnellen_participation
      - collectie_nederland_participation
      - collection_management_system
```

#### 2.2 Mapping Rules

**ISIL Registry → LinkML**:
```yaml
mappings:
  ISILRecord_to_HeritageCustodian:
    institution_name: name
    isil_code: isil_code
    city: geographic_coverage.municipality
    assigned_date: identifiers[isil].assigned_date
    institution_type_hint: institution_types (with inference)
```

**Dutch Organizations Registry → LinkML**:
```yaml
mappings:
  DutchOrganizationRecord_to_DutchHeritageCustodian:
    organization_name: name
    organization_type: institution_types
    parent_organization: parent_organization
    isil_code: isil_code
    kvk_number: identifiers[kvk].value
    website_url: digital_platforms[0].url
    collection_management_system: technical_infrastructure.cms
    consortium_memberships: organizational_relationships.consortia
    # ... (40+ field mappings)
```

#### 2.3 Instance Generation

**Process**:
1. Parse CSV → Python dataclass
2. Normalize/enrich data
3. Map to LinkML schema
4. Validate against schema
5. Generate instances (YAML/JSON-LD)
6. Store in RDF graph

**Output Structure**:
```
output/
├── dutch_institutions/
│   ├── instances/
│   │   ├── NL-AsdSAA.yaml           # Stadsarchief Amsterdam
│   │   ├── NL-AsdRM.jsonld          # Rijksmuseum
│   │   └── ...
│   ├── merged/
│   │   └── dutch_institutions.ttl   # All as RDF
│   └── stats/
│       └── coverage_report.json
```

### Phase 3: Integration with Conversation-Mined Data

#### 3.1 Dual Data Sources

**Data Flow**:
```
┌─────────────────────┐         ┌──────────────────────┐
│  Existing CSVs      │         │  Conversation JSONs  │
│  - ISIL registry    │         │  - Global research   │
│  - Dutch orgs       │         │  - 139 countries     │
└──────────┬──────────┘         └──────────┬───────────┘
           │                               │
           ▼                               ▼
    ┌──────────────┐              ┌─────────────────┐
    │ CSV Parser   │              │  NLP Extraction │
    │ & Normalizer │              │    Pipeline     │
    └──────┬───────┘              └────────┬────────┘
           │                               │
           ├───────────────────────────────┤
           │                               │
           ▼                               ▼
    ┌────────────────────────────────────────┐
    │    LinkML Instance Generator           │
    │  - Validate against schema             │
    │  - Merge overlapping records           │
    │  - Track provenance                    │
    └──────────────┬─────────────────────────┘
                   │
                   ▼
          ┌─────────────────────┐
          │  Global GLAM        │
          │  Dataset            │
          │  (RDF + exports)    │
          └─────────────────────┘
```

#### 3.2 Data Fusion Rules

**Scenario 1: Institution in both CSV and conversation data**
```python
# Example: Rijksmuseum appears in both ISIL registry and
# Dutch conversations about museum digitization

def merge_institution(csv_record: LinkedDutchInstitution,
                     extracted_record: ExtractedInstitution) -> HeritageCustodian:
    """
    Merge data from authoritative CSV and mined conversation
    """
    merged = HeritageCustodian()

    # Prefer CSV for authoritative fields
    merged.isil_code = csv_record.isil_code  # Authoritative
    merged.name = csv_record.organization_name  # Official name
    merged.kvk_number = csv_record.kvk_number  # Official

    # Prefer conversation data for URLs (may be more current)
    merged.website_url = extracted_record.urls[0] if extracted_record.urls else csv_record.website_url

    # Merge collections (union of both sources)
    merged.collections = list(set(
        csv_record.specialized_networks +  # From CSV
        extracted_record.collection_subjects  # From NLP
    ))

    # Track provenance
    merged.provenance = [
        ProvenanceRecord(source="isil_registry", confidence=1.0),
        ProvenanceRecord(source="conversation_extraction", confidence=0.85)
    ]

    return merged
```

**Scenario 2: Institution only in CSV**
```python
# Example: Small local archive in ISIL registry but not mentioned
# in conversations (yet)

def csv_only_institution(csv_record: LinkedDutchInstitution) -> HeritageCustodian:
    """
    Convert CSV-only record to LinkML instance
    """
    # All data from CSV
    # Mark as authoritative but limited metadata
    # May be enriched later by web crawling
```

**Scenario 3: Institution only in conversation data**
```python
# Example: Brazilian museum mentioned in conversations but no
# authoritative registry entry

def conversation_only_institution(extracted: ExtractedInstitution) -> HeritageCustodian:
    """
    Use conversation-mined data with appropriate confidence scores
    """
    # All data from NLP extraction
    # Lower confidence scores
    # Requires validation via web crawling
```

#### 3.3 Data Completeness Scoring

```python
@dataclass
class CompletenessScore:
    """Track data completeness per institution"""
    has_name: bool = True  # Always required
    has_type: bool = False
    has_location: bool = False
    has_isil: bool = False
    has_website: bool = False
    has_collections: bool = False
    has_systems: bool = False
    has_coordinates: bool = False

    @property
    def score(self) -> float:
        """Completeness as percentage"""
        fields = [f for f in self.__dataclass_fields__ if f != 'score']
        return sum(getattr(self, f) for f in fields) / len(fields)
```

**Target Completeness**:
- CSV-sourced institutions: >80% completeness expected
- Conversation-mined institutions: >60% completeness target
- Combined sources: >90% completeness ideal

### Phase 4: Data Enhancement Pipeline

#### 4.1 Geocoding Addresses

**For CSV data**:
```python
from geopy.geocoders import Nominatim

def geocode_dutch_addresses(records: List[DutchOrganizationRecord]):
    """
    Geocode full street addresses from Dutch org registry
    """
    geocoder = Nominatim(user_agent="glam-dataset-builder")

    for record in records:
        if record.street_address:
            full_address = f"{record.street_address}, {record.city}, Netherlands"
            try:
                location = geocoder.geocode(full_address)
                record.coordinates = (location.latitude, location.longitude)
            except:
                # Fall back to city-level geocoding
                pass
```

**Advantage**: CSV data has precise street addresses → high-quality geocoding

#### 4.2 URL Validation & Crawling

**For CSV URLs**:
```python
async def validate_and_crawl_csv_urls(records: List[DutchOrganizationRecord]):
    """
    Validate URLs from CSV and crawl for additional metadata
    """
    for record in records:
        if record.website_url:
            # Validate URL is accessible
            status = await check_url(record.website_url)
            record.url_status = status

            if status == 200:
                # Crawl for Schema.org metadata, OpenGraph, etc.
                metadata = await crawl4ai.extract_metadata(record.website_url)

                # Enrich record with crawled data
                if metadata.get('description'):
                    record.description = metadata['description']
                if metadata.get('collections'):
                    record.collections.extend(metadata['collections'])
```

#### 4.3 Wikidata Linking

**Strategy**:
1. Search Wikidata for institution name + city
2. Filter by instance type (Q33506, Q166118, Q7075, etc.)
3. Validate match quality
4. Extract additional identifiers (VIAF, GND, etc.)

```python
def link_to_wikidata(institution: HeritageCustodian) -> Optional[str]:
    """
    Find Wikidata entity for institution
    """
    query = f"""
    SELECT ?item WHERE {{
      ?item rdfs:label "{institution.name}"@nl .
      ?item wdt:P31/wdt:P279* wd:Q33506 .  # instance of archive (or subclass)
      ?item wdt:P131 wd:{city_qid} .  # located in administrative territory
    }}
    """
    # Returns Wikidata QID
```

### Phase 5: Quality Assurance

#### 5.1 Data Validation

**CSV Data Checks**:
```python
def validate_csv_data(record: DutchOrganizationRecord) -> List[ValidationIssue]:
    issues = []

    # Required fields
    if not record.organization_name:
        issues.append(ValidationIssue("Missing organization name", "error"))

    # ISIL format
    if record.isil_code and not re.match(r'NL-[A-Za-z0-9]+', record.isil_code):
        issues.append(ValidationIssue("Invalid ISIL format", "warning"))

    # URL validity
    if record.website_url and not validators.url(record.website_url):
        issues.append(ValidationIssue("Invalid URL format", "warning"))

    # Curator notes review
    if record.curator_notes and "geen KvK" in record.curator_notes.lower():
        issues.append(ValidationIssue("No KvK registration", "info"))

    return issues
```

#### 5.2 Duplicate Detection

**Within CSV data**:
- Check for duplicate ISIL codes
- Check for duplicate names + addresses
- Flag for manual review

**Across datasets**:
- Match CSV institutions to conversation-mined data
- Detect near-duplicates with fuzzy matching
- Merge or flag for review

#### 5.3 Manual Review Interface

**Flagged Cases** (from curator notes in CSV):
- Institutions without KvK registration
- Complex organizational hierarchies (museum part of municipality)
- Name discrepancies (official name vs. popular name)
- Missing parent organization links

**Review Queue**:
```python
@dataclass
class ReviewItem:
    institution_id: str
    issue_type: str  # 'duplicate', 'missing_data', 'conflict', 'ambiguous'
    description: str
    suggested_resolution: Optional[str]
    manual_decision: Optional[str]
```

### Phase 6: Export & Integration

#### 6.1 Standalone CSV Dataset Exports

**Dutch Heritage Institutions Export**:
```
output/dutch_institutions/
├── dutch_heritage_institutions.csv        # Flattened CSV
├── dutch_heritage_institutions.ttl        # RDF Turtle
├── dutch_heritage_institutions.jsonld     # JSON-LD
├── metadata.yaml                          # Dataset metadata
└── README.md                              # Documentation
```

**Statistics**:
- Total institutions: ~400 (estimated)
- With ISIL codes: ~300
- With websites: ~350
- With geocoded addresses: ~380
- Participating in Versnellen: ~100
- In Archives Portal Europe: ~150

#### 6.2 Integration into Global Dataset

**Global GLAM Dataset Structure**:
```
output/global_glam_dataset/
├── by_country/
│   ├── netherlands/
│   │   ├── authoritative/           # From CSV registries
│   │   │   └── nl_institutions.ttl
│   │   ├── conversation_mined/      # From Dutch conversations
│   │   │   └── nl_mined.ttl
│   │   └── merged/                  # Fused data
│   │       └── nl_complete.ttl
│   ├── brazil/
│   │   └── conversation_mined/      # No authoritative CSV available
│   │       └── br_institutions.ttl
│   └── ...
├── global_merged.ttl                # All countries combined
└── statistics/
    └── coverage_by_source.json
```

**Provenance Tracking**:
```turtle
<https://glam-dataset.org/institution/NL-AsdRM> a hc:HeritageCustodian ;
    schema:name "Rijksmuseum"@nl ;
    hc:isil_code "NL-AsdRM" ;
    prov:wasDerivedFrom [
        a prov:Entity ;
        prov:hadPrimarySource <data/ISIL-codes_2025-08-01.csv> ;
        prov:generatedAtTime "2025-11-05T12:00:00Z"^^xsd:dateTime
    ] ,
    [
        a prov:Entity ;
        prov:hadPrimarySource <conversations/Dutch_GLAM_research.json> ;
        prov:generatedAtTime "2025-11-05T12:30:00Z"^^xsd:dateTime
    ] .
```

## Data Governance

### Data Sources Hierarchy (by authority)

**Tier 1: Authoritative Registries** (highest trust)
- ISIL code registry (official)
- National library/archive registries
- Government heritage databases
- Confidence score: 1.0

**Tier 2: Curated Institutional Data** (high trust)
- Dutch organizations CSV (curated by heritage professionals)
- Institutional websites (verified)
- Confidence score: 0.9

**Tier 3: Conversation-Mined Data** (medium trust)
- NLP-extracted from conversations
- Validated by web crawling
- Confidence score: 0.7-0.85

**Tier 4: Inferred Data** (lower trust)
- Geocoded from addresses
- Type inference from names
- Linked data from fuzzy matching
- Confidence score: 0.5-0.7

### Data Update Strategy

**CSV Data**:
- **Frequency**: When new versions released (quarterly/annually)
- **Process**:
  1. Detect changes (diff against previous version)
  2. Re-process only changed records
  3. Update merged dataset
  4. Publish new version with changelog

**Conversation Data**:
- **Frequency**: As new conversations added
- **Process**: Incremental processing of new files

**Web-Crawled Data**:
- **Frequency**: Quarterly refresh of URLs
- **Process**: Re-crawl all URLs, update metadata

### License Considerations

**CSV Data**:
- ISIL registry: Public data (check specific license)
- Dutch orgs CSV: Clarify license with data provider
- Recommended: Request CC0 or CC-BY 4.0

**Derived Dataset**:
- Combine data under compatible license
- Provide attribution to original sources
- Document provenance clearly

## Implementation Priority

### Phase 1 (Week 1): CSV Processing Infrastructure
- CSV parser with encoding fixes
- Data validation rules
- Output to Python dataclasses
- Basic statistics

### Phase 2 (Week 2): LinkML Schema Integration
- Design schema extensions for CSV fields
- Implement mapping rules
- Generate LinkML instances
- Validate against schema

### Phase 3 (Week 3-4): Enrichment
- Geocoding pipeline
- URL validation
- Wikidata linking
- Web crawling integration

### Phase 4 (Week 5): Merging & Quality
- Implement data fusion logic
- Duplicate detection
- Conflict resolution
- Manual review queue

### Phase 5 (Week 6): Export & Documentation
- Multi-format export
- Statistics generation
- Documentation
- Publication

## Success Metrics

### Quantitative
- **Parse success rate**: >99% of CSV records successfully parsed
- **Geocoding success**: >95% of addresses geocoded
- **URL validity**: >90% of URLs accessible
- **Wikidata linking**: >60% linked to Wikidata
- **Merge rate**: >85% of ISIL registry entries matched to Dutch org registry

### Qualitative
- CSV data provides "ground truth" for Dutch institutions
- Conversation data adds international coverage
- Merged dataset superior to either source alone
- Clear provenance enables trust and verification

## Open Questions

1. **License for Dutch organizations CSV**: Need to confirm acceptable use and redistribution
2. **Update frequency**: How often are CSV files updated by maintainers?
3. **KvK integration**: Should we fetch additional data from Dutch Chamber of Commerce API?
4. **Parent organization resolution**: How to handle complex hierarchies (museum owned by municipality)?
5. **Platform data usage**: How to represent participation in platforms like "Versnellen" in LinkML schema?