glam/docs/CSV_SCHEMA_MAPPING.md
2025-11-19 23:25:22 +01:00

18 KiB

Dutch Organizations CSV Schema Mapping Analysis

Date: 2025-11-07
Schema Version: v0.2.0
Source: data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv

Executive Summary

This document maps all 32 columns from the Dutch organizations CSV to LinkML schema fields, identifies gaps, and recommends schema extensions.

Status:

  • Fully Mapped: 17 columns (53%)
  • ⚠️ Partially Mapped: 8 columns (25%) - require enum extensions or additional slots
  • Unmapped: 7 columns (22%) - require new schema features

Column-by-Column Mapping

Fully Mapped Fields (17/32)

CSV Column Schema Module Class/Slot Notes
Plaatsnaam bezoekadres core.yaml Location.city Direct mapping
Straat en huisnummer bezoekadres core.yaml Location.street_address Direct mapping
Organisatie core.yaml HeritageCustodian.name Primary name field
Webadres organisatie core.yaml HeritageCustodian.homepage Maps to foaf:homepage
Type organisatie enums.yaml HeritageCustodian.institution_type Requires value normalization
ISIL-code (NA) core.yaml Identifier (scheme=ISIL) Multivalued identifiers list
Museum register dutch.yaml DutchHeritageCustodian.in_museum_register Boolean flag
Rijkscollectie dutch.yaml DutchHeritageCustodian.in_rijkscollectie Boolean flag
Collectie Nederland dutch.yaml DutchHeritageCustodian.in_collectie_nederland Boolean flag
Archieven.nl dutch.yaml DutchHeritageCustodian.in_archieven_nl Boolean flag
Linked Data collections.yaml DigitalPlatform + custom property Platform with capability flag
Datasetregister collections.yaml DigitalPlatform Dataset registry participation

Additional implicit mappings:

  • Province (derived from city) → DutchHeritageCustodian.provincie
  • Country (always "NL") → Location.country
  • Data source → Provenance.data_source = CSV_REGISTRY
  • Data tier → Provenance.data_tier = TIER_1_AUTHORITATIVE

⚠️ Partially Mapped Fields (8/32)

These fields can be mapped to existing schema structures but require extensions or enum updates.

1. Koepelorganisatie (Umbrella Organization)

  • Current Mapping: HeritageCustodian.parent_organization
  • Gap: CSV has organization NAME only, schema expects HeritageCustodian reference
  • Solution:
    • Parse as string, resolve to actual HeritageCustodian ID during cross-linking phase
    • Store temporarily in description or create parent_organization_name slot
  • Recommendation: Add parent_organization_name: string slot for unresolved references

2. Samenwerkingsverband / Platform (Collaborative Network)

  • Current Mapping: DutchHeritageCustodian.samenwerkingsverband (multivalued string)
  • Gap: CSV mixes network names AND platform names (e.g., "Geheugen van Drenthe")
  • Solution: Split into two categories:
    • Networks → samenwerkingsverband
    • Platforms → DigitalPlatform
  • Recommendation: Parser logic to classify based on keywords

3. Systeem (System/Software)

  • Current Mapping: DigitalPlatform.platform_name
  • Gap: CSV values include systems NOT in DigitalPlatformTypeEnum:
    • "Atlantis" → COLLECTION_MANAGEMENT_SYSTEM
    • "MAIS Flexis?" → COLLECTION_MANAGEMENT_SYSTEM
    • "Adlib" → COLLECTION_MANAGEMENT_SYSTEM
    • "FileMaker" → GENERIC (new value)
  • Recommendation: Extend DigitalPlatformTypeEnum with:
    GENERIC:
      description: General-purpose software not specific to heritage sector
    

4. Versnellen (Acceleration Project)

  • Current Mapping: No direct mapping
  • Gap: Boolean flag indicating participation in Dutch digitization program
  • Solution: Add to DutchHeritageCustodian:
    in_versnellen_project:
      description: Participates in Versnellen digitization acceleration program
      range: boolean
    
  • Recommendation: NEW SLOT REQUIRED

5. Bibliotheek collectie (Library Collection)

  • Current Mapping: Collection.collection_type = "bibliographic"
  • Gap: CSV has boolean "has library collection", not collection metadata
  • Solution: Add boolean flag to indicate collection type presence:
    has_library_collection:
      description: Institution holds library collections
      range: boolean
    
  • Recommendation: NEW SLOT or use collection_type filter

6. in scope voor DC4EU (DC4EU Scope)

  • Current Mapping: Partnership or boolean flag
  • Gap: EU project participation not modeled
  • Solution: Either:
    • A) Add partnership: Partnership(partner_name="DC4EU", partnership_type="EU_PROJECT")
    • B) Add boolean: in_dc4eu_scope: boolean
  • Recommendation: Option B (boolean) simpler for this use case

7. DC4EU aansluit route (DC4EU Connection Route)

  • Current Mapping: No mapping
  • Gap: Technical integration pathway (e.g., "API", "OAI-PMH", "direct")
  • Solution: Add to DigitalPlatform:
    integration_method:
      description: Technical method for data integration (API, OAI-PMH, SPARQL, etc.)
      range: string
    
  • Recommendation: NEW SLOT for DigitalPlatform

8. Opmerkingen Inez + Opmerkingen (Remarks/Notes)

  • Current Mapping: HeritageCustodian.description (partial)
  • Gap: CSV has TWO note fields (one from editor, one general)
  • Solution: Concatenate both into description with attribution:
    description: >-
      Editor notes (Inez): {Opmerkingen Inez}
      General remarks: {Opmerkingen}  
    
  • Recommendation: Use existing description slot, document concatenation in parser

Unmapped Fields (7/32)

These fields require NEW schema features or represent specialized domain-specific platforms.

1. Archives Portal Europe

  • Type: Boolean flag for European aggregation platform
  • Schema Gap: Not covered by current Dutch extensions
  • Recommendation: Add to DutchHeritageCustodian:
    in_archives_portal_europe:
      description: Registered in Archives Portal Europe (APE)
      range: boolean
      slot_uri: dcterms:isPartOf
    

2. WO2Net (WWII Network)

  • Type: Thematic network for WWII heritage
  • Schema Gap: No slot for thematic networks
  • Recommendation: Add to DutchHeritageCustodian:
    in_wo2net:
      description: Participates in WO2Net (WWII heritage network)
      range: boolean
    

3. Modemuze (Fashion Museum Network)

  • Type: Specialized fashion heritage network
  • Schema Gap: Domain-specific network
  • Recommendation: Add to DutchHeritageCustodian:
    in_modemuze:
      description: Participates in Modemuze (fashion heritage network)
      range: boolean
    

4. Maritiem Digitaal (Maritime Digital)

  • Type: Maritime heritage aggregation platform
  • Schema Gap: Domain-specific platform
  • Recommendation: Add to DutchHeritageCustodian:
    in_maritiem_digitaal:
      description: Contributes to Maritiem Digitaal (maritime heritage platform)
      range: boolean
    

5. Delfts aardewerk (Delft Pottery)

  • Type: Delft pottery collections network
  • Schema Gap: Collection-specific network
  • Recommendation: Add to DutchHeritageCustodian:
    in_delfts_aardewerk:
      description: Participates in Delfts aardewerk network
      range: boolean
    

6. Stichting Academisch Erfgoed (Academic Heritage Foundation)

  • Type: Academic heritage network
  • Schema Gap: Network participation
  • Recommendation: Add to DutchHeritageCustodian:
    in_academisch_erfgoed:
      description: Member of Stichting Academisch Erfgoed
      range: boolean
    

7. Coleccion Aruba

  • Type: Aruba collections network
  • Schema Gap: Caribbean/colonial heritage network
  • Recommendation: Add to DutchHeritageCustodian:
    in_coleccion_aruba:
      description: Contributes to Coleccion Aruba
      range: boolean
    

8. Van Gogh Worldwide

  • Type: International Van Gogh collections network
  • Schema Gap: Artist-specific network
  • Recommendation: Add to DutchHeritageCustodian:
    in_van_gogh_worldwide:
      description: Participates in Van Gogh Worldwide
      range: boolean
    

9. OODE24 (Mondriaan)

  • Type: Mondriaan art network/project
  • Schema Gap: Artist/movement specific network
  • Recommendation: Add to DutchHeritageCustodian:
    in_oode24_mondriaan:
      description: Participates in OODE24 (Mondriaan project)
      range: boolean
    

10. Versnellen project (Acceleration Project Details)

  • Type: Freeform text describing project participation
  • Schema Gap: Project-specific metadata
  • Recommendation: Add to DutchHeritageCustodian:
    versnellen_project_details:
      description: Details about participation in Versnellen digitization project
      range: string
    

Schema Extension Recommendations

Priority 1: Critical Gaps (Required for Full CSV Parsing)

File: schemas/dutch.yaml (DutchHeritageCustodian extensions)

slots:
  # EU/International platforms
  in_archives_portal_europe:
    description: Registered in Archives Portal Europe (APE)
    range: boolean
    slot_uri: dcterms:isPartOf
  
  in_dc4eu_scope:
    description: In scope for DC4EU (Digital Collaboration for Europe) project
    range: boolean
  
  dc4eu_integration_method:
    description: Technical integration method for DC4EU (API, OAI-PMH, SPARQL, etc.)
    range: string
  
  # Digitization programs
  in_versnellen_project:
    description: Participates in Versnellen digitization acceleration program
    range: boolean
  
  versnellen_project_details:
    description: Details about Versnellen project participation (e.g., "Upgrade", "Aanschaf")
    range: string
  
  # Thematic networks
  in_wo2net:
    description: Participates in WO2Net (WWII heritage network)
    range: boolean
  
  in_modemuze:
    description: Participates in Modemuze (fashion heritage network)
    range: boolean
  
  in_maritiem_digitaal:
    description: Contributes to Maritiem Digitaal (maritime heritage platform)
    range: boolean
  
  in_delfts_aardewerk:
    description: Participates in Delfts aardewerk (Delft pottery network)
    range: boolean
  
  in_academisch_erfgoed:
    description: Member of Stichting Academisch Erfgoed (academic heritage foundation)
    range: boolean
  
  in_coleccion_aruba:
    description: Contributes to Coleccion Aruba (Caribbean heritage network)
    range: boolean
  
  in_van_gogh_worldwide:
    description: Participates in Van Gogh Worldwide (international Van Gogh collections network)
    range: boolean
  
  in_oode24_mondriaan:
    description: Participates in OODE24 (Mondriaan art project)
    range: boolean
  
  # Organizational
  parent_organization_name:
    description: Name of parent/umbrella organization (unresolved reference)
    range: string
    comments:
      - "Use this when CSV has organization name but not resolved HeritageCustodian ID"
      - "Resolve to parent_organization during cross-linking phase"
  
  # Collection indicators
  has_library_collection:
    description: Institution holds library collections (boolean indicator)
    range: boolean

Priority 2: Enum Extensions

File: schemas/enums.yaml

enums:
  DigitalPlatformTypeEnum:
    # Add new value:
    GENERIC:
      description: >-
        General-purpose software not specific to heritage sector.
        Examples: FileMaker, Microsoft Access, custom databases.        

Priority 3: Core Schema Enhancements

File: schemas/collections.yaml (DigitalPlatform extensions)

slots:
  integration_method:
    description: >-
      Technical method for data integration/harvesting.
      Examples: REST API, OAI-PMH, SPARQL endpoint, CSV export, direct database access.      
    range: string
    slot_uri: schema:applicationCategory

Parsing Strategy

Phase 1: Direct Mapping (Implemented)

# Map straightforward fields
custodian.name = row['Organisatie']
custodian.homepage = row['Webadres organisatie']
custodian.institution_type = normalize_type(row['Type organisatie'])
custodian.in_museum_register = parse_boolean(row['Museum register'])
# ... etc.

Phase 2: Complex Field Handling

Handling "Koepelorganisatie" (Umbrella Organizations)

# Store temporarily as string
custodian.parent_organization_name = row['Koepelorganisatie']

# Later, during cross-linking:
if custodian.parent_organization_name:
    parent = dataset.find_by_name(custodian.parent_organization_name)
    if parent:
        custodian.parent_organization = parent.id

Handling "Samenwerkingsverband / Platform"

value = row['Samenwerkingsverband / Platform']

# Classify as network vs. platform
if is_digital_platform(value):  # e.g., contains "digitaal", "portal"
    custodian.digital_platforms.append(
        DigitalPlatform(platform_name=value, platform_type="DISCOVERY_PORTAL")
    )
else:  # Network/consortium
    custodian.samenwerkingsverband.append(value)

Handling "Systeem" (Software Systems)

system = row['Systeem'].strip()
if system:
    platform_type = classify_system(system)  # Map to enum
    custodian.digital_platforms.append(
        DigitalPlatform(
            platform_name=system,
            platform_type=platform_type
        )
    )

def classify_system(name):
    """Map system names to DigitalPlatformTypeEnum."""
    cms_systems = ['Atlantis', 'Adlib', 'MAIS', 'TMS', 'Axiell']
    if any(cms in name for cms in cms_systems):
        return 'COLLECTION_MANAGEMENT_SYSTEM'
    elif name in ['FileMaker', 'Access']:
        return 'GENERIC'
    # ... etc.

Handling Multiple Notes Fields

notes_parts = []
if row['Opmerkingen Inez']:
    notes_parts.append(f"Editor notes: {row['Opmerkingen Inez']}")
if row['Opmerkingen']:
    notes_parts.append(f"General remarks: {row['Opmerkingen']}")

if notes_parts:
    custodian.description = '\n\n'.join(notes_parts)

Phase 3: Boolean Flag Mapping

# Thematic networks (new boolean flags)
custodian.in_wo2net = parse_boolean(row['WO2Net'])
custodian.in_modemuze = parse_boolean(row['Modemuze'])
custodian.in_maritiem_digitaal = parse_boolean(row['Maritiem Digitaal'])
# ... etc.

def parse_boolean(value):
    """Parse Dutch CSV boolean representations."""
    if not value or value.strip() == '':
        return None
    value_lower = value.strip().lower()
    # Dutch: 'ja' = yes, 'nee' = no
    if value_lower in ['ja', 'yes', 'true', '1', 'x']:
        return True
    if value_lower in ['nee', 'no', 'false', '0']:
        return False
    return None  # Ambiguous

Validation Checklist

Before considering CSV fully mapped:

  • All 32 columns have documented mapping strategy
  • Schema extensions implemented in dutch.yaml
  • Enum extensions implemented in enums.yaml
  • Parser handles all field types (string, boolean, reference, multivalued)
  • Provenance metadata captures CSV source (TIER_1_AUTHORITATIVE)
  • Test coverage for each new slot
  • Cross-linking logic for parent_organization_nameparent_organization
  • Classification logic for Samenwerkingsverband (network vs. platform)
  • System name → DigitalPlatformTypeEnum mapping complete

Impact Assessment

Schema Changes Required

  • dutch.yaml: +18 new slots (mostly boolean flags for network participation)
  • enums.yaml: +1 enum value (DigitalPlatformTypeEnum.GENERIC)
  • collections.yaml: +1 slot (DigitalPlatform.integration_method)

Backward Compatibility

  • All changes are ADDITIVE (new optional slots)
  • No breaking changes to existing schema
  • Existing data remains valid

Data Quality Implications

  • TIER_1_AUTHORITATIVE data will have richer Dutch network participation metadata
  • Cross-linking with conversation data (TIER_4_INFERRED) will benefit from parent organization names
  • Boolean flags enable precise filtering (e.g., "all institutions in Modemuze network")

Next Steps

  1. Implement Priority 1 extensions in schemas/dutch.yaml (Ready to proceed)
  2. Update parser (src/glam_extractor/parsers/dutch_orgs.py) to use new slots
  3. Create test fixtures with real CSV rows exercising all field types
  4. Update documentation (AGENTS.md) with new Dutch-specific extraction patterns
  5. Regenerate LinkML artifacts (JSON Schema, Python dataclasses, SQL DDL)
  6. Validate with real data (1,351 institutions from CSV)

Open Questions

  1. Should thematic networks be modeled as booleans OR as Partnership objects?

    • Current recommendation: Booleans (simpler, CSV is just participation flags)
    • Alternative: Partnership(partner_name="Modemuze", partnership_type="THEMATIC_NETWORK")
    • Decision needed for long-term maintainability
  2. How to handle uncertain system names (e.g., "MAIS Flexis?")?

    • Option A: Strip "?" and parse as "MAIS Flexis"
    • Option B: Store in notes, mark confidence_score lower
    • Option C: Create platform_name_uncertain: boolean flag
  3. Should "Bibliotheek collectie" create actual Collection objects or just set a flag?

    • Current: Flag approach (has_library_collection: boolean)
    • Alternative: Create Collection(collection_type="bibliographic") stub
    • Depends on whether CSV will later provide actual collection metadata
  4. Geographic scope: Should Caribbean networks (Coleccion Aruba) be in dutch.yaml?

    • These are Netherlands-administered but geographically outside Europe
    • Consider creating dutch_caribbean.yaml module?

Document Status: DRAFT
Next Review: After schema extensions implemented
Maintainer: GLAM Data Extraction Project