# Dutch Organizations CSV Schema Mapping Analysis **Date**: 2025-11-07 **Schema Version**: v0.2.0 **Source**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv` ## Executive Summary This document maps all 32 columns from the Dutch organizations CSV to LinkML schema fields, identifies gaps, and recommends schema extensions. **Status**: - ✅ **Fully Mapped**: 17 columns (53%) - ⚠️ **Partially Mapped**: 8 columns (25%) - require enum extensions or additional slots - ❌ **Unmapped**: 7 columns (22%) - require new schema features --- ## Column-by-Column Mapping ### ✅ Fully Mapped Fields (17/32) | CSV Column | Schema Module | Class/Slot | Notes | |------------|---------------|------------|-------| | **Plaatsnaam bezoekadres** | `core.yaml` | `Location.city` | Direct mapping | | **Straat en huisnummer bezoekadres** | `core.yaml` | `Location.street_address` | Direct mapping | | **Organisatie** | `core.yaml` | `HeritageCustodian.name` | Primary name field | | **Webadres organisatie** | `core.yaml` | `HeritageCustodian.homepage` | Maps to `foaf:homepage` | | **Type organisatie** | `enums.yaml` | `HeritageCustodian.institution_type` | Requires value normalization | | **ISIL-code (NA)** | `core.yaml` | `Identifier` (scheme=ISIL) | Multivalued identifiers list | | **Museum register** | `dutch.yaml` | `DutchHeritageCustodian.in_museum_register` | Boolean flag | | **Rijkscollectie** | `dutch.yaml` | `DutchHeritageCustodian.in_rijkscollectie` | Boolean flag | | **Collectie Nederland** | `dutch.yaml` | `DutchHeritageCustodian.in_collectie_nederland` | Boolean flag | | **Archieven.nl** | `dutch.yaml` | `DutchHeritageCustodian.in_archieven_nl` | Boolean flag | | **Linked Data** | `collections.yaml` | `DigitalPlatform` + custom property | Platform with capability flag | | **Datasetregister** | `collections.yaml` | `DigitalPlatform` | Dataset registry participation | **Additional implicit mappings**: - Province (derived from city) → `DutchHeritageCustodian.provincie` - Country (always "NL") → `Location.country` - Data source → `Provenance.data_source = CSV_REGISTRY` - Data tier → `Provenance.data_tier = TIER_1_AUTHORITATIVE` --- ### ⚠️ Partially Mapped Fields (8/32) These fields can be mapped to existing schema structures but require extensions or enum updates. #### 1. **Koepelorganisatie** (Umbrella Organization) - **Current Mapping**: `HeritageCustodian.parent_organization` - **Gap**: CSV has organization NAME only, schema expects HeritageCustodian reference - **Solution**: - Parse as string, resolve to actual HeritageCustodian ID during cross-linking phase - Store temporarily in `description` or create `parent_organization_name` slot - **Recommendation**: Add `parent_organization_name: string` slot for unresolved references #### 2. **Samenwerkingsverband / Platform** (Collaborative Network) - **Current Mapping**: `DutchHeritageCustodian.samenwerkingsverband` (multivalued string) - **Gap**: CSV mixes network names AND platform names (e.g., "Geheugen van Drenthe") - **Solution**: Split into two categories: - Networks → `samenwerkingsverband` - Platforms → `DigitalPlatform` - **Recommendation**: Parser logic to classify based on keywords #### 3. **Systeem** (System/Software) - **Current Mapping**: `DigitalPlatform.platform_name` - **Gap**: CSV values include systems NOT in `DigitalPlatformTypeEnum`: - "Atlantis" → COLLECTION_MANAGEMENT_SYSTEM - "MAIS Flexis?" → COLLECTION_MANAGEMENT_SYSTEM - "Adlib" → COLLECTION_MANAGEMENT_SYSTEM - "FileMaker" → GENERIC (new value) - **Recommendation**: Extend `DigitalPlatformTypeEnum` with: ```yaml GENERIC: description: General-purpose software not specific to heritage sector ``` #### 4. **Versnellen** (Acceleration Project) - **Current Mapping**: No direct mapping - **Gap**: Boolean flag indicating participation in Dutch digitization program - **Solution**: Add to `DutchHeritageCustodian`: ```yaml in_versnellen_project: description: Participates in Versnellen digitization acceleration program range: boolean ``` - **Recommendation**: NEW SLOT REQUIRED #### 5. **Bibliotheek collectie** (Library Collection) - **Current Mapping**: `Collection.collection_type = "bibliographic"` - **Gap**: CSV has boolean "has library collection", not collection metadata - **Solution**: Add boolean flag to indicate collection type presence: ```yaml has_library_collection: description: Institution holds library collections range: boolean ``` - **Recommendation**: NEW SLOT or use `collection_type` filter #### 6. **in scope voor DC4EU** (DC4EU Scope) - **Current Mapping**: `Partnership` or boolean flag - **Gap**: EU project participation not modeled - **Solution**: Either: - A) Add partnership: `Partnership(partner_name="DC4EU", partnership_type="EU_PROJECT")` - B) Add boolean: `in_dc4eu_scope: boolean` - **Recommendation**: Option B (boolean) simpler for this use case #### 7. **DC4EU aansluit route** (DC4EU Connection Route) - **Current Mapping**: No mapping - **Gap**: Technical integration pathway (e.g., "API", "OAI-PMH", "direct") - **Solution**: Add to `DigitalPlatform`: ```yaml integration_method: description: Technical method for data integration (API, OAI-PMH, SPARQL, etc.) range: string ``` - **Recommendation**: NEW SLOT for DigitalPlatform #### 8. **Opmerkingen Inez** + **Opmerkingen** (Remarks/Notes) - **Current Mapping**: `HeritageCustodian.description` (partial) - **Gap**: CSV has TWO note fields (one from editor, one general) - **Solution**: Concatenate both into `description` with attribution: ```yaml description: >- Editor notes (Inez): {Opmerkingen Inez} General remarks: {Opmerkingen} ``` - **Recommendation**: Use existing `description` slot, document concatenation in parser --- ### ❌ Unmapped Fields (7/32) These fields require NEW schema features or represent specialized domain-specific platforms. #### 1. **Archives Portal Europe** - **Type**: Boolean flag for European aggregation platform - **Schema Gap**: Not covered by current Dutch extensions - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_archives_portal_europe: description: Registered in Archives Portal Europe (APE) range: boolean slot_uri: dcterms:isPartOf ``` #### 2. **WO2Net** (WWII Network) - **Type**: Thematic network for WWII heritage - **Schema Gap**: No slot for thematic networks - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_wo2net: description: Participates in WO2Net (WWII heritage network) range: boolean ``` #### 3. **Modemuze** (Fashion Museum Network) - **Type**: Specialized fashion heritage network - **Schema Gap**: Domain-specific network - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_modemuze: description: Participates in Modemuze (fashion heritage network) range: boolean ``` #### 4. **Maritiem Digitaal** (Maritime Digital) - **Type**: Maritime heritage aggregation platform - **Schema Gap**: Domain-specific platform - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_maritiem_digitaal: description: Contributes to Maritiem Digitaal (maritime heritage platform) range: boolean ``` #### 5. **Delfts aardewerk** (Delft Pottery) - **Type**: Delft pottery collections network - **Schema Gap**: Collection-specific network - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_delfts_aardewerk: description: Participates in Delfts aardewerk network range: boolean ``` #### 6. **Stichting Academisch Erfgoed** (Academic Heritage Foundation) - **Type**: Academic heritage network - **Schema Gap**: Network participation - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_academisch_erfgoed: description: Member of Stichting Academisch Erfgoed range: boolean ``` #### 7. **Coleccion Aruba** - **Type**: Aruba collections network - **Schema Gap**: Caribbean/colonial heritage network - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_coleccion_aruba: description: Contributes to Coleccion Aruba range: boolean ``` #### 8. **Van Gogh Worldwide** - **Type**: International Van Gogh collections network - **Schema Gap**: Artist-specific network - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_van_gogh_worldwide: description: Participates in Van Gogh Worldwide range: boolean ``` #### 9. **OODE24 (Mondriaan)** - **Type**: Mondriaan art network/project - **Schema Gap**: Artist/movement specific network - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml in_oode24_mondriaan: description: Participates in OODE24 (Mondriaan project) range: boolean ``` #### 10. **Versnellen project** (Acceleration Project Details) - **Type**: Freeform text describing project participation - **Schema Gap**: Project-specific metadata - **Recommendation**: Add to `DutchHeritageCustodian`: ```yaml versnellen_project_details: description: Details about participation in Versnellen digitization project range: string ``` --- ## Schema Extension Recommendations ### Priority 1: Critical Gaps (Required for Full CSV Parsing) **File**: `schemas/dutch.yaml` (DutchHeritageCustodian extensions) ```yaml slots: # EU/International platforms in_archives_portal_europe: description: Registered in Archives Portal Europe (APE) range: boolean slot_uri: dcterms:isPartOf in_dc4eu_scope: description: In scope for DC4EU (Digital Collaboration for Europe) project range: boolean dc4eu_integration_method: description: Technical integration method for DC4EU (API, OAI-PMH, SPARQL, etc.) range: string # Digitization programs in_versnellen_project: description: Participates in Versnellen digitization acceleration program range: boolean versnellen_project_details: description: Details about Versnellen project participation (e.g., "Upgrade", "Aanschaf") range: string # Thematic networks in_wo2net: description: Participates in WO2Net (WWII heritage network) range: boolean in_modemuze: description: Participates in Modemuze (fashion heritage network) range: boolean in_maritiem_digitaal: description: Contributes to Maritiem Digitaal (maritime heritage platform) range: boolean in_delfts_aardewerk: description: Participates in Delfts aardewerk (Delft pottery network) range: boolean in_academisch_erfgoed: description: Member of Stichting Academisch Erfgoed (academic heritage foundation) range: boolean in_coleccion_aruba: description: Contributes to Coleccion Aruba (Caribbean heritage network) range: boolean in_van_gogh_worldwide: description: Participates in Van Gogh Worldwide (international Van Gogh collections network) range: boolean in_oode24_mondriaan: description: Participates in OODE24 (Mondriaan art project) range: boolean # Organizational parent_organization_name: description: Name of parent/umbrella organization (unresolved reference) range: string comments: - "Use this when CSV has organization name but not resolved HeritageCustodian ID" - "Resolve to parent_organization during cross-linking phase" # Collection indicators has_library_collection: description: Institution holds library collections (boolean indicator) range: boolean ``` ### Priority 2: Enum Extensions **File**: `schemas/enums.yaml` ```yaml enums: DigitalPlatformTypeEnum: # Add new value: GENERIC: description: >- General-purpose software not specific to heritage sector. Examples: FileMaker, Microsoft Access, custom databases. ``` ### Priority 3: Core Schema Enhancements **File**: `schemas/collections.yaml` (DigitalPlatform extensions) ```yaml slots: integration_method: description: >- Technical method for data integration/harvesting. Examples: REST API, OAI-PMH, SPARQL endpoint, CSV export, direct database access. range: string slot_uri: schema:applicationCategory ``` --- ## Parsing Strategy ### Phase 1: Direct Mapping (Implemented) ```python # Map straightforward fields custodian.name = row['Organisatie'] custodian.homepage = row['Webadres organisatie'] custodian.institution_type = normalize_type(row['Type organisatie']) custodian.in_museum_register = parse_boolean(row['Museum register']) # ... etc. ``` ### Phase 2: Complex Field Handling #### Handling "Koepelorganisatie" (Umbrella Organizations) ```python # Store temporarily as string custodian.parent_organization_name = row['Koepelorganisatie'] # Later, during cross-linking: if custodian.parent_organization_name: parent = dataset.find_by_name(custodian.parent_organization_name) if parent: custodian.parent_organization = parent.id ``` #### Handling "Samenwerkingsverband / Platform" ```python value = row['Samenwerkingsverband / Platform'] # Classify as network vs. platform if is_digital_platform(value): # e.g., contains "digitaal", "portal" custodian.digital_platforms.append( DigitalPlatform(platform_name=value, platform_type="DISCOVERY_PORTAL") ) else: # Network/consortium custodian.samenwerkingsverband.append(value) ``` #### Handling "Systeem" (Software Systems) ```python system = row['Systeem'].strip() if system: platform_type = classify_system(system) # Map to enum custodian.digital_platforms.append( DigitalPlatform( platform_name=system, platform_type=platform_type ) ) def classify_system(name): """Map system names to DigitalPlatformTypeEnum.""" cms_systems = ['Atlantis', 'Adlib', 'MAIS', 'TMS', 'Axiell'] if any(cms in name for cms in cms_systems): return 'COLLECTION_MANAGEMENT_SYSTEM' elif name in ['FileMaker', 'Access']: return 'GENERIC' # ... etc. ``` #### Handling Multiple Notes Fields ```python notes_parts = [] if row['Opmerkingen Inez']: notes_parts.append(f"Editor notes: {row['Opmerkingen Inez']}") if row['Opmerkingen']: notes_parts.append(f"General remarks: {row['Opmerkingen']}") if notes_parts: custodian.description = '\n\n'.join(notes_parts) ``` ### Phase 3: Boolean Flag Mapping ```python # Thematic networks (new boolean flags) custodian.in_wo2net = parse_boolean(row['WO2Net']) custodian.in_modemuze = parse_boolean(row['Modemuze']) custodian.in_maritiem_digitaal = parse_boolean(row['Maritiem Digitaal']) # ... etc. def parse_boolean(value): """Parse Dutch CSV boolean representations.""" if not value or value.strip() == '': return None value_lower = value.strip().lower() # Dutch: 'ja' = yes, 'nee' = no if value_lower in ['ja', 'yes', 'true', '1', 'x']: return True if value_lower in ['nee', 'no', 'false', '0']: return False return None # Ambiguous ``` --- ## Validation Checklist Before considering CSV fully mapped: - [ ] All 32 columns have documented mapping strategy - [ ] Schema extensions implemented in `dutch.yaml` - [ ] Enum extensions implemented in `enums.yaml` - [ ] Parser handles all field types (string, boolean, reference, multivalued) - [ ] Provenance metadata captures CSV source (TIER_1_AUTHORITATIVE) - [ ] Test coverage for each new slot - [ ] Cross-linking logic for `parent_organization_name` → `parent_organization` - [ ] Classification logic for `Samenwerkingsverband` (network vs. platform) - [ ] System name → DigitalPlatformTypeEnum mapping complete --- ## Impact Assessment ### Schema Changes Required - **dutch.yaml**: +18 new slots (mostly boolean flags for network participation) - **enums.yaml**: +1 enum value (`DigitalPlatformTypeEnum.GENERIC`) - **collections.yaml**: +1 slot (`DigitalPlatform.integration_method`) ### Backward Compatibility - ✅ All changes are ADDITIVE (new optional slots) - ✅ No breaking changes to existing schema - ✅ Existing data remains valid ### Data Quality Implications - **TIER_1_AUTHORITATIVE** data will have richer Dutch network participation metadata - Cross-linking with conversation data (TIER_4_INFERRED) will benefit from parent organization names - Boolean flags enable precise filtering (e.g., "all institutions in Modemuze network") --- ## Next Steps 1. **Implement Priority 1 extensions** in `schemas/dutch.yaml` ✅ (Ready to proceed) 2. **Update parser** (`src/glam_extractor/parsers/dutch_orgs.py`) to use new slots 3. **Create test fixtures** with real CSV rows exercising all field types 4. **Update documentation** (AGENTS.md) with new Dutch-specific extraction patterns 5. **Regenerate LinkML artifacts** (JSON Schema, Python dataclasses, SQL DDL) 6. **Validate with real data** (1,351 institutions from CSV) --- ## Open Questions 1. **Should thematic networks be modeled as booleans OR as Partnership objects?** - Current recommendation: Booleans (simpler, CSV is just participation flags) - Alternative: `Partnership(partner_name="Modemuze", partnership_type="THEMATIC_NETWORK")` - Decision needed for long-term maintainability 2. **How to handle uncertain system names (e.g., "MAIS Flexis?")?** - Option A: Strip "?" and parse as "MAIS Flexis" - Option B: Store in notes, mark confidence_score lower - Option C: Create `platform_name_uncertain: boolean` flag 3. **Should "Bibliotheek collectie" create actual Collection objects or just set a flag?** - Current: Flag approach (`has_library_collection: boolean`) - Alternative: Create `Collection(collection_type="bibliographic")` stub - Depends on whether CSV will later provide actual collection metadata 4. **Geographic scope**: Should Caribbean networks (Coleccion Aruba) be in `dutch.yaml`? - These are Netherlands-administered but geographically outside Europe - Consider creating `dutch_caribbean.yaml` module? --- **Document Status**: DRAFT **Next Review**: After schema extensions implemented **Maintainer**: GLAM Data Extraction Project