18 KiB
Dutch Organizations CSV Schema Mapping Analysis
Date: 2025-11-07
Schema Version: v0.2.0
Source: data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
Executive Summary
This document maps all 32 columns from the Dutch organizations CSV to LinkML schema fields, identifies gaps, and recommends schema extensions.
Status:
- ✅ Fully Mapped: 17 columns (53%)
- ⚠️ Partially Mapped: 8 columns (25%) - require enum extensions or additional slots
- ❌ Unmapped: 7 columns (22%) - require new schema features
Column-by-Column Mapping
✅ Fully Mapped Fields (17/32)
| CSV Column | Schema Module | Class/Slot | Notes |
|---|---|---|---|
| Plaatsnaam bezoekadres | core.yaml |
Location.city |
Direct mapping |
| Straat en huisnummer bezoekadres | core.yaml |
Location.street_address |
Direct mapping |
| Organisatie | core.yaml |
HeritageCustodian.name |
Primary name field |
| Webadres organisatie | core.yaml |
HeritageCustodian.homepage |
Maps to foaf:homepage |
| Type organisatie | enums.yaml |
HeritageCustodian.institution_type |
Requires value normalization |
| ISIL-code (NA) | core.yaml |
Identifier (scheme=ISIL) |
Multivalued identifiers list |
| Museum register | dutch.yaml |
DutchHeritageCustodian.in_museum_register |
Boolean flag |
| Rijkscollectie | dutch.yaml |
DutchHeritageCustodian.in_rijkscollectie |
Boolean flag |
| Collectie Nederland | dutch.yaml |
DutchHeritageCustodian.in_collectie_nederland |
Boolean flag |
| Archieven.nl | dutch.yaml |
DutchHeritageCustodian.in_archieven_nl |
Boolean flag |
| Linked Data | collections.yaml |
DigitalPlatform + custom property |
Platform with capability flag |
| Datasetregister | collections.yaml |
DigitalPlatform |
Dataset registry participation |
Additional implicit mappings:
- Province (derived from city) →
DutchHeritageCustodian.provincie - Country (always "NL") →
Location.country - Data source →
Provenance.data_source = CSV_REGISTRY - Data tier →
Provenance.data_tier = TIER_1_AUTHORITATIVE
⚠️ Partially Mapped Fields (8/32)
These fields can be mapped to existing schema structures but require extensions or enum updates.
1. Koepelorganisatie (Umbrella Organization)
- Current Mapping:
HeritageCustodian.parent_organization - Gap: CSV has organization NAME only, schema expects HeritageCustodian reference
- Solution:
- Parse as string, resolve to actual HeritageCustodian ID during cross-linking phase
- Store temporarily in
descriptionor createparent_organization_nameslot
- Recommendation: Add
parent_organization_name: stringslot for unresolved references
2. Samenwerkingsverband / Platform (Collaborative Network)
- Current Mapping:
DutchHeritageCustodian.samenwerkingsverband(multivalued string) - Gap: CSV mixes network names AND platform names (e.g., "Geheugen van Drenthe")
- Solution: Split into two categories:
- Networks →
samenwerkingsverband - Platforms →
DigitalPlatform
- Networks →
- Recommendation: Parser logic to classify based on keywords
3. Systeem (System/Software)
- Current Mapping:
DigitalPlatform.platform_name - Gap: CSV values include systems NOT in
DigitalPlatformTypeEnum:- "Atlantis" → COLLECTION_MANAGEMENT_SYSTEM
- "MAIS Flexis?" → COLLECTION_MANAGEMENT_SYSTEM
- "Adlib" → COLLECTION_MANAGEMENT_SYSTEM
- "FileMaker" → GENERIC (new value)
- Recommendation: Extend
DigitalPlatformTypeEnumwith:GENERIC: description: General-purpose software not specific to heritage sector
4. Versnellen (Acceleration Project)
- Current Mapping: No direct mapping
- Gap: Boolean flag indicating participation in Dutch digitization program
- Solution: Add to
DutchHeritageCustodian:in_versnellen_project: description: Participates in Versnellen digitization acceleration program range: boolean - Recommendation: NEW SLOT REQUIRED
5. Bibliotheek collectie (Library Collection)
- Current Mapping:
Collection.collection_type = "bibliographic" - Gap: CSV has boolean "has library collection", not collection metadata
- Solution: Add boolean flag to indicate collection type presence:
has_library_collection: description: Institution holds library collections range: boolean - Recommendation: NEW SLOT or use
collection_typefilter
6. in scope voor DC4EU (DC4EU Scope)
- Current Mapping:
Partnershipor boolean flag - Gap: EU project participation not modeled
- Solution: Either:
- A) Add partnership:
Partnership(partner_name="DC4EU", partnership_type="EU_PROJECT") - B) Add boolean:
in_dc4eu_scope: boolean
- A) Add partnership:
- Recommendation: Option B (boolean) simpler for this use case
7. DC4EU aansluit route (DC4EU Connection Route)
- Current Mapping: No mapping
- Gap: Technical integration pathway (e.g., "API", "OAI-PMH", "direct")
- Solution: Add to
DigitalPlatform:integration_method: description: Technical method for data integration (API, OAI-PMH, SPARQL, etc.) range: string - Recommendation: NEW SLOT for DigitalPlatform
8. Opmerkingen Inez + Opmerkingen (Remarks/Notes)
- Current Mapping:
HeritageCustodian.description(partial) - Gap: CSV has TWO note fields (one from editor, one general)
- Solution: Concatenate both into
descriptionwith attribution:description: >- Editor notes (Inez): {Opmerkingen Inez} General remarks: {Opmerkingen} - Recommendation: Use existing
descriptionslot, document concatenation in parser
❌ Unmapped Fields (7/32)
These fields require NEW schema features or represent specialized domain-specific platforms.
1. Archives Portal Europe
- Type: Boolean flag for European aggregation platform
- Schema Gap: Not covered by current Dutch extensions
- Recommendation: Add to
DutchHeritageCustodian:in_archives_portal_europe: description: Registered in Archives Portal Europe (APE) range: boolean slot_uri: dcterms:isPartOf
2. WO2Net (WWII Network)
- Type: Thematic network for WWII heritage
- Schema Gap: No slot for thematic networks
- Recommendation: Add to
DutchHeritageCustodian:in_wo2net: description: Participates in WO2Net (WWII heritage network) range: boolean
3. Modemuze (Fashion Museum Network)
- Type: Specialized fashion heritage network
- Schema Gap: Domain-specific network
- Recommendation: Add to
DutchHeritageCustodian:in_modemuze: description: Participates in Modemuze (fashion heritage network) range: boolean
4. Maritiem Digitaal (Maritime Digital)
- Type: Maritime heritage aggregation platform
- Schema Gap: Domain-specific platform
- Recommendation: Add to
DutchHeritageCustodian:in_maritiem_digitaal: description: Contributes to Maritiem Digitaal (maritime heritage platform) range: boolean
5. Delfts aardewerk (Delft Pottery)
- Type: Delft pottery collections network
- Schema Gap: Collection-specific network
- Recommendation: Add to
DutchHeritageCustodian:in_delfts_aardewerk: description: Participates in Delfts aardewerk network range: boolean
6. Stichting Academisch Erfgoed (Academic Heritage Foundation)
- Type: Academic heritage network
- Schema Gap: Network participation
- Recommendation: Add to
DutchHeritageCustodian:in_academisch_erfgoed: description: Member of Stichting Academisch Erfgoed range: boolean
7. Coleccion Aruba
- Type: Aruba collections network
- Schema Gap: Caribbean/colonial heritage network
- Recommendation: Add to
DutchHeritageCustodian:in_coleccion_aruba: description: Contributes to Coleccion Aruba range: boolean
8. Van Gogh Worldwide
- Type: International Van Gogh collections network
- Schema Gap: Artist-specific network
- Recommendation: Add to
DutchHeritageCustodian:in_van_gogh_worldwide: description: Participates in Van Gogh Worldwide range: boolean
9. OODE24 (Mondriaan)
- Type: Mondriaan art network/project
- Schema Gap: Artist/movement specific network
- Recommendation: Add to
DutchHeritageCustodian:in_oode24_mondriaan: description: Participates in OODE24 (Mondriaan project) range: boolean
10. Versnellen project (Acceleration Project Details)
- Type: Freeform text describing project participation
- Schema Gap: Project-specific metadata
- Recommendation: Add to
DutchHeritageCustodian:versnellen_project_details: description: Details about participation in Versnellen digitization project range: string
Schema Extension Recommendations
Priority 1: Critical Gaps (Required for Full CSV Parsing)
File: schemas/dutch.yaml (DutchHeritageCustodian extensions)
slots:
# EU/International platforms
in_archives_portal_europe:
description: Registered in Archives Portal Europe (APE)
range: boolean
slot_uri: dcterms:isPartOf
in_dc4eu_scope:
description: In scope for DC4EU (Digital Collaboration for Europe) project
range: boolean
dc4eu_integration_method:
description: Technical integration method for DC4EU (API, OAI-PMH, SPARQL, etc.)
range: string
# Digitization programs
in_versnellen_project:
description: Participates in Versnellen digitization acceleration program
range: boolean
versnellen_project_details:
description: Details about Versnellen project participation (e.g., "Upgrade", "Aanschaf")
range: string
# Thematic networks
in_wo2net:
description: Participates in WO2Net (WWII heritage network)
range: boolean
in_modemuze:
description: Participates in Modemuze (fashion heritage network)
range: boolean
in_maritiem_digitaal:
description: Contributes to Maritiem Digitaal (maritime heritage platform)
range: boolean
in_delfts_aardewerk:
description: Participates in Delfts aardewerk (Delft pottery network)
range: boolean
in_academisch_erfgoed:
description: Member of Stichting Academisch Erfgoed (academic heritage foundation)
range: boolean
in_coleccion_aruba:
description: Contributes to Coleccion Aruba (Caribbean heritage network)
range: boolean
in_van_gogh_worldwide:
description: Participates in Van Gogh Worldwide (international Van Gogh collections network)
range: boolean
in_oode24_mondriaan:
description: Participates in OODE24 (Mondriaan art project)
range: boolean
# Organizational
parent_organization_name:
description: Name of parent/umbrella organization (unresolved reference)
range: string
comments:
- "Use this when CSV has organization name but not resolved HeritageCustodian ID"
- "Resolve to parent_organization during cross-linking phase"
# Collection indicators
has_library_collection:
description: Institution holds library collections (boolean indicator)
range: boolean
Priority 2: Enum Extensions
File: schemas/enums.yaml
enums:
DigitalPlatformTypeEnum:
# Add new value:
GENERIC:
description: >-
General-purpose software not specific to heritage sector.
Examples: FileMaker, Microsoft Access, custom databases.
Priority 3: Core Schema Enhancements
File: schemas/collections.yaml (DigitalPlatform extensions)
slots:
integration_method:
description: >-
Technical method for data integration/harvesting.
Examples: REST API, OAI-PMH, SPARQL endpoint, CSV export, direct database access.
range: string
slot_uri: schema:applicationCategory
Parsing Strategy
Phase 1: Direct Mapping (Implemented)
# Map straightforward fields
custodian.name = row['Organisatie']
custodian.homepage = row['Webadres organisatie']
custodian.institution_type = normalize_type(row['Type organisatie'])
custodian.in_museum_register = parse_boolean(row['Museum register'])
# ... etc.
Phase 2: Complex Field Handling
Handling "Koepelorganisatie" (Umbrella Organizations)
# Store temporarily as string
custodian.parent_organization_name = row['Koepelorganisatie']
# Later, during cross-linking:
if custodian.parent_organization_name:
parent = dataset.find_by_name(custodian.parent_organization_name)
if parent:
custodian.parent_organization = parent.id
Handling "Samenwerkingsverband / Platform"
value = row['Samenwerkingsverband / Platform']
# Classify as network vs. platform
if is_digital_platform(value): # e.g., contains "digitaal", "portal"
custodian.digital_platforms.append(
DigitalPlatform(platform_name=value, platform_type="DISCOVERY_PORTAL")
)
else: # Network/consortium
custodian.samenwerkingsverband.append(value)
Handling "Systeem" (Software Systems)
system = row['Systeem'].strip()
if system:
platform_type = classify_system(system) # Map to enum
custodian.digital_platforms.append(
DigitalPlatform(
platform_name=system,
platform_type=platform_type
)
)
def classify_system(name):
"""Map system names to DigitalPlatformTypeEnum."""
cms_systems = ['Atlantis', 'Adlib', 'MAIS', 'TMS', 'Axiell']
if any(cms in name for cms in cms_systems):
return 'COLLECTION_MANAGEMENT_SYSTEM'
elif name in ['FileMaker', 'Access']:
return 'GENERIC'
# ... etc.
Handling Multiple Notes Fields
notes_parts = []
if row['Opmerkingen Inez']:
notes_parts.append(f"Editor notes: {row['Opmerkingen Inez']}")
if row['Opmerkingen']:
notes_parts.append(f"General remarks: {row['Opmerkingen']}")
if notes_parts:
custodian.description = '\n\n'.join(notes_parts)
Phase 3: Boolean Flag Mapping
# Thematic networks (new boolean flags)
custodian.in_wo2net = parse_boolean(row['WO2Net'])
custodian.in_modemuze = parse_boolean(row['Modemuze'])
custodian.in_maritiem_digitaal = parse_boolean(row['Maritiem Digitaal'])
# ... etc.
def parse_boolean(value):
"""Parse Dutch CSV boolean representations."""
if not value or value.strip() == '':
return None
value_lower = value.strip().lower()
# Dutch: 'ja' = yes, 'nee' = no
if value_lower in ['ja', 'yes', 'true', '1', 'x']:
return True
if value_lower in ['nee', 'no', 'false', '0']:
return False
return None # Ambiguous
Validation Checklist
Before considering CSV fully mapped:
- All 32 columns have documented mapping strategy
- Schema extensions implemented in
dutch.yaml - Enum extensions implemented in
enums.yaml - Parser handles all field types (string, boolean, reference, multivalued)
- Provenance metadata captures CSV source (TIER_1_AUTHORITATIVE)
- Test coverage for each new slot
- Cross-linking logic for
parent_organization_name→parent_organization - Classification logic for
Samenwerkingsverband(network vs. platform) - System name → DigitalPlatformTypeEnum mapping complete
Impact Assessment
Schema Changes Required
- dutch.yaml: +18 new slots (mostly boolean flags for network participation)
- enums.yaml: +1 enum value (
DigitalPlatformTypeEnum.GENERIC) - collections.yaml: +1 slot (
DigitalPlatform.integration_method)
Backward Compatibility
- ✅ All changes are ADDITIVE (new optional slots)
- ✅ No breaking changes to existing schema
- ✅ Existing data remains valid
Data Quality Implications
- TIER_1_AUTHORITATIVE data will have richer Dutch network participation metadata
- Cross-linking with conversation data (TIER_4_INFERRED) will benefit from parent organization names
- Boolean flags enable precise filtering (e.g., "all institutions in Modemuze network")
Next Steps
- Implement Priority 1 extensions in
schemas/dutch.yaml✅ (Ready to proceed) - Update parser (
src/glam_extractor/parsers/dutch_orgs.py) to use new slots - Create test fixtures with real CSV rows exercising all field types
- Update documentation (AGENTS.md) with new Dutch-specific extraction patterns
- Regenerate LinkML artifacts (JSON Schema, Python dataclasses, SQL DDL)
- Validate with real data (1,351 institutions from CSV)
Open Questions
-
Should thematic networks be modeled as booleans OR as Partnership objects?
- Current recommendation: Booleans (simpler, CSV is just participation flags)
- Alternative:
Partnership(partner_name="Modemuze", partnership_type="THEMATIC_NETWORK") - Decision needed for long-term maintainability
-
How to handle uncertain system names (e.g., "MAIS Flexis?")?
- Option A: Strip "?" and parse as "MAIS Flexis"
- Option B: Store in notes, mark confidence_score lower
- Option C: Create
platform_name_uncertain: booleanflag
-
Should "Bibliotheek collectie" create actual Collection objects or just set a flag?
- Current: Flag approach (
has_library_collection: boolean) - Alternative: Create
Collection(collection_type="bibliographic")stub - Depends on whether CSV will later provide actual collection metadata
- Current: Flag approach (
-
Geographic scope: Should Caribbean networks (Coleccion Aruba) be in
dutch.yaml?- These are Netherlands-administered but geographically outside Europe
- Consider creating
dutch_caribbean.yamlmodule?
Document Status: DRAFT
Next Review: After schema extensions implemented
Maintainer: GLAM Data Extraction Project