514 lines
18 KiB
Markdown
514 lines
18 KiB
Markdown
# Dutch Organizations CSV Schema Mapping Analysis
|
|
|
|
**Date**: 2025-11-07
|
|
**Schema Version**: v0.2.0
|
|
**Source**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`
|
|
|
|
## Executive Summary
|
|
|
|
This document maps all 32 columns from the Dutch organizations CSV to LinkML schema fields, identifies gaps, and recommends schema extensions.
|
|
|
|
**Status**:
|
|
- ✅ **Fully Mapped**: 17 columns (53%)
|
|
- ⚠️ **Partially Mapped**: 8 columns (25%) - require enum extensions or additional slots
|
|
- ❌ **Unmapped**: 7 columns (22%) - require new schema features
|
|
|
|
---
|
|
|
|
## Column-by-Column Mapping
|
|
|
|
### ✅ Fully Mapped Fields (17/32)
|
|
|
|
| CSV Column | Schema Module | Class/Slot | Notes |
|
|
|------------|---------------|------------|-------|
|
|
| **Plaatsnaam bezoekadres** | `core.yaml` | `Location.city` | Direct mapping |
|
|
| **Straat en huisnummer bezoekadres** | `core.yaml` | `Location.street_address` | Direct mapping |
|
|
| **Organisatie** | `core.yaml` | `HeritageCustodian.name` | Primary name field |
|
|
| **Webadres organisatie** | `core.yaml` | `HeritageCustodian.homepage` | Maps to `foaf:homepage` |
|
|
| **Type organisatie** | `enums.yaml` | `HeritageCustodian.institution_type` | Requires value normalization |
|
|
| **ISIL-code (NA)** | `core.yaml` | `Identifier` (scheme=ISIL) | Multivalued identifiers list |
|
|
| **Museum register** | `dutch.yaml` | `DutchHeritageCustodian.in_museum_register` | Boolean flag |
|
|
| **Rijkscollectie** | `dutch.yaml` | `DutchHeritageCustodian.in_rijkscollectie` | Boolean flag |
|
|
| **Collectie Nederland** | `dutch.yaml` | `DutchHeritageCustodian.in_collectie_nederland` | Boolean flag |
|
|
| **Archieven.nl** | `dutch.yaml` | `DutchHeritageCustodian.in_archieven_nl` | Boolean flag |
|
|
| **Linked Data** | `collections.yaml` | `DigitalPlatform` + custom property | Platform with capability flag |
|
|
| **Datasetregister** | `collections.yaml` | `DigitalPlatform` | Dataset registry participation |
|
|
|
|
**Additional implicit mappings**:
|
|
- Province (derived from city) → `DutchHeritageCustodian.provincie`
|
|
- Country (always "NL") → `Location.country`
|
|
- Data source → `Provenance.data_source = CSV_REGISTRY`
|
|
- Data tier → `Provenance.data_tier = TIER_1_AUTHORITATIVE`
|
|
|
|
---
|
|
|
|
### ⚠️ Partially Mapped Fields (8/32)
|
|
|
|
These fields can be mapped to existing schema structures but require extensions or enum updates.
|
|
|
|
#### 1. **Koepelorganisatie** (Umbrella Organization)
|
|
- **Current Mapping**: `HeritageCustodian.parent_organization`
|
|
- **Gap**: CSV has organization NAME only, schema expects HeritageCustodian reference
|
|
- **Solution**:
|
|
- Parse as string, resolve to actual HeritageCustodian ID during cross-linking phase
|
|
- Store temporarily in `description` or create `parent_organization_name` slot
|
|
- **Recommendation**: Add `parent_organization_name: string` slot for unresolved references
|
|
|
|
#### 2. **Samenwerkingsverband / Platform** (Collaborative Network)
|
|
- **Current Mapping**: `DutchHeritageCustodian.samenwerkingsverband` (multivalued string)
|
|
- **Gap**: CSV mixes network names AND platform names (e.g., "Geheugen van Drenthe")
|
|
- **Solution**: Split into two categories:
|
|
- Networks → `samenwerkingsverband`
|
|
- Platforms → `DigitalPlatform`
|
|
- **Recommendation**: Parser logic to classify based on keywords
|
|
|
|
#### 3. **Systeem** (System/Software)
|
|
- **Current Mapping**: `DigitalPlatform.platform_name`
|
|
- **Gap**: CSV values include systems NOT in `DigitalPlatformTypeEnum`:
|
|
- "Atlantis" → COLLECTION_MANAGEMENT_SYSTEM
|
|
- "MAIS Flexis?" → COLLECTION_MANAGEMENT_SYSTEM
|
|
- "Adlib" → COLLECTION_MANAGEMENT_SYSTEM
|
|
- "FileMaker" → GENERIC (new value)
|
|
- **Recommendation**: Extend `DigitalPlatformTypeEnum` with:
|
|
```yaml
|
|
GENERIC:
|
|
description: General-purpose software not specific to heritage sector
|
|
```
|
|
|
|
#### 4. **Versnellen** (Acceleration Project)
|
|
- **Current Mapping**: No direct mapping
|
|
- **Gap**: Boolean flag indicating participation in Dutch digitization program
|
|
- **Solution**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_versnellen_project:
|
|
description: Participates in Versnellen digitization acceleration program
|
|
range: boolean
|
|
```
|
|
- **Recommendation**: NEW SLOT REQUIRED
|
|
|
|
#### 5. **Bibliotheek collectie** (Library Collection)
|
|
- **Current Mapping**: `Collection.collection_type = "bibliographic"`
|
|
- **Gap**: CSV has boolean "has library collection", not collection metadata
|
|
- **Solution**: Add boolean flag to indicate collection type presence:
|
|
```yaml
|
|
has_library_collection:
|
|
description: Institution holds library collections
|
|
range: boolean
|
|
```
|
|
- **Recommendation**: NEW SLOT or use `collection_type` filter
|
|
|
|
#### 6. **in scope voor DC4EU** (DC4EU Scope)
|
|
- **Current Mapping**: `Partnership` or boolean flag
|
|
- **Gap**: EU project participation not modeled
|
|
- **Solution**: Either:
|
|
- A) Add partnership: `Partnership(partner_name="DC4EU", partnership_type="EU_PROJECT")`
|
|
- B) Add boolean: `in_dc4eu_scope: boolean`
|
|
- **Recommendation**: Option B (boolean) simpler for this use case
|
|
|
|
#### 7. **DC4EU aansluit route** (DC4EU Connection Route)
|
|
- **Current Mapping**: No mapping
|
|
- **Gap**: Technical integration pathway (e.g., "API", "OAI-PMH", "direct")
|
|
- **Solution**: Add to `DigitalPlatform`:
|
|
```yaml
|
|
integration_method:
|
|
description: Technical method for data integration (API, OAI-PMH, SPARQL, etc.)
|
|
range: string
|
|
```
|
|
- **Recommendation**: NEW SLOT for DigitalPlatform
|
|
|
|
#### 8. **Opmerkingen Inez** + **Opmerkingen** (Remarks/Notes)
|
|
- **Current Mapping**: `HeritageCustodian.description` (partial)
|
|
- **Gap**: CSV has TWO note fields (one from editor, one general)
|
|
- **Solution**: Concatenate both into `description` with attribution:
|
|
```yaml
|
|
description: >-
|
|
Editor notes (Inez): {Opmerkingen Inez}
|
|
General remarks: {Opmerkingen}
|
|
```
|
|
- **Recommendation**: Use existing `description` slot, document concatenation in parser
|
|
|
|
---
|
|
|
|
### ❌ Unmapped Fields (7/32)
|
|
|
|
These fields require NEW schema features or represent specialized domain-specific platforms.
|
|
|
|
#### 1. **Archives Portal Europe**
|
|
- **Type**: Boolean flag for European aggregation platform
|
|
- **Schema Gap**: Not covered by current Dutch extensions
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_archives_portal_europe:
|
|
description: Registered in Archives Portal Europe (APE)
|
|
range: boolean
|
|
slot_uri: dcterms:isPartOf
|
|
```
|
|
|
|
#### 2. **WO2Net** (WWII Network)
|
|
- **Type**: Thematic network for WWII heritage
|
|
- **Schema Gap**: No slot for thematic networks
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_wo2net:
|
|
description: Participates in WO2Net (WWII heritage network)
|
|
range: boolean
|
|
```
|
|
|
|
#### 3. **Modemuze** (Fashion Museum Network)
|
|
- **Type**: Specialized fashion heritage network
|
|
- **Schema Gap**: Domain-specific network
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_modemuze:
|
|
description: Participates in Modemuze (fashion heritage network)
|
|
range: boolean
|
|
```
|
|
|
|
#### 4. **Maritiem Digitaal** (Maritime Digital)
|
|
- **Type**: Maritime heritage aggregation platform
|
|
- **Schema Gap**: Domain-specific platform
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_maritiem_digitaal:
|
|
description: Contributes to Maritiem Digitaal (maritime heritage platform)
|
|
range: boolean
|
|
```
|
|
|
|
#### 5. **Delfts aardewerk** (Delft Pottery)
|
|
- **Type**: Delft pottery collections network
|
|
- **Schema Gap**: Collection-specific network
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_delfts_aardewerk:
|
|
description: Participates in Delfts aardewerk network
|
|
range: boolean
|
|
```
|
|
|
|
#### 6. **Stichting Academisch Erfgoed** (Academic Heritage Foundation)
|
|
- **Type**: Academic heritage network
|
|
- **Schema Gap**: Network participation
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_academisch_erfgoed:
|
|
description: Member of Stichting Academisch Erfgoed
|
|
range: boolean
|
|
```
|
|
|
|
#### 7. **Coleccion Aruba**
|
|
- **Type**: Aruba collections network
|
|
- **Schema Gap**: Caribbean/colonial heritage network
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_coleccion_aruba:
|
|
description: Contributes to Coleccion Aruba
|
|
range: boolean
|
|
```
|
|
|
|
#### 8. **Van Gogh Worldwide**
|
|
- **Type**: International Van Gogh collections network
|
|
- **Schema Gap**: Artist-specific network
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_van_gogh_worldwide:
|
|
description: Participates in Van Gogh Worldwide
|
|
range: boolean
|
|
```
|
|
|
|
#### 9. **OODE24 (Mondriaan)**
|
|
- **Type**: Mondriaan art network/project
|
|
- **Schema Gap**: Artist/movement specific network
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
in_oode24_mondriaan:
|
|
description: Participates in OODE24 (Mondriaan project)
|
|
range: boolean
|
|
```
|
|
|
|
#### 10. **Versnellen project** (Acceleration Project Details)
|
|
- **Type**: Freeform text describing project participation
|
|
- **Schema Gap**: Project-specific metadata
|
|
- **Recommendation**: Add to `DutchHeritageCustodian`:
|
|
```yaml
|
|
versnellen_project_details:
|
|
description: Details about participation in Versnellen digitization project
|
|
range: string
|
|
```
|
|
|
|
---
|
|
|
|
## Schema Extension Recommendations
|
|
|
|
### Priority 1: Critical Gaps (Required for Full CSV Parsing)
|
|
|
|
**File**: `schemas/dutch.yaml` (DutchHeritageCustodian extensions)
|
|
|
|
```yaml
|
|
slots:
|
|
# EU/International platforms
|
|
in_archives_portal_europe:
|
|
description: Registered in Archives Portal Europe (APE)
|
|
range: boolean
|
|
slot_uri: dcterms:isPartOf
|
|
|
|
in_dc4eu_scope:
|
|
description: In scope for DC4EU (Digital Collaboration for Europe) project
|
|
range: boolean
|
|
|
|
dc4eu_integration_method:
|
|
description: Technical integration method for DC4EU (API, OAI-PMH, SPARQL, etc.)
|
|
range: string
|
|
|
|
# Digitization programs
|
|
in_versnellen_project:
|
|
description: Participates in Versnellen digitization acceleration program
|
|
range: boolean
|
|
|
|
versnellen_project_details:
|
|
description: Details about Versnellen project participation (e.g., "Upgrade", "Aanschaf")
|
|
range: string
|
|
|
|
# Thematic networks
|
|
in_wo2net:
|
|
description: Participates in WO2Net (WWII heritage network)
|
|
range: boolean
|
|
|
|
in_modemuze:
|
|
description: Participates in Modemuze (fashion heritage network)
|
|
range: boolean
|
|
|
|
in_maritiem_digitaal:
|
|
description: Contributes to Maritiem Digitaal (maritime heritage platform)
|
|
range: boolean
|
|
|
|
in_delfts_aardewerk:
|
|
description: Participates in Delfts aardewerk (Delft pottery network)
|
|
range: boolean
|
|
|
|
in_academisch_erfgoed:
|
|
description: Member of Stichting Academisch Erfgoed (academic heritage foundation)
|
|
range: boolean
|
|
|
|
in_coleccion_aruba:
|
|
description: Contributes to Coleccion Aruba (Caribbean heritage network)
|
|
range: boolean
|
|
|
|
in_van_gogh_worldwide:
|
|
description: Participates in Van Gogh Worldwide (international Van Gogh collections network)
|
|
range: boolean
|
|
|
|
in_oode24_mondriaan:
|
|
description: Participates in OODE24 (Mondriaan art project)
|
|
range: boolean
|
|
|
|
# Organizational
|
|
parent_organization_name:
|
|
description: Name of parent/umbrella organization (unresolved reference)
|
|
range: string
|
|
comments:
|
|
- "Use this when CSV has organization name but not resolved HeritageCustodian ID"
|
|
- "Resolve to parent_organization during cross-linking phase"
|
|
|
|
# Collection indicators
|
|
has_library_collection:
|
|
description: Institution holds library collections (boolean indicator)
|
|
range: boolean
|
|
```
|
|
|
|
### Priority 2: Enum Extensions
|
|
|
|
**File**: `schemas/enums.yaml`
|
|
|
|
```yaml
|
|
enums:
|
|
DigitalPlatformTypeEnum:
|
|
# Add new value:
|
|
GENERIC:
|
|
description: >-
|
|
General-purpose software not specific to heritage sector.
|
|
Examples: FileMaker, Microsoft Access, custom databases.
|
|
```
|
|
|
|
### Priority 3: Core Schema Enhancements
|
|
|
|
**File**: `schemas/collections.yaml` (DigitalPlatform extensions)
|
|
|
|
```yaml
|
|
slots:
|
|
integration_method:
|
|
description: >-
|
|
Technical method for data integration/harvesting.
|
|
Examples: REST API, OAI-PMH, SPARQL endpoint, CSV export, direct database access.
|
|
range: string
|
|
slot_uri: schema:applicationCategory
|
|
```
|
|
|
|
---
|
|
|
|
## Parsing Strategy
|
|
|
|
### Phase 1: Direct Mapping (Implemented)
|
|
```python
|
|
# Map straightforward fields
|
|
custodian.name = row['Organisatie']
|
|
custodian.homepage = row['Webadres organisatie']
|
|
custodian.institution_type = normalize_type(row['Type organisatie'])
|
|
custodian.in_museum_register = parse_boolean(row['Museum register'])
|
|
# ... etc.
|
|
```
|
|
|
|
### Phase 2: Complex Field Handling
|
|
|
|
#### Handling "Koepelorganisatie" (Umbrella Organizations)
|
|
```python
|
|
# Store temporarily as string
|
|
custodian.parent_organization_name = row['Koepelorganisatie']
|
|
|
|
# Later, during cross-linking:
|
|
if custodian.parent_organization_name:
|
|
parent = dataset.find_by_name(custodian.parent_organization_name)
|
|
if parent:
|
|
custodian.parent_organization = parent.id
|
|
```
|
|
|
|
#### Handling "Samenwerkingsverband / Platform"
|
|
```python
|
|
value = row['Samenwerkingsverband / Platform']
|
|
|
|
# Classify as network vs. platform
|
|
if is_digital_platform(value): # e.g., contains "digitaal", "portal"
|
|
custodian.digital_platforms.append(
|
|
DigitalPlatform(platform_name=value, platform_type="DISCOVERY_PORTAL")
|
|
)
|
|
else: # Network/consortium
|
|
custodian.samenwerkingsverband.append(value)
|
|
```
|
|
|
|
#### Handling "Systeem" (Software Systems)
|
|
```python
|
|
system = row['Systeem'].strip()
|
|
if system:
|
|
platform_type = classify_system(system) # Map to enum
|
|
custodian.digital_platforms.append(
|
|
DigitalPlatform(
|
|
platform_name=system,
|
|
platform_type=platform_type
|
|
)
|
|
)
|
|
|
|
def classify_system(name):
|
|
"""Map system names to DigitalPlatformTypeEnum."""
|
|
cms_systems = ['Atlantis', 'Adlib', 'MAIS', 'TMS', 'Axiell']
|
|
if any(cms in name for cms in cms_systems):
|
|
return 'COLLECTION_MANAGEMENT_SYSTEM'
|
|
elif name in ['FileMaker', 'Access']:
|
|
return 'GENERIC'
|
|
# ... etc.
|
|
```
|
|
|
|
#### Handling Multiple Notes Fields
|
|
```python
|
|
notes_parts = []
|
|
if row['Opmerkingen Inez']:
|
|
notes_parts.append(f"Editor notes: {row['Opmerkingen Inez']}")
|
|
if row['Opmerkingen']:
|
|
notes_parts.append(f"General remarks: {row['Opmerkingen']}")
|
|
|
|
if notes_parts:
|
|
custodian.description = '\n\n'.join(notes_parts)
|
|
```
|
|
|
|
### Phase 3: Boolean Flag Mapping
|
|
```python
|
|
# Thematic networks (new boolean flags)
|
|
custodian.in_wo2net = parse_boolean(row['WO2Net'])
|
|
custodian.in_modemuze = parse_boolean(row['Modemuze'])
|
|
custodian.in_maritiem_digitaal = parse_boolean(row['Maritiem Digitaal'])
|
|
# ... etc.
|
|
|
|
def parse_boolean(value):
|
|
"""Parse Dutch CSV boolean representations."""
|
|
if not value or value.strip() == '':
|
|
return None
|
|
value_lower = value.strip().lower()
|
|
# Dutch: 'ja' = yes, 'nee' = no
|
|
if value_lower in ['ja', 'yes', 'true', '1', 'x']:
|
|
return True
|
|
if value_lower in ['nee', 'no', 'false', '0']:
|
|
return False
|
|
return None # Ambiguous
|
|
```
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
Before considering CSV fully mapped:
|
|
|
|
- [ ] All 32 columns have documented mapping strategy
|
|
- [ ] Schema extensions implemented in `dutch.yaml`
|
|
- [ ] Enum extensions implemented in `enums.yaml`
|
|
- [ ] Parser handles all field types (string, boolean, reference, multivalued)
|
|
- [ ] Provenance metadata captures CSV source (TIER_1_AUTHORITATIVE)
|
|
- [ ] Test coverage for each new slot
|
|
- [ ] Cross-linking logic for `parent_organization_name` → `parent_organization`
|
|
- [ ] Classification logic for `Samenwerkingsverband` (network vs. platform)
|
|
- [ ] System name → DigitalPlatformTypeEnum mapping complete
|
|
|
|
---
|
|
|
|
## Impact Assessment
|
|
|
|
### Schema Changes Required
|
|
- **dutch.yaml**: +18 new slots (mostly boolean flags for network participation)
|
|
- **enums.yaml**: +1 enum value (`DigitalPlatformTypeEnum.GENERIC`)
|
|
- **collections.yaml**: +1 slot (`DigitalPlatform.integration_method`)
|
|
|
|
### Backward Compatibility
|
|
- ✅ All changes are ADDITIVE (new optional slots)
|
|
- ✅ No breaking changes to existing schema
|
|
- ✅ Existing data remains valid
|
|
|
|
### Data Quality Implications
|
|
- **TIER_1_AUTHORITATIVE** data will have richer Dutch network participation metadata
|
|
- Cross-linking with conversation data (TIER_4_INFERRED) will benefit from parent organization names
|
|
- Boolean flags enable precise filtering (e.g., "all institutions in Modemuze network")
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Implement Priority 1 extensions** in `schemas/dutch.yaml` ✅ (Ready to proceed)
|
|
2. **Update parser** (`src/glam_extractor/parsers/dutch_orgs.py`) to use new slots
|
|
3. **Create test fixtures** with real CSV rows exercising all field types
|
|
4. **Update documentation** (AGENTS.md) with new Dutch-specific extraction patterns
|
|
5. **Regenerate LinkML artifacts** (JSON Schema, Python dataclasses, SQL DDL)
|
|
6. **Validate with real data** (1,351 institutions from CSV)
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Should thematic networks be modeled as booleans OR as Partnership objects?**
|
|
- Current recommendation: Booleans (simpler, CSV is just participation flags)
|
|
- Alternative: `Partnership(partner_name="Modemuze", partnership_type="THEMATIC_NETWORK")`
|
|
- Decision needed for long-term maintainability
|
|
|
|
2. **How to handle uncertain system names (e.g., "MAIS Flexis?")?**
|
|
- Option A: Strip "?" and parse as "MAIS Flexis"
|
|
- Option B: Store in notes, mark confidence_score lower
|
|
- Option C: Create `platform_name_uncertain: boolean` flag
|
|
|
|
3. **Should "Bibliotheek collectie" create actual Collection objects or just set a flag?**
|
|
- Current: Flag approach (`has_library_collection: boolean`)
|
|
- Alternative: Create `Collection(collection_type="bibliographic")` stub
|
|
- Depends on whether CSV will later provide actual collection metadata
|
|
|
|
4. **Geographic scope**: Should Caribbean networks (Coleccion Aruba) be in `dutch.yaml`?
|
|
- These are Netherlands-administered but geographically outside Europe
|
|
- Consider creating `dutch_caribbean.yaml` module?
|
|
|
|
---
|
|
|
|
**Document Status**: DRAFT
|
|
**Next Review**: After schema extensions implemented
|
|
**Maintainer**: GLAM Data Extraction Project
|