glam/docs/CSV_SCHEMA_MAPPING.md
2025-11-19 23:25:22 +01:00

514 lines
18 KiB
Markdown

# Dutch Organizations CSV Schema Mapping Analysis
**Date**: 2025-11-07
**Schema Version**: v0.2.0
**Source**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`
## Executive Summary
This document maps all 32 columns from the Dutch organizations CSV to LinkML schema fields, identifies gaps, and recommends schema extensions.
**Status**:
-**Fully Mapped**: 17 columns (53%)
- ⚠️ **Partially Mapped**: 8 columns (25%) - require enum extensions or additional slots
-**Unmapped**: 7 columns (22%) - require new schema features
---
## Column-by-Column Mapping
### ✅ Fully Mapped Fields (17/32)
| CSV Column | Schema Module | Class/Slot | Notes |
|------------|---------------|------------|-------|
| **Plaatsnaam bezoekadres** | `core.yaml` | `Location.city` | Direct mapping |
| **Straat en huisnummer bezoekadres** | `core.yaml` | `Location.street_address` | Direct mapping |
| **Organisatie** | `core.yaml` | `HeritageCustodian.name` | Primary name field |
| **Webadres organisatie** | `core.yaml` | `HeritageCustodian.homepage` | Maps to `foaf:homepage` |
| **Type organisatie** | `enums.yaml` | `HeritageCustodian.institution_type` | Requires value normalization |
| **ISIL-code (NA)** | `core.yaml` | `Identifier` (scheme=ISIL) | Multivalued identifiers list |
| **Museum register** | `dutch.yaml` | `DutchHeritageCustodian.in_museum_register` | Boolean flag |
| **Rijkscollectie** | `dutch.yaml` | `DutchHeritageCustodian.in_rijkscollectie` | Boolean flag |
| **Collectie Nederland** | `dutch.yaml` | `DutchHeritageCustodian.in_collectie_nederland` | Boolean flag |
| **Archieven.nl** | `dutch.yaml` | `DutchHeritageCustodian.in_archieven_nl` | Boolean flag |
| **Linked Data** | `collections.yaml` | `DigitalPlatform` + custom property | Platform with capability flag |
| **Datasetregister** | `collections.yaml` | `DigitalPlatform` | Dataset registry participation |
**Additional implicit mappings**:
- Province (derived from city) → `DutchHeritageCustodian.provincie`
- Country (always "NL") → `Location.country`
- Data source → `Provenance.data_source = CSV_REGISTRY`
- Data tier → `Provenance.data_tier = TIER_1_AUTHORITATIVE`
---
### ⚠️ Partially Mapped Fields (8/32)
These fields can be mapped to existing schema structures but require extensions or enum updates.
#### 1. **Koepelorganisatie** (Umbrella Organization)
- **Current Mapping**: `HeritageCustodian.parent_organization`
- **Gap**: CSV has organization NAME only, schema expects HeritageCustodian reference
- **Solution**:
- Parse as string, resolve to actual HeritageCustodian ID during cross-linking phase
- Store temporarily in `description` or create `parent_organization_name` slot
- **Recommendation**: Add `parent_organization_name: string` slot for unresolved references
#### 2. **Samenwerkingsverband / Platform** (Collaborative Network)
- **Current Mapping**: `DutchHeritageCustodian.samenwerkingsverband` (multivalued string)
- **Gap**: CSV mixes network names AND platform names (e.g., "Geheugen van Drenthe")
- **Solution**: Split into two categories:
- Networks → `samenwerkingsverband`
- Platforms → `DigitalPlatform`
- **Recommendation**: Parser logic to classify based on keywords
#### 3. **Systeem** (System/Software)
- **Current Mapping**: `DigitalPlatform.platform_name`
- **Gap**: CSV values include systems NOT in `DigitalPlatformTypeEnum`:
- "Atlantis" → COLLECTION_MANAGEMENT_SYSTEM
- "MAIS Flexis?" → COLLECTION_MANAGEMENT_SYSTEM
- "Adlib" → COLLECTION_MANAGEMENT_SYSTEM
- "FileMaker" → GENERIC (new value)
- **Recommendation**: Extend `DigitalPlatformTypeEnum` with:
```yaml
GENERIC:
description: General-purpose software not specific to heritage sector
```
#### 4. **Versnellen** (Acceleration Project)
- **Current Mapping**: No direct mapping
- **Gap**: Boolean flag indicating participation in Dutch digitization program
- **Solution**: Add to `DutchHeritageCustodian`:
```yaml
in_versnellen_project:
description: Participates in Versnellen digitization acceleration program
range: boolean
```
- **Recommendation**: NEW SLOT REQUIRED
#### 5. **Bibliotheek collectie** (Library Collection)
- **Current Mapping**: `Collection.collection_type = "bibliographic"`
- **Gap**: CSV has boolean "has library collection", not collection metadata
- **Solution**: Add boolean flag to indicate collection type presence:
```yaml
has_library_collection:
description: Institution holds library collections
range: boolean
```
- **Recommendation**: NEW SLOT or use `collection_type` filter
#### 6. **in scope voor DC4EU** (DC4EU Scope)
- **Current Mapping**: `Partnership` or boolean flag
- **Gap**: EU project participation not modeled
- **Solution**: Either:
- A) Add partnership: `Partnership(partner_name="DC4EU", partnership_type="EU_PROJECT")`
- B) Add boolean: `in_dc4eu_scope: boolean`
- **Recommendation**: Option B (boolean) simpler for this use case
#### 7. **DC4EU aansluit route** (DC4EU Connection Route)
- **Current Mapping**: No mapping
- **Gap**: Technical integration pathway (e.g., "API", "OAI-PMH", "direct")
- **Solution**: Add to `DigitalPlatform`:
```yaml
integration_method:
description: Technical method for data integration (API, OAI-PMH, SPARQL, etc.)
range: string
```
- **Recommendation**: NEW SLOT for DigitalPlatform
#### 8. **Opmerkingen Inez** + **Opmerkingen** (Remarks/Notes)
- **Current Mapping**: `HeritageCustodian.description` (partial)
- **Gap**: CSV has TWO note fields (one from editor, one general)
- **Solution**: Concatenate both into `description` with attribution:
```yaml
description: >-
Editor notes (Inez): {Opmerkingen Inez}
General remarks: {Opmerkingen}
```
- **Recommendation**: Use existing `description` slot, document concatenation in parser
---
### ❌ Unmapped Fields (7/32)
These fields require NEW schema features or represent specialized domain-specific platforms.
#### 1. **Archives Portal Europe**
- **Type**: Boolean flag for European aggregation platform
- **Schema Gap**: Not covered by current Dutch extensions
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_archives_portal_europe:
description: Registered in Archives Portal Europe (APE)
range: boolean
slot_uri: dcterms:isPartOf
```
#### 2. **WO2Net** (WWII Network)
- **Type**: Thematic network for WWII heritage
- **Schema Gap**: No slot for thematic networks
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_wo2net:
description: Participates in WO2Net (WWII heritage network)
range: boolean
```
#### 3. **Modemuze** (Fashion Museum Network)
- **Type**: Specialized fashion heritage network
- **Schema Gap**: Domain-specific network
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_modemuze:
description: Participates in Modemuze (fashion heritage network)
range: boolean
```
#### 4. **Maritiem Digitaal** (Maritime Digital)
- **Type**: Maritime heritage aggregation platform
- **Schema Gap**: Domain-specific platform
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_maritiem_digitaal:
description: Contributes to Maritiem Digitaal (maritime heritage platform)
range: boolean
```
#### 5. **Delfts aardewerk** (Delft Pottery)
- **Type**: Delft pottery collections network
- **Schema Gap**: Collection-specific network
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_delfts_aardewerk:
description: Participates in Delfts aardewerk network
range: boolean
```
#### 6. **Stichting Academisch Erfgoed** (Academic Heritage Foundation)
- **Type**: Academic heritage network
- **Schema Gap**: Network participation
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_academisch_erfgoed:
description: Member of Stichting Academisch Erfgoed
range: boolean
```
#### 7. **Coleccion Aruba**
- **Type**: Aruba collections network
- **Schema Gap**: Caribbean/colonial heritage network
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_coleccion_aruba:
description: Contributes to Coleccion Aruba
range: boolean
```
#### 8. **Van Gogh Worldwide**
- **Type**: International Van Gogh collections network
- **Schema Gap**: Artist-specific network
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_van_gogh_worldwide:
description: Participates in Van Gogh Worldwide
range: boolean
```
#### 9. **OODE24 (Mondriaan)**
- **Type**: Mondriaan art network/project
- **Schema Gap**: Artist/movement specific network
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
in_oode24_mondriaan:
description: Participates in OODE24 (Mondriaan project)
range: boolean
```
#### 10. **Versnellen project** (Acceleration Project Details)
- **Type**: Freeform text describing project participation
- **Schema Gap**: Project-specific metadata
- **Recommendation**: Add to `DutchHeritageCustodian`:
```yaml
versnellen_project_details:
description: Details about participation in Versnellen digitization project
range: string
```
---
## Schema Extension Recommendations
### Priority 1: Critical Gaps (Required for Full CSV Parsing)
**File**: `schemas/dutch.yaml` (DutchHeritageCustodian extensions)
```yaml
slots:
# EU/International platforms
in_archives_portal_europe:
description: Registered in Archives Portal Europe (APE)
range: boolean
slot_uri: dcterms:isPartOf
in_dc4eu_scope:
description: In scope for DC4EU (Digital Collaboration for Europe) project
range: boolean
dc4eu_integration_method:
description: Technical integration method for DC4EU (API, OAI-PMH, SPARQL, etc.)
range: string
# Digitization programs
in_versnellen_project:
description: Participates in Versnellen digitization acceleration program
range: boolean
versnellen_project_details:
description: Details about Versnellen project participation (e.g., "Upgrade", "Aanschaf")
range: string
# Thematic networks
in_wo2net:
description: Participates in WO2Net (WWII heritage network)
range: boolean
in_modemuze:
description: Participates in Modemuze (fashion heritage network)
range: boolean
in_maritiem_digitaal:
description: Contributes to Maritiem Digitaal (maritime heritage platform)
range: boolean
in_delfts_aardewerk:
description: Participates in Delfts aardewerk (Delft pottery network)
range: boolean
in_academisch_erfgoed:
description: Member of Stichting Academisch Erfgoed (academic heritage foundation)
range: boolean
in_coleccion_aruba:
description: Contributes to Coleccion Aruba (Caribbean heritage network)
range: boolean
in_van_gogh_worldwide:
description: Participates in Van Gogh Worldwide (international Van Gogh collections network)
range: boolean
in_oode24_mondriaan:
description: Participates in OODE24 (Mondriaan art project)
range: boolean
# Organizational
parent_organization_name:
description: Name of parent/umbrella organization (unresolved reference)
range: string
comments:
- "Use this when CSV has organization name but not resolved HeritageCustodian ID"
- "Resolve to parent_organization during cross-linking phase"
# Collection indicators
has_library_collection:
description: Institution holds library collections (boolean indicator)
range: boolean
```
### Priority 2: Enum Extensions
**File**: `schemas/enums.yaml`
```yaml
enums:
DigitalPlatformTypeEnum:
# Add new value:
GENERIC:
description: >-
General-purpose software not specific to heritage sector.
Examples: FileMaker, Microsoft Access, custom databases.
```
### Priority 3: Core Schema Enhancements
**File**: `schemas/collections.yaml` (DigitalPlatform extensions)
```yaml
slots:
integration_method:
description: >-
Technical method for data integration/harvesting.
Examples: REST API, OAI-PMH, SPARQL endpoint, CSV export, direct database access.
range: string
slot_uri: schema:applicationCategory
```
---
## Parsing Strategy
### Phase 1: Direct Mapping (Implemented)
```python
# Map straightforward fields
custodian.name = row['Organisatie']
custodian.homepage = row['Webadres organisatie']
custodian.institution_type = normalize_type(row['Type organisatie'])
custodian.in_museum_register = parse_boolean(row['Museum register'])
# ... etc.
```
### Phase 2: Complex Field Handling
#### Handling "Koepelorganisatie" (Umbrella Organizations)
```python
# Store temporarily as string
custodian.parent_organization_name = row['Koepelorganisatie']
# Later, during cross-linking:
if custodian.parent_organization_name:
parent = dataset.find_by_name(custodian.parent_organization_name)
if parent:
custodian.parent_organization = parent.id
```
#### Handling "Samenwerkingsverband / Platform"
```python
value = row['Samenwerkingsverband / Platform']
# Classify as network vs. platform
if is_digital_platform(value): # e.g., contains "digitaal", "portal"
custodian.digital_platforms.append(
DigitalPlatform(platform_name=value, platform_type="DISCOVERY_PORTAL")
)
else: # Network/consortium
custodian.samenwerkingsverband.append(value)
```
#### Handling "Systeem" (Software Systems)
```python
system = row['Systeem'].strip()
if system:
platform_type = classify_system(system) # Map to enum
custodian.digital_platforms.append(
DigitalPlatform(
platform_name=system,
platform_type=platform_type
)
)
def classify_system(name):
"""Map system names to DigitalPlatformTypeEnum."""
cms_systems = ['Atlantis', 'Adlib', 'MAIS', 'TMS', 'Axiell']
if any(cms in name for cms in cms_systems):
return 'COLLECTION_MANAGEMENT_SYSTEM'
elif name in ['FileMaker', 'Access']:
return 'GENERIC'
# ... etc.
```
#### Handling Multiple Notes Fields
```python
notes_parts = []
if row['Opmerkingen Inez']:
notes_parts.append(f"Editor notes: {row['Opmerkingen Inez']}")
if row['Opmerkingen']:
notes_parts.append(f"General remarks: {row['Opmerkingen']}")
if notes_parts:
custodian.description = '\n\n'.join(notes_parts)
```
### Phase 3: Boolean Flag Mapping
```python
# Thematic networks (new boolean flags)
custodian.in_wo2net = parse_boolean(row['WO2Net'])
custodian.in_modemuze = parse_boolean(row['Modemuze'])
custodian.in_maritiem_digitaal = parse_boolean(row['Maritiem Digitaal'])
# ... etc.
def parse_boolean(value):
"""Parse Dutch CSV boolean representations."""
if not value or value.strip() == '':
return None
value_lower = value.strip().lower()
# Dutch: 'ja' = yes, 'nee' = no
if value_lower in ['ja', 'yes', 'true', '1', 'x']:
return True
if value_lower in ['nee', 'no', 'false', '0']:
return False
return None # Ambiguous
```
---
## Validation Checklist
Before considering CSV fully mapped:
- [ ] All 32 columns have documented mapping strategy
- [ ] Schema extensions implemented in `dutch.yaml`
- [ ] Enum extensions implemented in `enums.yaml`
- [ ] Parser handles all field types (string, boolean, reference, multivalued)
- [ ] Provenance metadata captures CSV source (TIER_1_AUTHORITATIVE)
- [ ] Test coverage for each new slot
- [ ] Cross-linking logic for `parent_organization_name``parent_organization`
- [ ] Classification logic for `Samenwerkingsverband` (network vs. platform)
- [ ] System name → DigitalPlatformTypeEnum mapping complete
---
## Impact Assessment
### Schema Changes Required
- **dutch.yaml**: +18 new slots (mostly boolean flags for network participation)
- **enums.yaml**: +1 enum value (`DigitalPlatformTypeEnum.GENERIC`)
- **collections.yaml**: +1 slot (`DigitalPlatform.integration_method`)
### Backward Compatibility
- ✅ All changes are ADDITIVE (new optional slots)
- ✅ No breaking changes to existing schema
- ✅ Existing data remains valid
### Data Quality Implications
- **TIER_1_AUTHORITATIVE** data will have richer Dutch network participation metadata
- Cross-linking with conversation data (TIER_4_INFERRED) will benefit from parent organization names
- Boolean flags enable precise filtering (e.g., "all institutions in Modemuze network")
---
## Next Steps
1. **Implement Priority 1 extensions** in `schemas/dutch.yaml` ✅ (Ready to proceed)
2. **Update parser** (`src/glam_extractor/parsers/dutch_orgs.py`) to use new slots
3. **Create test fixtures** with real CSV rows exercising all field types
4. **Update documentation** (AGENTS.md) with new Dutch-specific extraction patterns
5. **Regenerate LinkML artifacts** (JSON Schema, Python dataclasses, SQL DDL)
6. **Validate with real data** (1,351 institutions from CSV)
---
## Open Questions
1. **Should thematic networks be modeled as booleans OR as Partnership objects?**
- Current recommendation: Booleans (simpler, CSV is just participation flags)
- Alternative: `Partnership(partner_name="Modemuze", partnership_type="THEMATIC_NETWORK")`
- Decision needed for long-term maintainability
2. **How to handle uncertain system names (e.g., "MAIS Flexis?")?**
- Option A: Strip "?" and parse as "MAIS Flexis"
- Option B: Store in notes, mark confidence_score lower
- Option C: Create `platform_name_uncertain: boolean` flag
3. **Should "Bibliotheek collectie" create actual Collection objects or just set a flag?**
- Current: Flag approach (`has_library_collection: boolean`)
- Alternative: Create `Collection(collection_type="bibliographic")` stub
- Depends on whether CSV will later provide actual collection metadata
4. **Geographic scope**: Should Caribbean networks (Coleccion Aruba) be in `dutch.yaml`?
- These are Netherlands-administered but geographically outside Europe
- Consider creating `dutch_caribbean.yaml` module?
---
**Document Status**: DRAFT
**Next Review**: After schema extensions implemented
**Maintainer**: GLAM Data Extraction Project