23 KiB
Global GLAM Dataset: Data Standardization Strategy
Overview
This document describes how existing structured datasets in the data/ directory will be standardized, integrated, and fit into the Global GLAM dataset extraction pipeline alongside data mined from conversation files.
Existing Data Assets
1. ISIL Codes Registry
File: data/ISIL-codes_2025-08-01.csv
Current Structure
Volgnr.,Plaats,Instelling,ISIL code,Toegekend op,Opmerking
1,Aalten,Nationaal Onderduikmuseum,NL-AtNOM,2021-03-17,
2,Alkmaar,Regionaal Archief Alkmaar,NL-AmrRAA,2009-08-18,
Data Characteristics
- Scope: Dutch heritage institutions with assigned ISIL codes
- Coverage: ~300+ institutions (estimated from file)
- Authority: Official ISIL registry for Netherlands
- Fields:
- Sequential number (Volgnr.)
- Place/City (Plaats)
- Institution name (Instelling)
- ISIL code (standardized identifier)
- Assignment date (Toegekend op)
- Remarks/Notes (Opmerking)
Data Quality
- Strengths:
- Authoritative source for Dutch ISIL codes
- Standardized identifier format (NL-{code})
- Official assignment dates
- Clean, structured data
- Limitations:
- Netherlands only (not global)
- Only institutions with ISIL codes (subset of all institutions)
- Limited metadata (no URLs, collections, types)
- CSV encoding issues (extra quote marks)
2. Dutch Organizations Registry
File: data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv
Current Structure
Plaatsnaam bezoekadres,Straat en huisnummer bezoekadres,Organisatie,Koepelorganisatie,
Webadres organisatie,Type organisatie,Opmerkingen Inez,ISIL-code (NA),
Samenwerkingsverband / Platform,Systeem,Versnellen,Collectie Nederland,...
Data Characteristics
- Scope: Comprehensive list of Dutch heritage organizations and services
- Coverage: Province-by-province (Drenthe, Flevoland visible in sample)
- Richness: Very detailed metadata about each institution
- Fields (40+ columns):
- Location: Province, city, street address
- Organization: Name, parent organization, website
- Classification: Type (museum, archive, library, historical society)
- Identifiers: ISIL code, Chamber of Commerce (KvK) number
- Platforms: Participation in various heritage platforms/networks
- Systems: Collection management systems used (Atlantis, MAIS Flexis, ZCBS, etc.)
- Projects: Participation in digitization projects (Versnellen, Collectie Nederland, etc.)
- Registers: Museum register, Rijkscollectie, Archives Portal Europe, etc.
- Specialized networks: WO2Net, Modemuze, Maritiem Digitaal, etc.
- Linked Data: Participation in linked data initiatives
- Notes: Extensive comments on organizational structure, data issues
Data Quality
- Strengths:
- Extremely rich metadata
- Real-world operational data (systems, platforms, projects)
- Parent/child organization relationships
- Network/consortium memberships
- Curator notes on data quality issues
- Physical addresses for geocoding
- Web URLs for crawling
- Limitations:
- Netherlands-specific (single country)
- Mixed data quality (notes indicate issues)
- Some institutions lack KvK registration
- Complex organizational relationships
- Many empty cells (not all institutions participate in all platforms)
Integration Strategy
Phase 1: Standardization & Normalization
1.1 ISIL Codes CSV Processing
Parsing Steps:
- Fix CSV encoding issues (extra quotes)
- Parse dates to ISO 8601 format
- Validate ISIL code format (NL-{prefix}{suffix})
- Normalize place names (standardize Dutch municipality names)
- Extract institution type hints from names
Output Schema:
@dataclass
class ISILRecord:
sequence_number: int
city: str
city_normalized: str # Standardized municipality
institution_name: str
institution_name_normalized: str # Cleaned name
isil_code: str
assigned_date: date
remarks: Optional[str]
# Derived fields
institution_type_hint: Optional[str] # Extracted from name
province: Optional[str] # Looked up from city
coordinates: Optional[Tuple[float, float]] # Geocoded
Enrichment:
- Geocode cities to coordinates
- Map cities to provinces
- Infer institution types from names (e.g., "Archief" → archive, "Museum" → museum)
- Link to Wikidata entities where possible
1.2 Dutch Organizations CSV Processing
Parsing Steps:
- Parse complex CSV with 40+ columns
- Handle empty cells gracefully
- Normalize organization types
- Parse boolean flags (ja/nee → true/false)
- Validate URLs
- Parse ISIL codes
- Extract system/platform names
- Process curator notes
Output Schema:
@dataclass
class DutchOrganizationRecord:
# Core identity
organization_name: str
organization_type: str # Normalized: 'museum', 'archive', 'library', etc.
parent_organization: Optional[str]
# Location
province: str
city: str
street_address: str
coordinates: Optional[Tuple[float, float]]
# Digital presence
website_url: Optional[str]
# Identifiers
isil_code: Optional[str]
kvk_number: Optional[str] # Dutch business registry
# Platforms & networks
consortium_memberships: List[str]
platform_participations: List[str]
# Technical systems
collection_management_system: Optional[str]
# Project participation
versnellen_project: bool
collectie_nederland: bool
museum_register: bool
rijkscollectie: bool
bibliotheek_collectie: bool
dc4eu: bool # Digital Culture for Europe
archives_portal_europe: bool
wo2net: bool # WWII network
specialized_networks: List[str] # Modemuze, Maritiem Digitaal, etc.
# Linked data
linked_data_participation: bool
dataset_register: bool
# Data quality
curator_notes: Optional[str]
data_issues: List[str] # Extracted from notes
Normalization:
- Standardize organization types to controlled vocabulary
- Parse platform/system names to canonical forms
- Convert "ja"/"nieuw"/"" to boolean/enum values
- Extract structured issues from free-text notes
1.3 Cross-Dataset Linking
ISIL Code Matching:
def link_datasets(isil_records: List[ISILRecord],
org_records: List[DutchOrganizationRecord]) -> List[LinkedRecord]:
"""
Link ISIL registry with Dutch organizations registry
"""
# Match on ISIL code (primary key)
# Match on normalized organization name + city (fallback)
# Flag conflicts and duplicates
Linking Strategy:
- Primary match: ISIL code (if present in both datasets)
- Secondary match: Fuzzy name matching + city
- Conflict resolution: Manual review queue for ambiguous matches
Output: Merged records with provenance tracking
@dataclass
class LinkedDutchInstitution:
# Merged data from both sources
institution_name: str
isil_code: str
# Provenance
in_isil_registry: bool
in_org_registry: bool
matched_on: str # 'isil_code', 'name_city', 'manual'
# All fields from both records...
Phase 2: LinkML Schema Mapping
2.1 Schema Design for Existing Data
Heritage Institution Base Class:
# schemas/heritage_custodian.yaml
classes:
HeritageCustodian:
slots:
- name
- institution_types
- isil_code
- geographic_coverage
- digital_platforms
- parent_organization
- consortium_memberships
DutchHeritageCustodian:
is_a: HeritageCustodian
mixins:
- TOOIOrganization # Dutch government org ontology
slots:
- kvk_number # Dutch-specific
- province
- municipality
- versnellen_participation
- collectie_nederland_participation
- collection_management_system
2.2 Mapping Rules
ISIL Registry → LinkML:
mappings:
ISILRecord_to_HeritageCustodian:
institution_name: name
isil_code: isil_code
city: geographic_coverage.municipality
assigned_date: identifiers[isil].assigned_date
institution_type_hint: institution_types (with inference)
Dutch Organizations Registry → LinkML:
mappings:
DutchOrganizationRecord_to_DutchHeritageCustodian:
organization_name: name
organization_type: institution_types
parent_organization: parent_organization
isil_code: isil_code
kvk_number: identifiers[kvk].value
website_url: digital_platforms[0].url
collection_management_system: technical_infrastructure.cms
consortium_memberships: organizational_relationships.consortia
# ... (40+ field mappings)
2.3 Instance Generation
Process:
- Parse CSV → Python dataclass
- Normalize/enrich data
- Map to LinkML schema
- Validate against schema
- Generate instances (YAML/JSON-LD)
- Store in RDF graph
Output Structure:
output/
├── dutch_institutions/
│ ├── instances/
│ │ ├── NL-AsdSAA.yaml # Stadsarchief Amsterdam
│ │ ├── NL-AsdRM.jsonld # Rijksmuseum
│ │ └── ...
│ ├── merged/
│ │ └── dutch_institutions.ttl # All as RDF
│ └── stats/
│ └── coverage_report.json
Phase 3: Integration with Conversation-Mined Data
3.1 Dual Data Sources
Data Flow:
┌─────────────────────┐ ┌──────────────────────┐
│ Existing CSVs │ │ Conversation JSONs │
│ - ISIL registry │ │ - Global research │
│ - Dutch orgs │ │ - 139 countries │
└──────────┬──────────┘ └──────────┬───────────┘
│ │
▼ ▼
┌──────────────┐ ┌─────────────────┐
│ CSV Parser │ │ NLP Extraction │
│ & Normalizer │ │ Pipeline │
└──────┬───────┘ └────────┬────────┘
│ │
├───────────────────────────────┤
│ │
▼ ▼
┌────────────────────────────────────────┐
│ LinkML Instance Generator │
│ - Validate against schema │
│ - Merge overlapping records │
│ - Track provenance │
└──────────────┬─────────────────────────┘
│
▼
┌─────────────────────┐
│ Global GLAM │
│ Dataset │
│ (RDF + exports) │
└─────────────────────┘
3.2 Data Fusion Rules
Scenario 1: Institution in both CSV and conversation data
# Example: Rijksmuseum appears in both ISIL registry and
# Dutch conversations about museum digitization
def merge_institution(csv_record: LinkedDutchInstitution,
extracted_record: ExtractedInstitution) -> HeritageCustodian:
"""
Merge data from authoritative CSV and mined conversation
"""
merged = HeritageCustodian()
# Prefer CSV for authoritative fields
merged.isil_code = csv_record.isil_code # Authoritative
merged.name = csv_record.organization_name # Official name
merged.kvk_number = csv_record.kvk_number # Official
# Prefer conversation data for URLs (may be more current)
merged.website_url = extracted_record.urls[0] if extracted_record.urls else csv_record.website_url
# Merge collections (union of both sources)
merged.collections = list(set(
csv_record.specialized_networks + # From CSV
extracted_record.collection_subjects # From NLP
))
# Track provenance
merged.provenance = [
ProvenanceRecord(source="isil_registry", confidence=1.0),
ProvenanceRecord(source="conversation_extraction", confidence=0.85)
]
return merged
Scenario 2: Institution only in CSV
# Example: Small local archive in ISIL registry but not mentioned
# in conversations (yet)
def csv_only_institution(csv_record: LinkedDutchInstitution) -> HeritageCustodian:
"""
Convert CSV-only record to LinkML instance
"""
# All data from CSV
# Mark as authoritative but limited metadata
# May be enriched later by web crawling
Scenario 3: Institution only in conversation data
# Example: Brazilian museum mentioned in conversations but no
# authoritative registry entry
def conversation_only_institution(extracted: ExtractedInstitution) -> HeritageCustodian:
"""
Use conversation-mined data with appropriate confidence scores
"""
# All data from NLP extraction
# Lower confidence scores
# Requires validation via web crawling
3.3 Data Completeness Scoring
@dataclass
class CompletenessScore:
"""Track data completeness per institution"""
has_name: bool = True # Always required
has_type: bool = False
has_location: bool = False
has_isil: bool = False
has_website: bool = False
has_collections: bool = False
has_systems: bool = False
has_coordinates: bool = False
@property
def score(self) -> float:
"""Completeness as percentage"""
fields = [f for f in self.__dataclass_fields__ if f != 'score']
return sum(getattr(self, f) for f in fields) / len(fields)
Target Completeness:
- CSV-sourced institutions: >80% completeness expected
- Conversation-mined institutions: >60% completeness target
- Combined sources: >90% completeness ideal
Phase 4: Data Enhancement Pipeline
4.1 Geocoding Addresses
For CSV data:
from geopy.geocoders import Nominatim
def geocode_dutch_addresses(records: List[DutchOrganizationRecord]):
"""
Geocode full street addresses from Dutch org registry
"""
geocoder = Nominatim(user_agent="glam-dataset-builder")
for record in records:
if record.street_address:
full_address = f"{record.street_address}, {record.city}, Netherlands"
try:
location = geocoder.geocode(full_address)
record.coordinates = (location.latitude, location.longitude)
except:
# Fall back to city-level geocoding
pass
Advantage: CSV data has precise street addresses → high-quality geocoding
4.2 URL Validation & Crawling
For CSV URLs:
async def validate_and_crawl_csv_urls(records: List[DutchOrganizationRecord]):
"""
Validate URLs from CSV and crawl for additional metadata
"""
for record in records:
if record.website_url:
# Validate URL is accessible
status = await check_url(record.website_url)
record.url_status = status
if status == 200:
# Crawl for Schema.org metadata, OpenGraph, etc.
metadata = await crawl4ai.extract_metadata(record.website_url)
# Enrich record with crawled data
if metadata.get('description'):
record.description = metadata['description']
if metadata.get('collections'):
record.collections.extend(metadata['collections'])
4.3 Wikidata Linking
Strategy:
- Search Wikidata for institution name + city
- Filter by instance type (Q33506, Q166118, Q7075, etc.)
- Validate match quality
- Extract additional identifiers (VIAF, GND, etc.)
def link_to_wikidata(institution: HeritageCustodian) -> Optional[str]:
"""
Find Wikidata entity for institution
"""
query = f"""
SELECT ?item WHERE {{
?item rdfs:label "{institution.name}"@nl .
?item wdt:P31/wdt:P279* wd:Q33506 . # instance of archive (or subclass)
?item wdt:P131 wd:{city_qid} . # located in administrative territory
}}
"""
# Returns Wikidata QID
Phase 5: Quality Assurance
5.1 Data Validation
CSV Data Checks:
def validate_csv_data(record: DutchOrganizationRecord) -> List[ValidationIssue]:
issues = []
# Required fields
if not record.organization_name:
issues.append(ValidationIssue("Missing organization name", "error"))
# ISIL format
if record.isil_code and not re.match(r'NL-[A-Za-z0-9]+', record.isil_code):
issues.append(ValidationIssue("Invalid ISIL format", "warning"))
# URL validity
if record.website_url and not validators.url(record.website_url):
issues.append(ValidationIssue("Invalid URL format", "warning"))
# Curator notes review
if record.curator_notes and "geen KvK" in record.curator_notes.lower():
issues.append(ValidationIssue("No KvK registration", "info"))
return issues
5.2 Duplicate Detection
Within CSV data:
- Check for duplicate ISIL codes
- Check for duplicate names + addresses
- Flag for manual review
Across datasets:
- Match CSV institutions to conversation-mined data
- Detect near-duplicates with fuzzy matching
- Merge or flag for review
5.3 Manual Review Interface
Flagged Cases (from curator notes in CSV):
- Institutions without KvK registration
- Complex organizational hierarchies (museum part of municipality)
- Name discrepancies (official name vs. popular name)
- Missing parent organization links
Review Queue:
@dataclass
class ReviewItem:
institution_id: str
issue_type: str # 'duplicate', 'missing_data', 'conflict', 'ambiguous'
description: str
suggested_resolution: Optional[str]
manual_decision: Optional[str]
Phase 6: Export & Integration
6.1 Standalone CSV Dataset Exports
Dutch Heritage Institutions Export:
output/dutch_institutions/
├── dutch_heritage_institutions.csv # Flattened CSV
├── dutch_heritage_institutions.ttl # RDF Turtle
├── dutch_heritage_institutions.jsonld # JSON-LD
├── metadata.yaml # Dataset metadata
└── README.md # Documentation
Statistics:
- Total institutions: ~400 (estimated)
- With ISIL codes: ~300
- With websites: ~350
- With geocoded addresses: ~380
- Participating in Versnellen: ~100
- In Archives Portal Europe: ~150
6.2 Integration into Global Dataset
Global GLAM Dataset Structure:
output/global_glam_dataset/
├── by_country/
│ ├── netherlands/
│ │ ├── authoritative/ # From CSV registries
│ │ │ └── nl_institutions.ttl
│ │ ├── conversation_mined/ # From Dutch conversations
│ │ │ └── nl_mined.ttl
│ │ └── merged/ # Fused data
│ │ └── nl_complete.ttl
│ ├── brazil/
│ │ └── conversation_mined/ # No authoritative CSV available
│ │ └── br_institutions.ttl
│ └── ...
├── global_merged.ttl # All countries combined
└── statistics/
└── coverage_by_source.json
Provenance Tracking:
<https://glam-dataset.org/institution/NL-AsdRM> a hc:HeritageCustodian ;
schema:name "Rijksmuseum"@nl ;
hc:isil_code "NL-AsdRM" ;
prov:wasDerivedFrom [
a prov:Entity ;
prov:hadPrimarySource <data/ISIL-codes_2025-08-01.csv> ;
prov:generatedAtTime "2025-11-05T12:00:00Z"^^xsd:dateTime
] ,
[
a prov:Entity ;
prov:hadPrimarySource <conversations/Dutch_GLAM_research.json> ;
prov:generatedAtTime "2025-11-05T12:30:00Z"^^xsd:dateTime
] .
Data Governance
Data Sources Hierarchy (by authority)
Tier 1: Authoritative Registries (highest trust)
- ISIL code registry (official)
- National library/archive registries
- Government heritage databases
- Confidence score: 1.0
Tier 2: Curated Institutional Data (high trust)
- Dutch organizations CSV (curated by heritage professionals)
- Institutional websites (verified)
- Confidence score: 0.9
Tier 3: Conversation-Mined Data (medium trust)
- NLP-extracted from conversations
- Validated by web crawling
- Confidence score: 0.7-0.85
Tier 4: Inferred Data (lower trust)
- Geocoded from addresses
- Type inference from names
- Linked data from fuzzy matching
- Confidence score: 0.5-0.7
Data Update Strategy
CSV Data:
- Frequency: When new versions released (quarterly/annually)
- Process:
- Detect changes (diff against previous version)
- Re-process only changed records
- Update merged dataset
- Publish new version with changelog
Conversation Data:
- Frequency: As new conversations added
- Process: Incremental processing of new files
Web-Crawled Data:
- Frequency: Quarterly refresh of URLs
- Process: Re-crawl all URLs, update metadata
License Considerations
CSV Data:
- ISIL registry: Public data (check specific license)
- Dutch orgs CSV: Clarify license with data provider
- Recommended: Request CC0 or CC-BY 4.0
Derived Dataset:
- Combine data under compatible license
- Provide attribution to original sources
- Document provenance clearly
Implementation Priority
Phase 1 (Week 1): CSV Processing Infrastructure
- CSV parser with encoding fixes
- Data validation rules
- Output to Python dataclasses
- Basic statistics
Phase 2 (Week 2): LinkML Schema Integration
- Design schema extensions for CSV fields
- Implement mapping rules
- Generate LinkML instances
- Validate against schema
Phase 3 (Week 3-4): Enrichment
- Geocoding pipeline
- URL validation
- Wikidata linking
- Web crawling integration
Phase 4 (Week 5): Merging & Quality
- Implement data fusion logic
- Duplicate detection
- Conflict resolution
- Manual review queue
Phase 5 (Week 6): Export & Documentation
- Multi-format export
- Statistics generation
- Documentation
- Publication
Success Metrics
Quantitative
- Parse success rate: >99% of CSV records successfully parsed
- Geocoding success: >95% of addresses geocoded
- URL validity: >90% of URLs accessible
- Wikidata linking: >60% linked to Wikidata
- Merge rate: >85% of ISIL registry entries matched to Dutch org registry
Qualitative
- CSV data provides "ground truth" for Dutch institutions
- Conversation data adds international coverage
- Merged dataset superior to either source alone
- Clear provenance enables trust and verification
Open Questions
- License for Dutch organizations CSV: Need to confirm acceptable use and redistribution
- Update frequency: How often are CSV files updated by maintainers?
- KvK integration: Should we fetch additional data from Dutch Chamber of Commerce API?
- Parent organization resolution: How to handle complex hierarchies (museum owned by municipality)?
- Platform data usage: How to represent participation in platforms like "Versnellen" in LinkML schema?