740 lines
23 KiB
Markdown
740 lines
23 KiB
Markdown
# Global GLAM Dataset: Data Standardization Strategy
|
|
|
|
## Overview
|
|
This document describes how existing structured datasets in the `data/` directory will be standardized, integrated, and fit into the Global GLAM dataset extraction pipeline alongside data mined from conversation files.
|
|
|
|
## Existing Data Assets
|
|
|
|
### 1. ISIL Codes Registry
|
|
**File**: `data/ISIL-codes_2025-08-01.csv`
|
|
|
|
#### Current Structure
|
|
```csv
|
|
Volgnr.,Plaats,Instelling,ISIL code,Toegekend op,Opmerking
|
|
1,Aalten,Nationaal Onderduikmuseum,NL-AtNOM,2021-03-17,
|
|
2,Alkmaar,Regionaal Archief Alkmaar,NL-AmrRAA,2009-08-18,
|
|
```
|
|
|
|
#### Data Characteristics
|
|
- **Scope**: Dutch heritage institutions with assigned ISIL codes
|
|
- **Coverage**: ~300+ institutions (estimated from file)
|
|
- **Authority**: Official ISIL registry for Netherlands
|
|
- **Fields**:
|
|
- Sequential number (Volgnr.)
|
|
- Place/City (Plaats)
|
|
- Institution name (Instelling)
|
|
- ISIL code (standardized identifier)
|
|
- Assignment date (Toegekend op)
|
|
- Remarks/Notes (Opmerking)
|
|
|
|
#### Data Quality
|
|
- **Strengths**:
|
|
- Authoritative source for Dutch ISIL codes
|
|
- Standardized identifier format (NL-{code})
|
|
- Official assignment dates
|
|
- Clean, structured data
|
|
- **Limitations**:
|
|
- Netherlands only (not global)
|
|
- Only institutions with ISIL codes (subset of all institutions)
|
|
- Limited metadata (no URLs, collections, types)
|
|
- CSV encoding issues (extra quote marks)
|
|
|
|
### 2. Dutch Organizations Registry
|
|
**File**: `data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv`
|
|
|
|
#### Current Structure
|
|
```csv
|
|
Plaatsnaam bezoekadres,Straat en huisnummer bezoekadres,Organisatie,Koepelorganisatie,
|
|
Webadres organisatie,Type organisatie,Opmerkingen Inez,ISIL-code (NA),
|
|
Samenwerkingsverband / Platform,Systeem,Versnellen,Collectie Nederland,...
|
|
```
|
|
|
|
#### Data Characteristics
|
|
- **Scope**: Comprehensive list of Dutch heritage organizations and services
|
|
- **Coverage**: Province-by-province (Drenthe, Flevoland visible in sample)
|
|
- **Richness**: Very detailed metadata about each institution
|
|
- **Fields** (40+ columns):
|
|
- **Location**: Province, city, street address
|
|
- **Organization**: Name, parent organization, website
|
|
- **Classification**: Type (museum, archive, library, historical society)
|
|
- **Identifiers**: ISIL code, Chamber of Commerce (KvK) number
|
|
- **Platforms**: Participation in various heritage platforms/networks
|
|
- **Systems**: Collection management systems used (Atlantis, MAIS Flexis, ZCBS, etc.)
|
|
- **Projects**: Participation in digitization projects (Versnellen, Collectie Nederland, etc.)
|
|
- **Registers**: Museum register, Rijkscollectie, Archives Portal Europe, etc.
|
|
- **Specialized networks**: WO2Net, Modemuze, Maritiem Digitaal, etc.
|
|
- **Linked Data**: Participation in linked data initiatives
|
|
- **Notes**: Extensive comments on organizational structure, data issues
|
|
|
|
#### Data Quality
|
|
- **Strengths**:
|
|
- Extremely rich metadata
|
|
- Real-world operational data (systems, platforms, projects)
|
|
- Parent/child organization relationships
|
|
- Network/consortium memberships
|
|
- Curator notes on data quality issues
|
|
- Physical addresses for geocoding
|
|
- Web URLs for crawling
|
|
- **Limitations**:
|
|
- Netherlands-specific (single country)
|
|
- Mixed data quality (notes indicate issues)
|
|
- Some institutions lack KvK registration
|
|
- Complex organizational relationships
|
|
- Many empty cells (not all institutions participate in all platforms)
|
|
|
|
## Integration Strategy
|
|
|
|
### Phase 1: Standardization & Normalization
|
|
|
|
#### 1.1 ISIL Codes CSV Processing
|
|
|
|
**Parsing Steps**:
|
|
1. Fix CSV encoding issues (extra quotes)
|
|
2. Parse dates to ISO 8601 format
|
|
3. Validate ISIL code format (NL-{prefix}{suffix})
|
|
4. Normalize place names (standardize Dutch municipality names)
|
|
5. Extract institution type hints from names
|
|
|
|
**Output Schema**:
|
|
```python
|
|
@dataclass
|
|
class ISILRecord:
|
|
sequence_number: int
|
|
city: str
|
|
city_normalized: str # Standardized municipality
|
|
institution_name: str
|
|
institution_name_normalized: str # Cleaned name
|
|
isil_code: str
|
|
assigned_date: date
|
|
remarks: Optional[str]
|
|
|
|
# Derived fields
|
|
institution_type_hint: Optional[str] # Extracted from name
|
|
province: Optional[str] # Looked up from city
|
|
coordinates: Optional[Tuple[float, float]] # Geocoded
|
|
```
|
|
|
|
**Enrichment**:
|
|
- Geocode cities to coordinates
|
|
- Map cities to provinces
|
|
- Infer institution types from names (e.g., "Archief" → archive, "Museum" → museum)
|
|
- Link to Wikidata entities where possible
|
|
|
|
#### 1.2 Dutch Organizations CSV Processing
|
|
|
|
**Parsing Steps**:
|
|
1. Parse complex CSV with 40+ columns
|
|
2. Handle empty cells gracefully
|
|
3. Normalize organization types
|
|
4. Parse boolean flags (ja/nee → true/false)
|
|
5. Validate URLs
|
|
6. Parse ISIL codes
|
|
7. Extract system/platform names
|
|
8. Process curator notes
|
|
|
|
**Output Schema**:
|
|
```python
|
|
@dataclass
|
|
class DutchOrganizationRecord:
|
|
# Core identity
|
|
organization_name: str
|
|
organization_type: str # Normalized: 'museum', 'archive', 'library', etc.
|
|
parent_organization: Optional[str]
|
|
|
|
# Location
|
|
province: str
|
|
city: str
|
|
street_address: str
|
|
coordinates: Optional[Tuple[float, float]]
|
|
|
|
# Digital presence
|
|
website_url: Optional[str]
|
|
|
|
# Identifiers
|
|
isil_code: Optional[str]
|
|
kvk_number: Optional[str] # Dutch business registry
|
|
|
|
# Platforms & networks
|
|
consortium_memberships: List[str]
|
|
platform_participations: List[str]
|
|
|
|
# Technical systems
|
|
collection_management_system: Optional[str]
|
|
|
|
# Project participation
|
|
versnellen_project: bool
|
|
collectie_nederland: bool
|
|
museum_register: bool
|
|
rijkscollectie: bool
|
|
bibliotheek_collectie: bool
|
|
dc4eu: bool # Digital Culture for Europe
|
|
archives_portal_europe: bool
|
|
wo2net: bool # WWII network
|
|
specialized_networks: List[str] # Modemuze, Maritiem Digitaal, etc.
|
|
|
|
# Linked data
|
|
linked_data_participation: bool
|
|
dataset_register: bool
|
|
|
|
# Data quality
|
|
curator_notes: Optional[str]
|
|
data_issues: List[str] # Extracted from notes
|
|
```
|
|
|
|
**Normalization**:
|
|
- Standardize organization types to controlled vocabulary
|
|
- Parse platform/system names to canonical forms
|
|
- Convert "ja"/"nieuw"/"" to boolean/enum values
|
|
- Extract structured issues from free-text notes
|
|
|
|
#### 1.3 Cross-Dataset Linking
|
|
|
|
**ISIL Code Matching**:
|
|
```python
|
|
def link_datasets(isil_records: List[ISILRecord],
|
|
org_records: List[DutchOrganizationRecord]) -> List[LinkedRecord]:
|
|
"""
|
|
Link ISIL registry with Dutch organizations registry
|
|
"""
|
|
# Match on ISIL code (primary key)
|
|
# Match on normalized organization name + city (fallback)
|
|
# Flag conflicts and duplicates
|
|
```
|
|
|
|
**Linking Strategy**:
|
|
1. **Primary match**: ISIL code (if present in both datasets)
|
|
2. **Secondary match**: Fuzzy name matching + city
|
|
3. **Conflict resolution**: Manual review queue for ambiguous matches
|
|
|
|
**Output**: Merged records with provenance tracking
|
|
```python
|
|
@dataclass
|
|
class LinkedDutchInstitution:
|
|
# Merged data from both sources
|
|
institution_name: str
|
|
isil_code: str
|
|
|
|
# Provenance
|
|
in_isil_registry: bool
|
|
in_org_registry: bool
|
|
matched_on: str # 'isil_code', 'name_city', 'manual'
|
|
|
|
# All fields from both records...
|
|
```
|
|
|
|
### Phase 2: LinkML Schema Mapping
|
|
|
|
#### 2.1 Schema Design for Existing Data
|
|
|
|
**Heritage Institution Base Class**:
|
|
```yaml
|
|
# schemas/heritage_custodian.yaml
|
|
classes:
|
|
HeritageCustodian:
|
|
slots:
|
|
- name
|
|
- institution_types
|
|
- isil_code
|
|
- geographic_coverage
|
|
- digital_platforms
|
|
- parent_organization
|
|
- consortium_memberships
|
|
|
|
DutchHeritageCustodian:
|
|
is_a: HeritageCustodian
|
|
mixins:
|
|
- TOOIOrganization # Dutch government org ontology
|
|
slots:
|
|
- kvk_number # Dutch-specific
|
|
- province
|
|
- municipality
|
|
- versnellen_participation
|
|
- collectie_nederland_participation
|
|
- collection_management_system
|
|
```
|
|
|
|
#### 2.2 Mapping Rules
|
|
|
|
**ISIL Registry → LinkML**:
|
|
```yaml
|
|
mappings:
|
|
ISILRecord_to_HeritageCustodian:
|
|
institution_name: name
|
|
isil_code: isil_code
|
|
city: geographic_coverage.municipality
|
|
assigned_date: identifiers[isil].assigned_date
|
|
institution_type_hint: institution_types (with inference)
|
|
```
|
|
|
|
**Dutch Organizations Registry → LinkML**:
|
|
```yaml
|
|
mappings:
|
|
DutchOrganizationRecord_to_DutchHeritageCustodian:
|
|
organization_name: name
|
|
organization_type: institution_types
|
|
parent_organization: parent_organization
|
|
isil_code: isil_code
|
|
kvk_number: identifiers[kvk].value
|
|
website_url: digital_platforms[0].url
|
|
collection_management_system: technical_infrastructure.cms
|
|
consortium_memberships: organizational_relationships.consortia
|
|
# ... (40+ field mappings)
|
|
```
|
|
|
|
#### 2.3 Instance Generation
|
|
|
|
**Process**:
|
|
1. Parse CSV → Python dataclass
|
|
2. Normalize/enrich data
|
|
3. Map to LinkML schema
|
|
4. Validate against schema
|
|
5. Generate instances (YAML/JSON-LD)
|
|
6. Store in RDF graph
|
|
|
|
**Output Structure**:
|
|
```
|
|
output/
|
|
├── dutch_institutions/
|
|
│ ├── instances/
|
|
│ │ ├── NL-AsdSAA.yaml # Stadsarchief Amsterdam
|
|
│ │ ├── NL-AsdRM.jsonld # Rijksmuseum
|
|
│ │ └── ...
|
|
│ ├── merged/
|
|
│ │ └── dutch_institutions.ttl # All as RDF
|
|
│ └── stats/
|
|
│ └── coverage_report.json
|
|
```
|
|
|
|
### Phase 3: Integration with Conversation-Mined Data
|
|
|
|
#### 3.1 Dual Data Sources
|
|
|
|
**Data Flow**:
|
|
```
|
|
┌─────────────────────┐ ┌──────────────────────┐
|
|
│ Existing CSVs │ │ Conversation JSONs │
|
|
│ - ISIL registry │ │ - Global research │
|
|
│ - Dutch orgs │ │ - 139 countries │
|
|
└──────────┬──────────┘ └──────────┬───────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────┐ ┌─────────────────┐
|
|
│ CSV Parser │ │ NLP Extraction │
|
|
│ & Normalizer │ │ Pipeline │
|
|
└──────┬───────┘ └────────┬────────┘
|
|
│ │
|
|
├───────────────────────────────┤
|
|
│ │
|
|
▼ ▼
|
|
┌────────────────────────────────────────┐
|
|
│ LinkML Instance Generator │
|
|
│ - Validate against schema │
|
|
│ - Merge overlapping records │
|
|
│ - Track provenance │
|
|
└──────────────┬─────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Global GLAM │
|
|
│ Dataset │
|
|
│ (RDF + exports) │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
#### 3.2 Data Fusion Rules
|
|
|
|
**Scenario 1: Institution in both CSV and conversation data**
|
|
```python
|
|
# Example: Rijksmuseum appears in both ISIL registry and
|
|
# Dutch conversations about museum digitization
|
|
|
|
def merge_institution(csv_record: LinkedDutchInstitution,
|
|
extracted_record: ExtractedInstitution) -> HeritageCustodian:
|
|
"""
|
|
Merge data from authoritative CSV and mined conversation
|
|
"""
|
|
merged = HeritageCustodian()
|
|
|
|
# Prefer CSV for authoritative fields
|
|
merged.isil_code = csv_record.isil_code # Authoritative
|
|
merged.name = csv_record.organization_name # Official name
|
|
merged.kvk_number = csv_record.kvk_number # Official
|
|
|
|
# Prefer conversation data for URLs (may be more current)
|
|
merged.website_url = extracted_record.urls[0] if extracted_record.urls else csv_record.website_url
|
|
|
|
# Merge collections (union of both sources)
|
|
merged.collections = list(set(
|
|
csv_record.specialized_networks + # From CSV
|
|
extracted_record.collection_subjects # From NLP
|
|
))
|
|
|
|
# Track provenance
|
|
merged.provenance = [
|
|
ProvenanceRecord(source="isil_registry", confidence=1.0),
|
|
ProvenanceRecord(source="conversation_extraction", confidence=0.85)
|
|
]
|
|
|
|
return merged
|
|
```
|
|
|
|
**Scenario 2: Institution only in CSV**
|
|
```python
|
|
# Example: Small local archive in ISIL registry but not mentioned
|
|
# in conversations (yet)
|
|
|
|
def csv_only_institution(csv_record: LinkedDutchInstitution) -> HeritageCustodian:
|
|
"""
|
|
Convert CSV-only record to LinkML instance
|
|
"""
|
|
# All data from CSV
|
|
# Mark as authoritative but limited metadata
|
|
# May be enriched later by web crawling
|
|
```
|
|
|
|
**Scenario 3: Institution only in conversation data**
|
|
```python
|
|
# Example: Brazilian museum mentioned in conversations but no
|
|
# authoritative registry entry
|
|
|
|
def conversation_only_institution(extracted: ExtractedInstitution) -> HeritageCustodian:
|
|
"""
|
|
Use conversation-mined data with appropriate confidence scores
|
|
"""
|
|
# All data from NLP extraction
|
|
# Lower confidence scores
|
|
# Requires validation via web crawling
|
|
```
|
|
|
|
#### 3.3 Data Completeness Scoring
|
|
|
|
```python
|
|
@dataclass
|
|
class CompletenessScore:
|
|
"""Track data completeness per institution"""
|
|
has_name: bool = True # Always required
|
|
has_type: bool = False
|
|
has_location: bool = False
|
|
has_isil: bool = False
|
|
has_website: bool = False
|
|
has_collections: bool = False
|
|
has_systems: bool = False
|
|
has_coordinates: bool = False
|
|
|
|
@property
|
|
def score(self) -> float:
|
|
"""Completeness as percentage"""
|
|
fields = [f for f in self.__dataclass_fields__ if f != 'score']
|
|
return sum(getattr(self, f) for f in fields) / len(fields)
|
|
```
|
|
|
|
**Target Completeness**:
|
|
- CSV-sourced institutions: >80% completeness expected
|
|
- Conversation-mined institutions: >60% completeness target
|
|
- Combined sources: >90% completeness ideal
|
|
|
|
### Phase 4: Data Enhancement Pipeline
|
|
|
|
#### 4.1 Geocoding Addresses
|
|
|
|
**For CSV data**:
|
|
```python
|
|
from geopy.geocoders import Nominatim
|
|
|
|
def geocode_dutch_addresses(records: List[DutchOrganizationRecord]):
|
|
"""
|
|
Geocode full street addresses from Dutch org registry
|
|
"""
|
|
geocoder = Nominatim(user_agent="glam-dataset-builder")
|
|
|
|
for record in records:
|
|
if record.street_address:
|
|
full_address = f"{record.street_address}, {record.city}, Netherlands"
|
|
try:
|
|
location = geocoder.geocode(full_address)
|
|
record.coordinates = (location.latitude, location.longitude)
|
|
except:
|
|
# Fall back to city-level geocoding
|
|
pass
|
|
```
|
|
|
|
**Advantage**: CSV data has precise street addresses → high-quality geocoding
|
|
|
|
#### 4.2 URL Validation & Crawling
|
|
|
|
**For CSV URLs**:
|
|
```python
|
|
async def validate_and_crawl_csv_urls(records: List[DutchOrganizationRecord]):
|
|
"""
|
|
Validate URLs from CSV and crawl for additional metadata
|
|
"""
|
|
for record in records:
|
|
if record.website_url:
|
|
# Validate URL is accessible
|
|
status = await check_url(record.website_url)
|
|
record.url_status = status
|
|
|
|
if status == 200:
|
|
# Crawl for Schema.org metadata, OpenGraph, etc.
|
|
metadata = await crawl4ai.extract_metadata(record.website_url)
|
|
|
|
# Enrich record with crawled data
|
|
if metadata.get('description'):
|
|
record.description = metadata['description']
|
|
if metadata.get('collections'):
|
|
record.collections.extend(metadata['collections'])
|
|
```
|
|
|
|
#### 4.3 Wikidata Linking
|
|
|
|
**Strategy**:
|
|
1. Search Wikidata for institution name + city
|
|
2. Filter by instance type (Q33506, Q166118, Q7075, etc.)
|
|
3. Validate match quality
|
|
4. Extract additional identifiers (VIAF, GND, etc.)
|
|
|
|
```python
|
|
def link_to_wikidata(institution: HeritageCustodian) -> Optional[str]:
|
|
"""
|
|
Find Wikidata entity for institution
|
|
"""
|
|
query = f"""
|
|
SELECT ?item WHERE {{
|
|
?item rdfs:label "{institution.name}"@nl .
|
|
?item wdt:P31/wdt:P279* wd:Q33506 . # instance of archive (or subclass)
|
|
?item wdt:P131 wd:{city_qid} . # located in administrative territory
|
|
}}
|
|
"""
|
|
# Returns Wikidata QID
|
|
```
|
|
|
|
### Phase 5: Quality Assurance
|
|
|
|
#### 5.1 Data Validation
|
|
|
|
**CSV Data Checks**:
|
|
```python
|
|
def validate_csv_data(record: DutchOrganizationRecord) -> List[ValidationIssue]:
|
|
issues = []
|
|
|
|
# Required fields
|
|
if not record.organization_name:
|
|
issues.append(ValidationIssue("Missing organization name", "error"))
|
|
|
|
# ISIL format
|
|
if record.isil_code and not re.match(r'NL-[A-Za-z0-9]+', record.isil_code):
|
|
issues.append(ValidationIssue("Invalid ISIL format", "warning"))
|
|
|
|
# URL validity
|
|
if record.website_url and not validators.url(record.website_url):
|
|
issues.append(ValidationIssue("Invalid URL format", "warning"))
|
|
|
|
# Curator notes review
|
|
if record.curator_notes and "geen KvK" in record.curator_notes.lower():
|
|
issues.append(ValidationIssue("No KvK registration", "info"))
|
|
|
|
return issues
|
|
```
|
|
|
|
#### 5.2 Duplicate Detection
|
|
|
|
**Within CSV data**:
|
|
- Check for duplicate ISIL codes
|
|
- Check for duplicate names + addresses
|
|
- Flag for manual review
|
|
|
|
**Across datasets**:
|
|
- Match CSV institutions to conversation-mined data
|
|
- Detect near-duplicates with fuzzy matching
|
|
- Merge or flag for review
|
|
|
|
#### 5.3 Manual Review Interface
|
|
|
|
**Flagged Cases** (from curator notes in CSV):
|
|
- Institutions without KvK registration
|
|
- Complex organizational hierarchies (museum part of municipality)
|
|
- Name discrepancies (official name vs. popular name)
|
|
- Missing parent organization links
|
|
|
|
**Review Queue**:
|
|
```python
|
|
@dataclass
|
|
class ReviewItem:
|
|
institution_id: str
|
|
issue_type: str # 'duplicate', 'missing_data', 'conflict', 'ambiguous'
|
|
description: str
|
|
suggested_resolution: Optional[str]
|
|
manual_decision: Optional[str]
|
|
```
|
|
|
|
### Phase 6: Export & Integration
|
|
|
|
#### 6.1 Standalone CSV Dataset Exports
|
|
|
|
**Dutch Heritage Institutions Export**:
|
|
```
|
|
output/dutch_institutions/
|
|
├── dutch_heritage_institutions.csv # Flattened CSV
|
|
├── dutch_heritage_institutions.ttl # RDF Turtle
|
|
├── dutch_heritage_institutions.jsonld # JSON-LD
|
|
├── metadata.yaml # Dataset metadata
|
|
└── README.md # Documentation
|
|
```
|
|
|
|
**Statistics**:
|
|
- Total institutions: ~400 (estimated)
|
|
- With ISIL codes: ~300
|
|
- With websites: ~350
|
|
- With geocoded addresses: ~380
|
|
- Participating in Versnellen: ~100
|
|
- In Archives Portal Europe: ~150
|
|
|
|
#### 6.2 Integration into Global Dataset
|
|
|
|
**Global GLAM Dataset Structure**:
|
|
```
|
|
output/global_glam_dataset/
|
|
├── by_country/
|
|
│ ├── netherlands/
|
|
│ │ ├── authoritative/ # From CSV registries
|
|
│ │ │ └── nl_institutions.ttl
|
|
│ │ ├── conversation_mined/ # From Dutch conversations
|
|
│ │ │ └── nl_mined.ttl
|
|
│ │ └── merged/ # Fused data
|
|
│ │ └── nl_complete.ttl
|
|
│ ├── brazil/
|
|
│ │ └── conversation_mined/ # No authoritative CSV available
|
|
│ │ └── br_institutions.ttl
|
|
│ └── ...
|
|
├── global_merged.ttl # All countries combined
|
|
└── statistics/
|
|
└── coverage_by_source.json
|
|
```
|
|
|
|
**Provenance Tracking**:
|
|
```turtle
|
|
<https://glam-dataset.org/institution/NL-AsdRM> a hc:HeritageCustodian ;
|
|
schema:name "Rijksmuseum"@nl ;
|
|
hc:isil_code "NL-AsdRM" ;
|
|
prov:wasDerivedFrom [
|
|
a prov:Entity ;
|
|
prov:hadPrimarySource <data/ISIL-codes_2025-08-01.csv> ;
|
|
prov:generatedAtTime "2025-11-05T12:00:00Z"^^xsd:dateTime
|
|
] ,
|
|
[
|
|
a prov:Entity ;
|
|
prov:hadPrimarySource <conversations/Dutch_GLAM_research.json> ;
|
|
prov:generatedAtTime "2025-11-05T12:30:00Z"^^xsd:dateTime
|
|
] .
|
|
```
|
|
|
|
## Data Governance
|
|
|
|
### Data Sources Hierarchy (by authority)
|
|
|
|
**Tier 1: Authoritative Registries** (highest trust)
|
|
- ISIL code registry (official)
|
|
- National library/archive registries
|
|
- Government heritage databases
|
|
- Confidence score: 1.0
|
|
|
|
**Tier 2: Curated Institutional Data** (high trust)
|
|
- Dutch organizations CSV (curated by heritage professionals)
|
|
- Institutional websites (verified)
|
|
- Confidence score: 0.9
|
|
|
|
**Tier 3: Conversation-Mined Data** (medium trust)
|
|
- NLP-extracted from conversations
|
|
- Validated by web crawling
|
|
- Confidence score: 0.7-0.85
|
|
|
|
**Tier 4: Inferred Data** (lower trust)
|
|
- Geocoded from addresses
|
|
- Type inference from names
|
|
- Linked data from fuzzy matching
|
|
- Confidence score: 0.5-0.7
|
|
|
|
### Data Update Strategy
|
|
|
|
**CSV Data**:
|
|
- **Frequency**: When new versions released (quarterly/annually)
|
|
- **Process**:
|
|
1. Detect changes (diff against previous version)
|
|
2. Re-process only changed records
|
|
3. Update merged dataset
|
|
4. Publish new version with changelog
|
|
|
|
**Conversation Data**:
|
|
- **Frequency**: As new conversations added
|
|
- **Process**: Incremental processing of new files
|
|
|
|
**Web-Crawled Data**:
|
|
- **Frequency**: Quarterly refresh of URLs
|
|
- **Process**: Re-crawl all URLs, update metadata
|
|
|
|
### License Considerations
|
|
|
|
**CSV Data**:
|
|
- ISIL registry: Public data (check specific license)
|
|
- Dutch orgs CSV: Clarify license with data provider
|
|
- Recommended: Request CC0 or CC-BY 4.0
|
|
|
|
**Derived Dataset**:
|
|
- Combine data under compatible license
|
|
- Provide attribution to original sources
|
|
- Document provenance clearly
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1 (Week 1): CSV Processing Infrastructure
|
|
- CSV parser with encoding fixes
|
|
- Data validation rules
|
|
- Output to Python dataclasses
|
|
- Basic statistics
|
|
|
|
### Phase 2 (Week 2): LinkML Schema Integration
|
|
- Design schema extensions for CSV fields
|
|
- Implement mapping rules
|
|
- Generate LinkML instances
|
|
- Validate against schema
|
|
|
|
### Phase 3 (Week 3-4): Enrichment
|
|
- Geocoding pipeline
|
|
- URL validation
|
|
- Wikidata linking
|
|
- Web crawling integration
|
|
|
|
### Phase 4 (Week 5): Merging & Quality
|
|
- Implement data fusion logic
|
|
- Duplicate detection
|
|
- Conflict resolution
|
|
- Manual review queue
|
|
|
|
### Phase 5 (Week 6): Export & Documentation
|
|
- Multi-format export
|
|
- Statistics generation
|
|
- Documentation
|
|
- Publication
|
|
|
|
## Success Metrics
|
|
|
|
### Quantitative
|
|
- **Parse success rate**: >99% of CSV records successfully parsed
|
|
- **Geocoding success**: >95% of addresses geocoded
|
|
- **URL validity**: >90% of URLs accessible
|
|
- **Wikidata linking**: >60% linked to Wikidata
|
|
- **Merge rate**: >85% of ISIL registry entries matched to Dutch org registry
|
|
|
|
### Qualitative
|
|
- CSV data provides "ground truth" for Dutch institutions
|
|
- Conversation data adds international coverage
|
|
- Merged dataset superior to either source alone
|
|
- Clear provenance enables trust and verification
|
|
|
|
## Open Questions
|
|
|
|
1. **License for Dutch organizations CSV**: Need to confirm acceptable use and redistribution
|
|
2. **Update frequency**: How often are CSV files updated by maintainers?
|
|
3. **KvK integration**: Should we fetch additional data from Dutch Chamber of Commerce API?
|
|
4. **Parent organization resolution**: How to handle complex hierarchies (museum owned by municipality)?
|
|
5. **Platform data usage**: How to represent participation in platforms like "Versnellen" in LinkML schema?
|