3.2 KiB
3.2 KiB
NL Custodian Entity Extraction - Final Report
Date: 2025-12-13 Project: GLAM Heritage Custodian Data Enrichment
Executive Summary
Successfully extracted and validated named entities from 1,865 Dutch heritage custodian files, achieving 14% Wikidata linking coverage on high-frequency entities.
Pipeline Statistics
Source Data
| Metric | Count |
|---|---|
| Total NL custodian files | 1,865 |
| Files with web archives | 1,771 (95%) |
| Files with annotations | 1,737 (93%) |
| Files with extracted entities | 1,454 (78%) |
Entity Extraction
| Metric | Count |
|---|---|
| Total entities extracted | 13,433 |
| Unique entity names | ~2,500 |
| Average entities per file | 9.2 |
| Low-quality entities removed | 1,685 |
Wikidata Linking
| Metric | Count |
|---|---|
| Wikidata reference mappings | 58 |
| Entity instances linked | 1,928+ |
| Unique entity names linked | 58 |
| Coverage rate | 14% |
Wikidata Reference Categories
| Category | Count | Examples |
|---|---|---|
| Countries | 1 | Nederland (+ aliases) |
| Provinces | 12 | All Dutch provinces |
| Regions | 1 | Twente |
| Settlements | 27 | Amsterdam, Den Haag, Zwolle, etc. |
| Heritage Institutions | 7 | Rijksmuseum, Eye Filmmuseum, etc. |
| Government | 1 | Ministerie van Algemene Zaken |
| Historical Events | 1 | Tweede Wereldoorlog |
| Total | 50 |
Top Linked Entities (by frequency)
| Wikidata ID | Label | Count |
|---|---|---|
| Q55 | Netherlands | 232 |
| Q9899 | Amsterdam | 55 |
| Q36600 | The Hague | 25 |
| Q752 | Groningen | 21 |
| Q81652 | Zwolle | 19 |
| Q43631 | Leiden | 16 |
| Q34370 | Rotterdam | 15 |
| Q39297 | Utrecht (city) | 14 |
| Q1455944 | Twente | 14 |
| Q1101 | North Brabant | 14 |
Entity Types Extracted
- TOP.SET - Settlements (cities, towns)
- TOP.CTY - Countries
- TOP.ADM - Administrative regions (provinces)
- TOP.REG - Geographic regions
- GRP.HER - Heritage institutions
- GRP.GOV - Government organizations
- TMP.EVT - Historical events
Data Quality Notes
-
Cleaned entities: Removed 1,685 low-quality entities including:
- Language codes (nl, en, de)
- Numeric-only values
- Generic labels (collectie, geschiedenis)
- Control characters
-
YAML fixes: Fixed 5 files with escape sequence errors
-
Unlinked high-frequency terms: Many remaining unlinked entities are generic Dutch terms not suitable for Wikidata linking:
- collectie, Collectie (collection)
- geschiedenis (history)
- Bibliotheek (library)
- Gemeentearchief (municipal archive)
- tentoonstellingen (exhibitions)
Files and Locations
| Resource | Location |
|---|---|
| Custodian files | data/custodian/NL-*.yaml |
| Wikidata reference | data/reference/wikidata_entity_links.yaml |
| Apply links script | scripts/apply_wikidata_links.py |
| Entity cleanup script | scripts/cleanup_entities.py |
Next Steps
- Expand Wikidata coverage - Add more Dutch settlements and heritage institutions
- Entity resolution - Link entities across custodian files
- Quality validation - Cross-reference with authoritative sources
- Export to RDF - Generate linked data exports