# NL Custodian Entity Extraction - Final Report **Date:** 2025-12-13 **Project:** GLAM Heritage Custodian Data Enrichment ## Executive Summary Successfully extracted and validated named entities from 1,865 Dutch heritage custodian files, achieving 14% Wikidata linking coverage on high-frequency entities. ## Pipeline Statistics ### Source Data | Metric | Count | |--------|-------| | Total NL custodian files | 1,865 | | Files with web archives | 1,771 (95%) | | Files with annotations | 1,737 (93%) | | Files with extracted entities | 1,454 (78%) | ### Entity Extraction | Metric | Count | |--------|-------| | Total entities extracted | 13,433 | | Unique entity names | ~2,500 | | Average entities per file | 9.2 | | Low-quality entities removed | 1,685 | ### Wikidata Linking | Metric | Count | |--------|-------| | Wikidata reference mappings | 58 | | Entity instances linked | 1,928+ | | Unique entity names linked | 58 | | Coverage rate | 14% | ## Wikidata Reference Categories | Category | Count | Examples | |----------|-------|----------| | Countries | 1 | Nederland (+ aliases) | | Provinces | 12 | All Dutch provinces | | Regions | 1 | Twente | | Settlements | 27 | Amsterdam, Den Haag, Zwolle, etc. | | Heritage Institutions | 7 | Rijksmuseum, Eye Filmmuseum, etc. | | Government | 1 | Ministerie van Algemene Zaken | | Historical Events | 1 | Tweede Wereldoorlog | | **Total** | **50** | | ## Top Linked Entities (by frequency) | Wikidata ID | Label | Count | |-------------|-------|-------| | Q55 | Netherlands | 232 | | Q9899 | Amsterdam | 55 | | Q36600 | The Hague | 25 | | Q752 | Groningen | 21 | | Q81652 | Zwolle | 19 | | Q43631 | Leiden | 16 | | Q34370 | Rotterdam | 15 | | Q39297 | Utrecht (city) | 14 | | Q1455944 | Twente | 14 | | Q1101 | North Brabant | 14 | ## Entity Types Extracted - **TOP.SET** - Settlements (cities, towns) - **TOP.CTY** - Countries - **TOP.ADM** - Administrative regions (provinces) - **TOP.REG** - Geographic regions - **GRP.HER** - Heritage institutions - **GRP.GOV** - Government organizations - **TMP.EVT** - Historical events ## Data Quality Notes 1. **Cleaned entities:** Removed 1,685 low-quality entities including: - Language codes (nl, en, de) - Numeric-only values - Generic labels (collectie, geschiedenis) - Control characters 2. **YAML fixes:** Fixed 5 files with escape sequence errors 3. **Unlinked high-frequency terms:** Many remaining unlinked entities are generic Dutch terms not suitable for Wikidata linking: - collectie, Collectie (collection) - geschiedenis (history) - Bibliotheek (library) - Gemeentearchief (municipal archive) - tentoonstellingen (exhibitions) ## Files and Locations | Resource | Location | |----------|----------| | Custodian files | `data/custodian/NL-*.yaml` | | Wikidata reference | `data/reference/wikidata_entity_links.yaml` | | Apply links script | `scripts/apply_wikidata_links.py` | | Entity cleanup script | `scripts/cleanup_entities.py` | ## Next Steps 1. **Expand Wikidata coverage** - Add more Dutch settlements and heritage institutions 2. **Entity resolution** - Link entities across custodian files 3. **Quality validation** - Cross-reference with authoritative sources 4. **Export to RDF** - Generate linked data exports