glam/reports/entity_extraction_final_report.md
2025-12-14 17:09:55 +01:00

3.2 KiB

NL Custodian Entity Extraction - Final Report

Date: 2025-12-13 Project: GLAM Heritage Custodian Data Enrichment

Executive Summary

Successfully extracted and validated named entities from 1,865 Dutch heritage custodian files, achieving 14% Wikidata linking coverage on high-frequency entities.

Pipeline Statistics

Source Data

Metric Count
Total NL custodian files 1,865
Files with web archives 1,771 (95%)
Files with annotations 1,737 (93%)
Files with extracted entities 1,454 (78%)

Entity Extraction

Metric Count
Total entities extracted 13,433
Unique entity names ~2,500
Average entities per file 9.2
Low-quality entities removed 1,685

Wikidata Linking

Metric Count
Wikidata reference mappings 58
Entity instances linked 1,928+
Unique entity names linked 58
Coverage rate 14%

Wikidata Reference Categories

Category Count Examples
Countries 1 Nederland (+ aliases)
Provinces 12 All Dutch provinces
Regions 1 Twente
Settlements 27 Amsterdam, Den Haag, Zwolle, etc.
Heritage Institutions 7 Rijksmuseum, Eye Filmmuseum, etc.
Government 1 Ministerie van Algemene Zaken
Historical Events 1 Tweede Wereldoorlog
Total 50

Top Linked Entities (by frequency)

Wikidata ID Label Count
Q55 Netherlands 232
Q9899 Amsterdam 55
Q36600 The Hague 25
Q752 Groningen 21
Q81652 Zwolle 19
Q43631 Leiden 16
Q34370 Rotterdam 15
Q39297 Utrecht (city) 14
Q1455944 Twente 14
Q1101 North Brabant 14

Entity Types Extracted

  • TOP.SET - Settlements (cities, towns)
  • TOP.CTY - Countries
  • TOP.ADM - Administrative regions (provinces)
  • TOP.REG - Geographic regions
  • GRP.HER - Heritage institutions
  • GRP.GOV - Government organizations
  • TMP.EVT - Historical events

Data Quality Notes

  1. Cleaned entities: Removed 1,685 low-quality entities including:

    • Language codes (nl, en, de)
    • Numeric-only values
    • Generic labels (collectie, geschiedenis)
    • Control characters
  2. YAML fixes: Fixed 5 files with escape sequence errors

  3. Unlinked high-frequency terms: Many remaining unlinked entities are generic Dutch terms not suitable for Wikidata linking:

    • collectie, Collectie (collection)
    • geschiedenis (history)
    • Bibliotheek (library)
    • Gemeentearchief (municipal archive)
    • tentoonstellingen (exhibitions)

Files and Locations

Resource Location
Custodian files data/custodian/NL-*.yaml
Wikidata reference data/reference/wikidata_entity_links.yaml
Apply links script scripts/apply_wikidata_links.py
Entity cleanup script scripts/cleanup_entities.py

Next Steps

  1. Expand Wikidata coverage - Add more Dutch settlements and heritage institutions
  2. Entity resolution - Link entities across custodian files
  3. Quality validation - Cross-reference with authoritative sources
  4. Export to RDF - Generate linked data exports