105 lines
3.2 KiB
Markdown
105 lines
3.2 KiB
Markdown
# NL Custodian Entity Extraction - Final Report
|
|
|
|
**Date:** 2025-12-13
|
|
**Project:** GLAM Heritage Custodian Data Enrichment
|
|
|
|
## Executive Summary
|
|
|
|
Successfully extracted and validated named entities from 1,865 Dutch heritage custodian files, achieving 14% Wikidata linking coverage on high-frequency entities.
|
|
|
|
## Pipeline Statistics
|
|
|
|
### Source Data
|
|
| Metric | Count |
|
|
|--------|-------|
|
|
| Total NL custodian files | 1,865 |
|
|
| Files with web archives | 1,771 (95%) |
|
|
| Files with annotations | 1,737 (93%) |
|
|
| Files with extracted entities | 1,454 (78%) |
|
|
|
|
### Entity Extraction
|
|
| Metric | Count |
|
|
|--------|-------|
|
|
| Total entities extracted | 13,433 |
|
|
| Unique entity names | ~2,500 |
|
|
| Average entities per file | 9.2 |
|
|
| Low-quality entities removed | 1,685 |
|
|
|
|
### Wikidata Linking
|
|
| Metric | Count |
|
|
|--------|-------|
|
|
| Wikidata reference mappings | 58 |
|
|
| Entity instances linked | 1,928+ |
|
|
| Unique entity names linked | 58 |
|
|
| Coverage rate | 14% |
|
|
|
|
## Wikidata Reference Categories
|
|
|
|
| Category | Count | Examples |
|
|
|----------|-------|----------|
|
|
| Countries | 1 | Nederland (+ aliases) |
|
|
| Provinces | 12 | All Dutch provinces |
|
|
| Regions | 1 | Twente |
|
|
| Settlements | 27 | Amsterdam, Den Haag, Zwolle, etc. |
|
|
| Heritage Institutions | 7 | Rijksmuseum, Eye Filmmuseum, etc. |
|
|
| Government | 1 | Ministerie van Algemene Zaken |
|
|
| Historical Events | 1 | Tweede Wereldoorlog |
|
|
| **Total** | **50** | |
|
|
|
|
## Top Linked Entities (by frequency)
|
|
|
|
| Wikidata ID | Label | Count |
|
|
|-------------|-------|-------|
|
|
| Q55 | Netherlands | 232 |
|
|
| Q9899 | Amsterdam | 55 |
|
|
| Q36600 | The Hague | 25 |
|
|
| Q752 | Groningen | 21 |
|
|
| Q81652 | Zwolle | 19 |
|
|
| Q43631 | Leiden | 16 |
|
|
| Q34370 | Rotterdam | 15 |
|
|
| Q39297 | Utrecht (city) | 14 |
|
|
| Q1455944 | Twente | 14 |
|
|
| Q1101 | North Brabant | 14 |
|
|
|
|
## Entity Types Extracted
|
|
|
|
- **TOP.SET** - Settlements (cities, towns)
|
|
- **TOP.CTY** - Countries
|
|
- **TOP.ADM** - Administrative regions (provinces)
|
|
- **TOP.REG** - Geographic regions
|
|
- **GRP.HER** - Heritage institutions
|
|
- **GRP.GOV** - Government organizations
|
|
- **TMP.EVT** - Historical events
|
|
|
|
## Data Quality Notes
|
|
|
|
1. **Cleaned entities:** Removed 1,685 low-quality entities including:
|
|
- Language codes (nl, en, de)
|
|
- Numeric-only values
|
|
- Generic labels (collectie, geschiedenis)
|
|
- Control characters
|
|
|
|
2. **YAML fixes:** Fixed 5 files with escape sequence errors
|
|
|
|
3. **Unlinked high-frequency terms:** Many remaining unlinked entities are generic Dutch terms not suitable for Wikidata linking:
|
|
- collectie, Collectie (collection)
|
|
- geschiedenis (history)
|
|
- Bibliotheek (library)
|
|
- Gemeentearchief (municipal archive)
|
|
- tentoonstellingen (exhibitions)
|
|
|
|
## Files and Locations
|
|
|
|
| Resource | Location |
|
|
|----------|----------|
|
|
| Custodian files | `data/custodian/NL-*.yaml` |
|
|
| Wikidata reference | `data/reference/wikidata_entity_links.yaml` |
|
|
| Apply links script | `scripts/apply_wikidata_links.py` |
|
|
| Entity cleanup script | `scripts/cleanup_entities.py` |
|
|
|
|
## Next Steps
|
|
|
|
1. **Expand Wikidata coverage** - Add more Dutch settlements and heritage institutions
|
|
2. **Entity resolution** - Link entities across custodian files
|
|
3. **Quality validation** - Cross-reference with authoritative sources
|
|
4. **Export to RDF** - Generate linked data exports
|