glam/reports/entity_extraction_final_report.md
2025-12-14 17:09:55 +01:00

105 lines
3.2 KiB
Markdown

# NL Custodian Entity Extraction - Final Report
**Date:** 2025-12-13
**Project:** GLAM Heritage Custodian Data Enrichment
## Executive Summary
Successfully extracted and validated named entities from 1,865 Dutch heritage custodian files, achieving 14% Wikidata linking coverage on high-frequency entities.
## Pipeline Statistics
### Source Data
| Metric | Count |
|--------|-------|
| Total NL custodian files | 1,865 |
| Files with web archives | 1,771 (95%) |
| Files with annotations | 1,737 (93%) |
| Files with extracted entities | 1,454 (78%) |
### Entity Extraction
| Metric | Count |
|--------|-------|
| Total entities extracted | 13,433 |
| Unique entity names | ~2,500 |
| Average entities per file | 9.2 |
| Low-quality entities removed | 1,685 |
### Wikidata Linking
| Metric | Count |
|--------|-------|
| Wikidata reference mappings | 58 |
| Entity instances linked | 1,928+ |
| Unique entity names linked | 58 |
| Coverage rate | 14% |
## Wikidata Reference Categories
| Category | Count | Examples |
|----------|-------|----------|
| Countries | 1 | Nederland (+ aliases) |
| Provinces | 12 | All Dutch provinces |
| Regions | 1 | Twente |
| Settlements | 27 | Amsterdam, Den Haag, Zwolle, etc. |
| Heritage Institutions | 7 | Rijksmuseum, Eye Filmmuseum, etc. |
| Government | 1 | Ministerie van Algemene Zaken |
| Historical Events | 1 | Tweede Wereldoorlog |
| **Total** | **50** | |
## Top Linked Entities (by frequency)
| Wikidata ID | Label | Count |
|-------------|-------|-------|
| Q55 | Netherlands | 232 |
| Q9899 | Amsterdam | 55 |
| Q36600 | The Hague | 25 |
| Q752 | Groningen | 21 |
| Q81652 | Zwolle | 19 |
| Q43631 | Leiden | 16 |
| Q34370 | Rotterdam | 15 |
| Q39297 | Utrecht (city) | 14 |
| Q1455944 | Twente | 14 |
| Q1101 | North Brabant | 14 |
## Entity Types Extracted
- **TOP.SET** - Settlements (cities, towns)
- **TOP.CTY** - Countries
- **TOP.ADM** - Administrative regions (provinces)
- **TOP.REG** - Geographic regions
- **GRP.HER** - Heritage institutions
- **GRP.GOV** - Government organizations
- **TMP.EVT** - Historical events
## Data Quality Notes
1. **Cleaned entities:** Removed 1,685 low-quality entities including:
- Language codes (nl, en, de)
- Numeric-only values
- Generic labels (collectie, geschiedenis)
- Control characters
2. **YAML fixes:** Fixed 5 files with escape sequence errors
3. **Unlinked high-frequency terms:** Many remaining unlinked entities are generic Dutch terms not suitable for Wikidata linking:
- collectie, Collectie (collection)
- geschiedenis (history)
- Bibliotheek (library)
- Gemeentearchief (municipal archive)
- tentoonstellingen (exhibitions)
## Files and Locations
| Resource | Location |
|----------|----------|
| Custodian files | `data/custodian/NL-*.yaml` |
| Wikidata reference | `data/reference/wikidata_entity_links.yaml` |
| Apply links script | `scripts/apply_wikidata_links.py` |
| Entity cleanup script | `scripts/cleanup_entities.py` |
## Next Steps
1. **Expand Wikidata coverage** - Add more Dutch settlements and heritage institutions
2. **Entity resolution** - Link entities across custodian files
3. **Quality validation** - Cross-reference with authoritative sources
4. **Export to RDF** - Generate linked data exports