2.3 KiB
2.3 KiB
NDE Heritage Institution Dataset - Status Report
Generated: 2025-12-01
Summary Statistics
| Metric | Count |
|---|---|
| Total Entries | 1,674 |
| Web Archives | 1,639 (97.9%) |
| Entries with Web Claims (xpath provenance) | 1,627 (97.2%) |
| Entries with CustodianName | 1,674 (100%) |
| Entries with GHCID | 1,673 (99.9%) |
CustodianName Sources
| Source | Count |
|---|---|
| Web: og:site_name | 569 |
| Web: title tag | 617 |
| Web: h1 tag | 330 |
| Web: schema.org | 11 |
| Wikidata | 81 |
| Original CSV entry | 66 |
GHCID Status
- Total GHCIDs generated: 1,673
- Collision groups: 91 (187 entries)
- Resolution strategy: First Batch - all get name suffixes
- Entries without GHCID: 1 (missing city data)
Regional Distribution (Netherlands)
| Region | Code | Count |
|---|---|---|
| Noord-Holland | NH | 280 |
| Zuid-Holland | ZH | 272 |
| Overijssel | OV | 218 |
| Noord-Brabant | NB | 198 |
| Gelderland | GE | 182 |
| Limburg | LI | 114 |
| Utrecht | UT | 102 |
| Friesland | FR | 91 |
| Zeeland | ZE | 68 |
| Groningen | GR | 68 |
| Drenthe | DR | 57 |
| Flevoland | FL | 17 |
Web Claims Distribution
| Claim Type | Count |
|---|---|
| org_name | 3,899 |
| social_facebook | 1,398 |
| description_short | 1,172 |
| social_instagram | 1,026 |
| 972 | |
| social_linkedin | 795 |
| phone | 599 |
| social_youtube | 561 |
| social_twitter | 463 |
Data Quality Notes
-
All web_claims have XPath provenance - Every claim extracted from websites includes:
xpath- exact location in HTMLhtml_file- path to archived HTMLsource_url- original URLxpath_match_score- 1.0 for exact matches
-
GHCID collisions - 91 collision groups identified:
- Many are TRUE duplicates in source data (same institution listed multiple times)
- All resolved with name suffixes per First Batch rule
-
Missing data:
- 35 entries without website (no URL in source data)
- 1 entry without GHCID (missing city)
- ~15 website fetch failures (timeout, SSL errors, 404)
Files
- Entry files:
data/nde/enriched/entries/*.yaml - Web archives:
data/nde/enriched/entries/web/*/ - Collision report:
data/nde/enriched/ghcid_collision_report.json