# NDE Heritage Institution Dataset - Status Report **Generated**: 2025-12-01 ## Summary Statistics | Metric | Count | |--------|-------| | **Total Entries** | 1,674 | | **Web Archives** | 1,639 (97.9%) | | **Entries with Web Claims (xpath provenance)** | 1,627 (97.2%) | | **Entries with CustodianName** | 1,674 (100%) | | **Entries with GHCID** | 1,673 (99.9%) | ## CustodianName Sources | Source | Count | |--------|-------| | Web: og:site_name | 569 | | Web: title tag | 617 | | Web: h1 tag | 330 | | Web: schema.org | 11 | | Wikidata | 81 | | Original CSV entry | 66 | ## GHCID Status - **Total GHCIDs generated**: 1,673 - **Collision groups**: 91 (187 entries) - **Resolution strategy**: First Batch - all get name suffixes - **Entries without GHCID**: 1 (missing city data) ### Regional Distribution (Netherlands) | Region | Code | Count | |--------|------|-------| | Noord-Holland | NH | 280 | | Zuid-Holland | ZH | 272 | | Overijssel | OV | 218 | | Noord-Brabant | NB | 198 | | Gelderland | GE | 182 | | Limburg | LI | 114 | | Utrecht | UT | 102 | | Friesland | FR | 91 | | Zeeland | ZE | 68 | | Groningen | GR | 68 | | Drenthe | DR | 57 | | Flevoland | FL | 17 | ## Web Claims Distribution | Claim Type | Count | |------------|-------| | org_name | 3,899 | | social_facebook | 1,398 | | description_short | 1,172 | | social_instagram | 1,026 | | email | 972 | | social_linkedin | 795 | | phone | 599 | | social_youtube | 561 | | social_twitter | 463 | ## Data Quality Notes 1. **All web_claims have XPath provenance** - Every claim extracted from websites includes: - `xpath` - exact location in HTML - `html_file` - path to archived HTML - `source_url` - original URL - `xpath_match_score` - 1.0 for exact matches 2. **GHCID collisions** - 91 collision groups identified: - Many are TRUE duplicates in source data (same institution listed multiple times) - All resolved with name suffixes per First Batch rule 3. **Missing data**: - 35 entries without website (no URL in source data) - 1 entry without GHCID (missing city) - ~15 website fetch failures (timeout, SSL errors, 404) ## Files - Entry files: `data/nde/enriched/entries/*.yaml` - Web archives: `data/nde/enriched/entries/web/*/` - Collision report: `data/nde/enriched/ghcid_collision_report.json`