glam/data/nde/enriched/DATASET_STATUS.md
2025-12-01 16:06:34 +01:00

2.3 KiB

NDE Heritage Institution Dataset - Status Report

Generated: 2025-12-01

Summary Statistics

Metric Count
Total Entries 1,674
Web Archives 1,639 (97.9%)
Entries with Web Claims (xpath provenance) 1,627 (97.2%)
Entries with CustodianName 1,674 (100%)
Entries with GHCID 1,673 (99.9%)

CustodianName Sources

Source Count
Web: og:site_name 569
Web: title tag 617
Web: h1 tag 330
Web: schema.org 11
Wikidata 81
Original CSV entry 66

GHCID Status

  • Total GHCIDs generated: 1,673
  • Collision groups: 91 (187 entries)
  • Resolution strategy: First Batch - all get name suffixes
  • Entries without GHCID: 1 (missing city data)

Regional Distribution (Netherlands)

Region Code Count
Noord-Holland NH 280
Zuid-Holland ZH 272
Overijssel OV 218
Noord-Brabant NB 198
Gelderland GE 182
Limburg LI 114
Utrecht UT 102
Friesland FR 91
Zeeland ZE 68
Groningen GR 68
Drenthe DR 57
Flevoland FL 17

Web Claims Distribution

Claim Type Count
org_name 3,899
social_facebook 1,398
description_short 1,172
social_instagram 1,026
email 972
social_linkedin 795
phone 599
social_youtube 561
social_twitter 463

Data Quality Notes

  1. All web_claims have XPath provenance - Every claim extracted from websites includes:

    • xpath - exact location in HTML
    • html_file - path to archived HTML
    • source_url - original URL
    • xpath_match_score - 1.0 for exact matches
  2. GHCID collisions - 91 collision groups identified:

    • Many are TRUE duplicates in source data (same institution listed multiple times)
    • All resolved with name suffixes per First Batch rule
  3. Missing data:

    • 35 entries without website (no URL in source data)
    • 1 entry without GHCID (missing city)
    • ~15 website fetch failures (timeout, SSL errors, 404)

Files

  • Entry files: data/nde/enriched/entries/*.yaml
  • Web archives: data/nde/enriched/entries/web/*/
  • Collision report: data/nde/enriched/ghcid_collision_report.json