85 lines
2.3 KiB
Markdown
85 lines
2.3 KiB
Markdown
# NDE Heritage Institution Dataset - Status Report
|
|
|
|
**Generated**: 2025-12-01
|
|
|
|
## Summary Statistics
|
|
|
|
| Metric | Count |
|
|
|--------|-------|
|
|
| **Total Entries** | 1,674 |
|
|
| **Web Archives** | 1,639 (97.9%) |
|
|
| **Entries with Web Claims (xpath provenance)** | 1,627 (97.2%) |
|
|
| **Entries with CustodianName** | 1,674 (100%) |
|
|
| **Entries with GHCID** | 1,673 (99.9%) |
|
|
|
|
## CustodianName Sources
|
|
|
|
| Source | Count |
|
|
|--------|-------|
|
|
| Web: og:site_name | 569 |
|
|
| Web: title tag | 617 |
|
|
| Web: h1 tag | 330 |
|
|
| Web: schema.org | 11 |
|
|
| Wikidata | 81 |
|
|
| Original CSV entry | 66 |
|
|
|
|
## GHCID Status
|
|
|
|
- **Total GHCIDs generated**: 1,673
|
|
- **Collision groups**: 91 (187 entries)
|
|
- **Resolution strategy**: First Batch - all get name suffixes
|
|
- **Entries without GHCID**: 1 (missing city data)
|
|
|
|
### Regional Distribution (Netherlands)
|
|
|
|
| Region | Code | Count |
|
|
|--------|------|-------|
|
|
| Noord-Holland | NH | 280 |
|
|
| Zuid-Holland | ZH | 272 |
|
|
| Overijssel | OV | 218 |
|
|
| Noord-Brabant | NB | 198 |
|
|
| Gelderland | GE | 182 |
|
|
| Limburg | LI | 114 |
|
|
| Utrecht | UT | 102 |
|
|
| Friesland | FR | 91 |
|
|
| Zeeland | ZE | 68 |
|
|
| Groningen | GR | 68 |
|
|
| Drenthe | DR | 57 |
|
|
| Flevoland | FL | 17 |
|
|
|
|
## Web Claims Distribution
|
|
|
|
| Claim Type | Count |
|
|
|------------|-------|
|
|
| org_name | 3,899 |
|
|
| social_facebook | 1,398 |
|
|
| description_short | 1,172 |
|
|
| social_instagram | 1,026 |
|
|
| email | 972 |
|
|
| social_linkedin | 795 |
|
|
| phone | 599 |
|
|
| social_youtube | 561 |
|
|
| social_twitter | 463 |
|
|
|
|
## Data Quality Notes
|
|
|
|
1. **All web_claims have XPath provenance** - Every claim extracted from websites includes:
|
|
- `xpath` - exact location in HTML
|
|
- `html_file` - path to archived HTML
|
|
- `source_url` - original URL
|
|
- `xpath_match_score` - 1.0 for exact matches
|
|
|
|
2. **GHCID collisions** - 91 collision groups identified:
|
|
- Many are TRUE duplicates in source data (same institution listed multiple times)
|
|
- All resolved with name suffixes per First Batch rule
|
|
|
|
3. **Missing data**:
|
|
- 35 entries without website (no URL in source data)
|
|
- 1 entry without GHCID (missing city)
|
|
- ~15 website fetch failures (timeout, SSL errors, 404)
|
|
|
|
## Files
|
|
|
|
- Entry files: `data/nde/enriched/entries/*.yaml`
|
|
- Web archives: `data/nde/enriched/entries/web/*/`
|
|
- Collision report: `data/nde/enriched/ghcid_collision_report.json`
|