glam/data/nde/bu/enriched/DATASET_STATUS.md
2025-12-23 13:27:35 +01:00

85 lines
2.3 KiB
Markdown

# NDE Heritage Institution Dataset - Status Report
**Generated**: 2025-12-01
## Summary Statistics
| Metric | Count |
|--------|-------|
| **Total Entries** | 1,674 |
| **Web Archives** | 1,639 (97.9%) |
| **Entries with Web Claims (xpath provenance)** | 1,627 (97.2%) |
| **Entries with CustodianName** | 1,674 (100%) |
| **Entries with GHCID** | 1,673 (99.9%) |
## CustodianName Sources
| Source | Count |
|--------|-------|
| Web: og:site_name | 569 |
| Web: title tag | 617 |
| Web: h1 tag | 330 |
| Web: schema.org | 11 |
| Wikidata | 81 |
| Original CSV entry | 66 |
## GHCID Status
- **Total GHCIDs generated**: 1,673
- **Collision groups**: 91 (187 entries)
- **Resolution strategy**: First Batch - all get name suffixes
- **Entries without GHCID**: 1 (missing city data)
### Regional Distribution (Netherlands)
| Region | Code | Count |
|--------|------|-------|
| Noord-Holland | NH | 280 |
| Zuid-Holland | ZH | 272 |
| Overijssel | OV | 218 |
| Noord-Brabant | NB | 198 |
| Gelderland | GE | 182 |
| Limburg | LI | 114 |
| Utrecht | UT | 102 |
| Friesland | FR | 91 |
| Zeeland | ZE | 68 |
| Groningen | GR | 68 |
| Drenthe | DR | 57 |
| Flevoland | FL | 17 |
## Web Claims Distribution
| Claim Type | Count |
|------------|-------|
| org_name | 3,899 |
| social_facebook | 1,398 |
| description_short | 1,172 |
| social_instagram | 1,026 |
| email | 972 |
| social_linkedin | 795 |
| phone | 599 |
| social_youtube | 561 |
| social_twitter | 463 |
## Data Quality Notes
1. **All web_claims have XPath provenance** - Every claim extracted from websites includes:
- `xpath` - exact location in HTML
- `html_file` - path to archived HTML
- `source_url` - original URL
- `xpath_match_score` - 1.0 for exact matches
2. **GHCID collisions** - 91 collision groups identified:
- Many are TRUE duplicates in source data (same institution listed multiple times)
- All resolved with name suffixes per First Batch rule
3. **Missing data**:
- 35 entries without website (no URL in source data)
- 1 entry without GHCID (missing city)
- ~15 website fetch failures (timeout, SSL errors, 404)
## Files
- Entry files: `data/nde/enriched/entries/*.yaml`
- Web archives: `data/nde/enriched/entries/web/*/`
- Collision report: `data/nde/enriched/ghcid_collision_report.json`