glam/docs/nde/URL_DISCOVERY_REPORT.md
kempersc 48a2b26f59 feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
2025-12-01 16:58:03 +01:00

171 lines
7.5 KiB
Markdown

# NDE URL Discovery Report
**Date**: 2025-12-01
**Dataset**: NDE Enriched Entries
**Total Entries**: 1,674
## Summary Statistics
| Category | Count | Percentage |
|----------|-------|------------|
| With URL (any source) | 1,621 | 96.8% |
| Without URL | 53 | 3.2% |
| - With Wikidata ID | 31 | - |
| - Without Wikidata ID | 22 | - |
## URLs Discovered via Wikidata P856
Found official websites for **16 entries** by querying Wikidata property P856:
| Entry ID | Wikidata | Institution | URL |
|----------|----------|-------------|-----|
| 0006 | Q81181279 | Rijckheyt | http://www.rijckheyt.nl |
| 0011 | Q110907483 | Stichting Nyen ä Wes van Nassau | https://www.nyenaenwasvannassau.nl/site/ |
| 0047 | Q110907518 | HKK Zuidkwartier | https://www.hkk-zuidkwartier.nl/ |
| 0059 | Q110907493 | Heemkunde Ravenstein | http://www.heemkunderavenstein.nl/ |
| 0107 | Q110907545 | Heemkunde Gemonde | http://www.heemkundegemonde.nl/ |
| 0135 | Q110995942 | Saet en Cruyt | https://www.saetencruyt.nl/ |
| 0211 | Q701 | Provincie Noord-Holland | https://www.noord-holland.nl/ |
| 0387 | Q2370540 | Collectie Six | http://nl.collectiesix.nl/ |
| 0507 | Q113119338 | Stichting Omroep Muziek | https://www.omroepmuziek.nl/ |
| 0700 | Q121225126 | Stadsarchief Oldenzaal | https://www.oldenzaal.nl/stadsarchief |
| 0839 | Q110279958 | Stichting Utrecht Altijd | https://www.utrechtaltijd.nl/ |
| 1022 | Q111509619 | Stichting Historische Behangsels | https://www.historischebehangsels.nl |
| 1195 | Q2334800 | Verzetsmuseum Zuid-Holland | http://www.verzetsmuseum-zh.nl |
| 1268 | Q2248381 | Spaarnestad Photo | http://www.spaarnestadphoto.nl |
| 1637 | Q59962362 | Bibliotheek De Groene Venen | https://www.bibliotheekdegroenevenen.nl |
| 0212 | Q1897962 | Geldmuseum | *Merged into DNB "De Nieuwe Schatkamer"* |
## URLs Discovered via Web Search (Exa)
Found URLs for **10 additional entries**:
| Entry ID | Institution | URL | Notes |
|----------|-------------|-----|-------|
| 0199 | Bibliotheek TU Kampen | https://tuu.nl/bibliotheek/ | Merged with TU Utrecht |
| 0389 | Heemkunde Ambt-Delden | https://www.heemkundedelden.nl/ | |
| 0413 | Historische Vereniging Old Deep'n | https://www.olddeepn.nl/ | Diepenveen |
| 0594 | Sambeeks Heem | https://www.sambeeksheem.nl/ | |
| 0627 | Stichting Weeshuisjes | https://www.weeshuisjes.nl/ | |
| 0715 | HDC Protestants Erfgoed | https://www.hdcvu.nl/ | Within VU Library |
| 0729 | Historische Vereniging Staphorst | https://www.historischeverenigingstaphorst.nl/ | |
| 0851 | Historische Vereniging Den Dolder | https://www.historischeverenigingdendolder.nl/ | |
| 1170 | Nederlandse Vereniging voor Papierknipkunst | https://papierknippen.nl/ | Original NDE entry |
| 1504 | Nederlandse Vereniging voor Papierknipkunst | https://papierknippen.nl/ | Same org, from NAN ISIL 2025-11-06 |
## Entries Without Dedicated Websites (Parent Organization Only)
| Entry ID | Institution | Parent Organization | Notes |
|----------|-------------|---------------------|-------|
| 1512 | Diocesane Commissie Kerkelijk Kunstbezit | Bisdom Roermond (https://bisdom-roermond.org/) | Commission managing diocesan church art - operates under diocese, no dedicated website |
## Problematic Entries
### NOT Heritage Custodians
| Entry ID | Wikidata | Name | Issue |
|----------|----------|------|-------|
| 0874 | Q2789869 | HEEMAF | Was an electrical equipment manufacturer (1906-1973), NOT a heritage institution |
**Recommendation**: Remove from dataset or mark as `status: NOT_HERITAGE`
### Closed/Defunct
| Entry ID | Wikidata | Name | Issue |
|----------|----------|------|-------|
| 1130 | Q110282061 | Museum Oud Westdorpe | Permanently closed |
**Recommendation**: Mark as `status: CLOSED`
### Duplicate Entries
| Entry IDs | Wikidata | Name | Issue |
|-----------|----------|------|-------|
| 0875, 0993 | Q110891769 | (Same institution) | Duplicate Wikidata ID |
| 0967, 0950 | - | (Same institution) | Duplicate entries |
**Recommendation**: Merge records, keep one canonical entry
## Entries Still Without URLs (~16)
These entries have no discoverable website:
| Entry ID | Name | Notes |
|----------|------|-------|
| Various | Small heemkundige kringen | May only have Facebook presence |
| Various | Historical societies | May be defunct or volunteer-run |
| 0210 | Greccio Museum (Leiden) | Located inside Hartebrug church, no dedicated website |
## Actions Taken
1. ✅ Created `docs/nde/` directory
2. ✅ Created this URL Discovery Report
3. ✅ Updated 25 YAML files with discovered URLs (15 Wikidata + 9 web search + 1 dedicated page)
4. ✅ Flagged problematic entries:
- `0874_Q2789869.yaml` (HEEMAF): `NOT_HERITAGE` - was electrical manufacturer
- `1130_Q110282061.yaml` (Museum Oud Westdorpe): `CLOSED` - permanently closed
- `0875_Q110891769.yaml` & `0993_Q110891769.yaml`: `DUPLICATE` - same Wikidata ID
- `0950_unknown.yaml` & `0967_Q110891782.yaml`: `DUPLICATE` - same institution
## Final URL Coverage
| Category | Count | Percentage |
|----------|-------|------------|
| **With URL** | 1,619 | **96.7%** |
| **Without URL** | 55 | 3.3% |
| **Flagged (duplicates, not heritage, closed)** | 5 | 0.3% |
## Remaining Entries Without URLs (55)
These entries have no discoverable website. Many are:
- Small heemkundige kringen (local history societies)
- Entries with Wikidata IDs but no P856 property
- Defunct or volunteer-run organizations
- Facebook-only presence
**Example entries still missing URLs:**
- `0001_Q2679819`: Stichting Hunebedcentrum (has Wikidata, needs P856 check)
- `0139_de_hollandse_cirkel`: De Hollandse Cirkel (no Wikidata)
- `0144_Q2710899`: Nationaal Onderduik Museum (has Wikidata)
- `0148_Q69725772`: CollectieGelderland (has Wikidata)
- Various small historical societies in Overijssel
## Recommendations
1. **Query Wikidata again** for entries with Q-numbers but no P856 discovered
2. **Web search** remaining entries with organization names
3. **Check for dedicated pages** on umbrella organization websites (like Museum Greccio on Hartebrug church site)
4. **Mark as `url_status: NOT_FOUND`** for entries where no web presence exists
5. **Consider Facebook URLs** as fallback for small volunteer organizations
## Technical Notes
- Added `digital_platforms` section with proper `platform_type` classification:
- `OFFICIAL_WEBSITE` for standalone websites
- `DEDICATED_PAGE` for pages on host websites (like Museum Greccio)
- Preserved `data_tier` in provenance:
- `TIER_2_VERIFIED` for Wikidata P856 URLs
- `TIER_4_INFERRED` for web search discovered URLs
---
*Generated by NDE URL Discovery workflow*
*Last updated: 2025-12-01T16:30:00+00:00*
## Session Updates - December 2025
### 2025-12-01: NAN ISIL Batch Corrections
Fixed several issues with entries 1502-1513 (NAN ISIL 2025-11-06 batch):
| Entry ID | Issue | Resolution |
|----------|-------|------------|
| 1504 | Missing URL | Added https://papierknippen.nl/ (discovered via Exa) |
| 1508 | Wrong custodian_name ("Cookiesbeleid") | Corrected to "Parochiearchief Kampen" from NAN ISIL registry |
| 1508 | Wrong institution type (U) | Changed to A (Archive) |
| 1508 | Incorrect GHCID | Regenerated: NL-OV-KAM-A-PK |
| 1511 | Wrong custodian_name (exhibition title) | Already corrected to "Wereldmuseum Leiden" |
| 1511 | Wrong institution type (U) | Changed to M (Museum) |
| 1511 | GHCID based on exhibition title | Regenerated: NL-ZH-LEI-M-WL |
| 1512 | Wrong institution type (U) | Changed to H (Holy Sites - diocesan heritage commission) |
| 1512 | No website info | Added parent organization note (Bisdom Roermond) |
| 1512 | Incorrect GHCID | Regenerated: NL-LI-ROE-H-DCKK |