glam/docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md
2025-11-19 23:25:22 +01:00

11 KiB

NDE Wikidata Enrichment Report

Date: 2025-11-17
Status: Test Batch Complete (10 records)
Next Phase: Scale to Full Dataset (1,351 records)


Executive Summary

Successfully completed Wikidata enrichment of the first 10 records from the NDE Dutch Heritage Organizations dataset. Achieved an 80% success rate (8 out of 10 organizations matched to Wikidata entities).

Key Achievements

Test Batch Processed: 10 organizations from Drenthe province
YAML File Updated: Added wikidata_id fields to matched records
LinkML Schema Updated: Added wikidata_id and wikidata_enrichment_status fields
Enrichment Logs Created: Complete SPARQL query history and results
Backup Created: Full backup before modifications


Enrichment Results

Summary Statistics

Metric Count Percentage
Total Records Processed 10 100%
Successfully Matched 8 80%
No Match Found 2 20%
ISIL Codes Present 9 90%
Museums 3 30%
Archives 6 60%
Historical Societies 0 0%

Detailed Results

# Organization Type Wikidata ID Status
1 Stichting Herinneringscentrum Kamp Westerbork museum Q22246632 ✓ Matched
2 Stichting Hunebedcentrum museum Q2679819 ✓ Matched
3 Regionaal Historisch Centrum (RHC) Drents Archief archief Q1978308 ✓ Matched
4 Stichting Drents Museum museum Q1258370 ✓ Matched
5 Stichting Drents Museum De Buitenplaats museum - ✗ No match
6 Gemeente Aa en Hunze archief Q300665 ✓ Matched
7 Gemeente Borger-Odoorn archief Q835118 ✓ Matched
8 Gemeente Coevorden archief Q60453 ✓ Matched
9 Gemeente De Wolden archief Q835108 ✓ Matched
10 Samenwerkingsorganisatie De Wolden/Hoogeveen archief - ✗ No match

Enrichment Methodology

Tools Used

  1. Wikidata MCP Service (authenticated)

    • wikidata-authenticated_search_entity - Entity search by name
    • wikidata-authenticated_get_metadata - Verification of Q-numbers
    • wikidata-authenticated_execute_sparql - Custom SPARQL queries
  2. SPARQL Queries - For precise matching:

    SELECT ?item ?itemLabel WHERE {
      ?item wdt:P31 wd:Q2039348 .  # Instance of: Dutch municipality
      ?item rdfs:label ?label .
      FILTER(CONTAINS(LCASE(?label), "coevorden"))
      SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". }
    } LIMIT 5
    

Search Strategy

  1. Initial Search: Direct entity search by organization name
  2. Verification: Retrieve metadata (label, description) to confirm match
  3. Fallback Strategy: SPARQL queries for entities not found via search
  4. Quality Check: Manual verification of all Q-numbers

Success Factors

  • Museums with international recognition: 100% match rate (Drents Museum, Hunebedcentrum, Westerbork)
  • Municipal archives: 100% match rate (all municipalities found)
  • Regional archives: 100% match rate (Drents Archief)

Challenges

  • Branch locations: "De Buitenplaats" (branch of Drents Museum) not in Wikidata
  • Collaborative organizations: Inter-municipal partnerships not typically in Wikidata
  • Specialized organizations: Small local societies less likely to have Wikidata entries

Technical Implementation

YAML File Structure Update

Before Enrichment:

- plaatsnaam_bezoekadres: Assen
  straat_en_huisnummer_bezoekadres: Brink 1
  organisatie: Stichting Drents Museum
  webadres_organisatie: https://drentsmuseum.nl/
  type_organisatie: museum
  isil-code_na: NL-AsnDM

After Enrichment:

- plaatsnaam_bezoekadres: Assen
  straat_en_huisnummer_bezoekadres: Brink 1
  organisatie: Stichting Drents Museum
  webadres_organisatie: https://drentsmuseum.nl/
  type_organisatie: museum
  isil-code_na: NL-AsnDM
  wikidata_id: Q1258370  # ← NEW FIELD

Records with No Match:

- plaatsnaam_bezoekadres: Eelde
  organisatie: Stichting Drents Museum De Buitenplaats
  type_organisatie: museum
  wikidata_enrichment_status: no_match_found  # ← Status tracking

LinkML Schema Updates

Added two new slots to nde_yaml_target.yaml:

slots:
  wikidata_id:
    description: Wikidata entity identifier (Q-number)
    range: string
    required: false
    pattern: "^Q[0-9]+$"
    comments:
      - "Added through Wikidata enrichment process"
      - "Links organization to Wikidata knowledge graph"
  
  wikidata_enrichment_status:
    description: Status of Wikidata enrichment process
    range: string
    required: false
    comments:
      - "Values: 'no_match_found', 'pending', 'verified', 'manual_review_needed'"

Enrichment Log Files

All queries logged in /data/nde/sparql/:

  • Prepared queries: 10 JSON files with search parameters
  • Master query log: Consolidated history of all SPARQL queries
  • Enrichment log: Results summary with timestamps and success rates

Data Quality Assessment

Match Confidence Levels

Wikidata ID Organization Confidence Verification Method
Q22246632 Herinneringscentrum Kamp Westerbork High Direct match, ISIL verified
Q2679819 Hunebedcentrum High Direct match, exact name
Q1978308 Drents Archief High Direct match, ISIL verified
Q1258370 Drents Museum High Direct match, ISIL verified
Q300665 Gemeente Aa en Hunze High SPARQL query, verified label
Q835118 Gemeente Borger-Odoorn High SPARQL query, verified label
Q60453 Gemeente Coevorden High SPARQL query, exact match
Q835108 Gemeente De Wolden High SPARQL query, exact match

All matches verified with wikidata-authenticated_get_metadata to confirm:

  • Correct Dutch label
  • Appropriate entity description
  • Geographic location (municipality in Drenthe)

No-Match Analysis

1. Stichting Drents Museum De Buitenplaats

  • Reason: Branch location/extension of main museum
  • Parent Institution: Stichting Drents Museum (Q1258370)
  • Recommendation: Manual Wikidata entry creation OR link to parent institution
  • Website: https://dmdebuitenplaats.nl/

2. Samenwerkingsorganisatie De Wolden/Hoogeveen

  • Reason: Inter-municipal administrative partnership
  • Nature: Shared services organization
  • Recommendation: May not warrant Wikidata entry (administrative entity, not public-facing heritage institution)
  • ISIL Code: NL-HgvSWO (indicates archival function)

Next Steps

Immediate Tasks (Complete by Next Session)

  1. Scale to Full Dataset (1,341 remaining records)

    • Batch processing script with rate limiting
    • Estimated time: 2-3 hours
    • Wikidata API limits: 1,000 requests/hour (authenticated)
  2. Handle Ambiguous Matches

    • Fuzzy matching threshold: 85% similarity
    • Manual review queue for low-confidence matches
    • Document decision criteria
  3. Manual Wikidata Entry Creation

    • Identify high-priority organizations without Q-numbers
    • Create Wikidata entries for significant institutions
    • Follow Wikidata notability guidelines

Quality Assurance

  • Verify all Q-numbers resolve to correct entities
  • Check for duplicate Q-numbers across dataset
  • Validate ISIL code alignment with Wikidata identifiers
  • Cross-reference with existing Wikidata ISIL properties (P791)

Documentation Updates

  • Update data/nde/README.md with enrichment status
  • Create batch processing statistics report
  • Document manual review decisions
  • Update LinkML schema version to v0.2.1

Integration with Main GLAM Project

  • Map NDE organizations to HeritageCustodian class
  • Generate GHCIDs for all organizations
  • Export to JSON-LD with Wikidata links
  • Validate against main project LinkML schema

Lessons Learned

What Worked Well

SPARQL queries: Highly effective for municipalities and government organizations
Metadata verification: Prevented false positives
Incremental approach: Test batch revealed patterns before scaling
Logging everything: Complete audit trail for reproducibility

What Needs Improvement

⚠️ Branch locations: Need strategy for sub-organizations
⚠️ Collaborative entities: May require manual curation
⚠️ Historical societies: Lower Wikidata coverage, may need enrichment workflow

Recommendations for Full Dataset

  1. Batch by organization type: Process museums → archives → libraries → societies separately
  2. ISIL code leverage: Use ISIL codes (P791) in SPARQL queries for higher precision
  3. Municipality fallback: For municipal archives, always search for municipality entity
  4. Manual review threshold: Flag matches with confidence < 85% for human verification

Files Modified

Updated Files

  • /data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.yaml (259 KB → 259 KB)
  • /data/nde/linkml/nde_yaml_target.yaml (223 lines → 242 lines)

New Files Created

  • /scripts/update_nde_yaml_with_wikidata_test_batch.py (enrichment script)
  • /data/nde/sparql/enrichment_log_test_batch_20251117_115941.json (results log)
  • /data/nde/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.backup.20251117_115940.yaml (backup)

Existing Files

  • 10 prepared SPARQL query JSONs (from previous session)
  • Master query log (from previous session)

Appendix: Sample Wikidata Entries

Example 1: Museum (Herinneringscentrum Kamp Westerbork)

Wikidata ID: Q22246632
Label (NL): Herinneringscentrum Kamp Westerbork
Description: memorial center in Hooghalen, Netherlands
Properties:

  • P31 (instance of): Q33506 (museum)
  • P17 (country): Q55 (Netherlands)
  • P131 (located in): Q242103 (Midden-Drenthe)
  • P856 (official website): https://kampwesterbork.nl/
  • P791 (ISIL): NL-HhlHCKW ← Matches our data!

Example 2: Archive (Regionaal Historisch Centrum Drents Archief)

Wikidata ID: Q1978308
Label (NL): Drents Archief
Description: regional archive in Assen, Netherlands
Properties:

  • P31 (instance of): Q7075 (library) + Q166118 (archive)
  • P17 (country): Q55 (Netherlands)
  • P131 (located in): Q10013 (Assen)
  • P856 (official website): https://www.drentsarchief.nl/
  • P791 (ISIL): NL-AsnDA ← Matches our data!

Example 3: Municipality (Gemeente Coevorden)

Wikidata ID: Q60453
Label (NL): Coevorden
Description: gemeente in Drenthe, Nederland
Properties:

  • P31 (instance of): Q2039348 (Dutch municipality)
  • P17 (country): Q55 (Netherlands)
  • P131 (located in): Q770 (Drenthe)
  • P856 (official website): https://www.coevorden.nl/

Contact & Support

Project Lead: GLAM Data Extraction Project
Repository: /Users/kempersc/apps/glam
Documentation: /docs/NDE_WIKIDATA_ENRICHMENT_REPORT.md
Questions: See AGENTS.md for AI agent instructions and workflows


End of Report