Commit graph

7 commits

Author SHA1 Message Date
kempersc
fa5680f0dd Add initial versions of custodian hub UML diagrams in Mermaid and PlantUML formats
- Introduced custodian_hub_v3.mmd, custodian_hub_v4_final.mmd, and custodian_hub_v5_FINAL.mmd for Mermaid representation.
- Created custodian_hub_FINAL.puml and custodian_hub_v3.puml for PlantUML representation.
- Defined entities such as CustodianReconstruction, Identifier, TimeSpan, Agent, CustodianName, CustodianObservation, ReconstructionActivity, Appellation, ConfidenceMeasure, Custodian, LanguageCode, and SourceDocument.
- Established relationships and associations between entities, including temporal extents, observations, and reconstruction activities.
- Incorporated enumerations for various types, statuses, and classifications relevant to custodians and their activities.
2025-11-22 14:33:51 +01:00
kempersc
edb1e07941 updated schemata 2025-11-21 22:12:33 +01:00
kempersc
e6684e815b feat: Enhance hyponyms with additional labels and types for better classification 2025-11-20 07:52:23 +01:00
kempersc
38354539a6 feat: Add comprehensive harvester for Thüringen archives
- Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de.
- Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more.
- Introduced structured data parsing and error handling for robust data extraction.
- Added rate limiting to respect server load and improve scraping efficiency.
- Results are saved in a JSON format with detailed metadata about the extraction process.
2025-11-20 00:25:45 +01:00
kempersc
3c80de87e0 add isil entries 2025-11-19 23:25:22 +01:00
kempersc
5e9f54bd91 Deduplicate Brazilian institutions (212→121)
- Merged 91 duplicate Brazilian institution records
- Improved Wikidata coverage from 26.4% to 38.8% (+12.4pp)
- Created intelligent merge strategy:
  - Prefer records with higher confidence scores
  - Merge locations (prefer most complete)
  - Combine all unique identifiers
  - Combine all unique digital platforms
  - Combine all unique collections
- Add provenance notes documenting merges
- Create backup before deduplication
- Generate comprehensive deduplication report

Dataset changes:
- Total institutions: 13,502 → 13,411
- Brazilian institutions: 212 → 121
- Coverage: 47/121 institutions with Q-numbers (38.8%)
2025-11-11 22:08:34 +01:00
kempersc
59c99bfb26 Brazil Batch 10: Enrich 8 institutions (26.4% coverage)
- Add Wikidata Q-numbers to 8 Brazilian institutions
- Coverage: 56/212 institutions (26.4%, +5.6pp gain)
- All Q-numbers validated via Wikidata authenticated API
- Largest single batch gain yet
- Note: Duplicate entries detected, deduplication needed

Q-numbers added:
- Q10333651 - Museu da Borracha
- Q10387829 - UFAC Repository
- Q10345196 - Parque Memorial Quilombo dos Palmares
- Q1434444 - Teatro Amazonas
- Q116921020 - Centro Cultural dos Povos da Amazônia
- Q7894381 - UNIFAP
- Q16496091 - Arquivo Público do Estado da Bahia
- Q56695457 - Museu de Arqueologia e Etnologia da UFPR
2025-11-11 22:05:43 +01:00