kempersc/glam - Forgejo: Beyond coding. We Forge.

kempersc/glam

Fork 0

Commit graph

Author	SHA1	Message	Date
kempersc	38354539a6	feat: Add comprehensive harvester for Thüringen archives - Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de. - Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more. - Introduced structured data parsing and error handling for robust data extraction. - Added rate limiting to respect server load and improve scraping efficiency. - Results are saved in a JSON format with detailed metadata about the extraction process.	2025-11-20 00:25:45 +01:00
kempersc	3c80de87e0	add isil entries	2025-11-19 23:25:22 +01:00
kempersc	e5a532a8bc	Add comprehensive tests for NLP institution extraction and RDF partnership integration - Introduced `test_nlp_extractor.py` with unit tests for the InstitutionExtractor, covering various extraction patterns (ISIL, Wikidata, VIAF, city names) and ensuring proper classification of institutions (museum, library, archive). - Added tests for extracted entities and result handling to validate the extraction process. - Created `test_partnership_rdf_integration.py` to validate the end-to-end process of extracting partnerships from a conversation and exporting them to RDF format. - Implemented tests for temporal properties in partnerships and ensured compliance with W3C Organization Ontology patterns. - Verified that extracted partnerships are correctly linked with PROV-O provenance metadata.	2025-11-19 23:20:47 +01:00

Author

SHA1

Message

Date

kempersc

38354539a6

feat: Add comprehensive harvester for Thüringen archives

- Implemented a new script to extract full metadata from 149 archive detail pages on archive-in-thueringen.de.
- Extracted data includes addresses, emails, phones, directors, collection sizes, opening hours, histories, and more.
- Introduced structured data parsing and error handling for robust data extraction.
- Added rate limiting to respect server load and improve scraping efficiency.
- Results are saved in a JSON format with detailed metadata about the extraction process.

2025-11-20 00:25:45 +01:00

kempersc

3c80de87e0

add isil entries

2025-11-19 23:25:22 +01:00

kempersc

e5a532a8bc

Add comprehensive tests for NLP institution extraction and RDF partnership integration

- Introduced `test_nlp_extractor.py` with unit tests for the InstitutionExtractor, covering various extraction patterns (ISIL, Wikidata, VIAF, city names) and ensuring proper classification of institutions (museum, library, archive).
- Added tests for extracted entities and result handling to validate the extraction process.
- Created `test_partnership_rdf_integration.py` to validate the end-to-end process of extracting partnerships from a conversation and exporting them to RDF format.
- Implemented tests for temporal properties in partnerships and ensured compliance with W3C Organization Ontology patterns.
- Verified that extracted partnerships are correctly linked with PROV-O provenance metadata.

2025-11-19 23:20:47 +01:00

3 commits