glam/data
kempersc 49f4054802 data(person/entity): add 83,845 LinkedIn profile extractions from company pages
Bulk extraction of heritage professional profiles from LinkedIn company pages
using extract_persons_with_provenance.py script.

Key characteristics:
- Source: LinkedIn company 'People' pages for heritage institutions
- File format: {linkedin-slug}_{timestamp}.json
- Total size: ~3.6GB
- Includes: profile_data, heritage_relevance, affiliations, web_claims
- Provenance: Full XPath + archived HTML references (Rule 6 compliant)
- Dual timestamps: statement_created_at + source_archived_at (Rule 35)

Extraction metadata includes:
- extraction_agent: extract_persons_with_provenance.py
- source_file: Original archived HTML filename
- source_archived_at: When LinkedIn page was captured
- schema_version: 1.0.0

Note: URL-encoded filenames preserve international characters (Arabic,
Hebrew, Chinese, Turkish, accented Latin, etc.)
2026-01-10 13:27:08 +01:00
..
custodian data(person/entity): add 83,845 LinkedIn profile extractions from company pages 2026-01-10 13:27:08 +01:00
custodian.backup.20251230
custodian_sample
entity_annotation
examples
extracted
google_maps_enrichment
instances
intangible_heritage
isil
json
jsonld
manual_enrichment
museum_register_nl
nde
ontology
person data(person): update profiles with web claims and PPID corrections 2026-01-10 12:56:28 +01:00
rag_eval chore: minor updates and evaluation results 2026-01-09 21:10:55 +01:00
raw
rdf
reference
reports
review
test
training
unified
validation
web/lap_gaza_report_2024
wikidata
wikpedia/Destruction_of_cultural_heritage_during_the_Israeli_invasion_of_the_Gaza_Strip
collision_edge_case_analysis.md
deduplication_improvement_summary.md
dutch_collision_report.txt
dutch_collision_stats.json
dutch_deduplication_report.txt
dutch_institutions_with_ghcids.yaml
extraction_checkpoint.json
failed_crawl_urls.txt
failed_crawl_urls_round1_backup.txt
failed_crawl_urls_round3_backup.txt
failed_crawl_urls_round4.txt
ISIL-codes_2025-08-01.csv
linkedin_locations.json
mexican_geography_analysis.yaml
missing_annotations_checkpoint.json
NDE-logo-RGB-basis-nl-blauw.png
reenrich_queue.json
sparql_templates.yaml
temp_conv1_artifact2.md
temp_conv2_artifact1.md
temp_mexican_conv1.json
temp_mexican_conv2.json
unenriched_urls_round2.txt
xxx_matches.json