20 KiB
Kamp Westerbork Digital Collection - Data Harvesting Analysis
Date: 2025-11-17
Institution: Herinneringscentrum Kamp Westerbork
URL: https://collecties.kampwesterbork.nl/
ISIL Code: NL-HhlHCKW
Executive Summary
Herinneringscentrum Kamp Westerbork operates a digital collection platform built on Atlantis CMS with IIIF image services, Spinque search backend, and Sanity.io content management. The platform uses persistent HTTP URIs for linked data but has limited public API access - harvesting requires reverse-engineering Next.js data endpoints and the Spinque search API.
Collections Management System
Primary System: Atlantis (by Atlantis Erfgoed)
- Provider: Atlantis Erfgoed
- Type: Web-based archive and collection management system
- Standards: ISAD(G), ISAAR(CPF), ISDF (archival standards)
- Confirmed Usage: Listed in Dutch Organizations CSV as using "Atlantis"
Atlantis Features:
- Multimedia information system (links images, audio/video, documents)
- Depot management for physical storage tracking
- Integrated online publication platform
- Plugin-based architecture for extensibility
Reference: Dutch Organizations CSV confirms "Atlantis" as their system:
Drenthe,Hooghalen,Oosthalen 8,Stichting Herinneringscentrum Kamp Westerbork,,https://kampwesterbork.nl/ ,museum,,NL-HhlHCKW,,Atlantis,ja,,ja
Digital Infrastructure Architecture
1. Frontend Framework
- Technology: Next.js (React-based, server-side rendered)
- Build ID:
b6DuV8zPMsiXPQxO3Lrq8(visible in/_next/static/paths) - Deployment: Static assets via
_next/static/chunks/
Key Observation: The site uses Next.js data fetching via JSON endpoints like:
https://collecties.kampwesterbork.nl/_next/data/b6DuV8zPMsiXPQxO3Lrq8/nl/zoeken.json?term=Amsterdam
This is a harvestable endpoint for search results!
2. Search Backend: Spinque API
Endpoint: https://collecties.kampwesterbork.nl/api/spinque-proxy
Method: POST
Status: ✅ Active and responding (observed multiple 200 OK responses during search)
Observed Behavior:
- 3 POST requests fired on search query "Amsterdam"
- Likely structure:
{query: "Amsterdam", page: 1, filters: {...}} - Returns JSON with search results (persons, works, entities)
Search Result Statistics (from "Amsterdam" query):
- 65,581 total results
- 63,659 persons (5,305 pages at ~12 results/page)
- Documents: Brief (894), Briefkaart (179), Bewijs van verzending (103), etc.
- Pagination:
/zoeken?term=Amsterdam&page=2(URL-based) and&personPage=2(person-specific)
Harvesting Strategy:
- Intercept or reverse-engineer Spinque API calls
- Use Chrome DevTools Network tab to capture POST request payload
- Replicate requests with Python
requestslibrary - Iterate through pagination parameters
3. IIIF Image API 3.0
Base URL: https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/
Format: Standard IIIF Image API
https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/{IMAGE_ID}/full/{SIZE}/0/default.jpg
Examples:
17614498/full/!100,100/0/default.jpg(thumbnail, max 100x100px)17614498/full/!300,300/0/default.jpg(medium, max 300x300px)17614498/full/!600,600/0/default.jpg(large, max 600x600px)17614498/full/max/0/default.jpg(full resolution)
IIIF Info Document: https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/{IMAGE_ID}/info.json
Compliance: Fully IIIF 3.0 compliant ✅
Harvesting: Use IIIF manifest URLs if available, or construct URLs from image IDs
4. Content Management: Sanity.io
CDN: https://cdn.sanity.io/images/e44e8rzu/production/
Project ID: e44e8rzu
Dataset: production
Image Examples:
https://cdn.sanity.io/images/e44e8rzu/production/daba9f3b89cd9c794d4baa061296c803705bcf55-658x548.jpg?w=1599&fit=max
Sanity API Endpoint (potentially accessible):
https://e44e8rzu.api.sanity.io/v2021-10-21/data/query/production?query=*[_type == "story"]
Use Case: Sanity stores editorial content (stories/verhalen), not collection metadata
Harvesting: Query Sanity API for stories, but collection data is in Atlantis/Spinque
Persistent Identifier Schemes
1. Work/Object IDs
Format: https://data.kampwesterbork.nl/work/{IDENTIFIER}
Examples:
TH0000017587(ThesisHolocaust collection prefix)HCIS.00011671(Herinneringscentrum Information System)
Web URL Structure (URL-encoded PIDs):
https://collecties.kampwesterbork.nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587
Note: Direct access to https://data.kampwesterbork.nl/work/TH0000017587 returns connection refused (likely internal-only or requires authentication).
2. Person IDs
Format: https://kampwesterbork.nl/data/person/{NUMERIC_ID}
Examples:
10907179(Marie Schönberg-Amsterdam)13442221(Abraham Peekel)14808721(Henriette Kalker)
Web URL Structure:
https://collecties.kampwesterbork.nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179
Harvesting: Person IDs are visible in search results - can be extracted from Spinque API responses.
3. Entity/Concept IDs (Thesaurus)
Format: https://digitaalerfgoed.poolparty.biz/westerbork/{ID}
Examples:
joodse%20gemeente%20nieuw-amsterdam(Joodse gemeente Nieuw-Amsterdam)doorgangskamp%20westerbork(Doorgangskamp Westerbork periods)
System: PoolParty Semantic Suite (controlled vocabulary management)
Use Case: Subject headings, locations, events, organizational changes (1939-1971 camp periods)
4. Image IDs
Format: Simple numeric identifiers (no URI prefix)
Examples: 17614498, 17614499, 15557762
Retrieval: Via IIIF Image API using numeric ID
Data Harvesting Strategies
Strategy 1: Next.js Data Endpoints ⭐ RECOMMENDED
Approach: Scrape Next.js JSON data files for server-side rendered pages
Example Endpoint:
curl "https://collecties.kampwesterbork.nl/_next/data/b6DuV8zPMsiXPQxO3Lrq8/nl/zoeken.json?term=Amsterdam&page=1"
Advantages:
- ✅ Returns structured JSON (not HTML parsing)
- ✅ Contains all data rendered on page (persons, works, entities)
- ✅ Pagination via
?page=Nparameter - ✅ No authentication required
Disadvantages:
- ⚠️ Build ID (
b6DuV8zPMsiXPQxO3Lrq8) may change on redeployment - ⚠️ Requires discovering build ID from homepage HTML or
_buildManifest.js
Implementation:
import requests
# 1. Discover current build ID
response = requests.get("https://collecties.kampwesterbork.nl/")
build_id = extract_build_id_from_html(response.text) # Parse from <script> tags
# 2. Query search data endpoint
base_url = f"https://collecties.kampwesterbork.nl/_next/data/{build_id}/nl/zoeken.json"
params = {"term": "Amsterdam", "page": 1}
data = requests.get(base_url, params=params).json()
# 3. Parse results
persons = data['pageProps']['searchResults']['persons']
works = data['pageProps']['searchResults']['works']
Strategy 2: Spinque Search API (Reverse-Engineered)
Approach: Intercept and replicate POST requests to /api/spinque-proxy
Endpoint: https://collecties.kampwesterbork.nl/api/spinque-proxy
Observed Behavior:
- 3 POST requests per search query
- Likely separate calls for: persons, works, entities/topics
Reverse Engineering Steps:
- Open Chrome DevTools → Network tab
- Perform search query on website
- Find POST requests to
spinque-proxy - Copy request payload (JSON) and headers
- Replicate with Python
requests.post()
Expected Payload Structure (hypothetical):
{
"query": "Amsterdam",
"page": 1,
"pageSize": 12,
"filters": {
"type": "person"
}
}
Advantages:
- ✅ Direct access to search engine
- ✅ Structured JSON responses
- ✅ May support advanced filtering
Disadvantages:
- ⚠️ Undocumented API (may change without notice)
- ⚠️ Requires reverse-engineering request format
- ⚠️ No guarantee of rate limit policy
Strategy 3: IIIF Manifest Harvesting (If Available)
Approach: Check for IIIF Presentation API manifests
Test Endpoints:
# Collection-level manifest
curl "https://kenniscentrum.kampwesterbork.nl/iiif/collection/top"
# Object-level manifest
curl "https://kenniscentrum.kampwesterbork.nl/iiif/presentation/v3/TH0000017587/manifest.json"
If manifests exist:
- ✅ Contains structured metadata (title, creator, date, description)
- ✅ Links to all images for an object
- ✅ IIIF standard = easy parsing with libraries (iiif-prezi3, pyld)
Status: ⚠️ Unknown - needs testing (not observed in initial session)
Strategy 4: Web Scraping with Playwright/Selenium
Approach: Automate browser to render JavaScript and extract data from DOM
Use Case: Fallback if JSON endpoints are unavailable
Advantages:
- ✅ Works even with heavy JavaScript rendering
- ✅ Can handle dynamic content loading
Disadvantages:
- ❌ Slow (browser overhead)
- ❌ Fragile (breaks if HTML structure changes)
- ❌ Resource-intensive
Not Recommended when JSON endpoints are available.
Strategy 5: Sitemap Crawling
Check for Sitemap:
curl "https://collecties.kampwesterbork.nl/sitemap.xml"
curl "https://collecties.kampwesterbork.nl/robots.txt"
If sitemap exists:
- ✅ Lists all public URLs (persons, works, entities)
- ✅ Can iterate through all records systematically
- ✅ Respects website's preferred crawl order
Implementation:
import xml.etree.ElementTree as ET
import requests
sitemap_url = "https://collecties.kampwesterbork.nl/sitemap.xml"
response = requests.get(sitemap_url)
root = ET.fromstring(response.content)
urls = [url.text for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')]
for url in urls:
if '/werk/' in url:
harvest_work_page(url)
elif '/persoon/' in url:
harvest_person_page(url)
URL Pattern Analysis
Search Results Page
/zoeken?term={QUERY}&page={PAGE_NUMBER}&personPage={PERSON_PAGE}
Parameters:
term: Search query (URL-encoded)page: Page number for works/documents (1-based)personPage: Page number for persons (1-based, separate pagination)
Example:
https://collecties.kampwesterbork.nl/zoeken?term=Amsterdam&page=1&personPage=1
Person Detail Page
/persoon/{URL_ENCODED_PERSON_URI}
Example:
https://collecties.kampwesterbork.nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179
Data Endpoint (hypothetical):
https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179.json
Work/Object Detail Page
/werk/{URL_ENCODED_WORK_URI}
Example:
https://collecties.kampwesterbork.nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587
Data Endpoint (hypothetical):
https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587.json
Entity/Topic Detail Page
/entiteit/{URL_ENCODED_ENTITY_URI}
Example:
https://collecties.kampwesterbork.nl/entiteit/joodse%20gemeente%20nieuw-amsterdam
Story/Verhaal Page
/verhaal/{STORY_ID}
Example:
https://collecties.kampwesterbork.nl/verhaal/18268091
Note: Stories are editorial content (from Sanity CMS), not collection records.
Metadata Standards Detected
1. Schema.org/JSON-LD ✅
Embedded in HTML: Yes (visible in <script type="application/ld+json"> tags)
Use Case: Search engine optimization, semantic markup
Sample Fields (expected):
{
"@context": "https://schema.org",
"@type": "ArchiveComponent",
"name": "Brief van G. Bamberg-Herschel aan Janny Bosman",
"dateCreated": "1943-05-02",
"creator": {"@type": "Person", "name": "G. Bamberg-Herschel"},
"isPartOf": {"@type": "ArchiveOrganization", "name": "Herinneringscentrum Kamp Westerbork"}
}
2. IIIF Presentation API (Likely)
Standard: IIIF Presentation API 3.0
Evidence: IIIF Image API 3.0 in use → high probability of Presentation API
Harvesting Value: Manifests contain rich metadata + image sequences
3. Archival Standards (Backend - ISAD(G), ISAAR(CPF))
System: Atlantis uses ISAD(G) (General International Standard Archival Description)
Not Exposed via Web: Archival XML/EAD exports may require direct Atlantis API access
Data Extraction Priority Roadmap
Phase 1: Reconnaissance ⏳ In Progress
- Identify backend system (Atlantis)
- Discover IIIF Image API
- Locate Spinque search endpoint
- Analyze URL patterns
- TODO: Test IIIF Presentation API endpoints
- TODO: Capture Spinque API request/response format
- TODO: Check for sitemap.xml
Phase 2: Proof of Concept Harvester
- Extract build ID from homepage HTML
- Query Next.js data endpoint for search results (test with "Amsterdam")
- Parse JSON to extract:
- Person IDs and names
- Work IDs and titles
- Entity/topic URIs
- Iterate pagination (page 1-10 test)
- Fetch individual record pages via data endpoints
Expected Output: CSV/JSON with ~1,000 sample records
Phase 3: Full-Scale Harvesting
- Systematic crawl via:
- Sitemap.xml (if available), OR
- Alphabet search queries (A-Z), OR
- Numeric ID range scanning (person IDs appear sequential)
- Extract all metadata fields:
- Names (persons, creators, recipients)
- Dates (creation, death, events)
- Locations (cities, camps, addresses)
- Subjects (PoolParty thesaurus terms)
- Document types (brief, briefkaart, foto, etc.)
- Download IIIF images (thumbnail + full-res URLs)
- Cross-reference with:
- EHRI Portal (https://portal.ehri-project.eu/institutions/nl-002896)
- Wikidata (Q2668075 - Kamp Westerbork)
Phase 4: Data Integration
- Map to LinkML schema (HeritageCustodian + Collection + DigitalPlatform)
- Generate GHCID for institution:
NL-DR-HHL-M-HCK(Netherlands, Drenthe, Hooghalen, Museum/Archive, Herinneringscentrum) - Create RDF triples for Linked Open Data
- Export to formats:
- JSON-LD
- RDF/Turtle
- CSV (for analysis)
- Parquet (for data warehouse)
Rate Limiting & Ethical Harvesting
Observed Infrastructure
- CDN: Cloudflare (likely) or custom CDN for static assets
- IIIF Server: Dedicated subdomain (
kenniscentrum.kampwesterbork.nl) - Backend API: Next.js API routes + Spinque proxy
Recommended Practices
- Respect robots.txt (check for crawl-delay directive)
- Rate limit: 1-2 requests/second (conservative)
- User-Agent: Identify as research crawler
User-Agent: GLAM-Harvester/1.0 (Heritage Data Research; contact@example.org) - Cache responses: Avoid redundant requests (use ETags, Last-Modified headers)
- Avoid peak hours: Harvest during off-peak times (European late night)
API Access Summary
| Endpoint Type | URL | Status | Authentication | Harvestable |
|---|---|---|---|---|
| Spinque Search API | /api/spinque-proxy |
✅ Active | None | ✅ Yes (reverse-engineer) |
| Next.js Data | /_next/data/{BUILD_ID}/... |
✅ Active | None | ✅ Yes (JSON) |
| IIIF Image API | kenniscentrum.../iiif/image/3.0/ |
✅ Active | None | ✅ Yes (standard) |
| IIIF Presentation | kenniscentrum.../iiif/presentation/ |
❓ Unknown | None | ⏳ Test needed |
| Direct Data URIs | data.kampwesterbork.nl/work/ |
❌ Connection refused | Yes (likely) | ❌ No |
| Sanity CMS API | e44e8rzu.api.sanity.io/v2021-10-21/ |
❓ Unknown | API key | ⏳ Test needed |
| Atlantis API | Unknown | ❓ Unknown | Yes (likely) | ❌ Likely internal |
Collection Scope
Size Estimates (from Search Results)
- Persons: 63,659 records (deportees, staff, correspondents)
- Works/Documents: 65,581+ items
- Brieven (letters): ~1,000+
- Briefkaarten (postcards): ~200
- Foto's (photos): ~50
- Bulletins, verklaringen, bewijzen, etc.
- Time Period: 1939-1971 (focus 1942-1945)
- Geographic Scope: Netherlands (Amsterdam, other cities) → Westerbork → Deportation destinations
Thematic Coverage
- Daily life in Westerbork camp (1942-1945)
- Joodse Raad voor Amsterdam (Jewish Council) records
- Transport lists and camp administration
- Ego-documents (letters, diaries, memoirs thrown from trains)
- Post-war repatriation and commemoration
- Camp periods: refugee camp (1939-1942), transit camp (1942-1945), internment (1945-1948), military (1948-1949), repatriation (1950-1951), Schattenberg settlement (1951-1971)
Technical Contact Points
Known Infrastructure Providers
-
Atlantis Erfgoed - Collections management system
Website: https://www.atlantis-erfgoed.nl/
Contact: info@atlantis-erfgoed.nl -
PoolParty - Thesaurus/controlled vocabulary
Provider: Semantic Web Company
Instance: https://digitaalerfgoed.poolparty.biz/westerbork/ -
Sanity.io - Content management (stories)
Project ID:e44e8rzu -
Spinque - Search engine
Website: https://spinque.com/
Institutional Contact
Herinneringscentrum Kamp Westerbork
Oosthalen 8, 9414 TG Hooghalen, Netherlands
Tel: +31 (0)593-592600
Email: info@westerbork.nl
Website: https://www.kampwesterbork.nl/
Next Steps for Data Harvesting
Immediate Actions (Next Session)
-
✅ Test IIIF Presentation API:
curl -I "https://kenniscentrum.kampwesterbork.nl/iiif/presentation/v3/TH0000017587/manifest.json" -
✅ Capture Spinque API format:
- Open DevTools Network tab
- Perform search
- Copy
spinque-proxyPOST request as cURL - Document request/response structure
-
✅ Check sitemap.xml:
curl "https://collecties.kampwesterbork.nl/sitemap.xml" curl "https://collecties.kampwesterbork.nl/robots.txt" -
✅ Extract sample record via Next.js data endpoint:
# 1. Get build ID from homepage source curl "https://collecties.kampwesterbork.nl/" | grep -o '"buildId":"[^"]*"' # 2. Query work data curl "https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587.json"
Development Tasks
- Write Python harvester using
requests+json - Test Spinque pagination (verify max pages, rate limits)
- Build record parser for Next.js JSON schema
- Implement IIIF image downloader (selective: thumbnails vs full-res)
- Map fields to LinkML schema (HeritageCustodian + DigitalPlatform + Identifier)
Conclusion
Herinneringscentrum Kamp Westerbork's digital collection is harvestable via Next.js data endpoints and the Spinque search API, despite lacking a public REST API. The site's use of IIIF Image API 3.0 provides standardized image access, and the Atlantis CMS backend suggests potential for XML/EAD exports if direct system access is negotiated.
Recommended Harvesting Method: Next.js JSON endpoints (Strategy 1) as primary, with Spinque API (Strategy 2) as fallback for bulk search queries.
Estimated Effort:
- Proof-of-concept: 2-3 days (1,000 records)
- Full harvest: 1-2 weeks (65,000+ records, rate-limited)
- Data integration: 3-5 days (mapping, validation, RDF export)
Data Quality: High (authoritative archival institution, EHRI partner, structured metadata)
References
- EHRI Portal Entry: https://portal.ehri-project.eu/institutions/nl-002896
- Dutch ISIL Registry: NL-HhlHCKW (confirmed 2016-12-02)
- Dutch Organizations CSV: Row 190 (Drenthe, Hooghalen, Atlantis system)
- Atlantis Documentation: https://www.atlantis-erfgoed.nl/collectie-beheer/
- IIIF Image API 3.0 Spec: https://iiif.io/api/image/3.0/
- Next.js Data Fetching: https://nextjs.org/docs/basic-features/data-fetching
Report Author: AI Agent (OpenCODE)
Session Date: 2025-11-17
Project: GLAM Data Extraction (Global Heritage Custodian Identifiers)