glam/docs/sessions/kamp_westerbork_data_harvesting_analysis.md
2025-11-19 23:25:22 +01:00

20 KiB

Kamp Westerbork Digital Collection - Data Harvesting Analysis

Date: 2025-11-17
Institution: Herinneringscentrum Kamp Westerbork
URL: https://collecties.kampwesterbork.nl/
ISIL Code: NL-HhlHCKW


Executive Summary

Herinneringscentrum Kamp Westerbork operates a digital collection platform built on Atlantis CMS with IIIF image services, Spinque search backend, and Sanity.io content management. The platform uses persistent HTTP URIs for linked data but has limited public API access - harvesting requires reverse-engineering Next.js data endpoints and the Spinque search API.


Collections Management System

Primary System: Atlantis (by Atlantis Erfgoed)

  • Provider: Atlantis Erfgoed
  • Type: Web-based archive and collection management system
  • Standards: ISAD(G), ISAAR(CPF), ISDF (archival standards)
  • Confirmed Usage: Listed in Dutch Organizations CSV as using "Atlantis"

Atlantis Features:

  • Multimedia information system (links images, audio/video, documents)
  • Depot management for physical storage tracking
  • Integrated online publication platform
  • Plugin-based architecture for extensibility

Reference: Dutch Organizations CSV confirms "Atlantis" as their system:

Drenthe,Hooghalen,Oosthalen 8,Stichting Herinneringscentrum Kamp Westerbork,,https://kampwesterbork.nl/ ,museum,,NL-HhlHCKW,,Atlantis,ja,,ja

Digital Infrastructure Architecture

1. Frontend Framework

  • Technology: Next.js (React-based, server-side rendered)
  • Build ID: b6DuV8zPMsiXPQxO3Lrq8 (visible in /_next/static/ paths)
  • Deployment: Static assets via _next/static/chunks/

Key Observation: The site uses Next.js data fetching via JSON endpoints like:

https://collecties.kampwesterbork.nl/_next/data/b6DuV8zPMsiXPQxO3Lrq8/nl/zoeken.json?term=Amsterdam

This is a harvestable endpoint for search results!


2. Search Backend: Spinque API

Endpoint: https://collecties.kampwesterbork.nl/api/spinque-proxy
Method: POST
Status: Active and responding (observed multiple 200 OK responses during search)

Observed Behavior:

  • 3 POST requests fired on search query "Amsterdam"
  • Likely structure: {query: "Amsterdam", page: 1, filters: {...}}
  • Returns JSON with search results (persons, works, entities)

Search Result Statistics (from "Amsterdam" query):

  • 65,581 total results
  • 63,659 persons (5,305 pages at ~12 results/page)
  • Documents: Brief (894), Briefkaart (179), Bewijs van verzending (103), etc.
  • Pagination: /zoeken?term=Amsterdam&page=2 (URL-based) and &personPage=2 (person-specific)

Harvesting Strategy:

  1. Intercept or reverse-engineer Spinque API calls
  2. Use Chrome DevTools Network tab to capture POST request payload
  3. Replicate requests with Python requests library
  4. Iterate through pagination parameters

3. IIIF Image API 3.0

Base URL: https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/

Format: Standard IIIF Image API

https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/{IMAGE_ID}/full/{SIZE}/0/default.jpg

Examples:

  • 17614498/full/!100,100/0/default.jpg (thumbnail, max 100x100px)
  • 17614498/full/!300,300/0/default.jpg (medium, max 300x300px)
  • 17614498/full/!600,600/0/default.jpg (large, max 600x600px)
  • 17614498/full/max/0/default.jpg (full resolution)

IIIF Info Document: https://kenniscentrum.kampwesterbork.nl/iiif/image/3.0/{IMAGE_ID}/info.json

Compliance: Fully IIIF 3.0 compliant
Harvesting: Use IIIF manifest URLs if available, or construct URLs from image IDs


4. Content Management: Sanity.io

CDN: https://cdn.sanity.io/images/e44e8rzu/production/

Project ID: e44e8rzu
Dataset: production

Image Examples:

https://cdn.sanity.io/images/e44e8rzu/production/daba9f3b89cd9c794d4baa061296c803705bcf55-658x548.jpg?w=1599&fit=max

Sanity API Endpoint (potentially accessible):

https://e44e8rzu.api.sanity.io/v2021-10-21/data/query/production?query=*[_type == "story"]

Use Case: Sanity stores editorial content (stories/verhalen), not collection metadata
Harvesting: Query Sanity API for stories, but collection data is in Atlantis/Spinque


Persistent Identifier Schemes

1. Work/Object IDs

Format: https://data.kampwesterbork.nl/work/{IDENTIFIER}

Examples:

  • TH0000017587 (ThesisHolocaust collection prefix)
  • HCIS.00011671 (Herinneringscentrum Information System)

Web URL Structure (URL-encoded PIDs):

https://collecties.kampwesterbork.nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587

Note: Direct access to https://data.kampwesterbork.nl/work/TH0000017587 returns connection refused (likely internal-only or requires authentication).


2. Person IDs

Format: https://kampwesterbork.nl/data/person/{NUMERIC_ID}

Examples:

  • 10907179 (Marie Schönberg-Amsterdam)
  • 13442221 (Abraham Peekel)
  • 14808721 (Henriette Kalker)

Web URL Structure:

https://collecties.kampwesterbork.nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179

Harvesting: Person IDs are visible in search results - can be extracted from Spinque API responses.


3. Entity/Concept IDs (Thesaurus)

Format: https://digitaalerfgoed.poolparty.biz/westerbork/{ID}

Examples:

  • joodse%20gemeente%20nieuw-amsterdam (Joodse gemeente Nieuw-Amsterdam)
  • doorgangskamp%20westerbork (Doorgangskamp Westerbork periods)

System: PoolParty Semantic Suite (controlled vocabulary management)

Use Case: Subject headings, locations, events, organizational changes (1939-1971 camp periods)


4. Image IDs

Format: Simple numeric identifiers (no URI prefix)

Examples: 17614498, 17614499, 15557762

Retrieval: Via IIIF Image API using numeric ID


Data Harvesting Strategies

Approach: Scrape Next.js JSON data files for server-side rendered pages

Example Endpoint:

curl "https://collecties.kampwesterbork.nl/_next/data/b6DuV8zPMsiXPQxO3Lrq8/nl/zoeken.json?term=Amsterdam&page=1"

Advantages:

  • Returns structured JSON (not HTML parsing)
  • Contains all data rendered on page (persons, works, entities)
  • Pagination via ?page=N parameter
  • No authentication required

Disadvantages:

  • ⚠️ Build ID (b6DuV8zPMsiXPQxO3Lrq8) may change on redeployment
  • ⚠️ Requires discovering build ID from homepage HTML or _buildManifest.js

Implementation:

import requests

# 1. Discover current build ID
response = requests.get("https://collecties.kampwesterbork.nl/")
build_id = extract_build_id_from_html(response.text)  # Parse from <script> tags

# 2. Query search data endpoint
base_url = f"https://collecties.kampwesterbork.nl/_next/data/{build_id}/nl/zoeken.json"
params = {"term": "Amsterdam", "page": 1}
data = requests.get(base_url, params=params).json()

# 3. Parse results
persons = data['pageProps']['searchResults']['persons']
works = data['pageProps']['searchResults']['works']

Strategy 2: Spinque Search API (Reverse-Engineered)

Approach: Intercept and replicate POST requests to /api/spinque-proxy

Endpoint: https://collecties.kampwesterbork.nl/api/spinque-proxy

Observed Behavior:

  • 3 POST requests per search query
  • Likely separate calls for: persons, works, entities/topics

Reverse Engineering Steps:

  1. Open Chrome DevTools → Network tab
  2. Perform search query on website
  3. Find POST requests to spinque-proxy
  4. Copy request payload (JSON) and headers
  5. Replicate with Python requests.post()

Expected Payload Structure (hypothetical):

{
  "query": "Amsterdam",
  "page": 1,
  "pageSize": 12,
  "filters": {
    "type": "person"
  }
}

Advantages:

  • Direct access to search engine
  • Structured JSON responses
  • May support advanced filtering

Disadvantages:

  • ⚠️ Undocumented API (may change without notice)
  • ⚠️ Requires reverse-engineering request format
  • ⚠️ No guarantee of rate limit policy

Strategy 3: IIIF Manifest Harvesting (If Available)

Approach: Check for IIIF Presentation API manifests

Test Endpoints:

# Collection-level manifest
curl "https://kenniscentrum.kampwesterbork.nl/iiif/collection/top"

# Object-level manifest
curl "https://kenniscentrum.kampwesterbork.nl/iiif/presentation/v3/TH0000017587/manifest.json"

If manifests exist:

  • Contains structured metadata (title, creator, date, description)
  • Links to all images for an object
  • IIIF standard = easy parsing with libraries (iiif-prezi3, pyld)

Status: ⚠️ Unknown - needs testing (not observed in initial session)


Strategy 4: Web Scraping with Playwright/Selenium

Approach: Automate browser to render JavaScript and extract data from DOM

Use Case: Fallback if JSON endpoints are unavailable

Advantages:

  • Works even with heavy JavaScript rendering
  • Can handle dynamic content loading

Disadvantages:

  • Slow (browser overhead)
  • Fragile (breaks if HTML structure changes)
  • Resource-intensive

Not Recommended when JSON endpoints are available.


Strategy 5: Sitemap Crawling

Check for Sitemap:

curl "https://collecties.kampwesterbork.nl/sitemap.xml"
curl "https://collecties.kampwesterbork.nl/robots.txt"

If sitemap exists:

  • Lists all public URLs (persons, works, entities)
  • Can iterate through all records systematically
  • Respects website's preferred crawl order

Implementation:

import xml.etree.ElementTree as ET
import requests

sitemap_url = "https://collecties.kampwesterbork.nl/sitemap.xml"
response = requests.get(sitemap_url)
root = ET.fromstring(response.content)

urls = [url.text for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')]

for url in urls:
    if '/werk/' in url:
        harvest_work_page(url)
    elif '/persoon/' in url:
        harvest_person_page(url)

URL Pattern Analysis

Search Results Page

/zoeken?term={QUERY}&page={PAGE_NUMBER}&personPage={PERSON_PAGE}

Parameters:

  • term: Search query (URL-encoded)
  • page: Page number for works/documents (1-based)
  • personPage: Page number for persons (1-based, separate pagination)

Example:

https://collecties.kampwesterbork.nl/zoeken?term=Amsterdam&page=1&personPage=1

Person Detail Page

/persoon/{URL_ENCODED_PERSON_URI}

Example:

https://collecties.kampwesterbork.nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179

Data Endpoint (hypothetical):

https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/persoon/https%3A%2F%2Fkampwesterbork.nl%2Fdata%2Fperson%2F10907179.json

Work/Object Detail Page

/werk/{URL_ENCODED_WORK_URI}

Example:

https://collecties.kampwesterbork.nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587

Data Endpoint (hypothetical):

https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587.json

Entity/Topic Detail Page

/entiteit/{URL_ENCODED_ENTITY_URI}

Example:

https://collecties.kampwesterbork.nl/entiteit/joodse%20gemeente%20nieuw-amsterdam

Story/Verhaal Page

/verhaal/{STORY_ID}

Example:

https://collecties.kampwesterbork.nl/verhaal/18268091

Note: Stories are editorial content (from Sanity CMS), not collection records.


Metadata Standards Detected

1. Schema.org/JSON-LD

Embedded in HTML: Yes (visible in <script type="application/ld+json"> tags)

Use Case: Search engine optimization, semantic markup

Sample Fields (expected):

{
  "@context": "https://schema.org",
  "@type": "ArchiveComponent",
  "name": "Brief van G. Bamberg-Herschel aan Janny Bosman",
  "dateCreated": "1943-05-02",
  "creator": {"@type": "Person", "name": "G. Bamberg-Herschel"},
  "isPartOf": {"@type": "ArchiveOrganization", "name": "Herinneringscentrum Kamp Westerbork"}
}

2. IIIF Presentation API (Likely)

Standard: IIIF Presentation API 3.0

Evidence: IIIF Image API 3.0 in use → high probability of Presentation API

Harvesting Value: Manifests contain rich metadata + image sequences


3. Archival Standards (Backend - ISAD(G), ISAAR(CPF))

System: Atlantis uses ISAD(G) (General International Standard Archival Description)

Not Exposed via Web: Archival XML/EAD exports may require direct Atlantis API access


Data Extraction Priority Roadmap

Phase 1: Reconnaissance In Progress

  • Identify backend system (Atlantis)
  • Discover IIIF Image API
  • Locate Spinque search endpoint
  • Analyze URL patterns
  • TODO: Test IIIF Presentation API endpoints
  • TODO: Capture Spinque API request/response format
  • TODO: Check for sitemap.xml

Phase 2: Proof of Concept Harvester

  1. Extract build ID from homepage HTML
  2. Query Next.js data endpoint for search results (test with "Amsterdam")
  3. Parse JSON to extract:
    • Person IDs and names
    • Work IDs and titles
    • Entity/topic URIs
  4. Iterate pagination (page 1-10 test)
  5. Fetch individual record pages via data endpoints

Expected Output: CSV/JSON with ~1,000 sample records


Phase 3: Full-Scale Harvesting

  1. Systematic crawl via:
    • Sitemap.xml (if available), OR
    • Alphabet search queries (A-Z), OR
    • Numeric ID range scanning (person IDs appear sequential)
  2. Extract all metadata fields:
    • Names (persons, creators, recipients)
    • Dates (creation, death, events)
    • Locations (cities, camps, addresses)
    • Subjects (PoolParty thesaurus terms)
    • Document types (brief, briefkaart, foto, etc.)
  3. Download IIIF images (thumbnail + full-res URLs)
  4. Cross-reference with:

Phase 4: Data Integration

  1. Map to LinkML schema (HeritageCustodian + Collection + DigitalPlatform)
  2. Generate GHCID for institution: NL-DR-HHL-M-HCK (Netherlands, Drenthe, Hooghalen, Museum/Archive, Herinneringscentrum)
  3. Create RDF triples for Linked Open Data
  4. Export to formats:
    • JSON-LD
    • RDF/Turtle
    • CSV (for analysis)
    • Parquet (for data warehouse)

Rate Limiting & Ethical Harvesting

Observed Infrastructure

  • CDN: Cloudflare (likely) or custom CDN for static assets
  • IIIF Server: Dedicated subdomain (kenniscentrum.kampwesterbork.nl)
  • Backend API: Next.js API routes + Spinque proxy
  1. Respect robots.txt (check for crawl-delay directive)
  2. Rate limit: 1-2 requests/second (conservative)
  3. User-Agent: Identify as research crawler
    User-Agent: GLAM-Harvester/1.0 (Heritage Data Research; contact@example.org)
    
  4. Cache responses: Avoid redundant requests (use ETags, Last-Modified headers)
  5. Avoid peak hours: Harvest during off-peak times (European late night)

API Access Summary

Endpoint Type URL Status Authentication Harvestable
Spinque Search API /api/spinque-proxy Active None Yes (reverse-engineer)
Next.js Data /_next/data/{BUILD_ID}/... Active None Yes (JSON)
IIIF Image API kenniscentrum.../iiif/image/3.0/ Active None Yes (standard)
IIIF Presentation kenniscentrum.../iiif/presentation/ Unknown None Test needed
Direct Data URIs data.kampwesterbork.nl/work/ Connection refused Yes (likely) No
Sanity CMS API e44e8rzu.api.sanity.io/v2021-10-21/ Unknown API key Test needed
Atlantis API Unknown Unknown Yes (likely) Likely internal

Collection Scope

Size Estimates (from Search Results)

  • Persons: 63,659 records (deportees, staff, correspondents)
  • Works/Documents: 65,581+ items
    • Brieven (letters): ~1,000+
    • Briefkaarten (postcards): ~200
    • Foto's (photos): ~50
    • Bulletins, verklaringen, bewijzen, etc.
  • Time Period: 1939-1971 (focus 1942-1945)
  • Geographic Scope: Netherlands (Amsterdam, other cities) → Westerbork → Deportation destinations

Thematic Coverage

  • Daily life in Westerbork camp (1942-1945)
  • Joodse Raad voor Amsterdam (Jewish Council) records
  • Transport lists and camp administration
  • Ego-documents (letters, diaries, memoirs thrown from trains)
  • Post-war repatriation and commemoration
  • Camp periods: refugee camp (1939-1942), transit camp (1942-1945), internment (1945-1948), military (1948-1949), repatriation (1950-1951), Schattenberg settlement (1951-1971)

Technical Contact Points

Known Infrastructure Providers

  1. Atlantis Erfgoed - Collections management system
    Website: https://www.atlantis-erfgoed.nl/
    Contact: info@atlantis-erfgoed.nl

  2. PoolParty - Thesaurus/controlled vocabulary
    Provider: Semantic Web Company
    Instance: https://digitaalerfgoed.poolparty.biz/westerbork/

  3. Sanity.io - Content management (stories)
    Project ID: e44e8rzu

  4. Spinque - Search engine
    Website: https://spinque.com/

Institutional Contact

Herinneringscentrum Kamp Westerbork
Oosthalen 8, 9414 TG Hooghalen, Netherlands
Tel: +31 (0)593-592600
Email: info@westerbork.nl
Website: https://www.kampwesterbork.nl/


Next Steps for Data Harvesting

Immediate Actions (Next Session)

  1. Test IIIF Presentation API:

    curl -I "https://kenniscentrum.kampwesterbork.nl/iiif/presentation/v3/TH0000017587/manifest.json"
    
  2. Capture Spinque API format:

    • Open DevTools Network tab
    • Perform search
    • Copy spinque-proxy POST request as cURL
    • Document request/response structure
  3. Check sitemap.xml:

    curl "https://collecties.kampwesterbork.nl/sitemap.xml"
    curl "https://collecties.kampwesterbork.nl/robots.txt"
    
  4. Extract sample record via Next.js data endpoint:

    # 1. Get build ID from homepage source
    curl "https://collecties.kampwesterbork.nl/" | grep -o '"buildId":"[^"]*"'
    
    # 2. Query work data
    curl "https://collecties.kampwesterbork.nl/_next/data/{BUILD_ID}/nl/werk/https%3A%2F%2Fdata.kampwesterbork.nl%2Fwork%2FTH0000017587.json"
    

Development Tasks

  1. Write Python harvester using requests + json
  2. Test Spinque pagination (verify max pages, rate limits)
  3. Build record parser for Next.js JSON schema
  4. Implement IIIF image downloader (selective: thumbnails vs full-res)
  5. Map fields to LinkML schema (HeritageCustodian + DigitalPlatform + Identifier)

Conclusion

Herinneringscentrum Kamp Westerbork's digital collection is harvestable via Next.js data endpoints and the Spinque search API, despite lacking a public REST API. The site's use of IIIF Image API 3.0 provides standardized image access, and the Atlantis CMS backend suggests potential for XML/EAD exports if direct system access is negotiated.

Recommended Harvesting Method: Next.js JSON endpoints (Strategy 1) as primary, with Spinque API (Strategy 2) as fallback for bulk search queries.

Estimated Effort:

  • Proof-of-concept: 2-3 days (1,000 records)
  • Full harvest: 1-2 weeks (65,000+ records, rate-limited)
  • Data integration: 3-5 days (mapping, validation, RDF export)

Data Quality: High (authoritative archival institution, EHRI partner, structured metadata)


References

  1. EHRI Portal Entry: https://portal.ehri-project.eu/institutions/nl-002896
  2. Dutch ISIL Registry: NL-HhlHCKW (confirmed 2016-12-02)
  3. Dutch Organizations CSV: Row 190 (Drenthe, Hooghalen, Atlantis system)
  4. Atlantis Documentation: https://www.atlantis-erfgoed.nl/collectie-beheer/
  5. IIIF Image API 3.0 Spec: https://iiif.io/api/image/3.0/
  6. Next.js Data Fetching: https://nextjs.org/docs/basic-features/data-fetching

Report Author: AI Agent (OpenCODE)
Session Date: 2025-11-17
Project: GLAM Data Extraction (Global Heritage Custodian Identifiers)