glam/CZECH_ARON_API_INVESTIGATION.md
2025-11-19 23:25:22 +01:00

215 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Czech Archives ARON API Investigation Results
**Date**: 2025-11-19
**API Base**: https://portal.nacr.cz/aron/api/aron
## Summary
**FOUND**: Undocumented REST API for ARON portal
**WORKS**: API returns JSON data without authentication
⚠️ **CHALLENGE**: API returns ALL records (505,884), need to filter for institutions only
## API Endpoints Discovered
### 1. List All Records
```bash
POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST
```
**Request Body**:
```json
{
"size": 100, # Records per page (max seems to be 100)
"searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
}
```
**Response Structure**:
```json
{
"count": 505884, # Total records in database
"items": [ # Array of records
{
"id": "uuid",
"name": "Record name",
"description": "Optional description",
"order": 1
}
],
"searchAfter": "[\"next-cursor\"]", # Cursor for next page
"aggregations": null
}
```
**Pagination**: Uses cursor-based pagination (searchAfter), not page numbers
### 2. Get Individual Record Detail
```bash
GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}
```
**Response includes**:
- `type`: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")
- `parts`: Array of metadata parts with items
- `name`, `description`, `id`, etc.
**Example metadata items**:
- `INST~CODE`: Institution code
- `INST~SHORT~NAME`: Short name
- `INST~ADDRESS`: Address
- `INST~PHONE`: Phone
- `INST~EMAIL`: Email
- `INST~URL`: Website
## Key Findings
### Database Contents
- **Total records**: 505,884
- **Record types**: Institutions, archival fonds, originators, finding aids, etc.
- **Institutions**: Unknown count (need to filter by `type="INSTITUTION"` from detail API)
### Challenge: Identifying Institutions
The list API **does NOT include** the `type` field. This means:
**Cannot filter institutions** at list level
**Must fetch detail for each record** to check `type` field
⚠️ **505,884 API calls required** to identify all institutions
### Web Interface Shows ~560 Institutions
The browser interface at `https://portal.nacr.cz/aron/institution` shows:
- 56 pages × 10 per page = ~560 institutions
- **This uses a different filter/query** that we haven't found yet
## Recommendation: Two Approaches
### Approach A: Smart Filtering (RECOMMENDED) ⭐
**Strategy**: Filter by name patterns in list API, then verify with detail API
1. **Phase 1**: Scan all 505,884 records in list API
- Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
- Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
- Estimated: ~5,000-10,000 candidates
2. **Phase 2**: Fetch details for candidates only
- Check `type == "INSTITUTION"`
- Extract full metadata
- Estimated: ~560 confirmed institutions
**Pros**:
- Reduces API calls from 505k to ~10k
- Faster execution (~2-3 hours)
- Respects rate limits (2 req/sec)
**Cons**:
- May miss institutions with unusual names
- Name-based filtering is heuristic
**Estimated time**: 2-3 hours total
### Approach B: Complete Scan (THOROUGH)
**Strategy**: Fetch details for ALL 505,884 records, filter by type
1. Fetch detail API for every record
2. Filter where `type == "INSTITUTION"`
3. Extract metadata
**Pros**:
- 100% coverage
- No false negatives
**Cons**:
- 505,884 API calls
- ~70 hours at 2 req/sec
- May hit rate limits
- Excessive load on server
**Estimated time**: 70 hours (3 days continuous)
## Implemented Solution
Created: `scripts/scrapers/scrape_czech_archives_aron.py`
**Current implementation**: Approach A (smart filtering)
**Features**:
- Two-phase scraping (list → filter → details)
- Cursor-based pagination
- Rate limiting (0.5s delay = 2 req/sec)
- Resumable (can interrupt and restart)
- Progress display
- LinkML-compliant output
**Usage**:
```bash
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_czech_archives_aron.py
```
**Output**: `data/instances/czech_archives_aron.yaml`
## Alternative: Find Specific Institution Endpoint
The browser interface must use a different endpoint or filter parameter. Possible options:
1. **Check for filter parameter**:
```bash
POST .../listview?listType=RECORD~TYPE
Body: {"recordType": "INSTITUTION"}
```
Status: Tried, didn't work (still returns 505k)
2. **Check for different listType**:
- `listType=EVIDENCE-LIST` (current, returns all)
- `listType=RECORD~TYPE` (tried, same result)
- Other types? (need to investigate)
3. **Check browser network tab more carefully**:
- The `/aron/institution` page makes a POST request
- Need to capture exact request body with filters
## Next Steps
### Option 1: Run Smart Filter Scraper (RECOMMENDED)
```bash
python3 scripts/scrapers/scrape_czech_archives_aron.py
```
- Expected: ~560 institutions in 2-3 hours
- Safe, respectful of server
### Option 2: Investigate Institution-Specific Endpoint
- Use browser DevTools on `/aron/institution` page
- Capture POST request body when loading page
- Look for filter parameters we missed
- **Try this first before running scraper**
### Option 3: Contact Národní archiv
- Email: posta@nacr.cz or arch@mvcr.cz
- Ask: "Is there an API endpoint for institutions only?"
- Mention: "We found undocumented API but it returns 505k records"
## API Rate Limits
**Not documented**, but being conservative:
- Max 2 requests/second (0.5s delay)
- Avoid burst requests
- Monitor for 429 errors
## Data Quality
**TIER_1_AUTHORITATIVE**:
- Source: Národní archiv (National Archives)
- Official national registry
- Real-time updates by archivists
- High quality metadata
**License**: Unknown (likely CC0 like library data)
---
**Status**: API discovered, scraper ready, awaiting final decision on approach
**Next action**: Capture exact browser request to find institution filter, OR run smart filter scraper
**Estimated completion**: 2-3 hours for scraping