215 lines
5.9 KiB
Markdown
215 lines
5.9 KiB
Markdown
# Czech Archives ARON API Investigation Results
|
||
|
||
**Date**: 2025-11-19
|
||
**API Base**: https://portal.nacr.cz/aron/api/aron
|
||
|
||
## Summary
|
||
|
||
✅ **FOUND**: Undocumented REST API for ARON portal
|
||
✅ **WORKS**: API returns JSON data without authentication
|
||
⚠️ **CHALLENGE**: API returns ALL records (505,884), need to filter for institutions only
|
||
|
||
## API Endpoints Discovered
|
||
|
||
### 1. List All Records
|
||
```bash
|
||
POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST
|
||
```
|
||
|
||
**Request Body**:
|
||
```json
|
||
{
|
||
"size": 100, # Records per page (max seems to be 100)
|
||
"searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
|
||
}
|
||
```
|
||
|
||
**Response Structure**:
|
||
```json
|
||
{
|
||
"count": 505884, # Total records in database
|
||
"items": [ # Array of records
|
||
{
|
||
"id": "uuid",
|
||
"name": "Record name",
|
||
"description": "Optional description",
|
||
"order": 1
|
||
}
|
||
],
|
||
"searchAfter": "[\"next-cursor\"]", # Cursor for next page
|
||
"aggregations": null
|
||
}
|
||
```
|
||
|
||
**Pagination**: Uses cursor-based pagination (searchAfter), not page numbers
|
||
|
||
### 2. Get Individual Record Detail
|
||
```bash
|
||
GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}
|
||
```
|
||
|
||
**Response includes**:
|
||
- `type`: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")
|
||
- `parts`: Array of metadata parts with items
|
||
- `name`, `description`, `id`, etc.
|
||
|
||
**Example metadata items**:
|
||
- `INST~CODE`: Institution code
|
||
- `INST~SHORT~NAME`: Short name
|
||
- `INST~ADDRESS`: Address
|
||
- `INST~PHONE`: Phone
|
||
- `INST~EMAIL`: Email
|
||
- `INST~URL`: Website
|
||
|
||
## Key Findings
|
||
|
||
### Database Contents
|
||
- **Total records**: 505,884
|
||
- **Record types**: Institutions, archival fonds, originators, finding aids, etc.
|
||
- **Institutions**: Unknown count (need to filter by `type="INSTITUTION"` from detail API)
|
||
|
||
### Challenge: Identifying Institutions
|
||
|
||
The list API **does NOT include** the `type` field. This means:
|
||
|
||
❌ **Cannot filter institutions** at list level
|
||
✅ **Must fetch detail for each record** to check `type` field
|
||
⚠️ **505,884 API calls required** to identify all institutions
|
||
|
||
### Web Interface Shows ~560 Institutions
|
||
|
||
The browser interface at `https://portal.nacr.cz/aron/institution` shows:
|
||
- 56 pages × 10 per page = ~560 institutions
|
||
- **This uses a different filter/query** that we haven't found yet
|
||
|
||
## Recommendation: Two Approaches
|
||
|
||
### Approach A: Smart Filtering (RECOMMENDED) ⭐
|
||
|
||
**Strategy**: Filter by name patterns in list API, then verify with detail API
|
||
|
||
1. **Phase 1**: Scan all 505,884 records in list API
|
||
- Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
|
||
- Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
|
||
- Estimated: ~5,000-10,000 candidates
|
||
|
||
2. **Phase 2**: Fetch details for candidates only
|
||
- Check `type == "INSTITUTION"`
|
||
- Extract full metadata
|
||
- Estimated: ~560 confirmed institutions
|
||
|
||
**Pros**:
|
||
- Reduces API calls from 505k to ~10k
|
||
- Faster execution (~2-3 hours)
|
||
- Respects rate limits (2 req/sec)
|
||
|
||
**Cons**:
|
||
- May miss institutions with unusual names
|
||
- Name-based filtering is heuristic
|
||
|
||
**Estimated time**: 2-3 hours total
|
||
|
||
### Approach B: Complete Scan (THOROUGH)
|
||
|
||
**Strategy**: Fetch details for ALL 505,884 records, filter by type
|
||
|
||
1. Fetch detail API for every record
|
||
2. Filter where `type == "INSTITUTION"`
|
||
3. Extract metadata
|
||
|
||
**Pros**:
|
||
- 100% coverage
|
||
- No false negatives
|
||
|
||
**Cons**:
|
||
- 505,884 API calls
|
||
- ~70 hours at 2 req/sec
|
||
- May hit rate limits
|
||
- Excessive load on server
|
||
|
||
**Estimated time**: 70 hours (3 days continuous)
|
||
|
||
## Implemented Solution
|
||
|
||
Created: `scripts/scrapers/scrape_czech_archives_aron.py`
|
||
|
||
**Current implementation**: Approach A (smart filtering)
|
||
|
||
**Features**:
|
||
- Two-phase scraping (list → filter → details)
|
||
- Cursor-based pagination
|
||
- Rate limiting (0.5s delay = 2 req/sec)
|
||
- Resumable (can interrupt and restart)
|
||
- Progress display
|
||
- LinkML-compliant output
|
||
|
||
**Usage**:
|
||
```bash
|
||
cd /Users/kempersc/apps/glam
|
||
python3 scripts/scrapers/scrape_czech_archives_aron.py
|
||
```
|
||
|
||
**Output**: `data/instances/czech_archives_aron.yaml`
|
||
|
||
## Alternative: Find Specific Institution Endpoint
|
||
|
||
The browser interface must use a different endpoint or filter parameter. Possible options:
|
||
|
||
1. **Check for filter parameter**:
|
||
```bash
|
||
POST .../listview?listType=RECORD~TYPE
|
||
Body: {"recordType": "INSTITUTION"}
|
||
```
|
||
Status: Tried, didn't work (still returns 505k)
|
||
|
||
2. **Check for different listType**:
|
||
- `listType=EVIDENCE-LIST` (current, returns all)
|
||
- `listType=RECORD~TYPE` (tried, same result)
|
||
- Other types? (need to investigate)
|
||
|
||
3. **Check browser network tab more carefully**:
|
||
- The `/aron/institution` page makes a POST request
|
||
- Need to capture exact request body with filters
|
||
|
||
## Next Steps
|
||
|
||
### Option 1: Run Smart Filter Scraper (RECOMMENDED)
|
||
```bash
|
||
python3 scripts/scrapers/scrape_czech_archives_aron.py
|
||
```
|
||
- Expected: ~560 institutions in 2-3 hours
|
||
- Safe, respectful of server
|
||
|
||
### Option 2: Investigate Institution-Specific Endpoint
|
||
- Use browser DevTools on `/aron/institution` page
|
||
- Capture POST request body when loading page
|
||
- Look for filter parameters we missed
|
||
- **Try this first before running scraper**
|
||
|
||
### Option 3: Contact Národní archiv
|
||
- Email: posta@nacr.cz or arch@mvcr.cz
|
||
- Ask: "Is there an API endpoint for institutions only?"
|
||
- Mention: "We found undocumented API but it returns 505k records"
|
||
|
||
## API Rate Limits
|
||
|
||
**Not documented**, but being conservative:
|
||
- Max 2 requests/second (0.5s delay)
|
||
- Avoid burst requests
|
||
- Monitor for 429 errors
|
||
|
||
## Data Quality
|
||
|
||
**TIER_1_AUTHORITATIVE**:
|
||
- Source: Národní archiv (National Archives)
|
||
- Official national registry
|
||
- Real-time updates by archivists
|
||
- High quality metadata
|
||
|
||
**License**: Unknown (likely CC0 like library data)
|
||
|
||
---
|
||
|
||
**Status**: API discovered, scraper ready, awaiting final decision on approach
|
||
**Next action**: Capture exact browser request to find institution filter, OR run smart filter scraper
|
||
**Estimated completion**: 2-3 hours for scraping
|