glam/CZECH_ARON_API_INVESTIGATION.md

# Czech Archives ARON API Investigation Results

**Date**: 2025-11-19
**API Base**: https://portal.nacr.cz/aron/api/aron

## Summary

✅ **FOUND**: Undocumented REST API for ARON portal
✅ **WORKS**: API returns JSON data without authentication
⚠️ **CHALLENGE**: API returns ALL records (505,884), need to filter for institutions only

## API Endpoints Discovered

### 1. List All Records
```bash
POST https://portal.nacr.cz/aron/api/aron/apu/listview?listType=EVIDENCE-LIST
```

**Request Body**:
```json
{
  "size": 100,           # Records per page (max seems to be 100)
  "searchAfter": "[...]" # Cursor for pagination (optional, from previous response)
}
```

**Response Structure**:
```json
{
  "count": 505884,                    # Total records in database
  "items": [                          # Array of records
    {
      "id": "uuid",
      "name": "Record name",
      "description": "Optional description",
      "order": 1
    }
  ],
  "searchAfter": "[\"next-cursor\"]", # Cursor for next page
  "aggregations": null
}
```

**Pagination**: Uses cursor-based pagination (searchAfter), not page numbers

### 2. Get Individual Record Detail
```bash
GET https://portal.nacr.cz/aron/api/aron/apu/{uuid}
```

**Response includes**:
- `type`: Record type (e.g., "INSTITUTION", "FOND", "ORIGINATOR")
- `parts`: Array of metadata parts with items
- `name`, `description`, `id`, etc.

**Example metadata items**:
- `INST~CODE`: Institution code
- `INST~SHORT~NAME`: Short name
- `INST~ADDRESS`: Address
- `INST~PHONE`: Phone
- `INST~EMAIL`: Email
- `INST~URL`: Website

## Key Findings

### Database Contents
- **Total records**: 505,884
- **Record types**: Institutions, archival fonds, originators, finding aids, etc.
- **Institutions**: Unknown count (need to filter by `type="INSTITUTION"` from detail API)

### Challenge: Identifying Institutions

The list API **does NOT include** the `type` field. This means:

❌ **Cannot filter institutions** at list level
✅ **Must fetch detail for each record** to check `type` field
⚠️ **505,884 API calls required** to identify all institutions

### Web Interface Shows ~560 Institutions

The browser interface at `https://portal.nacr.cz/aron/institution` shows:
- 56 pages × 10 per page = ~560 institutions
- **This uses a different filter/query** that we haven't found yet

## Recommendation: Two Approaches

### Approach A: Smart Filtering (RECOMMENDED) ⭐

**Strategy**: Filter by name patterns in list API, then verify with detail API

1. **Phase 1**: Scan all 505,884 records in list API
   - Filter by name patterns (contains "archiv", "muzeum", "galerie", etc.)
   - Exclude non-institutions ("inventář", "pomůcka", "fond", "družstvo")
   - Estimated: ~5,000-10,000 candidates

2. **Phase 2**: Fetch details for candidates only
   - Check `type == "INSTITUTION"`
   - Extract full metadata
   - Estimated: ~560 confirmed institutions

**Pros**:
- Reduces API calls from 505k to ~10k
- Faster execution (~2-3 hours)
- Respects rate limits (2 req/sec)

**Cons**:
- May miss institutions with unusual names
- Name-based filtering is heuristic

**Estimated time**: 2-3 hours total

### Approach B: Complete Scan (THOROUGH)

**Strategy**: Fetch details for ALL 505,884 records, filter by type

1. Fetch detail API for every record
2. Filter where `type == "INSTITUTION"`
3. Extract metadata

**Pros**:
- 100% coverage
- No false negatives

**Cons**:
- 505,884 API calls
- ~70 hours at 2 req/sec
- May hit rate limits
- Excessive load on server

**Estimated time**: 70 hours (3 days continuous)

## Implemented Solution

Created: `scripts/scrapers/scrape_czech_archives_aron.py`

**Current implementation**: Approach A (smart filtering)

**Features**:
- Two-phase scraping (list → filter → details)
- Cursor-based pagination
- Rate limiting (0.5s delay = 2 req/sec)
- Resumable (can interrupt and restart)
- Progress display
- LinkML-compliant output

**Usage**:
```bash
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/scrape_czech_archives_aron.py
```

**Output**: `data/instances/czech_archives_aron.yaml`

## Alternative: Find Specific Institution Endpoint

The browser interface must use a different endpoint or filter parameter. Possible options:

1. **Check for filter parameter**:
   ```bash
   POST .../listview?listType=RECORD~TYPE
   Body: {"recordType": "INSTITUTION"}
   ```
   Status: Tried, didn't work (still returns 505k)

2. **Check for different listType**:
   - `listType=EVIDENCE-LIST` (current, returns all)
   - `listType=RECORD~TYPE` (tried, same result)
   - Other types? (need to investigate)

3. **Check browser network tab more carefully**:
   - The `/aron/institution` page makes a POST request
   - Need to capture exact request body with filters

## Next Steps

### Option 1: Run Smart Filter Scraper (RECOMMENDED)
```bash
python3 scripts/scrapers/scrape_czech_archives_aron.py
```
- Expected: ~560 institutions in 2-3 hours
- Safe, respectful of server

### Option 2: Investigate Institution-Specific Endpoint
- Use browser DevTools on `/aron/institution` page
- Capture POST request body when loading page
- Look for filter parameters we missed
- **Try this first before running scraper**

### Option 3: Contact Národní archiv
- Email: posta@nacr.cz or arch@mvcr.cz
- Ask: "Is there an API endpoint for institutions only?"
- Mention: "We found undocumented API but it returns 505k records"

## API Rate Limits

**Not documented**, but being conservative:
- Max 2 requests/second (0.5s delay)
- Avoid burst requests
- Monitor for 429 errors

## Data Quality

**TIER_1_AUTHORITATIVE**:
- Source: Národní archiv (National Archives)
- Official national registry
- Real-time updates by archivists
- High quality metadata

**License**: Unknown (likely CC0 like library data)

---

**Status**: API discovered, scraper ready, awaiting final decision on approach
**Next action**: Capture exact browser request to find institution filter, OR run smart filter scraper
**Estimated completion**: 2-3 hours for scraping