329 lines
10 KiB
Markdown
329 lines
10 KiB
Markdown
# Australian Heritage Institution Extraction - Trove API
|
|
|
|
## Overview
|
|
|
|
This document describes the extraction of Australian heritage custodian organizations from the **Trove API** (National Library of Australia).
|
|
|
|
## Data Source
|
|
|
|
**Authority**: National Library of Australia (NLA)
|
|
**System**: Trove API v3
|
|
**ISIL Registry**: Australian Interlibrary Resource Sharing (ILRS) Directory
|
|
**Coverage**: Organizations that contribute to Trove and the Australian National Bibliographic Database (ANBD)
|
|
|
|
### What is Trove?
|
|
|
|
Trove is Australia's national discovery service, aggregating collections from libraries, archives, museums, galleries, and other heritage institutions across Australia.
|
|
|
|
### What is NUC?
|
|
|
|
**NUC (National Union Catalogue)** symbols are unique identifiers for Australian heritage institutions. They function as Australia's ISIL equivalent:
|
|
|
|
- **Format**: `AU-{NUC}` (e.g., `AU-NLA` for National Library of Australia)
|
|
- **Scope**: Identifies contributing organizations in the Australian National Bibliographic Database
|
|
- **Standards**: Complies with ISO 15511 (ISIL) standard
|
|
|
|
## Extraction Script
|
|
|
|
### Location
|
|
`scripts/extract_trove_contributors.py`
|
|
|
|
### Features
|
|
|
|
- ✅ Extracts all Trove contributors via API v3
|
|
- ✅ Retrieves full metadata (name, NUC code, URLs, access policies)
|
|
- ✅ Maps to LinkML HeritageCustodian schema v0.2.1
|
|
- ✅ Generates GHCID persistent identifiers (UUID v5, numeric, base string)
|
|
- ✅ Classifies institutions using GLAMORCUBESFIXPHDNT taxonomy
|
|
- ✅ Exports to YAML, JSON, and CSV formats
|
|
- ✅ Tracks provenance metadata (TIER_1_AUTHORITATIVE)
|
|
- ✅ Respects API rate limits (200 requests/minute)
|
|
|
|
### Requirements
|
|
|
|
1. **Trove API Key** (free registration required)
|
|
- Register at: https://trove.nla.gov.au/about/create-something/using-api
|
|
- Follow "Sign up for an API key" instructions
|
|
- Key is emailed immediately after registration
|
|
|
|
2. **Python Dependencies** (already installed in this project)
|
|
- `requests` - HTTP client for API calls
|
|
- `pyyaml` - YAML parsing and generation
|
|
- `pydantic` - Data validation (LinkML schema)
|
|
|
|
### Usage
|
|
|
|
#### Basic Usage
|
|
|
|
```bash
|
|
python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY
|
|
```
|
|
|
|
#### Advanced Options
|
|
|
|
```bash
|
|
# Specify output directory
|
|
python scripts/extract_trove_contributors.py \
|
|
--api-key YOUR_KEY \
|
|
--output-dir data/instances/australia
|
|
|
|
# Adjust rate limiting (default: 0.3s delay = 200 req/min)
|
|
python scripts/extract_trove_contributors.py \
|
|
--api-key YOUR_KEY \
|
|
--delay 0.5
|
|
|
|
# Export specific formats only
|
|
python scripts/extract_trove_contributors.py \
|
|
--api-key YOUR_KEY \
|
|
--formats yaml json
|
|
```
|
|
|
|
#### Command-Line Arguments
|
|
|
|
| Argument | Required | Default | Description |
|
|
|----------|----------|---------|-------------|
|
|
| `--api-key` | ✅ Yes | - | Trove API key (from NLA registration) |
|
|
| `--output-dir` | No | `data/instances` | Output directory for exported files |
|
|
| `--delay` | No | `0.3` | Delay between API calls (seconds) |
|
|
| `--formats` | No | `yaml json csv` | Export formats (choose from: yaml, json, csv) |
|
|
|
|
### Output Files
|
|
|
|
The script generates timestamped files in the output directory:
|
|
|
|
```
|
|
data/instances/
|
|
├── trove_contributors_20251118_143000.yaml # LinkML-compliant YAML
|
|
├── trove_contributors_20251118_143000.json # JSON format
|
|
└── trove_contributors_20251118_143000.csv # Flattened CSV
|
|
```
|
|
|
|
### Output Schema
|
|
|
|
Records conform to LinkML `HeritageCustodian` schema (v0.2.1):
|
|
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/au/nla
|
|
record_id: "uuid-v4-database-id"
|
|
ghcid_uuid: "uuid-v5-persistent-id"
|
|
ghcid_numeric: 213324328442227739
|
|
ghcid_current: AU-NSW-CAN-L-NLA
|
|
name: National Library of Australia
|
|
official_name: National Library of Australia
|
|
institution_type: L # Library
|
|
identifiers:
|
|
- identifier_scheme: NUC
|
|
identifier_value: NLA
|
|
identifier_url: https://www.nla.gov.au/apps/ilrs/?action=IlrsSearch&term=NLA
|
|
- identifier_scheme: ISIL
|
|
identifier_value: AU-NLA
|
|
homepage: https://www.nla.gov.au
|
|
digital_platforms:
|
|
- platform_name: Institutional Catalogue
|
|
platform_url: https://catalogue.nla.gov.au
|
|
platform_type: CATALOGUE
|
|
locations:
|
|
- city: Canberra
|
|
region: ACT
|
|
country: AU
|
|
provenance:
|
|
data_source: TROVE_API
|
|
data_tier: TIER_1_AUTHORITATIVE
|
|
extraction_date: "2025-11-18T14:30:00Z"
|
|
extraction_method: "Trove API v3 /contributor endpoint with reclevel=full"
|
|
confidence_score: 0.95
|
|
source_url: https://api.trove.nla.gov.au/v3/contributor/NLA
|
|
```
|
|
|
|
## Institution Type Classification
|
|
|
|
The script automatically classifies institutions using the GLAMORCUBESFIXPHDNT taxonomy:
|
|
|
|
| Type | Code | Detection Keywords |
|
|
|------|------|-------------------|
|
|
| **Library** | L | library, bibliothek, biblioteca, bibliotheque |
|
|
| **Archive** | A | archive, archiv, archivo, records |
|
|
| **Museum** | M | museum, museo, musee (with 'museum' keyword) |
|
|
| **Gallery** | G | gallery (without 'museum') |
|
|
| **Education Provider** | E | university, college, school, institut |
|
|
| **Official Institution** | O | national, state, government, department, ministry |
|
|
| **Research Center** | R | research, institute, center, centre |
|
|
| **Society** | S | society, association, club, historical |
|
|
| **Unknown** | U | Default when no keywords match |
|
|
|
|
## Data Quality
|
|
|
|
### TIER_1_AUTHORITATIVE Classification
|
|
|
|
Trove API data is classified as **TIER_1_AUTHORITATIVE** because:
|
|
|
|
- ✅ **Official source**: National Library of Australia (government agency)
|
|
- ✅ **Maintained registry**: Actively curated by NLA staff
|
|
- ✅ **Quality controlled**: Contributing organizations verified before inclusion
|
|
- ✅ **Standards compliant**: NUC codes map to ISIL standard (ISO 15511)
|
|
- ✅ **Current data**: Updated regularly by contributing institutions
|
|
|
|
### Confidence Score: 0.95
|
|
|
|
Records receive a confidence score of **0.95** (very high confidence) because:
|
|
|
|
- Data comes directly from authoritative API
|
|
- No NLP extraction or inference required
|
|
- Minimal ambiguity in classification
|
|
- 5% margin accounts for potential classification edge cases
|
|
|
|
## Coverage and Limitations
|
|
|
|
### What the Trove API Includes
|
|
|
|
✅ **Organizations that contribute to Trove**:
|
|
- Libraries contributing bibliographic records
|
|
- Archives providing digitized collections
|
|
- Museums sharing collection metadata
|
|
- Galleries with digitized artworks
|
|
- Universities contributing research outputs
|
|
- Research institutions with digital repositories
|
|
|
|
### What the Trove API Does NOT Include
|
|
|
|
❌ **Organizations not contributing to Trove**:
|
|
- Heritage institutions without digital presence
|
|
- Private collections not shared with Trove
|
|
- Recently established institutions pending registration
|
|
- Organizations that declined Trove participation
|
|
|
|
### Full ISIL Registry Coverage
|
|
|
|
For **complete** Australian ISIL coverage, additional extraction is needed:
|
|
|
|
**ILRS Directory** (https://www.nla.gov.au/apps/ilrs/)
|
|
- Full registry of all Australian ISIL codes
|
|
- Includes non-contributing institutions
|
|
- Contains detailed ILL policies, service levels, charges
|
|
- Requires web scraping (no public API)
|
|
|
|
**Recommendation**:
|
|
1. ✅ **Start with Trove API** (authoritative, easy to extract)
|
|
2. ⏳ **Supplement with ILRS scraping** (comprehensive coverage)
|
|
3. 🔗 **Cross-link datasets** using NUC/ISIL codes
|
|
|
|
## API Rate Limits
|
|
|
|
**Trove API v3 Rate Limit**: 200 requests per minute
|
|
|
|
The script automatically respects this limit with:
|
|
- Default delay: 0.3 seconds between requests (≈200 req/min)
|
|
- Configurable delay via `--delay` parameter
|
|
- Progress logging every 50 records
|
|
|
|
**Estimated extraction time** (for 500 contributors):
|
|
- At 200 req/min: ~2.5 minutes
|
|
- At 120 req/min (safer): ~4.2 minutes
|
|
|
|
## Next Steps
|
|
|
|
### 1. Obtain Trove API Key
|
|
|
|
Register at: https://trove.nla.gov.au/about/create-something/using-api
|
|
|
|
### 2. Run Extraction
|
|
|
|
```bash
|
|
python scripts/extract_trove_contributors.py --api-key YOUR_KEY
|
|
```
|
|
|
|
### 3. Validate Output
|
|
|
|
Check the generated files in `data/instances/`:
|
|
|
|
```bash
|
|
# Count records
|
|
wc -l data/instances/trove_contributors_*.csv
|
|
|
|
# View YAML structure
|
|
head -n 50 data/instances/trove_contributors_*.yaml
|
|
|
|
# Check institution type distribution
|
|
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c
|
|
```
|
|
|
|
### 4. Integrate with Existing Data
|
|
|
|
The extracted records can be:
|
|
- ✅ **Merged** with other Australian heritage datasets
|
|
- ✅ **Cross-referenced** with Wikidata (Q-numbers for GHCIDs)
|
|
- ✅ **Enriched** with geocoding (cities → lat/lon)
|
|
- ✅ **Exported** to RDF/Turtle for semantic web integration
|
|
|
|
### 5. Enrich with Full ISIL Registry
|
|
|
|
For comprehensive coverage, build ILRS Directory scraper:
|
|
|
|
```bash
|
|
# Future script (not yet implemented)
|
|
python scripts/scrape_ilrs_directory.py
|
|
```
|
|
|
|
**Target**: https://www.nla.gov.au/apps/ilrs/
|
|
**Method**: Playwright/Selenium web scraping
|
|
**Output**: Additional ISIL codes not in Trove
|
|
|
|
## References
|
|
|
|
- **Trove API Documentation**: https://trove.nla.gov.au/about/create-something/using-api
|
|
- **ILRS Directory**: https://www.nla.gov.au/apps/ilrs/
|
|
- **ISIL Standard (ISO 15511)**: https://www.iso.org/standard/77849.html
|
|
- **LinkML Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml`
|
|
- **GHCID Specification**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
|
|
|
|
## Troubleshooting
|
|
|
|
### "API key required" Error
|
|
|
|
**Problem**: Missing or invalid Trove API key
|
|
|
|
**Solution**:
|
|
1. Register at https://trove.nla.gov.au/about/create-something/using-api
|
|
2. Check email for API key
|
|
3. Use key with `--api-key` parameter
|
|
|
|
### Rate Limit Errors (HTTP 429)
|
|
|
|
**Problem**: Exceeding 200 requests/minute
|
|
|
|
**Solution**: Increase delay between requests:
|
|
```bash
|
|
python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5
|
|
```
|
|
|
|
### No Contributors Found
|
|
|
|
**Problem**: API returns empty response
|
|
|
|
**Solution**:
|
|
1. Check API key validity
|
|
2. Verify internet connection
|
|
3. Check Trove API status: https://status.nla.gov.au
|
|
|
|
### Classification Issues
|
|
|
|
**Problem**: Institutions classified as "UNKNOWN" (U)
|
|
|
|
**Solution**:
|
|
- Review institution names in output
|
|
- Add keywords to `classify_institution_type()` function
|
|
- Update classification logic in script
|
|
|
|
## Support
|
|
|
|
For questions or issues:
|
|
- Project documentation: `/Users/kempersc/apps/glam/docs/`
|
|
- Agent instructions: `/Users/kempersc/apps/glam/AGENTS.md`
|
|
- Schema reference: `/Users/kempersc/apps/glam/schemas/`
|
|
|
|
---
|
|
|
|
**Status**: ✅ Ready to use
|
|
**Version**: 1.0.0
|
|
**Last Updated**: 2025-11-18
|
|
**Maintainer**: GLAM Data Extraction Project
|