glam/docs/AUSTRALIA_TROVE_EXTRACTION.md
2025-11-19 23:25:22 +01:00

329 lines
10 KiB
Markdown

# Australian Heritage Institution Extraction - Trove API
## Overview
This document describes the extraction of Australian heritage custodian organizations from the **Trove API** (National Library of Australia).
## Data Source
**Authority**: National Library of Australia (NLA)
**System**: Trove API v3
**ISIL Registry**: Australian Interlibrary Resource Sharing (ILRS) Directory
**Coverage**: Organizations that contribute to Trove and the Australian National Bibliographic Database (ANBD)
### What is Trove?
Trove is Australia's national discovery service, aggregating collections from libraries, archives, museums, galleries, and other heritage institutions across Australia.
### What is NUC?
**NUC (National Union Catalogue)** symbols are unique identifiers for Australian heritage institutions. They function as Australia's ISIL equivalent:
- **Format**: `AU-{NUC}` (e.g., `AU-NLA` for National Library of Australia)
- **Scope**: Identifies contributing organizations in the Australian National Bibliographic Database
- **Standards**: Complies with ISO 15511 (ISIL) standard
## Extraction Script
### Location
`scripts/extract_trove_contributors.py`
### Features
- ✅ Extracts all Trove contributors via API v3
- ✅ Retrieves full metadata (name, NUC code, URLs, access policies)
- ✅ Maps to LinkML HeritageCustodian schema v0.2.1
- ✅ Generates GHCID persistent identifiers (UUID v5, numeric, base string)
- ✅ Classifies institutions using GLAMORCUBESFIXPHDNT taxonomy
- ✅ Exports to YAML, JSON, and CSV formats
- ✅ Tracks provenance metadata (TIER_1_AUTHORITATIVE)
- ✅ Respects API rate limits (200 requests/minute)
### Requirements
1. **Trove API Key** (free registration required)
- Register at: https://trove.nla.gov.au/about/create-something/using-api
- Follow "Sign up for an API key" instructions
- Key is emailed immediately after registration
2. **Python Dependencies** (already installed in this project)
- `requests` - HTTP client for API calls
- `pyyaml` - YAML parsing and generation
- `pydantic` - Data validation (LinkML schema)
### Usage
#### Basic Usage
```bash
python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY
```
#### Advanced Options
```bash
# Specify output directory
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--output-dir data/instances/australia
# Adjust rate limiting (default: 0.3s delay = 200 req/min)
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--delay 0.5
# Export specific formats only
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--formats yaml json
```
#### Command-Line Arguments
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `--api-key` | ✅ Yes | - | Trove API key (from NLA registration) |
| `--output-dir` | No | `data/instances` | Output directory for exported files |
| `--delay` | No | `0.3` | Delay between API calls (seconds) |
| `--formats` | No | `yaml json csv` | Export formats (choose from: yaml, json, csv) |
### Output Files
The script generates timestamped files in the output directory:
```
data/instances/
├── trove_contributors_20251118_143000.yaml # LinkML-compliant YAML
├── trove_contributors_20251118_143000.json # JSON format
└── trove_contributors_20251118_143000.csv # Flattened CSV
```
### Output Schema
Records conform to LinkML `HeritageCustodian` schema (v0.2.1):
```yaml
- id: https://w3id.org/heritage/custodian/au/nla
record_id: "uuid-v4-database-id"
ghcid_uuid: "uuid-v5-persistent-id"
ghcid_numeric: 213324328442227739
ghcid_current: AU-NSW-CAN-L-NLA
name: National Library of Australia
official_name: National Library of Australia
institution_type: L # Library
identifiers:
- identifier_scheme: NUC
identifier_value: NLA
identifier_url: https://www.nla.gov.au/apps/ilrs/?action=IlrsSearch&term=NLA
- identifier_scheme: ISIL
identifier_value: AU-NLA
homepage: https://www.nla.gov.au
digital_platforms:
- platform_name: Institutional Catalogue
platform_url: https://catalogue.nla.gov.au
platform_type: CATALOGUE
locations:
- city: Canberra
region: ACT
country: AU
provenance:
data_source: TROVE_API
data_tier: TIER_1_AUTHORITATIVE
extraction_date: "2025-11-18T14:30:00Z"
extraction_method: "Trove API v3 /contributor endpoint with reclevel=full"
confidence_score: 0.95
source_url: https://api.trove.nla.gov.au/v3/contributor/NLA
```
## Institution Type Classification
The script automatically classifies institutions using the GLAMORCUBESFIXPHDNT taxonomy:
| Type | Code | Detection Keywords |
|------|------|-------------------|
| **Library** | L | library, bibliothek, biblioteca, bibliotheque |
| **Archive** | A | archive, archiv, archivo, records |
| **Museum** | M | museum, museo, musee (with 'museum' keyword) |
| **Gallery** | G | gallery (without 'museum') |
| **Education Provider** | E | university, college, school, institut |
| **Official Institution** | O | national, state, government, department, ministry |
| **Research Center** | R | research, institute, center, centre |
| **Society** | S | society, association, club, historical |
| **Unknown** | U | Default when no keywords match |
## Data Quality
### TIER_1_AUTHORITATIVE Classification
Trove API data is classified as **TIER_1_AUTHORITATIVE** because:
-**Official source**: National Library of Australia (government agency)
-**Maintained registry**: Actively curated by NLA staff
-**Quality controlled**: Contributing organizations verified before inclusion
-**Standards compliant**: NUC codes map to ISIL standard (ISO 15511)
-**Current data**: Updated regularly by contributing institutions
### Confidence Score: 0.95
Records receive a confidence score of **0.95** (very high confidence) because:
- Data comes directly from authoritative API
- No NLP extraction or inference required
- Minimal ambiguity in classification
- 5% margin accounts for potential classification edge cases
## Coverage and Limitations
### What the Trove API Includes
**Organizations that contribute to Trove**:
- Libraries contributing bibliographic records
- Archives providing digitized collections
- Museums sharing collection metadata
- Galleries with digitized artworks
- Universities contributing research outputs
- Research institutions with digital repositories
### What the Trove API Does NOT Include
**Organizations not contributing to Trove**:
- Heritage institutions without digital presence
- Private collections not shared with Trove
- Recently established institutions pending registration
- Organizations that declined Trove participation
### Full ISIL Registry Coverage
For **complete** Australian ISIL coverage, additional extraction is needed:
**ILRS Directory** (https://www.nla.gov.au/apps/ilrs/)
- Full registry of all Australian ISIL codes
- Includes non-contributing institutions
- Contains detailed ILL policies, service levels, charges
- Requires web scraping (no public API)
**Recommendation**:
1.**Start with Trove API** (authoritative, easy to extract)
2.**Supplement with ILRS scraping** (comprehensive coverage)
3. 🔗 **Cross-link datasets** using NUC/ISIL codes
## API Rate Limits
**Trove API v3 Rate Limit**: 200 requests per minute
The script automatically respects this limit with:
- Default delay: 0.3 seconds between requests (≈200 req/min)
- Configurable delay via `--delay` parameter
- Progress logging every 50 records
**Estimated extraction time** (for 500 contributors):
- At 200 req/min: ~2.5 minutes
- At 120 req/min (safer): ~4.2 minutes
## Next Steps
### 1. Obtain Trove API Key
Register at: https://trove.nla.gov.au/about/create-something/using-api
### 2. Run Extraction
```bash
python scripts/extract_trove_contributors.py --api-key YOUR_KEY
```
### 3. Validate Output
Check the generated files in `data/instances/`:
```bash
# Count records
wc -l data/instances/trove_contributors_*.csv
# View YAML structure
head -n 50 data/instances/trove_contributors_*.yaml
# Check institution type distribution
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c
```
### 4. Integrate with Existing Data
The extracted records can be:
-**Merged** with other Australian heritage datasets
-**Cross-referenced** with Wikidata (Q-numbers for GHCIDs)
-**Enriched** with geocoding (cities → lat/lon)
-**Exported** to RDF/Turtle for semantic web integration
### 5. Enrich with Full ISIL Registry
For comprehensive coverage, build ILRS Directory scraper:
```bash
# Future script (not yet implemented)
python scripts/scrape_ilrs_directory.py
```
**Target**: https://www.nla.gov.au/apps/ilrs/
**Method**: Playwright/Selenium web scraping
**Output**: Additional ISIL codes not in Trove
## References
- **Trove API Documentation**: https://trove.nla.gov.au/about/create-something/using-api
- **ILRS Directory**: https://www.nla.gov.au/apps/ilrs/
- **ISIL Standard (ISO 15511)**: https://www.iso.org/standard/77849.html
- **LinkML Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml`
- **GHCID Specification**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md`
## Troubleshooting
### "API key required" Error
**Problem**: Missing or invalid Trove API key
**Solution**:
1. Register at https://trove.nla.gov.au/about/create-something/using-api
2. Check email for API key
3. Use key with `--api-key` parameter
### Rate Limit Errors (HTTP 429)
**Problem**: Exceeding 200 requests/minute
**Solution**: Increase delay between requests:
```bash
python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5
```
### No Contributors Found
**Problem**: API returns empty response
**Solution**:
1. Check API key validity
2. Verify internet connection
3. Check Trove API status: https://status.nla.gov.au
### Classification Issues
**Problem**: Institutions classified as "UNKNOWN" (U)
**Solution**:
- Review institution names in output
- Add keywords to `classify_institution_type()` function
- Update classification logic in script
## Support
For questions or issues:
- Project documentation: `/Users/kempersc/apps/glam/docs/`
- Agent instructions: `/Users/kempersc/apps/glam/AGENTS.md`
- Schema reference: `/Users/kempersc/apps/glam/schemas/`
---
**Status**: ✅ Ready to use
**Version**: 1.0.0
**Last Updated**: 2025-11-18
**Maintainer**: GLAM Data Extraction Project