10 KiB
Australian Heritage Institution Extraction - Trove API
Overview
This document describes the extraction of Australian heritage custodian organizations from the Trove API (National Library of Australia).
Data Source
Authority: National Library of Australia (NLA)
System: Trove API v3
ISIL Registry: Australian Interlibrary Resource Sharing (ILRS) Directory
Coverage: Organizations that contribute to Trove and the Australian National Bibliographic Database (ANBD)
What is Trove?
Trove is Australia's national discovery service, aggregating collections from libraries, archives, museums, galleries, and other heritage institutions across Australia.
What is NUC?
NUC (National Union Catalogue) symbols are unique identifiers for Australian heritage institutions. They function as Australia's ISIL equivalent:
- Format:
AU-{NUC}(e.g.,AU-NLAfor National Library of Australia) - Scope: Identifies contributing organizations in the Australian National Bibliographic Database
- Standards: Complies with ISO 15511 (ISIL) standard
Extraction Script
Location
scripts/extract_trove_contributors.py
Features
- ✅ Extracts all Trove contributors via API v3
- ✅ Retrieves full metadata (name, NUC code, URLs, access policies)
- ✅ Maps to LinkML HeritageCustodian schema v0.2.1
- ✅ Generates GHCID persistent identifiers (UUID v5, numeric, base string)
- ✅ Classifies institutions using GLAMORCUBESFIXPHDNT taxonomy
- ✅ Exports to YAML, JSON, and CSV formats
- ✅ Tracks provenance metadata (TIER_1_AUTHORITATIVE)
- ✅ Respects API rate limits (200 requests/minute)
Requirements
-
Trove API Key (free registration required)
- Register at: https://trove.nla.gov.au/about/create-something/using-api
- Follow "Sign up for an API key" instructions
- Key is emailed immediately after registration
-
Python Dependencies (already installed in this project)
requests- HTTP client for API callspyyaml- YAML parsing and generationpydantic- Data validation (LinkML schema)
Usage
Basic Usage
python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY
Advanced Options
# Specify output directory
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--output-dir data/instances/australia
# Adjust rate limiting (default: 0.3s delay = 200 req/min)
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--delay 0.5
# Export specific formats only
python scripts/extract_trove_contributors.py \
--api-key YOUR_KEY \
--formats yaml json
Command-Line Arguments
| Argument | Required | Default | Description |
|---|---|---|---|
--api-key |
✅ Yes | - | Trove API key (from NLA registration) |
--output-dir |
No | data/instances |
Output directory for exported files |
--delay |
No | 0.3 |
Delay between API calls (seconds) |
--formats |
No | yaml json csv |
Export formats (choose from: yaml, json, csv) |
Output Files
The script generates timestamped files in the output directory:
data/instances/
├── trove_contributors_20251118_143000.yaml # LinkML-compliant YAML
├── trove_contributors_20251118_143000.json # JSON format
└── trove_contributors_20251118_143000.csv # Flattened CSV
Output Schema
Records conform to LinkML HeritageCustodian schema (v0.2.1):
- id: https://w3id.org/heritage/custodian/au/nla
record_id: "uuid-v4-database-id"
ghcid_uuid: "uuid-v5-persistent-id"
ghcid_numeric: 213324328442227739
ghcid_current: AU-NSW-CAN-L-NLA
name: National Library of Australia
official_name: National Library of Australia
institution_type: L # Library
identifiers:
- identifier_scheme: NUC
identifier_value: NLA
identifier_url: https://www.nla.gov.au/apps/ilrs/?action=IlrsSearch&term=NLA
- identifier_scheme: ISIL
identifier_value: AU-NLA
homepage: https://www.nla.gov.au
digital_platforms:
- platform_name: Institutional Catalogue
platform_url: https://catalogue.nla.gov.au
platform_type: CATALOGUE
locations:
- city: Canberra
region: ACT
country: AU
provenance:
data_source: TROVE_API
data_tier: TIER_1_AUTHORITATIVE
extraction_date: "2025-11-18T14:30:00Z"
extraction_method: "Trove API v3 /contributor endpoint with reclevel=full"
confidence_score: 0.95
source_url: https://api.trove.nla.gov.au/v3/contributor/NLA
Institution Type Classification
The script automatically classifies institutions using the GLAMORCUBESFIXPHDNT taxonomy:
| Type | Code | Detection Keywords |
|---|---|---|
| Library | L | library, bibliothek, biblioteca, bibliotheque |
| Archive | A | archive, archiv, archivo, records |
| Museum | M | museum, museo, musee (with 'museum' keyword) |
| Gallery | G | gallery (without 'museum') |
| Education Provider | E | university, college, school, institut |
| Official Institution | O | national, state, government, department, ministry |
| Research Center | R | research, institute, center, centre |
| Society | S | society, association, club, historical |
| Unknown | U | Default when no keywords match |
Data Quality
TIER_1_AUTHORITATIVE Classification
Trove API data is classified as TIER_1_AUTHORITATIVE because:
- ✅ Official source: National Library of Australia (government agency)
- ✅ Maintained registry: Actively curated by NLA staff
- ✅ Quality controlled: Contributing organizations verified before inclusion
- ✅ Standards compliant: NUC codes map to ISIL standard (ISO 15511)
- ✅ Current data: Updated regularly by contributing institutions
Confidence Score: 0.95
Records receive a confidence score of 0.95 (very high confidence) because:
- Data comes directly from authoritative API
- No NLP extraction or inference required
- Minimal ambiguity in classification
- 5% margin accounts for potential classification edge cases
Coverage and Limitations
What the Trove API Includes
✅ Organizations that contribute to Trove:
- Libraries contributing bibliographic records
- Archives providing digitized collections
- Museums sharing collection metadata
- Galleries with digitized artworks
- Universities contributing research outputs
- Research institutions with digital repositories
What the Trove API Does NOT Include
❌ Organizations not contributing to Trove:
- Heritage institutions without digital presence
- Private collections not shared with Trove
- Recently established institutions pending registration
- Organizations that declined Trove participation
Full ISIL Registry Coverage
For complete Australian ISIL coverage, additional extraction is needed:
ILRS Directory (https://www.nla.gov.au/apps/ilrs/)
- Full registry of all Australian ISIL codes
- Includes non-contributing institutions
- Contains detailed ILL policies, service levels, charges
- Requires web scraping (no public API)
Recommendation:
- ✅ Start with Trove API (authoritative, easy to extract)
- ⏳ Supplement with ILRS scraping (comprehensive coverage)
- 🔗 Cross-link datasets using NUC/ISIL codes
API Rate Limits
Trove API v3 Rate Limit: 200 requests per minute
The script automatically respects this limit with:
- Default delay: 0.3 seconds between requests (≈200 req/min)
- Configurable delay via
--delayparameter - Progress logging every 50 records
Estimated extraction time (for 500 contributors):
- At 200 req/min: ~2.5 minutes
- At 120 req/min (safer): ~4.2 minutes
Next Steps
1. Obtain Trove API Key
Register at: https://trove.nla.gov.au/about/create-something/using-api
2. Run Extraction
python scripts/extract_trove_contributors.py --api-key YOUR_KEY
3. Validate Output
Check the generated files in data/instances/:
# Count records
wc -l data/instances/trove_contributors_*.csv
# View YAML structure
head -n 50 data/instances/trove_contributors_*.yaml
# Check institution type distribution
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c
4. Integrate with Existing Data
The extracted records can be:
- ✅ Merged with other Australian heritage datasets
- ✅ Cross-referenced with Wikidata (Q-numbers for GHCIDs)
- ✅ Enriched with geocoding (cities → lat/lon)
- ✅ Exported to RDF/Turtle for semantic web integration
5. Enrich with Full ISIL Registry
For comprehensive coverage, build ILRS Directory scraper:
# Future script (not yet implemented)
python scripts/scrape_ilrs_directory.py
Target: https://www.nla.gov.au/apps/ilrs/
Method: Playwright/Selenium web scraping
Output: Additional ISIL codes not in Trove
References
- Trove API Documentation: https://trove.nla.gov.au/about/create-something/using-api
- ILRS Directory: https://www.nla.gov.au/apps/ilrs/
- ISIL Standard (ISO 15511): https://www.iso.org/standard/77849.html
- LinkML Schema:
/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml - GHCID Specification:
/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md
Troubleshooting
"API key required" Error
Problem: Missing or invalid Trove API key
Solution:
- Register at https://trove.nla.gov.au/about/create-something/using-api
- Check email for API key
- Use key with
--api-keyparameter
Rate Limit Errors (HTTP 429)
Problem: Exceeding 200 requests/minute
Solution: Increase delay between requests:
python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5
No Contributors Found
Problem: API returns empty response
Solution:
- Check API key validity
- Verify internet connection
- Check Trove API status: https://status.nla.gov.au
Classification Issues
Problem: Institutions classified as "UNKNOWN" (U)
Solution:
- Review institution names in output
- Add keywords to
classify_institution_type()function - Update classification logic in script
Support
For questions or issues:
- Project documentation:
/Users/kempersc/apps/glam/docs/ - Agent instructions:
/Users/kempersc/apps/glam/AGENTS.md - Schema reference:
/Users/kempersc/apps/glam/schemas/
Status: ✅ Ready to use
Version: 1.0.0
Last Updated: 2025-11-18
Maintainer: GLAM Data Extraction Project