# Australian Heritage Institution Extraction - Trove API ## Overview This document describes the extraction of Australian heritage custodian organizations from the **Trove API** (National Library of Australia). ## Data Source **Authority**: National Library of Australia (NLA) **System**: Trove API v3 **ISIL Registry**: Australian Interlibrary Resource Sharing (ILRS) Directory **Coverage**: Organizations that contribute to Trove and the Australian National Bibliographic Database (ANBD) ### What is Trove? Trove is Australia's national discovery service, aggregating collections from libraries, archives, museums, galleries, and other heritage institutions across Australia. ### What is NUC? **NUC (National Union Catalogue)** symbols are unique identifiers for Australian heritage institutions. They function as Australia's ISIL equivalent: - **Format**: `AU-{NUC}` (e.g., `AU-NLA` for National Library of Australia) - **Scope**: Identifies contributing organizations in the Australian National Bibliographic Database - **Standards**: Complies with ISO 15511 (ISIL) standard ## Extraction Script ### Location `scripts/extract_trove_contributors.py` ### Features - ✅ Extracts all Trove contributors via API v3 - ✅ Retrieves full metadata (name, NUC code, URLs, access policies) - ✅ Maps to LinkML HeritageCustodian schema v0.2.1 - ✅ Generates GHCID persistent identifiers (UUID v5, numeric, base string) - ✅ Classifies institutions using GLAMORCUBESFIXPHDNT taxonomy - ✅ Exports to YAML, JSON, and CSV formats - ✅ Tracks provenance metadata (TIER_1_AUTHORITATIVE) - ✅ Respects API rate limits (200 requests/minute) ### Requirements 1. **Trove API Key** (free registration required) - Register at: https://trove.nla.gov.au/about/create-something/using-api - Follow "Sign up for an API key" instructions - Key is emailed immediately after registration 2. **Python Dependencies** (already installed in this project) - `requests` - HTTP client for API calls - `pyyaml` - YAML parsing and generation - `pydantic` - Data validation (LinkML schema) ### Usage #### Basic Usage ```bash python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY ``` #### Advanced Options ```bash # Specify output directory python scripts/extract_trove_contributors.py \ --api-key YOUR_KEY \ --output-dir data/instances/australia # Adjust rate limiting (default: 0.3s delay = 200 req/min) python scripts/extract_trove_contributors.py \ --api-key YOUR_KEY \ --delay 0.5 # Export specific formats only python scripts/extract_trove_contributors.py \ --api-key YOUR_KEY \ --formats yaml json ``` #### Command-Line Arguments | Argument | Required | Default | Description | |----------|----------|---------|-------------| | `--api-key` | ✅ Yes | - | Trove API key (from NLA registration) | | `--output-dir` | No | `data/instances` | Output directory for exported files | | `--delay` | No | `0.3` | Delay between API calls (seconds) | | `--formats` | No | `yaml json csv` | Export formats (choose from: yaml, json, csv) | ### Output Files The script generates timestamped files in the output directory: ``` data/instances/ ├── trove_contributors_20251118_143000.yaml # LinkML-compliant YAML ├── trove_contributors_20251118_143000.json # JSON format └── trove_contributors_20251118_143000.csv # Flattened CSV ``` ### Output Schema Records conform to LinkML `HeritageCustodian` schema (v0.2.1): ```yaml - id: https://w3id.org/heritage/custodian/au/nla record_id: "uuid-v4-database-id" ghcid_uuid: "uuid-v5-persistent-id" ghcid_numeric: 213324328442227739 ghcid_current: AU-NSW-CAN-L-NLA name: National Library of Australia official_name: National Library of Australia institution_type: L # Library identifiers: - identifier_scheme: NUC identifier_value: NLA identifier_url: https://www.nla.gov.au/apps/ilrs/?action=IlrsSearch&term=NLA - identifier_scheme: ISIL identifier_value: AU-NLA homepage: https://www.nla.gov.au digital_platforms: - platform_name: Institutional Catalogue platform_url: https://catalogue.nla.gov.au platform_type: CATALOGUE locations: - city: Canberra region: ACT country: AU provenance: data_source: TROVE_API data_tier: TIER_1_AUTHORITATIVE extraction_date: "2025-11-18T14:30:00Z" extraction_method: "Trove API v3 /contributor endpoint with reclevel=full" confidence_score: 0.95 source_url: https://api.trove.nla.gov.au/v3/contributor/NLA ``` ## Institution Type Classification The script automatically classifies institutions using the GLAMORCUBESFIXPHDNT taxonomy: | Type | Code | Detection Keywords | |------|------|-------------------| | **Library** | L | library, bibliothek, biblioteca, bibliotheque | | **Archive** | A | archive, archiv, archivo, records | | **Museum** | M | museum, museo, musee (with 'museum' keyword) | | **Gallery** | G | gallery (without 'museum') | | **Education Provider** | E | university, college, school, institut | | **Official Institution** | O | national, state, government, department, ministry | | **Research Center** | R | research, institute, center, centre | | **Society** | S | society, association, club, historical | | **Unknown** | U | Default when no keywords match | ## Data Quality ### TIER_1_AUTHORITATIVE Classification Trove API data is classified as **TIER_1_AUTHORITATIVE** because: - ✅ **Official source**: National Library of Australia (government agency) - ✅ **Maintained registry**: Actively curated by NLA staff - ✅ **Quality controlled**: Contributing organizations verified before inclusion - ✅ **Standards compliant**: NUC codes map to ISIL standard (ISO 15511) - ✅ **Current data**: Updated regularly by contributing institutions ### Confidence Score: 0.95 Records receive a confidence score of **0.95** (very high confidence) because: - Data comes directly from authoritative API - No NLP extraction or inference required - Minimal ambiguity in classification - 5% margin accounts for potential classification edge cases ## Coverage and Limitations ### What the Trove API Includes ✅ **Organizations that contribute to Trove**: - Libraries contributing bibliographic records - Archives providing digitized collections - Museums sharing collection metadata - Galleries with digitized artworks - Universities contributing research outputs - Research institutions with digital repositories ### What the Trove API Does NOT Include ❌ **Organizations not contributing to Trove**: - Heritage institutions without digital presence - Private collections not shared with Trove - Recently established institutions pending registration - Organizations that declined Trove participation ### Full ISIL Registry Coverage For **complete** Australian ISIL coverage, additional extraction is needed: **ILRS Directory** (https://www.nla.gov.au/apps/ilrs/) - Full registry of all Australian ISIL codes - Includes non-contributing institutions - Contains detailed ILL policies, service levels, charges - Requires web scraping (no public API) **Recommendation**: 1. ✅ **Start with Trove API** (authoritative, easy to extract) 2. ⏳ **Supplement with ILRS scraping** (comprehensive coverage) 3. 🔗 **Cross-link datasets** using NUC/ISIL codes ## API Rate Limits **Trove API v3 Rate Limit**: 200 requests per minute The script automatically respects this limit with: - Default delay: 0.3 seconds between requests (≈200 req/min) - Configurable delay via `--delay` parameter - Progress logging every 50 records **Estimated extraction time** (for 500 contributors): - At 200 req/min: ~2.5 minutes - At 120 req/min (safer): ~4.2 minutes ## Next Steps ### 1. Obtain Trove API Key Register at: https://trove.nla.gov.au/about/create-something/using-api ### 2. Run Extraction ```bash python scripts/extract_trove_contributors.py --api-key YOUR_KEY ``` ### 3. Validate Output Check the generated files in `data/instances/`: ```bash # Count records wc -l data/instances/trove_contributors_*.csv # View YAML structure head -n 50 data/instances/trove_contributors_*.yaml # Check institution type distribution grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c ``` ### 4. Integrate with Existing Data The extracted records can be: - ✅ **Merged** with other Australian heritage datasets - ✅ **Cross-referenced** with Wikidata (Q-numbers for GHCIDs) - ✅ **Enriched** with geocoding (cities → lat/lon) - ✅ **Exported** to RDF/Turtle for semantic web integration ### 5. Enrich with Full ISIL Registry For comprehensive coverage, build ILRS Directory scraper: ```bash # Future script (not yet implemented) python scripts/scrape_ilrs_directory.py ``` **Target**: https://www.nla.gov.au/apps/ilrs/ **Method**: Playwright/Selenium web scraping **Output**: Additional ISIL codes not in Trove ## References - **Trove API Documentation**: https://trove.nla.gov.au/about/create-something/using-api - **ILRS Directory**: https://www.nla.gov.au/apps/ilrs/ - **ISIL Standard (ISO 15511)**: https://www.iso.org/standard/77849.html - **LinkML Schema**: `/Users/kempersc/apps/glam/schemas/heritage_custodian.yaml` - **GHCID Specification**: `/Users/kempersc/apps/glam/docs/PERSISTENT_IDENTIFIERS.md` ## Troubleshooting ### "API key required" Error **Problem**: Missing or invalid Trove API key **Solution**: 1. Register at https://trove.nla.gov.au/about/create-something/using-api 2. Check email for API key 3. Use key with `--api-key` parameter ### Rate Limit Errors (HTTP 429) **Problem**: Exceeding 200 requests/minute **Solution**: Increase delay between requests: ```bash python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5 ``` ### No Contributors Found **Problem**: API returns empty response **Solution**: 1. Check API key validity 2. Verify internet connection 3. Check Trove API status: https://status.nla.gov.au ### Classification Issues **Problem**: Institutions classified as "UNKNOWN" (U) **Solution**: - Review institution names in output - Add keywords to `classify_institution_type()` function - Update classification logic in script ## Support For questions or issues: - Project documentation: `/Users/kempersc/apps/glam/docs/` - Agent instructions: `/Users/kempersc/apps/glam/AGENTS.md` - Schema reference: `/Users/kempersc/apps/glam/schemas/` --- **Status**: ✅ Ready to use **Version**: 1.0.0 **Last Updated**: 2025-11-18 **Maintainer**: GLAM Data Extraction Project