glam/docs/AUSTRALIA_TROVE_EXTRACTION.md
2025-11-19 23:25:22 +01:00

10 KiB

Australian Heritage Institution Extraction - Trove API

Overview

This document describes the extraction of Australian heritage custodian organizations from the Trove API (National Library of Australia).

Data Source

Authority: National Library of Australia (NLA)
System: Trove API v3
ISIL Registry: Australian Interlibrary Resource Sharing (ILRS) Directory
Coverage: Organizations that contribute to Trove and the Australian National Bibliographic Database (ANBD)

What is Trove?

Trove is Australia's national discovery service, aggregating collections from libraries, archives, museums, galleries, and other heritage institutions across Australia.

What is NUC?

NUC (National Union Catalogue) symbols are unique identifiers for Australian heritage institutions. They function as Australia's ISIL equivalent:

  • Format: AU-{NUC} (e.g., AU-NLA for National Library of Australia)
  • Scope: Identifies contributing organizations in the Australian National Bibliographic Database
  • Standards: Complies with ISO 15511 (ISIL) standard

Extraction Script

Location

scripts/extract_trove_contributors.py

Features

  • Extracts all Trove contributors via API v3
  • Retrieves full metadata (name, NUC code, URLs, access policies)
  • Maps to LinkML HeritageCustodian schema v0.2.1
  • Generates GHCID persistent identifiers (UUID v5, numeric, base string)
  • Classifies institutions using GLAMORCUBESFIXPHDNT taxonomy
  • Exports to YAML, JSON, and CSV formats
  • Tracks provenance metadata (TIER_1_AUTHORITATIVE)
  • Respects API rate limits (200 requests/minute)

Requirements

  1. Trove API Key (free registration required)

  2. Python Dependencies (already installed in this project)

    • requests - HTTP client for API calls
    • pyyaml - YAML parsing and generation
    • pydantic - Data validation (LinkML schema)

Usage

Basic Usage

python scripts/extract_trove_contributors.py --api-key YOUR_TROVE_API_KEY

Advanced Options

# Specify output directory
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --output-dir data/instances/australia

# Adjust rate limiting (default: 0.3s delay = 200 req/min)
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --delay 0.5

# Export specific formats only
python scripts/extract_trove_contributors.py \
  --api-key YOUR_KEY \
  --formats yaml json

Command-Line Arguments

Argument Required Default Description
--api-key Yes - Trove API key (from NLA registration)
--output-dir No data/instances Output directory for exported files
--delay No 0.3 Delay between API calls (seconds)
--formats No yaml json csv Export formats (choose from: yaml, json, csv)

Output Files

The script generates timestamped files in the output directory:

data/instances/
├── trove_contributors_20251118_143000.yaml  # LinkML-compliant YAML
├── trove_contributors_20251118_143000.json  # JSON format
└── trove_contributors_20251118_143000.csv   # Flattened CSV

Output Schema

Records conform to LinkML HeritageCustodian schema (v0.2.1):

- id: https://w3id.org/heritage/custodian/au/nla
  record_id: "uuid-v4-database-id"
  ghcid_uuid: "uuid-v5-persistent-id"
  ghcid_numeric: 213324328442227739
  ghcid_current: AU-NSW-CAN-L-NLA
  name: National Library of Australia
  official_name: National Library of Australia
  institution_type: L  # Library
  identifiers:
    - identifier_scheme: NUC
      identifier_value: NLA
      identifier_url: https://www.nla.gov.au/apps/ilrs/?action=IlrsSearch&term=NLA
    - identifier_scheme: ISIL
      identifier_value: AU-NLA
  homepage: https://www.nla.gov.au
  digital_platforms:
    - platform_name: Institutional Catalogue
      platform_url: https://catalogue.nla.gov.au
      platform_type: CATALOGUE
  locations:
    - city: Canberra
      region: ACT
      country: AU
  provenance:
    data_source: TROVE_API
    data_tier: TIER_1_AUTHORITATIVE
    extraction_date: "2025-11-18T14:30:00Z"
    extraction_method: "Trove API v3 /contributor endpoint with reclevel=full"
    confidence_score: 0.95
    source_url: https://api.trove.nla.gov.au/v3/contributor/NLA

Institution Type Classification

The script automatically classifies institutions using the GLAMORCUBESFIXPHDNT taxonomy:

Type Code Detection Keywords
Library L library, bibliothek, biblioteca, bibliotheque
Archive A archive, archiv, archivo, records
Museum M museum, museo, musee (with 'museum' keyword)
Gallery G gallery (without 'museum')
Education Provider E university, college, school, institut
Official Institution O national, state, government, department, ministry
Research Center R research, institute, center, centre
Society S society, association, club, historical
Unknown U Default when no keywords match

Data Quality

TIER_1_AUTHORITATIVE Classification

Trove API data is classified as TIER_1_AUTHORITATIVE because:

  • Official source: National Library of Australia (government agency)
  • Maintained registry: Actively curated by NLA staff
  • Quality controlled: Contributing organizations verified before inclusion
  • Standards compliant: NUC codes map to ISIL standard (ISO 15511)
  • Current data: Updated regularly by contributing institutions

Confidence Score: 0.95

Records receive a confidence score of 0.95 (very high confidence) because:

  • Data comes directly from authoritative API
  • No NLP extraction or inference required
  • Minimal ambiguity in classification
  • 5% margin accounts for potential classification edge cases

Coverage and Limitations

What the Trove API Includes

Organizations that contribute to Trove:

  • Libraries contributing bibliographic records
  • Archives providing digitized collections
  • Museums sharing collection metadata
  • Galleries with digitized artworks
  • Universities contributing research outputs
  • Research institutions with digital repositories

What the Trove API Does NOT Include

Organizations not contributing to Trove:

  • Heritage institutions without digital presence
  • Private collections not shared with Trove
  • Recently established institutions pending registration
  • Organizations that declined Trove participation

Full ISIL Registry Coverage

For complete Australian ISIL coverage, additional extraction is needed:

ILRS Directory (https://www.nla.gov.au/apps/ilrs/)

  • Full registry of all Australian ISIL codes
  • Includes non-contributing institutions
  • Contains detailed ILL policies, service levels, charges
  • Requires web scraping (no public API)

Recommendation:

  1. Start with Trove API (authoritative, easy to extract)
  2. Supplement with ILRS scraping (comprehensive coverage)
  3. 🔗 Cross-link datasets using NUC/ISIL codes

API Rate Limits

Trove API v3 Rate Limit: 200 requests per minute

The script automatically respects this limit with:

  • Default delay: 0.3 seconds between requests (≈200 req/min)
  • Configurable delay via --delay parameter
  • Progress logging every 50 records

Estimated extraction time (for 500 contributors):

  • At 200 req/min: ~2.5 minutes
  • At 120 req/min (safer): ~4.2 minutes

Next Steps

1. Obtain Trove API Key

Register at: https://trove.nla.gov.au/about/create-something/using-api

2. Run Extraction

python scripts/extract_trove_contributors.py --api-key YOUR_KEY

3. Validate Output

Check the generated files in data/instances/:

# Count records
wc -l data/instances/trove_contributors_*.csv

# View YAML structure
head -n 50 data/instances/trove_contributors_*.yaml

# Check institution type distribution
grep "institution_type:" data/instances/trove_contributors_*.yaml | sort | uniq -c

4. Integrate with Existing Data

The extracted records can be:

  • Merged with other Australian heritage datasets
  • Cross-referenced with Wikidata (Q-numbers for GHCIDs)
  • Enriched with geocoding (cities → lat/lon)
  • Exported to RDF/Turtle for semantic web integration

5. Enrich with Full ISIL Registry

For comprehensive coverage, build ILRS Directory scraper:

# Future script (not yet implemented)
python scripts/scrape_ilrs_directory.py

Target: https://www.nla.gov.au/apps/ilrs/
Method: Playwright/Selenium web scraping
Output: Additional ISIL codes not in Trove

References

Troubleshooting

"API key required" Error

Problem: Missing or invalid Trove API key

Solution:

  1. Register at https://trove.nla.gov.au/about/create-something/using-api
  2. Check email for API key
  3. Use key with --api-key parameter

Rate Limit Errors (HTTP 429)

Problem: Exceeding 200 requests/minute

Solution: Increase delay between requests:

python scripts/extract_trove_contributors.py --api-key YOUR_KEY --delay 0.5

No Contributors Found

Problem: API returns empty response

Solution:

  1. Check API key validity
  2. Verify internet connection
  3. Check Trove API status: https://status.nla.gov.au

Classification Issues

Problem: Institutions classified as "UNKNOWN" (U)

Solution:

  • Review institution names in output
  • Add keywords to classify_institution_type() function
  • Update classification logic in script

Support

For questions or issues:

  • Project documentation: /Users/kempersc/apps/glam/docs/
  • Agent instructions: /Users/kempersc/apps/glam/AGENTS.md
  • Schema reference: /Users/kempersc/apps/glam/schemas/

Status: Ready to use
Version: 1.0.0
Last Updated: 2025-11-18
Maintainer: GLAM Data Extraction Project