661 lines
21 KiB
Markdown
661 lines
21 KiB
Markdown
# Global GLAM Dataset: System Architecture
|
|
|
|
## Overview
|
|
This document describes the technical architecture for extracting, enriching, and publishing a comprehensive global GLAM (Galleries, Libraries, Archives, Museums) dataset from Claude conversation files.
|
|
|
|
## System Context Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Global GLAM Extraction System │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Conversation │ ───> │ Extraction │ ───> │ LinkML │ │
|
|
│ │ Parsers │ │ Pipeline │ │ Instances │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Text │ │ Web Crawler │ │ Export │ │
|
|
│ │ Corpus │ │ (crawl4ai) │ │ Formats │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
[139 JSON Files] [Web Sources] [Published Dataset]
|
|
```
|
|
|
|
## High-Level Architecture
|
|
|
|
### Extraction Architecture: Hybrid Approach
|
|
|
|
This system uses a **hybrid extraction architecture** combining pattern matching and subagent-based NER:
|
|
|
|
**Pattern Matching (Main Code)**:
|
|
- Extract structured identifiers: ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers
|
|
- URL extraction and validation
|
|
- Fast, deterministic, no dependencies on NLP libraries
|
|
- Implemented in: `src/glam_extractor/extractors/identifiers.py`
|
|
|
|
**Subagent-Based NER**:
|
|
- Extract unstructured entities: institution names, locations, relationships
|
|
- Coding subagents autonomously choose NER tools (spaCy, transformers, GPT-4, etc.)
|
|
- Main application stays lightweight (no PyTorch, spaCy, transformers dependencies)
|
|
- Flexible: can swap extraction methods without changing main code
|
|
- Implemented via: Task tool invocation in extractors
|
|
|
|
**Rationale**: See `docs/plan/global_glam/07-subagent-architecture.md` for detailed architectural decision record.
|
|
|
|
### 1. Data Ingestion Layer
|
|
|
|
#### 1.1 Conversation Parser
|
|
**Purpose**: Extract structured content from Claude conversation JSON files
|
|
|
|
**Components**:
|
|
- `ConversationReader`: Read and validate JSON files
|
|
- `MessageExtractor`: Extract message content and metadata
|
|
- `CitationExtractor`: Parse URLs and references
|
|
- `ConversationIndexer`: Create searchable index of conversations
|
|
|
|
**Inputs**:
|
|
- Claude conversation JSON files (139 files covering global GLAM research)
|
|
|
|
**Outputs**:
|
|
- Structured conversation objects
|
|
- Text corpus (markdown content)
|
|
- Citation database (URLs, references)
|
|
- Conversation metadata index
|
|
|
|
**Technology**:
|
|
- Python `json` module for parsing
|
|
- `pydantic` for data validation
|
|
- DuckDB or SQLite for indexing
|
|
|
|
#### 1.2 Data Model
|
|
```python
|
|
@dataclass
|
|
class Conversation:
|
|
id: str
|
|
created_at: datetime
|
|
title: str
|
|
messages: List[Message]
|
|
citations: List[Citation]
|
|
metadata: ConversationMetadata
|
|
|
|
@dataclass
|
|
class Message:
|
|
role: str # 'user' or 'assistant'
|
|
content: str # Markdown text
|
|
timestamp: datetime
|
|
|
|
@dataclass
|
|
class Citation:
|
|
url: str
|
|
title: Optional[str]
|
|
context: str # Surrounding text
|
|
|
|
@dataclass
|
|
class ConversationMetadata:
|
|
country: Optional[str]
|
|
region: Optional[str]
|
|
glam_types: List[str] # ['museum', 'archive', 'library', 'gallery']
|
|
languages: List[str]
|
|
```
|
|
|
|
### 2. Extraction & Processing Layer
|
|
|
|
#### 2.1 NLP Extraction Pipeline
|
|
**Purpose**: Extract heritage institutions and attributes using pattern matching and subagent-based NER
|
|
|
|
**Architecture**: This pipeline uses a **hybrid approach**:
|
|
- **Pattern matching** (in main code) for identifiers and structured data
|
|
- **Coding subagents** (via Task tool) for Named Entity Recognition
|
|
- See `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale
|
|
|
|
**Components**:
|
|
|
|
##### Entity Extractor (Main Code - Pattern Matching)
|
|
- **IdentifierExtractor**: Extract ISIL codes, Wikidata IDs, VIAF IDs, KvK numbers using regex patterns
|
|
- **URLExtractor**: URL pattern matching and validation
|
|
|
|
##### Entity Extractor (Subagent-Based NER)
|
|
- **InstitutionExtractor**: Launch subagents to extract institution names using NLP/NER
|
|
- **LocationExtractor**: Launch subagents for geographic entity extraction
|
|
- **RelationshipExtractor**: Launch subagents to extract organizational relationships
|
|
|
|
##### Attribute Extractor (Subagent-Based)
|
|
- **TypeClassifier**: Subagents classify institution type (museum, archive, etc.)
|
|
- **CollectionExtractor**: Subagents extract collection subjects/themes
|
|
- **StandardsExtractor**: Subagents identify metadata/preservation standards used
|
|
|
|
##### Data Model
|
|
```python
|
|
@dataclass
|
|
class ExtractedInstitution:
|
|
name: str
|
|
name_variants: List[str]
|
|
institution_type: List[str] # Can be multiple
|
|
|
|
# Location
|
|
country: Optional[str]
|
|
region: Optional[str]
|
|
city: Optional[str]
|
|
address: Optional[str]
|
|
coordinates: Optional[Tuple[float, float]]
|
|
|
|
# Digital presence
|
|
urls: List[str]
|
|
repositories: List[DigitalRepository]
|
|
|
|
# Identifiers
|
|
isil_code: Optional[str]
|
|
wikidata_id: Optional[str]
|
|
national_ids: Dict[str, str]
|
|
|
|
# Collections
|
|
collection_subjects: List[str]
|
|
collection_formats: List[str]
|
|
|
|
# Technical
|
|
metadata_standards: List[str]
|
|
preservation_standards: List[str]
|
|
access_protocols: List[str]
|
|
|
|
# Organizational
|
|
parent_organization: Optional[str]
|
|
consortia: List[str]
|
|
partnerships: List[str]
|
|
|
|
# Provenance
|
|
source_conversations: List[str]
|
|
extraction_confidence: float
|
|
extraction_method: str
|
|
|
|
@dataclass
|
|
class DigitalRepository:
|
|
url: str
|
|
platform: Optional[str] # e.g., "DSpace", "Omeka", "CollectiveAccess"
|
|
repository_type: str # e.g., "institutional repository", "catalog", "portal"
|
|
access_type: str # "open", "restricted", "mixed"
|
|
```
|
|
|
|
**Technology**:
|
|
- **Pattern Matching** (main code): Python `re` module, `rapidfuzz` for fuzzy matching
|
|
- **NER & Extraction** (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
|
|
- **Task Orchestration**: Task tool for subagent invocation
|
|
- **Text Processing**: `langdetect` for language identification (main code)
|
|
|
|
#### 2.2 Web Crawling & Enrichment
|
|
**Purpose**: Fetch and extract data from URLs cited in conversations
|
|
|
|
**Components**:
|
|
|
|
##### Web Crawler (crawl4ai)
|
|
- **URLValidator**: Check URL availability and redirects
|
|
- **PageFetcher**: Async crawling with rate limiting
|
|
- **MetadataExtractor**: Extract structured metadata from pages
|
|
- **ContentExtractor**: Platform-specific content extraction
|
|
|
|
##### Enrichment Services
|
|
- **GeocodingService**: Resolve addresses to coordinates (Nominatim)
|
|
- **WikidataLinker**: Link institutions to Wikidata entities
|
|
- **VIAFLinker**: Link to VIAF authority records
|
|
- **RegistryChecker**: Verify against national/international registries
|
|
|
|
**Technology**:
|
|
- **crawl4ai**: Async web crawling with LLM extraction
|
|
- **httpx**: Modern async HTTP client
|
|
- **beautifulsoup4**: HTML parsing
|
|
- **lxml**: XML/HTML processing
|
|
- **geopy**: Geocoding
|
|
- **SPARQLWrapper**: Query Wikidata/VIAF
|
|
|
|
**Data Flow**:
|
|
```
|
|
Citations → URL Validation → Crawl Queue
|
|
↓
|
|
crawl4ai Async Crawler
|
|
↓
|
|
┌─────────┴─────────┐
|
|
▼ ▼
|
|
Metadata Extract Content Extract
|
|
│ │
|
|
└─────────┬─────────┘
|
|
▼
|
|
Enrichment Pipeline
|
|
▼
|
|
Enriched Institution Data
|
|
```
|
|
|
|
### 3. Schema & Validation Layer
|
|
|
|
#### 3.1 LinkML Schema Definition
|
|
**Purpose**: Define comprehensive heritage custodian ontology
|
|
|
|
**Schema Structure**:
|
|
```yaml
|
|
# heritage_custodian.yaml (simplified)
|
|
id: https://w3id.org/heritage-custodian
|
|
name: heritage-custodian-schema
|
|
title: Heritage Custodian Ontology
|
|
|
|
prefixes:
|
|
hc: https://w3id.org/heritage-custodian/
|
|
schema: http://schema.org/
|
|
dcterms: http://purl.org/dc/terms/
|
|
skos: http://www.w3.org/2004/02/skos/core#
|
|
isil: http://id.loc.gov/vocabulary/organizations/
|
|
rico: https://www.ica.org/standards/RiC/ontology#
|
|
|
|
imports:
|
|
- linkml:types
|
|
- schema_org_subset
|
|
- cpoc_subset
|
|
- tooi_subset
|
|
|
|
classes:
|
|
HeritageCustodian:
|
|
is_a: schema:Organization
|
|
mixins:
|
|
- CPOCPublicOrganization
|
|
- TOOIOrganization
|
|
slots:
|
|
- name
|
|
- institution_types
|
|
- geographic_coverage
|
|
- digital_platforms
|
|
- collections
|
|
- identifiers
|
|
|
|
DigitalPlatform:
|
|
is_a: schema:WebSite
|
|
slots:
|
|
- platform_type
|
|
- software_platform
|
|
- access_protocol
|
|
- metadata_standard
|
|
|
|
Collection:
|
|
slots:
|
|
- collection_name
|
|
- subjects
|
|
- formats
|
|
- temporal_coverage
|
|
- rights_statements
|
|
|
|
slots:
|
|
institution_types:
|
|
multivalued: true
|
|
range: InstitutionTypeEnum
|
|
|
|
identifiers:
|
|
multivalued: true
|
|
range: Identifier
|
|
|
|
enums:
|
|
InstitutionTypeEnum:
|
|
permissible_values:
|
|
ARCHIVE:
|
|
meaning: rico:RecordCreator
|
|
LIBRARY:
|
|
meaning: schema:Library
|
|
MUSEUM:
|
|
meaning: schema:Museum
|
|
GALLERY:
|
|
meaning: schema:ArtGallery
|
|
```
|
|
|
|
**Modular Design**:
|
|
- `core/` - Core classes (HeritageCustodian, Collection)
|
|
- `mixins/` - Reusable components (TOOI, CPOC, Schema.org)
|
|
- `identifiers/` - Identifier systems (ISIL, Wikidata, etc.)
|
|
- `standards/` - Domain standards (RiC-O, BIBFRAME, LIDO)
|
|
- `enums/` - Controlled vocabularies
|
|
|
|
#### 3.2 Validation & Mapping
|
|
**Components**:
|
|
- **SchemaValidator**: Validate instances against LinkML schema
|
|
- **EntityMapper**: Map extracted data to LinkML classes
|
|
- **IRIGenerator**: Generate unique IRIs for instances
|
|
- **ProvenanceTracker**: Track extraction provenance
|
|
|
|
**Technology**:
|
|
- **linkml-runtime**: Schema validation and instance generation
|
|
- **linkml**: Schema development tools
|
|
- **rdflib**: RDF graph manipulation
|
|
- **jsonschema**: JSON validation
|
|
|
|
### 4. Storage Layer
|
|
|
|
#### 4.1 Intermediate Storage
|
|
**Purpose**: Store processed data during extraction pipeline
|
|
|
|
**Databases**:
|
|
- **Conversation Index**: DuckDB
|
|
- Fast analytical queries
|
|
- Full-text search on conversation content
|
|
- Lightweight, embedded database
|
|
|
|
- **Extraction Cache**: SQLite
|
|
- Store intermediate extraction results
|
|
- Cache web crawling results
|
|
- Track processing status
|
|
|
|
**Schema**:
|
|
```sql
|
|
-- Conversations
|
|
CREATE TABLE conversations (
|
|
id TEXT PRIMARY KEY,
|
|
created_at TIMESTAMP,
|
|
title TEXT,
|
|
country TEXT,
|
|
region TEXT,
|
|
content TEXT, -- Full markdown content
|
|
processed BOOLEAN DEFAULT FALSE
|
|
);
|
|
|
|
CREATE TABLE citations (
|
|
id INTEGER PRIMARY KEY,
|
|
conversation_id TEXT,
|
|
url TEXT,
|
|
title TEXT,
|
|
context TEXT,
|
|
crawled BOOLEAN DEFAULT FALSE,
|
|
crawl_status TEXT,
|
|
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
|
|
);
|
|
|
|
-- Extracted entities
|
|
CREATE TABLE extracted_institutions (
|
|
id TEXT PRIMARY KEY,
|
|
name TEXT,
|
|
institution_type TEXT[],
|
|
country TEXT,
|
|
raw_data JSONB, -- Full extracted data
|
|
confidence FLOAT,
|
|
status TEXT, -- 'pending', 'validated', 'rejected'
|
|
created_at TIMESTAMP
|
|
);
|
|
|
|
-- Crawl results
|
|
CREATE TABLE crawl_results (
|
|
url TEXT PRIMARY KEY,
|
|
status_code INTEGER,
|
|
content_hash TEXT,
|
|
metadata JSONB,
|
|
crawled_at TIMESTAMP
|
|
);
|
|
```
|
|
|
|
#### 4.2 Output Storage
|
|
**Purpose**: Store final dataset in multiple formats
|
|
|
|
**Formats**:
|
|
- **RDF Store**: Apache Jena TDB or Oxigraph
|
|
- Native RDF/SPARQL support
|
|
- Linked data queries
|
|
|
|
- **Document Store**: JSON-LD files
|
|
- One file per institution
|
|
- GitHub-friendly for version control
|
|
|
|
- **Analytical**: Parquet + DuckDB
|
|
- Efficient columnar storage
|
|
- Fast analytical queries
|
|
|
|
- **Relational**: PostgreSQL (optional)
|
|
- If needed for applications
|
|
- Can be generated from RDF
|
|
|
|
### 5. Export & Publishing Layer
|
|
|
|
#### 5.1 Export Pipeline
|
|
**Components**:
|
|
- **RDFExporter**: Export to Turtle/N-Triples/JSON-LD
|
|
- **TabularExporter**: Export to CSV/TSV/Excel
|
|
- **SQLExporter**: Generate SQL dumps
|
|
- **ParquetExporter**: Generate Parquet files
|
|
- **StatisticsGenerator**: Dataset statistics and reports
|
|
|
|
#### 5.2 Publishing Formats
|
|
```
|
|
dataset/
|
|
├── rdf/
|
|
│ ├── institutions.ttl # All institutions in Turtle
|
|
│ ├── institutions.nt # N-Triples format
|
|
│ └── jsonld/ # Individual JSON-LD files
|
|
│ ├── inst_001.jsonld
|
|
│ └── ...
|
|
├── tabular/
|
|
│ ├── institutions.csv # Flattened institution data
|
|
│ ├── collections.csv # Collections table
|
|
│ ├── digital_platforms.csv # Digital platforms table
|
|
│ └── data_dictionary.csv # Column descriptions
|
|
├── database/
|
|
│ ├── glam_dataset.db # SQLite database
|
|
│ ├── glam_dataset.duckdb # DuckDB database
|
|
│ └── schema.sql # SQL schema
|
|
├── analytics/
|
|
│ └── glam_dataset.parquet # Parquet format
|
|
├── metadata/
|
|
│ ├── dataset_metadata.yaml # Dataset description
|
|
│ ├── statistics.json # Dataset statistics
|
|
│ └── provenance.jsonld # PROV-O provenance
|
|
└── docs/
|
|
├── README.md # Dataset documentation
|
|
├── schema.html # Schema documentation
|
|
└── query_examples.md # SPARQL/SQL examples
|
|
```
|
|
|
|
## Component Integration
|
|
|
|
### Data Flow Architecture
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Conversation │
|
|
│ JSON Files │
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Conversation │ ──> [DuckDB Index]
|
|
│ Parser │
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Pattern-based │ ──> Extract: ISIL, Wikidata, URLs
|
|
│ Extraction │ (IdentifierExtractor in main code)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ SUBAGENT │ ──> Launch subagents for NER
|
|
│ BOUNDARY │ (InstitutionExtractor, LocationExtractor)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Coding │ ──> Subagents use spaCy/transformers
|
|
│ Subagents │ Return structured JSON results
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Web Crawler │ ──> [Crawl Results DB]
|
|
│ (crawl4ai) │
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Enrichment │ ──> [External APIs]
|
|
│ Services │ (Wikidata, VIAF)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ LinkML Mapper │ ──> [Validation Errors]
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Quality Check │ ──> [Review Queue]
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Exporters │
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Published │
|
|
│ Dataset │
|
|
└─────────────────┘
|
|
```
|
|
|
|
### Processing Modes
|
|
|
|
#### 1. Batch Processing (Default)
|
|
- Process all 139 conversations sequentially
|
|
- Suitable for initial dataset creation
|
|
- Can be parallelized by conversation
|
|
|
|
#### 2. Incremental Processing
|
|
- Process new/updated conversations only
|
|
- Track processing timestamps
|
|
- Merge with existing dataset
|
|
|
|
#### 3. Targeted Processing
|
|
- Process specific conversations (by country, date, etc.)
|
|
- For dataset updates or testing
|
|
|
|
## Technology Stack
|
|
|
|
### Core Languages
|
|
- **Python 3.11+**: Main implementation language
|
|
- **YAML**: Schema definition (LinkML)
|
|
- **SPARQL**: RDF queries
|
|
|
|
### Key Libraries
|
|
- **Data Processing**: pandas, polars (for tabular data)
|
|
- **Pattern Matching**: Python `re` module, `rapidfuzz` (fuzzy matching)
|
|
- **Subagent Orchestration**: Task tool (for NER and complex extraction)
|
|
- **Web**: crawl4ai, httpx, beautifulsoup4
|
|
- **Semantic Web**: linkml-runtime, rdflib, pyshacl
|
|
- **Database**: duckdb, sqlite3, sqlalchemy
|
|
- **Validation**: pydantic, jsonschema
|
|
- **Export**: pyarrow (Parquet), openpyxl (Excel)
|
|
|
|
**Note**: NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction. See `docs/plan/global_glam/07-subagent-architecture.md` for details.
|
|
|
|
### Development Tools
|
|
- **Package Management**: Poetry
|
|
- **Code Quality**: ruff (linting), mypy (type checking)
|
|
- **Testing**: pytest, hypothesis
|
|
- **Documentation**: mkdocs, sphinx
|
|
- **CI/CD**: GitHub Actions
|
|
|
|
## Scalability Considerations
|
|
|
|
### Performance Optimizations
|
|
1. **Parallel Processing**: Process conversations concurrently
|
|
2. **Caching**: Cache web crawl results and API calls
|
|
3. **Incremental Processing**: Only process changed data
|
|
4. **Efficient Storage**: Use columnar formats (Parquet) for analytics
|
|
|
|
### Resource Management
|
|
- **Memory**: Stream large files, avoid loading all data in memory
|
|
- **Disk**: Compress intermediate results
|
|
- **Network**: Rate limiting for web crawling, batch API calls
|
|
|
|
## Security & Ethics
|
|
|
|
### Web Crawling
|
|
- Respect robots.txt
|
|
- Implement rate limiting
|
|
- User-agent identification
|
|
- No aggressive crawling
|
|
|
|
### Data Privacy
|
|
- No personal data extraction
|
|
- Public information only
|
|
- Comply with institutional terms of service
|
|
|
|
### Licensing
|
|
- Clear dataset license (CC0, CC-BY, or ODbL)
|
|
- Respect source data licenses
|
|
- Attribution to source institutions
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Logging
|
|
- Structured logging (JSON format)
|
|
- Log levels: DEBUG, INFO, WARNING, ERROR
|
|
- Per-component logging
|
|
|
|
### Metrics
|
|
- Extraction success rates
|
|
- Web crawl success rates
|
|
- Validation pass/fail rates
|
|
- Processing time per conversation
|
|
- Dataset statistics over time
|
|
|
|
### Error Handling
|
|
- Graceful degradation
|
|
- Retry logic for transient failures
|
|
- Error categorization and reporting
|
|
- Manual review queue for edge cases
|
|
|
|
## Deployment Architecture
|
|
|
|
### Local Development
|
|
```
|
|
poetry install
|
|
poetry run glam-extract --input data/conversations/ --output data/dataset/
|
|
```
|
|
|
|
### Docker Container (Future)
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
RUN pip install poetry
|
|
COPY . /app
|
|
WORKDIR /app
|
|
RUN poetry install --no-dev
|
|
ENTRYPOINT ["poetry", "run", "glam-extract"]
|
|
```
|
|
|
|
### CI/CD Pipeline
|
|
- **GitHub Actions**: Run on push/PR
|
|
- **Tests**: Unit tests, integration tests
|
|
- **Quality Gates**: Linting, type checking, coverage
|
|
- **Dataset Build**: Weekly automated builds
|
|
- **Publication**: Auto-publish to GitHub Releases
|
|
|
|
## Future Enhancements
|
|
|
|
### Phase 2 Features
|
|
- Web UI for manual review and curation
|
|
- SPARQL endpoint for queries
|
|
- REST API for dataset access
|
|
- Automated duplicate detection across conversations
|
|
- Multi-language support (NLP for non-English content)
|
|
- Machine learning for institution type classification
|
|
- Automated monitoring of institution URLs (link rot detection)
|
|
|
|
### Integration Opportunities
|
|
- Wikidata integration (read/write)
|
|
- National heritage registries
|
|
- OpenStreetMap for geocoding
|
|
- IIIF for image collections
|
|
- OAI-PMH for metadata harvesting
|
|
|
|
## Related Documentation
|
|
|
|
- **Subagent Architecture**: `docs/plan/global_glam/07-subagent-architecture.md` - Detailed explanation of subagent-based NER approach
|
|
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md` - Code patterns and best practices
|
|
- **Data Standardization**: `docs/plan/global_glam/04-data-standardization.md` - Data tier system and provenance
|
|
- **Dependencies**: `docs/plan/global_glam/03-dependencies.md` - Technology stack and library choices
|
|
- **LinkML Schema**: `schemas/heritage_custodian.yaml` - Data model definition
|
|
- **Agent Instructions**: `AGENTS.md` - Instructions for AI agents working on this project
|