glam/schemas/20251121/MIGRATION_GUIDE.md
2025-11-21 22:12:33 +01:00

434 lines
12 KiB
Markdown

# Legal Form Migration Guide: Generic Enums → ISO 20275
**Purpose**: Migrate heritage institution data from generic legal form enumerations to ISO 20275 Entity Legal Forms (ELF) codes.
**Status**: Migration script ready for testing
**Date**: 2025-11-21
**Version**: 1.0.0
---
## Quick Start
### 1. Install Dependencies
```bash
pip install pyyaml
```
### 2. Run Migration Script
```bash
# Single file (Dutch institutions)
python scripts/migrate_legal_form_to_iso20275.py \
--input data/nde/institutions.yaml \
--output data/nde/institutions_migrated.yaml \
--country NL
# Entire directory (auto-detect country)
python scripts/migrate_legal_form_to_iso20275.py \
--input-dir data/nde/ \
--output-dir data/nde_migrated/
# Dry run (preview changes)
python scripts/migrate_legal_form_to_iso20275.py \
--input data/nde/institutions.yaml \
--output /dev/null \
--dry-run
```
### 3. Review Migration Report
The script generates a detailed report saved to `migration_report.txt`:
```
Migration Report
================
Total records processed: 1351
Successfully migrated: 1200
Unchanged (already ISO 20275): 50
Requiring manual review: 95
Errors: 6
Success rate: 88.8%
```
---
## Migration Mappings
### Netherlands (NL)
| Old Enum | ISO 20275 Code | Local Name | Confidence |
|----------|----------------|------------|------------|
| `STICHTING` | `V44D` | Stichting | 1.0 |
| `ASSOCIATION` | `33MN` | Vereniging met volledige rechtsbevoegdheid | 0.9 |
| `NGO` | `33MN` | Vereniging | 0.7 |
| `GOVERNMENT_AGENCY` | `A0W7` | Publiekrechtelijke rechtspersoon | 0.95 |
| `LIMITED_COMPANY` | `54M6` | Besloten vennootschap (BV) | 0.85 |
| `COOPERATIVE` | `NFFH` | Coöperatie | 1.0 |
| `TRUST` | `V44D` | Stichting | 0.6 |
### France (FR)
| Old Enum | ISO 20275 Code | Local Name | Confidence |
|----------|----------------|------------|------------|
| `STICHTING` | `9T5S` | Fondation | 0.8 |
| `ASSOCIATION` | `BEWI` | Association déclarée | 1.0 |
| `NGO` | `BEWI` | Association | 0.9 |
| `GOVERNMENT_AGENCY` | `5RDO` | Établissement public | 1.0 |
| `LIMITED_COMPANY` | `KMPN` | SARL | 0.9 |
| `COOPERATIVE` | `6HB6` | Société coopérative (SCOP) | 1.0 |
### Germany (DE)
| Old Enum | ISO 20275 Code | Local Name | Confidence |
|----------|----------------|------------|------------|
| `STICHTING` | `V2YH` | Stiftung | 1.0 |
| `ASSOCIATION` | `QZ3L` | Eingetragener Verein (e.V.) | 1.0 |
| `NGO` | `QZ3L` | e.V. | 0.9 |
| `GOVERNMENT_AGENCY` | `SQKS` | Körperschaft des öffentlichen Rechts | 1.0 |
| `LIMITED_COMPANY` | `XLWA` | GmbH | 0.9 |
| `COOPERATIVE` | `XAEA` | Eingetragene Genossenschaft (eG) | 1.0 |
### United Kingdom (GB/UK)
| Old Enum | ISO 20275 Code | Local Name | Confidence |
|----------|----------------|------------|------------|
| `STICHTING` | `FC0R` | Trust | 0.8 |
| `ASSOCIATION` | `9HLU` | Charity | 0.9 |
| `NGO` | `9HLU` | Charity | 0.95 |
| `GOVERNMENT_AGENCY` | `AVYY` | Public corporation | 1.0 |
| `LIMITED_COMPANY` | `CBL2` | Private company limited by shares (Ltd) | 0.9 |
| `COOPERATIVE` | `83XL` | Co-operative Society | 1.0 |
| `TRUST` | `FC0R` | Trust | 1.0 |
### United States (US)
| Old Enum | ISO 20275 Code | Local Name | Confidence |
|----------|----------------|------------|------------|
| `STICHTING` | `QQQ0` | 501(c)(3) Nonprofit Organization | 0.8 |
| `ASSOCIATION` | `QQQ0` | 501(c)(3) | 0.9 |
| `NGO` | `QQQ0` | 501(c)(3) | 0.95 |
| `GOVERNMENT_AGENCY` | `W2ES` | Government Entity | 1.0 |
| `LIMITED_COMPANY` | `CNQ3` | Business Corporation | 0.8 |
| `COOPERATIVE` | `S63E` | Cooperative | 1.0 |
| `TRUST` | `7TPC` | Trust | 1.0 |
---
## Understanding Confidence Scores
**Confidence Range**: 0.0 (uncertain) to 1.0 (certain)
**Thresholds**:
- **1.0**: Exact one-to-one mapping (e.g., STICHTING → V44D in Netherlands)
- **0.9-0.99**: High confidence, common pattern (e.g., ASSOCIATION → BEWI in France)
- **0.7-0.89**: Moderate confidence, generally correct but verify (e.g., NGO → 33MN in Netherlands)
- **0.5-0.69**: Low confidence, requires manual review (e.g., TRUST → V44D in Netherlands)
- **< 0.5**: Very uncertain, manual mapping required
**Default threshold**: 0.7 (automatically migrate only if confidence 0.7)
Change threshold:
```bash
python scripts/migrate_legal_form_to_iso20275.py \
--confidence-threshold 0.8 \
...
```
---
## Manual Review Cases
The script flags records for manual review in these situations:
### 1. Low Confidence Mappings
**Example**: `TRUST` in Netherlands context
- Generic enum `TRUST` could map to `V44D` (stichting) or `FC0R` (trust)
- Confidence: 0.6 (below default threshold)
- **Action**: Verify legal documents, check KvK registry
### 2. Unknown Enum Values
**Example**: `PRIVATE_FOUNDATION` (not in standard mappings)
- No predefined mapping available
- **Action**: Consult ISO 20275 CSV file, check country-specific guide
### 3. Country Unknown
**Example**: Record without `locations` field
- Cannot determine country-specific mapping
- Uses default fallback mapping
- **Action**: Add country code, re-run migration
### 4. Ambiguous Cases
**Example**: `NGO` could be various legal forms
- France: `BEWI` (association) vs `9T5S` (fondation)
- US: `QQQ0` (501(c)(3)) vs `7TPC` (trust)
- **Action**: Check organizational charter, verify legal form
---
## Provenance Tracking
The script automatically adds migration metadata to `provenance.notes`:
```yaml
provenance:
data_source: CSV_REGISTRY
notes: |
Extracted from Dutch ISIL registry
[MIGRATION 2025-11-21T14:30:00Z] legal_form migrated: 'STICHTING' → 'V44D' (ISO 20275).
Country: NL. Confidence: 1.0. Mapped to Stichting (Stichting)
```
**Timestamp**: ISO 8601 format with timezone
**Format**: `[MIGRATION <timestamp>] legal_form migrated: '<old>' → '<new>' (ISO 20275). Country: <country>. Confidence: <score>. <notes>`
---
## Validation After Migration
### 1. Schema Validation
```bash
# Install LinkML CLI
pip install linkml
# Validate migrated data
linkml-validate -s schemas/20251121/linkml/02_organization_observation_reconstruction.yaml \
data/nde_migrated/institutions.yaml
```
### 2. Manual Spot Checks
Review 10-20 random records:
```bash
# Extract random sample
shuf data/nde_migrated/institutions.yaml | head -20 > sample_review.yaml
# Check:
# - legal_form matches expected ISO 20275 code
# - legal_form matches country (e.g., V44D for NL institutions)
# - Provenance notes document migration
```
### 3. Cross-Reference with ISO 20275 Registry
```bash
# Check all migrated codes exist in registry
grep -o '"[A-Z0-9]\{4\}"' data/nde_migrated/institutions.yaml | sort -u > used_codes.txt
# Compare with ISO 20275 CSV
cut -d',' -f1 data/ontology/2023-09-28-elf-code-list-v1.5.csv | sort -u > valid_codes.txt
comm -23 used_codes.txt valid_codes.txt # Should be empty (all codes valid)
```
---
## Edge Cases and Solutions
### Case 1: Mixed Legal Forms in Single Record
**Problem**: Institution changed legal form over time
```yaml
legal_form: STICHTING # Founded as foundation
# ... but became government agency in 2010
```
**Solution**: Use `ChangeEvent` to track legal form changes
```yaml
legal_form: A0W7 # Current form (government agency)
change_history:
- change_type: LEGAL_CHANGE
event_date: "2010-01-01"
event_description: "Converted from private stichting to public entity"
old_legal_form: V44D
new_legal_form: A0W7
```
### Case 2: Foreign Entities Operating in Country
**Problem**: French association operating branch in Netherlands
```yaml
legal_form: ASSOCIATION
locations:
- city: Amsterdam
country: NL
```
**Solution**: Use parent country, not branch location
```yaml
legal_form: BEWI # French association (parent entity)
locations:
- city: Paris
country: FR
is_headquarters: true
- city: Amsterdam
country: NL
is_branch: true
```
### Case 3: Multiple Legal Forms (Group Structure)
**Problem**: Holding company with multiple subsidiaries
```yaml
legal_name: Museum Group Holding
legal_form: ??? # Parent is BV, subsidiary is stichting
```
**Solution**: Create separate records, link via `parent_organization`
```yaml
# Parent
- id: https://w3id.org/heritage/org/museum-holding
legal_name: Museum Group Holding BV
legal_form: 54M6 # Dutch BV
# Subsidiary
- id: https://w3id.org/heritage/org/museum-foundation
legal_name: Stichting Museum X
legal_form: V44D # Dutch stichting
parent_organization: https://w3id.org/heritage/org/museum-holding
```
---
## Troubleshooting
### Error: "ELF codes file not found"
**Cause**: Missing ISO 20275 CSV file
**Solution**:
```bash
# Download from GLEIF
wget -O data/ontology/2023-09-28-elf-code-list-v1.5.csv \
https://www.gleif.org/en/about-lei/code-lists/iso-20275-entity-legal-forms-code-list/download-the-code-list
# Or specify custom path
python scripts/migrate_legal_form_to_iso20275.py \
--elf-codes /path/to/elf-codes.csv \
...
```
### Error: "Code 'XXXX' not found in ISO 20275 registry"
**Cause**: Invalid or non-existent ELF code
**Solutions**:
1. Check for typos (codes are case-sensitive)
2. Verify code is in latest ISO 20275 version
3. Check if code is inactive (`ELF Status ACTV/INAC` column)
4. Use country-specific guide to find correct code
### Warning: "Low confidence mapping"
**Not an error** - Script requires manual review
**Actions**:
1. Review migration report for flagged records
2. Consult country-specific guide (`elf_codes/{country}/README.md`)
3. Verify with legal documents (statutes, KvK registry)
4. Update record manually if needed
---
## Country-Specific Guides
For detailed legal form mappings:
- **Netherlands**: `/schemas/20251121/elf_codes/netherlands/README.md` (30+ codes)
- **France**: `/schemas/20251121/elf_codes/france/README.md` (240+ codes)
- **Germany**: `/schemas/20251121/elf_codes/germany/README.md` (30+ codes)
- **United Kingdom**: `/schemas/20251121/elf_codes/uk/README.md` (40+ codes)
- **United States**: `/schemas/20251121/elf_codes/usa/README.md` (732+ codes)
---
## Testing the Migration
Run unit tests:
```bash
pytest tests/test_legal_form_migration.py -v
```
Test coverage:
- ELF code validation
- Country-specific mappings
- Confidence scoring
- Manual review flagging
- Provenance tracking
- Edge case handling
---
## Script Options Reference
```
usage: migrate_legal_form_to_iso20275.py [-h] [--input INPUT] [--output OUTPUT]
[--input-dir INPUT_DIR] [--output-dir OUTPUT_DIR]
[--country COUNTRY] [--elf-codes ELF_CODES]
[--confidence-threshold CONFIDENCE_THRESHOLD]
[--dry-run] [--report-only]
[--report-path REPORT_PATH]
options:
--input INPUT Input YAML file
--output OUTPUT Output YAML file
--input-dir INPUT_DIR Input directory (batch mode)
--output-dir OUTPUT_DIR Output directory (batch mode)
--country COUNTRY Default country code (ISO 3166-1, e.g., NL, FR, DE)
--elf-codes ELF_CODES Path to ISO 20275 CSV file
--confidence-threshold Minimum confidence for auto-migration (0.0-1.0, default: 0.7)
--dry-run Preview changes without writing files
--report-only Generate report only, no file output
--report-path Path to save migration report (default: migration_report.txt)
```
---
## Next Steps After Migration
1. **Validate migrated data**
- Run LinkML schema validation
- Spot check 10-20 records manually
- Cross-reference with ISO 20275 registry
2. **Regenerate RDF files**
- Convert migrated YAML to RDF
- Validate triples with ontology alignment
- Update RDF_GENERATION_SUMMARY.md
3. **Update documentation**
- Document migration statistics
- Note any manual corrections made
- Update schema change log
4. **Commit changes**
- Commit migrated data files
- Commit migration report
- Tag release with migration completion
---
## References
- **ISO 20275 Standard**: https://www.gleif.org/en/about-lei/code-lists/iso-20275-entity-legal-forms-code-list
- **GLEIF Registry**: https://www.gleif.org
- **Schema Documentation**: `/schemas/20251121/linkml/02_organization_observation_reconstruction.yaml`
- **Country Guides**: `/schemas/20251121/elf_codes/{country}/README.md`
- **Migration Script**: `/scripts/migrate_legal_form_to_iso20275.py`
- **Unit Tests**: `/tests/test_legal_form_migration.py`
---
**Last Updated**: 2025-11-21
**Maintainer**: GLAM Ontology Project
**Version**: 1.0.0