281 lines
7.9 KiB
Markdown
281 lines
7.9 KiB
Markdown
# Rule 36: Original Language Preservation in Web Content Extraction
|
|
|
|
**Status**: ACTIVE
|
|
**Created**: 2025-12-31
|
|
**Applies to**: All web content extraction, mission statements, descriptions, organizational information
|
|
|
|
---
|
|
|
|
## Core Principle
|
|
|
|
**ALL extracted text content MUST be preserved in its original source language. Translation is STRICTLY FORBIDDEN during extraction.**
|
|
|
|
This rule applies to:
|
|
- Mission statements
|
|
- Vision statements
|
|
- Organizational descriptions
|
|
- About us content
|
|
- Historical narratives
|
|
- Collection descriptions
|
|
- Any textual content extracted from institutional websites
|
|
|
|
---
|
|
|
|
## Rationale
|
|
|
|
### 1. Emic Authenticity
|
|
The institution's own voice and terminology must be preserved. A Dutch museum's mission statement in Dutch is the **authoritative version** - any translation is a derivative.
|
|
|
|
### 2. Semantic Fidelity
|
|
Translation introduces interpretation and potential distortion:
|
|
- "Erfgoed" (Dutch) has connotations beyond "heritage"
|
|
- "Patrimonio" (Spanish) carries cultural weight lost in translation
|
|
- Institutional jargon may not have direct equivalents
|
|
|
|
### 3. Provenance Integrity
|
|
If content is translated during extraction:
|
|
- The `source_url` no longer matches the stored text
|
|
- XPath provenance becomes invalid
|
|
- Content hashes cannot verify authenticity
|
|
|
|
### 4. Downstream Flexibility
|
|
Original language content allows:
|
|
- Users to request translations in their preferred language
|
|
- Machine translation with disclosed provenance
|
|
- Linguistic analysis and terminology extraction
|
|
- Multilingual search indexing
|
|
|
|
---
|
|
|
|
## Implementation Requirements
|
|
|
|
### LLM Extraction Prompts
|
|
|
|
All LLM prompts for content extraction MUST include explicit no-translation instructions:
|
|
|
|
```
|
|
CRITICAL: Extract the text EXACTLY as it appears on the webpage.
|
|
DO NOT translate any content.
|
|
Preserve the original language (Dutch, Spanish, German, French, etc.).
|
|
If the source is in Dutch, the output must be in Dutch.
|
|
If the source is in Spanish, the output must be in Spanish.
|
|
```
|
|
|
|
### Language Detection and Storage
|
|
|
|
Every extracted text field MUST include:
|
|
|
|
```yaml
|
|
mission_statement:
|
|
text: "Het Rijksmuseum is het museum van Nederland..." # Original Dutch
|
|
language: "nl" # ISO 639-1 code detected from GHCID or content
|
|
source_url: "https://www.rijksmuseum.nl/nl/over-ons"
|
|
extracted_verbatim: true # Confirms no translation occurred
|
|
```
|
|
|
|
### Validation Checks
|
|
|
|
Before storing extracted content:
|
|
1. Verify `language` field matches expected language from GHCID country code
|
|
2. Flag mismatches for review (may indicate English fallback pages)
|
|
3. Never silently translate to "normalize" content
|
|
|
|
---
|
|
|
|
## Anti-Patterns (FORBIDDEN)
|
|
|
|
### Translation During Extraction
|
|
|
|
```yaml
|
|
# WRONG - Translated from Dutch to English
|
|
mission_statement:
|
|
text: "The Rijksmuseum is the museum of the Netherlands..."
|
|
language: "en"
|
|
source_url: "https://www.rijksmuseum.nl/nl/over-ons" # Dutch URL!
|
|
```
|
|
|
|
### Language Mismatch Without Documentation
|
|
|
|
```yaml
|
|
# WRONG - No explanation for language mismatch
|
|
mission_statement:
|
|
text: "We preserve cultural heritage..." # English text
|
|
language: "nl" # Claims Dutch
|
|
```
|
|
|
|
### Mixing Languages
|
|
|
|
```yaml
|
|
# WRONG - Partial translation
|
|
mission_statement:
|
|
text: "Het museum has the goal to preserve heritage..."
|
|
```
|
|
|
|
---
|
|
|
|
## Correct Implementation
|
|
|
|
### Dutch Institution (NL)
|
|
|
|
```yaml
|
|
ghcid: NL-NH-AMS-M-RM
|
|
mission_statement:
|
|
text: |
|
|
Het Rijksmuseum is het museum van Nederland. Kunst en geschiedenis
|
|
nemen hier een bijzondere plek in. Al meer dan 200 jaar vertelt
|
|
het Rijksmuseum het verhaal van Nederland.
|
|
language: "nl"
|
|
source_url: "https://www.rijksmuseum.nl/nl/over-ons"
|
|
source_section: "Over ons"
|
|
extracted_verbatim: true
|
|
extraction_timestamp: "2025-12-31T10:00:00Z"
|
|
```
|
|
|
|
### Spanish Institution (AR)
|
|
|
|
```yaml
|
|
ghcid: AR-C-BUE-M-MALBA
|
|
mission_statement:
|
|
text: |
|
|
MALBA tiene como misión coleccionar, preservar, investigar y difundir
|
|
el arte latinoamericano desde principios del siglo XX hasta la actualidad.
|
|
language: "es"
|
|
source_url: "https://www.malba.org.ar/sobre-malba/"
|
|
source_section: "Sobre MALBA"
|
|
extracted_verbatim: true
|
|
extraction_timestamp: "2025-12-31T10:00:00Z"
|
|
```
|
|
|
|
### English Fallback Page (Documented)
|
|
|
|
```yaml
|
|
ghcid: NL-NH-AMS-M-VGM
|
|
mission_statement:
|
|
text: |
|
|
The Van Gogh Museum makes the life and work of Vincent van Gogh
|
|
and the art of his time accessible to as many people as possible.
|
|
language: "en"
|
|
source_url: "https://www.vangoghmuseum.nl/en/about/organisation"
|
|
source_section: "About - Organisation"
|
|
extracted_verbatim: true
|
|
language_note: "Extracted from English version of website; Dutch version not available at /nl/over-ons"
|
|
extraction_timestamp: "2025-12-31T10:00:00Z"
|
|
```
|
|
|
|
---
|
|
|
|
## Language-Specific LLM Prompts
|
|
|
|
### Dutch (NL, BE, SR)
|
|
|
|
```
|
|
Je bent een expert in het analyseren van websites van erfgoedinstellingen.
|
|
|
|
KRITIEK: Extraheer de tekst EXACT zoals deze op de webpagina staat.
|
|
VERTAAL NIET. Behoud de originele Nederlandse tekst.
|
|
Als de bron in het Nederlands is, moet de output in het Nederlands zijn.
|
|
```
|
|
|
|
### Spanish (AR, CL, MX, ES, etc.)
|
|
|
|
```
|
|
Eres un experto en analizar sitios web de instituciones patrimoniales.
|
|
|
|
CRITICO: Extrae el texto EXACTAMENTE como aparece en la pagina web.
|
|
NO TRADUZCAS. Preserva el texto original en espanol.
|
|
Si la fuente esta en espanol, la salida debe estar en espanol.
|
|
```
|
|
|
|
### Portuguese (BR, PT)
|
|
|
|
```
|
|
Voce e um especialista em analisar sites de instituicoes patrimoniais.
|
|
|
|
CRITICO: Extraia o texto EXATAMENTE como aparece na pagina.
|
|
NAO TRADUZA. Preserve o texto original em portugues.
|
|
Se a fonte esta em portugues, a saida deve estar em portugues.
|
|
```
|
|
|
|
### German (DE, AT, CH)
|
|
|
|
```
|
|
Sie sind ein Experte fur die Analyse von Websites von Kulturerbe-Institutionen.
|
|
|
|
KRITISCH: Extrahieren Sie den Text GENAU so, wie er auf der Webseite steht.
|
|
NICHT UBERSETZEN. Bewahren Sie den deutschen Originaltext.
|
|
Wenn die Quelle auf Deutsch ist, muss die Ausgabe auf Deutsch sein.
|
|
```
|
|
|
|
### French (FR, BE, CH, CA)
|
|
|
|
```
|
|
Vous etes un expert dans l'analyse des sites web d'institutions patrimoniales.
|
|
|
|
CRITIQUE: Extrayez le texte EXACTEMENT tel qu'il apparait sur la page.
|
|
NE TRADUISEZ PAS. Preservez le texte original en francais.
|
|
Si la source est en francais, la sortie doit etre en francais.
|
|
```
|
|
|
|
### Universal Fallback (English)
|
|
|
|
```
|
|
You are an expert in analyzing heritage institution websites.
|
|
|
|
CRITICAL: Extract the text EXACTLY as it appears on the webpage.
|
|
DO NOT TRANSLATE. Preserve the original language text.
|
|
If the source is in Dutch, output must be in Dutch.
|
|
If the source is in Spanish, output must be in Spanish.
|
|
If the source is in German, output must be in German.
|
|
```
|
|
|
|
---
|
|
|
|
## Handling Multilingual Websites
|
|
|
|
Some institutions have content in multiple languages on the same page:
|
|
|
|
```yaml
|
|
mission_statement:
|
|
primary:
|
|
text: "Het museum bewaart en toont kunst..."
|
|
language: "nl"
|
|
translations:
|
|
- text: "The museum preserves and displays art..."
|
|
language: "en"
|
|
is_official_translation: true # Institution-provided, not AI-translated
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
Before committing extracted content:
|
|
|
|
- [ ] Text is in original source language
|
|
- [ ] `language` field matches content language
|
|
- [ ] `language` field matches expected language from GHCID (or mismatch documented)
|
|
- [ ] `source_url` matches the actual page content was extracted from
|
|
- [ ] `extracted_verbatim: true` is set
|
|
- [ ] No mixed-language content (unless source is genuinely multilingual)
|
|
- [ ] LLM prompt included no-translation instruction
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance
|
|
- AGENTS.md: Emic Name First-Letter Protocol (preserves native language names)
|
|
- ISO 639-1: Language codes
|
|
- W3C: Content-Language HTTP header
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
| Action | Status |
|
|
|--------|--------|
|
|
| Translate during extraction | FORBIDDEN |
|
|
| Store original language text | REQUIRED |
|
|
| Document language in metadata | REQUIRED |
|
|
| Flag language mismatches | REQUIRED |
|
|
| Provide translations on request | ALLOWED (downstream) |
|