7.9 KiB
Rule 36: Original Language Preservation in Web Content Extraction
Status: ACTIVE
Created: 2025-12-31
Applies to: All web content extraction, mission statements, descriptions, organizational information
Core Principle
ALL extracted text content MUST be preserved in its original source language. Translation is STRICTLY FORBIDDEN during extraction.
This rule applies to:
- Mission statements
- Vision statements
- Organizational descriptions
- About us content
- Historical narratives
- Collection descriptions
- Any textual content extracted from institutional websites
Rationale
1. Emic Authenticity
The institution's own voice and terminology must be preserved. A Dutch museum's mission statement in Dutch is the authoritative version - any translation is a derivative.
2. Semantic Fidelity
Translation introduces interpretation and potential distortion:
- "Erfgoed" (Dutch) has connotations beyond "heritage"
- "Patrimonio" (Spanish) carries cultural weight lost in translation
- Institutional jargon may not have direct equivalents
3. Provenance Integrity
If content is translated during extraction:
- The
source_urlno longer matches the stored text - XPath provenance becomes invalid
- Content hashes cannot verify authenticity
4. Downstream Flexibility
Original language content allows:
- Users to request translations in their preferred language
- Machine translation with disclosed provenance
- Linguistic analysis and terminology extraction
- Multilingual search indexing
Implementation Requirements
LLM Extraction Prompts
All LLM prompts for content extraction MUST include explicit no-translation instructions:
CRITICAL: Extract the text EXACTLY as it appears on the webpage.
DO NOT translate any content.
Preserve the original language (Dutch, Spanish, German, French, etc.).
If the source is in Dutch, the output must be in Dutch.
If the source is in Spanish, the output must be in Spanish.
Language Detection and Storage
Every extracted text field MUST include:
mission_statement:
text: "Het Rijksmuseum is het museum van Nederland..." # Original Dutch
language: "nl" # ISO 639-1 code detected from GHCID or content
source_url: "https://www.rijksmuseum.nl/nl/over-ons"
extracted_verbatim: true # Confirms no translation occurred
Validation Checks
Before storing extracted content:
- Verify
languagefield matches expected language from GHCID country code - Flag mismatches for review (may indicate English fallback pages)
- Never silently translate to "normalize" content
Anti-Patterns (FORBIDDEN)
Translation During Extraction
# WRONG - Translated from Dutch to English
mission_statement:
text: "The Rijksmuseum is the museum of the Netherlands..."
language: "en"
source_url: "https://www.rijksmuseum.nl/nl/over-ons" # Dutch URL!
Language Mismatch Without Documentation
# WRONG - No explanation for language mismatch
mission_statement:
text: "We preserve cultural heritage..." # English text
language: "nl" # Claims Dutch
Mixing Languages
# WRONG - Partial translation
mission_statement:
text: "Het museum has the goal to preserve heritage..."
Correct Implementation
Dutch Institution (NL)
ghcid: NL-NH-AMS-M-RM
mission_statement:
text: |
Het Rijksmuseum is het museum van Nederland. Kunst en geschiedenis
nemen hier een bijzondere plek in. Al meer dan 200 jaar vertelt
het Rijksmuseum het verhaal van Nederland.
language: "nl"
source_url: "https://www.rijksmuseum.nl/nl/over-ons"
source_section: "Over ons"
extracted_verbatim: true
extraction_timestamp: "2025-12-31T10:00:00Z"
Spanish Institution (AR)
ghcid: AR-C-BUE-M-MALBA
mission_statement:
text: |
MALBA tiene como misión coleccionar, preservar, investigar y difundir
el arte latinoamericano desde principios del siglo XX hasta la actualidad.
language: "es"
source_url: "https://www.malba.org.ar/sobre-malba/"
source_section: "Sobre MALBA"
extracted_verbatim: true
extraction_timestamp: "2025-12-31T10:00:00Z"
English Fallback Page (Documented)
ghcid: NL-NH-AMS-M-VGM
mission_statement:
text: |
The Van Gogh Museum makes the life and work of Vincent van Gogh
and the art of his time accessible to as many people as possible.
language: "en"
source_url: "https://www.vangoghmuseum.nl/en/about/organisation"
source_section: "About - Organisation"
extracted_verbatim: true
language_note: "Extracted from English version of website; Dutch version not available at /nl/over-ons"
extraction_timestamp: "2025-12-31T10:00:00Z"
Language-Specific LLM Prompts
Dutch (NL, BE, SR)
Je bent een expert in het analyseren van websites van erfgoedinstellingen.
KRITIEK: Extraheer de tekst EXACT zoals deze op de webpagina staat.
VERTAAL NIET. Behoud de originele Nederlandse tekst.
Als de bron in het Nederlands is, moet de output in het Nederlands zijn.
Spanish (AR, CL, MX, ES, etc.)
Eres un experto en analizar sitios web de instituciones patrimoniales.
CRITICO: Extrae el texto EXACTAMENTE como aparece en la pagina web.
NO TRADUZCAS. Preserva el texto original en espanol.
Si la fuente esta en espanol, la salida debe estar en espanol.
Portuguese (BR, PT)
Voce e um especialista em analisar sites de instituicoes patrimoniais.
CRITICO: Extraia o texto EXATAMENTE como aparece na pagina.
NAO TRADUZA. Preserve o texto original em portugues.
Se a fonte esta em portugues, a saida deve estar em portugues.
German (DE, AT, CH)
Sie sind ein Experte fur die Analyse von Websites von Kulturerbe-Institutionen.
KRITISCH: Extrahieren Sie den Text GENAU so, wie er auf der Webseite steht.
NICHT UBERSETZEN. Bewahren Sie den deutschen Originaltext.
Wenn die Quelle auf Deutsch ist, muss die Ausgabe auf Deutsch sein.
French (FR, BE, CH, CA)
Vous etes un expert dans l'analyse des sites web d'institutions patrimoniales.
CRITIQUE: Extrayez le texte EXACTEMENT tel qu'il apparait sur la page.
NE TRADUISEZ PAS. Preservez le texte original en francais.
Si la source est en francais, la sortie doit etre en francais.
Universal Fallback (English)
You are an expert in analyzing heritage institution websites.
CRITICAL: Extract the text EXACTLY as it appears on the webpage.
DO NOT TRANSLATE. Preserve the original language text.
If the source is in Dutch, output must be in Dutch.
If the source is in Spanish, output must be in Spanish.
If the source is in German, output must be in German.
Handling Multilingual Websites
Some institutions have content in multiple languages on the same page:
mission_statement:
primary:
text: "Het museum bewaart en toont kunst..."
language: "nl"
translations:
- text: "The museum preserves and displays art..."
language: "en"
is_official_translation: true # Institution-provided, not AI-translated
Verification Checklist
Before committing extracted content:
- Text is in original source language
languagefield matches content languagelanguagefield matches expected language from GHCID (or mismatch documented)source_urlmatches the actual page content was extracted fromextracted_verbatim: trueis set- No mixed-language content (unless source is genuinely multilingual)
- LLM prompt included no-translation instruction
References
- AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance
- AGENTS.md: Emic Name First-Letter Protocol (preserves native language names)
- ISO 639-1: Language codes
- W3C: Content-Language HTTP header
Summary
| Action | Status |
|---|---|
| Translate during extraction | FORBIDDEN |
| Store original language text | REQUIRED |
| Document language in metadata | REQUIRED |
| Flag language mismatches | REQUIRED |
| Provide translations on request | ALLOWED (downstream) |