# Rule 36: Original Language Preservation in Web Content Extraction **Status**: ACTIVE **Created**: 2025-12-31 **Applies to**: All web content extraction, mission statements, descriptions, organizational information --- ## Core Principle **ALL extracted text content MUST be preserved in its original source language. Translation is STRICTLY FORBIDDEN during extraction.** This rule applies to: - Mission statements - Vision statements - Organizational descriptions - About us content - Historical narratives - Collection descriptions - Any textual content extracted from institutional websites --- ## Rationale ### 1. Emic Authenticity The institution's own voice and terminology must be preserved. A Dutch museum's mission statement in Dutch is the **authoritative version** - any translation is a derivative. ### 2. Semantic Fidelity Translation introduces interpretation and potential distortion: - "Erfgoed" (Dutch) has connotations beyond "heritage" - "Patrimonio" (Spanish) carries cultural weight lost in translation - Institutional jargon may not have direct equivalents ### 3. Provenance Integrity If content is translated during extraction: - The `source_url` no longer matches the stored text - XPath provenance becomes invalid - Content hashes cannot verify authenticity ### 4. Downstream Flexibility Original language content allows: - Users to request translations in their preferred language - Machine translation with disclosed provenance - Linguistic analysis and terminology extraction - Multilingual search indexing --- ## Implementation Requirements ### LLM Extraction Prompts All LLM prompts for content extraction MUST include explicit no-translation instructions: ``` CRITICAL: Extract the text EXACTLY as it appears on the webpage. DO NOT translate any content. Preserve the original language (Dutch, Spanish, German, French, etc.). If the source is in Dutch, the output must be in Dutch. If the source is in Spanish, the output must be in Spanish. ``` ### Language Detection and Storage Every extracted text field MUST include: ```yaml mission_statement: text: "Het Rijksmuseum is het museum van Nederland..." # Original Dutch language: "nl" # ISO 639-1 code detected from GHCID or content source_url: "https://www.rijksmuseum.nl/nl/over-ons" extracted_verbatim: true # Confirms no translation occurred ``` ### Validation Checks Before storing extracted content: 1. Verify `language` field matches expected language from GHCID country code 2. Flag mismatches for review (may indicate English fallback pages) 3. Never silently translate to "normalize" content --- ## Anti-Patterns (FORBIDDEN) ### Translation During Extraction ```yaml # WRONG - Translated from Dutch to English mission_statement: text: "The Rijksmuseum is the museum of the Netherlands..." language: "en" source_url: "https://www.rijksmuseum.nl/nl/over-ons" # Dutch URL! ``` ### Language Mismatch Without Documentation ```yaml # WRONG - No explanation for language mismatch mission_statement: text: "We preserve cultural heritage..." # English text language: "nl" # Claims Dutch ``` ### Mixing Languages ```yaml # WRONG - Partial translation mission_statement: text: "Het museum has the goal to preserve heritage..." ``` --- ## Correct Implementation ### Dutch Institution (NL) ```yaml ghcid: NL-NH-AMS-M-RM mission_statement: text: | Het Rijksmuseum is het museum van Nederland. Kunst en geschiedenis nemen hier een bijzondere plek in. Al meer dan 200 jaar vertelt het Rijksmuseum het verhaal van Nederland. language: "nl" source_url: "https://www.rijksmuseum.nl/nl/over-ons" source_section: "Over ons" extracted_verbatim: true extraction_timestamp: "2025-12-31T10:00:00Z" ``` ### Spanish Institution (AR) ```yaml ghcid: AR-C-BUE-M-MALBA mission_statement: text: | MALBA tiene como misión coleccionar, preservar, investigar y difundir el arte latinoamericano desde principios del siglo XX hasta la actualidad. language: "es" source_url: "https://www.malba.org.ar/sobre-malba/" source_section: "Sobre MALBA" extracted_verbatim: true extraction_timestamp: "2025-12-31T10:00:00Z" ``` ### English Fallback Page (Documented) ```yaml ghcid: NL-NH-AMS-M-VGM mission_statement: text: | The Van Gogh Museum makes the life and work of Vincent van Gogh and the art of his time accessible to as many people as possible. language: "en" source_url: "https://www.vangoghmuseum.nl/en/about/organisation" source_section: "About - Organisation" extracted_verbatim: true language_note: "Extracted from English version of website; Dutch version not available at /nl/over-ons" extraction_timestamp: "2025-12-31T10:00:00Z" ``` --- ## Language-Specific LLM Prompts ### Dutch (NL, BE, SR) ``` Je bent een expert in het analyseren van websites van erfgoedinstellingen. KRITIEK: Extraheer de tekst EXACT zoals deze op de webpagina staat. VERTAAL NIET. Behoud de originele Nederlandse tekst. Als de bron in het Nederlands is, moet de output in het Nederlands zijn. ``` ### Spanish (AR, CL, MX, ES, etc.) ``` Eres un experto en analizar sitios web de instituciones patrimoniales. CRITICO: Extrae el texto EXACTAMENTE como aparece en la pagina web. NO TRADUZCAS. Preserva el texto original en espanol. Si la fuente esta en espanol, la salida debe estar en espanol. ``` ### Portuguese (BR, PT) ``` Voce e um especialista em analisar sites de instituicoes patrimoniais. CRITICO: Extraia o texto EXATAMENTE como aparece na pagina. NAO TRADUZA. Preserve o texto original em portugues. Se a fonte esta em portugues, a saida deve estar em portugues. ``` ### German (DE, AT, CH) ``` Sie sind ein Experte fur die Analyse von Websites von Kulturerbe-Institutionen. KRITISCH: Extrahieren Sie den Text GENAU so, wie er auf der Webseite steht. NICHT UBERSETZEN. Bewahren Sie den deutschen Originaltext. Wenn die Quelle auf Deutsch ist, muss die Ausgabe auf Deutsch sein. ``` ### French (FR, BE, CH, CA) ``` Vous etes un expert dans l'analyse des sites web d'institutions patrimoniales. CRITIQUE: Extrayez le texte EXACTEMENT tel qu'il apparait sur la page. NE TRADUISEZ PAS. Preservez le texte original en francais. Si la source est en francais, la sortie doit etre en francais. ``` ### Universal Fallback (English) ``` You are an expert in analyzing heritage institution websites. CRITICAL: Extract the text EXACTLY as it appears on the webpage. DO NOT TRANSLATE. Preserve the original language text. If the source is in Dutch, output must be in Dutch. If the source is in Spanish, output must be in Spanish. If the source is in German, output must be in German. ``` --- ## Handling Multilingual Websites Some institutions have content in multiple languages on the same page: ```yaml mission_statement: primary: text: "Het museum bewaart en toont kunst..." language: "nl" translations: - text: "The museum preserves and displays art..." language: "en" is_official_translation: true # Institution-provided, not AI-translated ``` --- ## Verification Checklist Before committing extracted content: - [ ] Text is in original source language - [ ] `language` field matches content language - [ ] `language` field matches expected language from GHCID (or mismatch documented) - [ ] `source_url` matches the actual page content was extracted from - [ ] `extracted_verbatim: true` is set - [ ] No mixed-language content (unless source is genuinely multilingual) - [ ] LLM prompt included no-translation instruction --- ## References - AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance - AGENTS.md: Emic Name First-Letter Protocol (preserves native language names) - ISO 639-1: Language codes - W3C: Content-Language HTTP header --- ## Summary | Action | Status | |--------|--------| | Translate during extraction | FORBIDDEN | | Store original language text | REQUIRED | | Document language in metadata | REQUIRED | | Flag language mismatches | REQUIRED | | Provide translations on request | ALLOWED (downstream) |