glam/.opencode/ORIGINAL_LANGUAGE_PRESERVATION_RULE.md
2026-01-02 02:10:18 +01:00

7.9 KiB

Rule 36: Original Language Preservation in Web Content Extraction

Status: ACTIVE
Created: 2025-12-31
Applies to: All web content extraction, mission statements, descriptions, organizational information


Core Principle

ALL extracted text content MUST be preserved in its original source language. Translation is STRICTLY FORBIDDEN during extraction.

This rule applies to:

  • Mission statements
  • Vision statements
  • Organizational descriptions
  • About us content
  • Historical narratives
  • Collection descriptions
  • Any textual content extracted from institutional websites

Rationale

1. Emic Authenticity

The institution's own voice and terminology must be preserved. A Dutch museum's mission statement in Dutch is the authoritative version - any translation is a derivative.

2. Semantic Fidelity

Translation introduces interpretation and potential distortion:

  • "Erfgoed" (Dutch) has connotations beyond "heritage"
  • "Patrimonio" (Spanish) carries cultural weight lost in translation
  • Institutional jargon may not have direct equivalents

3. Provenance Integrity

If content is translated during extraction:

  • The source_url no longer matches the stored text
  • XPath provenance becomes invalid
  • Content hashes cannot verify authenticity

4. Downstream Flexibility

Original language content allows:

  • Users to request translations in their preferred language
  • Machine translation with disclosed provenance
  • Linguistic analysis and terminology extraction
  • Multilingual search indexing

Implementation Requirements

LLM Extraction Prompts

All LLM prompts for content extraction MUST include explicit no-translation instructions:

CRITICAL: Extract the text EXACTLY as it appears on the webpage.
DO NOT translate any content.
Preserve the original language (Dutch, Spanish, German, French, etc.).
If the source is in Dutch, the output must be in Dutch.
If the source is in Spanish, the output must be in Spanish.

Language Detection and Storage

Every extracted text field MUST include:

mission_statement:
  text: "Het Rijksmuseum is het museum van Nederland..."  # Original Dutch
  language: "nl"  # ISO 639-1 code detected from GHCID or content
  source_url: "https://www.rijksmuseum.nl/nl/over-ons"
  extracted_verbatim: true  # Confirms no translation occurred

Validation Checks

Before storing extracted content:

  1. Verify language field matches expected language from GHCID country code
  2. Flag mismatches for review (may indicate English fallback pages)
  3. Never silently translate to "normalize" content

Anti-Patterns (FORBIDDEN)

Translation During Extraction

# WRONG - Translated from Dutch to English
mission_statement:
  text: "The Rijksmuseum is the museum of the Netherlands..."
  language: "en"
  source_url: "https://www.rijksmuseum.nl/nl/over-ons"  # Dutch URL!

Language Mismatch Without Documentation

# WRONG - No explanation for language mismatch
mission_statement:
  text: "We preserve cultural heritage..."  # English text
  language: "nl"  # Claims Dutch

Mixing Languages

# WRONG - Partial translation
mission_statement:
  text: "Het museum has the goal to preserve heritage..."

Correct Implementation

Dutch Institution (NL)

ghcid: NL-NH-AMS-M-RM
mission_statement:
  text: |
    Het Rijksmuseum is het museum van Nederland. Kunst en geschiedenis 
    nemen hier een bijzondere plek in. Al meer dan 200 jaar vertelt 
    het Rijksmuseum het verhaal van Nederland.    
  language: "nl"
  source_url: "https://www.rijksmuseum.nl/nl/over-ons"
  source_section: "Over ons"
  extracted_verbatim: true
  extraction_timestamp: "2025-12-31T10:00:00Z"

Spanish Institution (AR)

ghcid: AR-C-BUE-M-MALBA
mission_statement:
  text: |
    MALBA tiene como misión coleccionar, preservar, investigar y difundir 
    el arte latinoamericano desde principios del siglo XX hasta la actualidad.    
  language: "es"
  source_url: "https://www.malba.org.ar/sobre-malba/"
  source_section: "Sobre MALBA"
  extracted_verbatim: true
  extraction_timestamp: "2025-12-31T10:00:00Z"

English Fallback Page (Documented)

ghcid: NL-NH-AMS-M-VGM
mission_statement:
  text: |
    The Van Gogh Museum makes the life and work of Vincent van Gogh 
    and the art of his time accessible to as many people as possible.    
  language: "en"
  source_url: "https://www.vangoghmuseum.nl/en/about/organisation"
  source_section: "About - Organisation"
  extracted_verbatim: true
  language_note: "Extracted from English version of website; Dutch version not available at /nl/over-ons"
  extraction_timestamp: "2025-12-31T10:00:00Z"

Language-Specific LLM Prompts

Dutch (NL, BE, SR)

Je bent een expert in het analyseren van websites van erfgoedinstellingen.

KRITIEK: Extraheer de tekst EXACT zoals deze op de webpagina staat.
VERTAAL NIET. Behoud de originele Nederlandse tekst.
Als de bron in het Nederlands is, moet de output in het Nederlands zijn.

Spanish (AR, CL, MX, ES, etc.)

Eres un experto en analizar sitios web de instituciones patrimoniales.

CRITICO: Extrae el texto EXACTAMENTE como aparece en la pagina web.
NO TRADUZCAS. Preserva el texto original en espanol.
Si la fuente esta en espanol, la salida debe estar en espanol.

Portuguese (BR, PT)

Voce e um especialista em analisar sites de instituicoes patrimoniais.

CRITICO: Extraia o texto EXATAMENTE como aparece na pagina.
NAO TRADUZA. Preserve o texto original em portugues.
Se a fonte esta em portugues, a saida deve estar em portugues.

German (DE, AT, CH)

Sie sind ein Experte fur die Analyse von Websites von Kulturerbe-Institutionen.

KRITISCH: Extrahieren Sie den Text GENAU so, wie er auf der Webseite steht.
NICHT UBERSETZEN. Bewahren Sie den deutschen Originaltext.
Wenn die Quelle auf Deutsch ist, muss die Ausgabe auf Deutsch sein.

French (FR, BE, CH, CA)

Vous etes un expert dans l'analyse des sites web d'institutions patrimoniales.

CRITIQUE: Extrayez le texte EXACTEMENT tel qu'il apparait sur la page.
NE TRADUISEZ PAS. Preservez le texte original en francais.
Si la source est en francais, la sortie doit etre en francais.

Universal Fallback (English)

You are an expert in analyzing heritage institution websites.

CRITICAL: Extract the text EXACTLY as it appears on the webpage.
DO NOT TRANSLATE. Preserve the original language text.
If the source is in Dutch, output must be in Dutch.
If the source is in Spanish, output must be in Spanish.
If the source is in German, output must be in German.

Handling Multilingual Websites

Some institutions have content in multiple languages on the same page:

mission_statement:
  primary:
    text: "Het museum bewaart en toont kunst..."
    language: "nl"
  translations:
    - text: "The museum preserves and displays art..."
      language: "en"
      is_official_translation: true  # Institution-provided, not AI-translated

Verification Checklist

Before committing extracted content:

  • Text is in original source language
  • language field matches content language
  • language field matches expected language from GHCID (or mismatch documented)
  • source_url matches the actual page content was extracted from
  • extracted_verbatim: true is set
  • No mixed-language content (unless source is genuinely multilingual)
  • LLM prompt included no-translation instruction

References

  • AGENTS.md Rule 6: WebObservation Claims MUST Have XPath Provenance
  • AGENTS.md: Emic Name First-Letter Protocol (preserves native language names)
  • ISO 639-1: Language codes
  • W3C: Content-Language HTTP header

Summary

Action Status
Translate during extraction FORBIDDEN
Store original language text REQUIRED
Document language in metadata REQUIRED
Flag language mismatches REQUIRED
Provide translations on request ALLOWED (downstream)