#!/usr/bin/env python3 """ Optimize AGENTS.md by condensing verbose rules into summaries with .opencode references. This script: 1. Preserves header (lines 1-23) 2. Condenses rules 0-34 (lines 24-2775) into enhanced summaries with .opencode references 3. Preserves all unique reference content (lines 2776-5332) in full """ import re from pathlib import Path # Enhanced rule mappings with more detail for key rules RULE_CONTENT = { 0: """### Rule 0: LinkML Schemas Are the Single Source of Truth 🚨 **CRITICAL**: LinkML schema files in `schemas/20251121/linkml/` are the authoritative definition of the Heritage Custodian Ontology. **Key Points**: - ALL derived files (RDF, TypeDB, UML) are GENERATED - never edit them directly - Always use full timestamps (`YYYYMMDD_HHMMSS`) in generated filenames - Primary schema: `schemas/20251121/linkml/01_custodian_name.yaml` **Workflow**: ``` 1. EDIT LinkML schema 2. REGENERATE: gen-owl β†’ rdfpipe β†’ all 8 RDF formats 3. REGENERATE: gen-yuml β†’ UML diagrams 4. UPDATE: TypeDB schema (manual) 5. VALIDATE: linkml-validate ``` **See**: `.opencode/SCHEMA_GENERATION_RULES.md` for complete generation rules --- """, 1: """### Rule 1: Ontology Files Are Your Primary Reference 🚨 **CRITICAL**: Before designing any schema, class, or property, consult base ontologies. **Required Steps**: 1. READ base ontology files in `/data/ontology/` 2. SEARCH for existing classes and properties 3. DOCUMENT your ontology alignment with rationale 4. NEVER invent custom properties when ontology equivalents exist **Available Ontologies**: - `tooiont.ttl` - TOOI (Dutch government) - `core-public-organisation-ap.ttl` - CPOV (EU public sector) - `schemaorg.owl` - Schema.org (web semantics) - `CIDOC_CRM_v7.1.3.rdf` - CIDOC-CRM (cultural heritage) - `RiC-O_1-1.rdf` - Records in Contexts (archival) - `pico.ttl` - PiCo (person observations) **See**: `.opencode/HYPER_MODULAR_STRUCTURE.md` for complete documentation --- """, 2: """### Rule 2: Wikidata Entities Are NOT Ontology Classes 🚨 **CRITICAL**: Files in `data/wikidata/GLAMORCUBEPSXHFN/` contain Wikidata Q-numbers for institution TYPES, NOT formal ontology class definitions. **Workflow**: `Wikidata Q-number β†’ Analyze semantics β†’ Search ontologies β†’ Map to ontology class β†’ Document rationale` **Note**: Full rule content preserved in Appendix below (no .opencode equivalent). --- """, 3: """### Rule 3: Multi-Aspect Modeling is Mandatory 🚨 **CRITICAL**: Every heritage entity has MULTIPLE ontological aspects with INDEPENDENT temporal lifecycles. **Required Aspects**: | Aspect | Ontology Class | Temporal Example | |--------|---------------|------------------| | Place | `crm:E27_Site` | Building: 1880-present | | Custodian | `cpov:PublicOrganisation` | Foundation: 1994-present | | Legal Form | `org:FormalOrganization` | Registration: 1994-present | | Collections | `rico:RecordSet` | Accession dates vary | | People | `pico:PersonObservation` | Employment: 2020-present | | Events | `crm:E10_Transfer_of_Custody` | Discrete timestamps | **Note**: Full rule content preserved in Appendix below (no .opencode equivalent). --- """, 4: """### Rule 4: Technical Classes Are Excluded from Visualizations 🚨 **CRITICAL**: Some LinkML classes exist solely for validation (e.g., `Container` with `tree_root: true`). These have NO semantic significance and MUST be excluded from UML diagrams. **Excluded Classes**: `Container` (tree_root for validation only) **See**: `.opencode/LINKML_TECHNICAL_CLASSES.md` for complete documentation --- """, 5: """### Rule 5: NEVER Delete Enriched Data - Additive Only 🚨 **CRITICAL**: Data enrichment is ADDITIVE ONLY. Never delete or overwrite existing enriched content. **Protected Data Types**: | Source | Protected Fields | |--------|------------------| | Google Maps | `reviews`, `rating`, `photo_count`, `popular_times`, `place_id` | | OpenStreetMap | `osm_id`, `osm_type`, `osm_tags`, `amenity`, `heritage` | | Wikidata | `wikidata_id`, `claims`, `sitelinks`, `aliases` | | Website Scrape | `organization_details`, `collections`, `contact`, `social_media` | | ISIL Registry | `isil_code`, `assigned_date`, `remarks` | **See**: `.opencode/DATA_PRESERVATION_RULES.md` for complete documentation --- """, 6: """### Rule 6: WebObservation Claims MUST Have XPath Provenance 🚨 **CRITICAL**: Every claim extracted from a webpage MUST have an XPath pointer to the exact location in archived HTML. Claims without XPath provenance are FABRICATED. **Required Fields**: ```yaml claim_type: full_name claim_value: "Institution Name" source_url: https://example.org/about retrieved_on: "2025-11-29T12:28:00Z" xpath: /html/body/div[1]/h1 html_file: web/GHCID/example.org/rendered.html xpath_match_score: 1.0 ``` **Scope**: Applies to `WebClaim` and `WebObservation` classes. Other classes (LinkupTimelineEvent, GoogleMapsEnrichment) have different provenance models. **See**: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` for complete documentation --- """, 7: """### Rule 7: Deployment is LOCAL via SSH/rsync (NO CI/CD) 🚨 **CRITICAL**: NO GitHub Actions. ALL deployments executed locally via SSH and rsync. **Server**: `91.98.224.44` (Hetzner Cloud) **Two Frontend Apps** (MONOREPO): | Domain | Local Directory | Server Path | |--------|-----------------|-------------| | bronhouder.nl | `/frontend/` | `/var/www/glam-frontend/` | | archief.support | `/apps/archief-assistent/` | `/var/www/archief-assistent/` | **Deployment Commands**: ```bash ./infrastructure/deploy.sh --frontend # bronhouder.nl ./infrastructure/deploy.sh --data # Data files only ./infrastructure/deploy.sh --status # Check server ``` **See**: `.opencode/DEPLOYMENT_RULES.md` and `.opencode/MONOREPO_FRONTEND_APPS.md` --- """, 8: """### Rule 8: Legal Form Terms MUST Be Filtered from CustodianName 🚨 **CRITICAL**: Exception to emic principle - Legal forms are ALWAYS filtered from CustodianName. **Examples**: `Stichting Rijksmuseum` β†’ CustodianName: `Rijksmuseum`, Legal Form: `Stichting` **Terms to Filter** (by language): - Dutch: Stichting, B.V., N.V., CoΓΆperatie - English: Foundation, Inc., Ltd., LLC, Corp. - German: Stiftung, e.V., GmbH, AG - French: Fondation, S.A., S.A.R.L. **NOT Filtered** (part of identity): Vereniging, Association, Society, Verein **See**: `.opencode/LEGAL_FORM_FILTERING_RULE.md` for complete documentation --- """, 9: """### Rule 9: Enum-to-Class Promotion - Single Source of Truth 🚨 **CRITICAL**: When an enum is promoted to a class hierarchy, the original enum MUST be deleted. Never maintain parallel enum/class definitions. **Archive Location**: `schemas/20251121/linkml/archive/enums/` **See**: `.opencode/ENUM_TO_CLASS_PRINCIPLE.md` for complete documentation --- """, 10: """### Rule 10: CH-Annotator is the Entity Annotation Convention 🚨 **CRITICAL**: All entity annotation follows `ch_annotator-v1_7_0` convention. **9 Hypernym Types**: AGT (Agent), GRP (Group), TOP (Toponym), GEO (Geometry), TMP (Temporal), APP (Appellation), ROL (Role), WRK (Work), QTY (Quantity) **Heritage Institutions**: `GRP.HER` with GLAMORCUBESFIXPHDNT subtypes (GRP.HER.MUS, GRP.HER.LIB, GRP.HER.ARC, etc.) **See**: `.opencode/CH_ANNOTATOR_CONVENTION.md` for complete documentation --- """, 11: """### Rule 11: Z.AI GLM API for LLM Tasks (NOT BigModel) 🚨 **CRITICAL**: Use Z.AI Coding Plan endpoint, NOT regular BigModel API. **Configuration**: - API URL: `https://api.z.ai/api/coding/paas/v4/chat/completions` - Environment Variable: `ZAI_API_TOKEN` - Models: `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.6` - Cost: Free (0 per token) **See**: `.opencode/ZAI_GLM_API_RULES.md` for complete documentation --- """, 12: """### Rule 12: Person Data Reference Pattern - Avoid Inline Duplication 🚨 **CRITICAL**: Person profiles stored in `data/custodian/person/entity/`. Custodian files reference via `person_profile_path` - NEVER duplicate 50+ lines of profile data inline. **File Naming**: `{linkedin-slug}_{ISO-timestamp}.json` **See**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` for complete documentation --- """, 13: """### Rule 13: Custodian Type Annotations on LinkML Schema Elements 🚨 **CRITICAL**: All schema elements MUST have `custodian_types` annotation with GLAMORCUBESFIXPHDNT single-letter codes. **Annotation Keys**: `custodian_types` (list), `custodian_types_rationale` (string), `custodian_types_primary` (string) **Universal**: Use `["*"]` for elements applying to all types. **See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation --- """, 14: """### Rule 14: Exa MCP LinkedIn Profile Extraction 🚨 **CRITICAL**: Use `exa_crawling_exa` with direct URL for comprehensive LinkedIn profile extraction. **Tool Priority**: 1. `exa_crawling_exa` - Profile URL known (preferred) 2. `exa_linkedin_search_exa` - Profile URL unknown 3. `exa_web_search_exa` - Fallback search **Output**: `data/custodian/person/entity/{linkedin-slug}_{timestamp}.json` **See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation --- """, 15: """### Rule 15: Connection Data Registration - Full Network Preservation 🚨 **CRITICAL**: ALL LinkedIn connections must be fully registered in dedicated connections files. **File Location**: `data/custodian/person/{slug}_connections_{timestamp}.json` **Required**: `source_metadata`, `connections[]` array, `network_analysis` with heritage type breakdown **See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation --- """, 16: """### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages 🚨 **CRITICAL**: Store actual CDN URL, NOT overlay page URL. - ❌ WRONG: `linkedin.com/in/{slug}/overlay/photo/` (derivable, useless) - βœ… CORRECT: `media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...` **See**: `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for complete documentation --- """, 17: """### Rule 17: LinkedIn Connection Unique Identifiers 🚨 **CRITICAL**: Every connection gets unique ID including abbreviated and anonymous names. **Format**: `{target_slug}_conn_{index:04d}_{name_slug}` **Name Types**: `full`, `abbreviated` (Amy B.), `anonymous` (LinkedIn Member) **See**: `.opencode/LINKEDIN_CONNECTION_ID_RULE.md` for complete documentation --- """, 18: """### Rule 18: Custodian Staff Parsing from LinkedIn Company Pages 🚨 **CRITICAL**: Use `scripts/parse_custodian_staff.py` for staff registration parsing. **Staff ID Format**: `{custodian_slug}_staff_{index:04d}_{name_slug}` **See**: `.opencode/CUSTODIAN_STAFF_PARSING_RULE.md` for complete documentation --- """, 19: """### Rule 19: HTML-Only LinkedIn Extraction (Preferred Method) 🚨 **CRITICAL**: Use ONLY manually saved HTML files for LinkedIn data extraction. **Data Completeness**: HTML = 100% (including profile URLs), MD copy-paste = ~90% **Script**: `scripts/parse_linkedin_html.py` **How to Save**: Navigate β†’ Scroll to load all β†’ File > Save Page As > "Webpage, Complete" **See**: `.opencode/HTML_ONLY_LINKEDIN_EXTRACTION_RULE.md` for complete documentation --- """, 20: """### Rule 20: Person Entity Profiles - Individual File Storage 🚨 **CRITICAL**: Person profiles stored as individual files in `data/custodian/person/entity/`. **File Naming**: `{linkedin-slug}_{ISO-timestamp}.json` **Required**: ALL profiles MUST use structured JSON with `extraction_agent: "claude-opus-4.5"`. Raw content dumps are NOT acceptable. **See**: `.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md` for complete documentation --- """, 21: """### Rule 21: Data Fabrication is Strictly Prohibited 🚨 **CRITICAL**: ALL DATA MUST BE REAL AND VERIFIABLE. Fabricating any data is strictly prohibited. **❌ FORBIDDEN**: - Creating fake names, job titles, companies - Inventing education history or skills - Generating placeholder data when extraction fails - Creating fictional LinkedIn URLs **βœ… ALLOWED**: - Skip profiles that cannot be extracted - Return `null` or empty fields for missing data - Mark profiles with `extraction_error: true` - Log why extraction failed **See**: `.opencode/DATA_FABRICATION_PROHIBITION.md` for complete documentation --- """, 22: """### Rule 22: Custodian YAML Files Are the Single Source of Truth 🚨 **CRITICAL**: `data/custodian/*.yaml` is the SINGLE SOURCE OF TRUTH for all enrichment data. **Data Hierarchy**: ``` data/custodian/*.yaml ← SINGLE SOURCE OF TRUTH ↓ Ducklake β†’ PostgreSQL β†’ TypeDB β†’ Oxigraph β†’ Qdrant (All databases are DERIVED - never add data independently) ↓ REST API β†’ Frontend (both DERIVED) ``` **Workflow**: FETCH β†’ VALIDATE β†’ WRITE TO YAML β†’ Import to database β†’ Verify **See**: `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` for complete documentation --- """, 23: """### Rule 23: Social Media Link Validation - No Generic Links 🚨 **CRITICAL**: Social media links MUST be institution-specific, NOT generic platform homepages. **Invalid**: `facebook.com/`, `facebook.com/facebook`, `twitter.com/twitter` **Valid**: `facebook.com/rijksmuseum/`, `twitter.com/rijksmuseum` **See**: `.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md` for complete documentation --- """, 24: """### Rule 24: Unused Import Investigation - Check Before Removing 🚨 **CRITICAL**: Before removing unused imports, INVESTIGATE whether they indicate incomplete implementations. **Checklist**: 1. Was it recently used? (`git log -p --all -S 'ImportName'`) 2. Is there a TODO/FIXME? 3. Pattern mismatch (old vs new syntax)? 4. Incomplete feature? 5. Conditional usage (`TYPE_CHECKING` blocks)? **See**: `.opencode/UNUSED_IMPORT_INVESTIGATION_RULE.md` for complete documentation --- """, 25: """### Rule 25: Digital Platform Discovery Enrichment 🚨 **CRITICAL**: Every heritage custodian MUST be enriched with digital platform discovery data. **Discover**: Collection management systems, discovery portals, external integrations, APIs **Required Provenance**: `retrieval_agent`, `retrieval_timestamp`, `source_url`, `xpath_base`, `html_file` **See**: `.opencode/DIGITAL_PLATFORM_DISCOVERY_RULE.md` for complete documentation --- """, 26: """### Rule 26: Person Data Provenance - Web Claims for Staff Information 🚨 **CRITICAL**: All person/staff data MUST have web claim provenance with verifiable sources. **Required Fields**: `claim_type`, `claim_value`, `source_url`, `retrieved_on`, `retrieval_agent` **Recommended**: `xpath`, `xpath_match_score` **See**: `.opencode/PERSON_DATA_PROVENANCE_RULE.md` for complete documentation --- """, 27: """### Rule 27: Person-Custodian Data Architecture 🚨 **CRITICAL**: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. **In Person Entity File**: `extraction_metadata`, `profile_data`, `web_claims`, `affiliations` **In Custodian YAML**: `person_id`, `person_name`, `role_title`, `affiliation_provenance`, `linkedin_profile_path` (reference only) **NEVER**: Put `web_claims` in custodian YAML files **See**: `.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` for complete documentation --- """, 28: """### Rule 28: Web Claims Deduplication - No Redundant Claims 🚨 **CRITICAL**: Do not duplicate claims unless genuine variation exists with uncertainty. **Eliminate**: Favicon variants, same value from different extractions, dynamic content **Document**: Removed claims in `removed_claims` section for audit trail **See**: `.opencode/WEB_CLAIMS_DEDUPLICATION_RULE.md` for complete documentation --- """, 29: """### Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs 🚨 **CRITICAL**: Names CAN be derived from hyphenated LinkedIn slugs - this is data transformation, NOT fabrication. **Dutch Particles**: Keep lowercase when not first word (van, de, den, der) **Known Compound Slugs**: Use mapping for `jponjee` β†’ "J. Ponjee", etc. **See**: `.opencode/ANONYMOUS_PROFILE_NAME_RULE.md` for complete documentation --- """, 30: """### Rule 30: Person Profile Extraction Confidence Scoring 🚨 **CRITICAL**: Every enriched profile MUST have confidence score (0.50-0.95) for data extraction quality. **Distinct from**: Heritage sector relevance score (different purpose) **Scoring Factors**: - Clear job title: +0.10 to +0.15 - Named institution: +0.05 to +0.10 - Privacy-abbreviated name: -0.15 to -0.20 - Intern/trainee: -0.10 **See**: `.opencode/PERSON_PROFILE_CONFIDENCE_SCORING.md` for complete documentation --- """, 31: """### Rule 31: Organizational Subdivision Extraction 🚨 **CRITICAL**: ALWAYS capture organizational subdivisions as structured data. **Types**: department, team, unit, division, section, lab_or_center, office **Store in**: `affiliations[].subdivision` with `type`, `name`, `parent_subdivision`, `extraction_source` **See**: `.opencode/ORGANIZATIONAL_SUBDIVISION_EXTRACTION.md` for complete documentation --- """, 32: """### Rule 32: Government Ministries Are Heritage Custodians (Type O) 🚨 **CRITICAL**: Government ministries ARE heritage custodians due to statutory record-keeping obligations. **Heritage Relevance Scores**: | Role Category | Score Range | |---------------|-------------| | Records Management | 0.40-0.50 | | IT/Systems (records) | 0.30-0.40 | | Policy/Advisory | 0.25-0.35 | | Administrative | 0.15-0.25 | **See**: `.opencode/GOVERNMENT_MINISTRY_HERITAGE_RULE.md` for complete documentation --- """, 33: """### Rule 33: GHCID Collision Duplicate Detection 🚨 **CRITICAL**: Duplicate detection is MANDATORY in GHCID collision resolution. **Decision Matrix**: - ALL details match β†’ DUPLICATE (keep earliest, archive later) - Same name, different city β†’ NOT DUPLICATE (keep both, add suffix) - Same name, same city, different Wikidata IDs β†’ NOT DUPLICATE - When in doubt β†’ Keep both files (can merge later) **See**: `.opencode/GHCID_COLLISION_DUPLICATE_DETECTION.md` for complete documentation --- """, 34: """### Rule 34: Linkup is the Preferred Web Scraper 🚨 **CRITICAL**: Use Linkup as primary web scraper. Firecrawl credits are limited. **Tool Priority**: | Priority | Tool | When to Use | |----------|------|-------------| | 1st | `linkup_linkup-search` | General research, finding pages | | 2nd | `linkup_linkup-fetch` | Fetching known URL | | 3rd | `firecrawl_*` | Only when Linkup fails | | 4th | `playwright_*` | Interactive pages, HTML archival | **Two-Phase for XPath Provenance** (Rule 6 compliance): 1. Linkup for discovery 2. Playwright for archival with XPath extraction **See**: `.opencode/LINKUP_PREFERRED_WEB_SCRAPER_RULE.md` for complete documentation --- """, } def main(): agents_path = Path("/Users/kempersc/apps/glam/AGENTS.md") output_path = Path("/Users/kempersc/apps/glam/AGENTS_OPTIMIZED.md") # Read original file with open(agents_path, "r", encoding="utf-8") as f: original_lines = f.readlines() print(f"Original file: {len(original_lines)} lines") # Extract sections header = original_lines[:22] # Lines 1-22 (0-indexed: 0-21) unique_content = original_lines[2775:] # Lines 2776+ (0-indexed: 2775+) print(f"Header: {len(header)} lines") print(f"Unique content: {len(unique_content)} lines") # Extract Rules 2-3 full content for appendix rule2_start = None rule4_start = None for i, line in enumerate(original_lines): if "### Rule 2: Wikidata Entities Are NOT Ontology Classes" in line: rule2_start = i elif "### Rule 4:" in line: rule4_start = i break if rule2_start and rule4_start: rules_2_3_content = original_lines[rule2_start:rule4_start] print(f"Rules 2-3 full content: {len(rules_2_3_content)} lines") else: rules_2_3_content = [] print("Warning: Could not extract Rules 2-3 boundaries") # Build output output_lines = [] # Header output_lines.extend(header) output_lines.append("\n") # Enhanced rules section output_lines.append("## 🚨 CRITICAL RULES FOR ALL AGENTS\n\n") output_lines.append("This section summarizes 35 critical rules. Each rule has complete documentation in `.opencode/` files.\n\n") for rule_num in range(35): output_lines.append(RULE_CONTENT[rule_num]) output_lines.append("\n") # Add appendix for rules without .opencode files if rules_2_3_content: output_lines.append("## Appendix: Full Rule Content (No .opencode Equivalent)\n\n") output_lines.append("The following rules have no separate .opencode file and are preserved in full:\n\n") output_lines.extend([line if line.endswith("\n") else line + "\n" for line in rules_2_3_content]) output_lines.append("\n") # Unique content (Project Overview onward) output_lines.extend(unique_content) # Write output with open(output_path, "w", encoding="utf-8") as f: f.writelines(output_lines) final_lines = len(output_lines) print(f"\nOutput file: {final_lines} lines") print(f"Reduction: {len(original_lines)} β†’ {final_lines} ({len(original_lines) - final_lines} lines removed)") print(f"Saved to: {output_path}") return output_path if __name__ == "__main__": main()