940 lines
32 KiB
Markdown
940 lines
32 KiB
Markdown
# Entity Linking for Heritage Custodians
|
|
|
|
## Overview
|
|
|
|
This document defines entity linking strategies for resolving extracted heritage institution mentions to canonical knowledge bases (Wikidata, VIAF, ISIL registry) and the local Heritage Custodian Ontology knowledge graph.
|
|
|
|
## Entity Linking Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Entity Linking Pipeline │
|
|
├──────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Extracted Entity ──► Candidate Generation ──► Candidate Ranking │
|
|
│ (NER) (Multi-source) (Features + ML) │
|
|
│ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Knowledge │ │ Disambiguation │ │
|
|
│ │ Bases │ │ Module │ │
|
|
│ ├─────────────────┤ └────────┬────────┘ │
|
|
│ │ • Wikidata │ │ │
|
|
│ │ • VIAF │ ▼ │
|
|
│ │ • ISIL Registry │ ┌─────────────────┐ │
|
|
│ │ • GeoNames │ │ NIL Detection │ │
|
|
│ │ • Local KG │ │ (No KB Entry) │ │
|
|
│ └─────────────────┘ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ Linked Entity (or NIL) │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Knowledge Bases
|
|
|
|
### Primary Knowledge Bases
|
|
|
|
| KB | Property | Use Case | Lookup Method |
|
|
|----|----------|----------|---------------|
|
|
| **Wikidata** | Q-entities | Primary reference KB | SPARQL + API |
|
|
| **VIAF** | Authority IDs | Organization authorities | SRU API |
|
|
| **ISIL** | Library/archive codes | Unique institution IDs | Direct lookup |
|
|
| **GeoNames** | Place IDs | Location disambiguation | API + DB |
|
|
| **Local KG** | GHCID | Internal entity resolution | TypeDB query |
|
|
|
|
### Identifier Cross-Reference Table
|
|
|
|
```python
|
|
IDENTIFIER_PROPERTIES = {
|
|
"wikidata": {
|
|
"isil": "P791", # ISIL identifier
|
|
"viaf": "P214", # VIAF ID
|
|
"isni": "P213", # ISNI
|
|
"ror": "P6782", # ROR ID
|
|
"gnd": "P227", # GND ID (German)
|
|
"loc": "P244", # Library of Congress
|
|
"bnf": "P268", # BnF (French)
|
|
"nta": "P1006", # Netherlands Thesaurus for Author names
|
|
}
|
|
}
|
|
```
|
|
|
|
## DSPy Entity Linker Module
|
|
|
|
### EntityLinker Signature
|
|
|
|
```python
|
|
import dspy
|
|
from typing import List, Optional
|
|
from pydantic import BaseModel, Field
|
|
|
|
class LinkedEntity(BaseModel):
|
|
"""A linked entity with KB reference."""
|
|
mention_text: str = Field(description="Original mention text")
|
|
canonical_name: str = Field(description="Canonical name from KB")
|
|
kb_id: str = Field(description="Knowledge base identifier")
|
|
kb_source: str = Field(description="KB source: wikidata, viaf, isil, geonames, local")
|
|
confidence: float = Field(ge=0.0, le=1.0)
|
|
|
|
# Additional identifiers discovered
|
|
wikidata_id: Optional[str] = None
|
|
viaf_id: Optional[str] = None
|
|
isil_code: Optional[str] = None
|
|
ghcid: Optional[str] = None
|
|
|
|
# Disambiguation features
|
|
type_match: bool = Field(default=False, description="KB type matches expected type")
|
|
location_match: bool = Field(default=False, description="Location context matches")
|
|
|
|
class EntityLinkerOutput(BaseModel):
|
|
linked_entities: List[LinkedEntity]
|
|
nil_entities: List[str] = Field(description="Mentions with no KB match (NIL)")
|
|
|
|
class EntityLinker(dspy.Signature):
|
|
"""Link extracted heritage institution mentions to knowledge bases.
|
|
|
|
Linking strategy:
|
|
1. Generate candidates from multiple KBs (Wikidata, VIAF, ISIL, local KG)
|
|
2. Score candidates using name similarity, type matching, location context
|
|
3. Apply disambiguation for ambiguous cases
|
|
4. Detect NIL entities (no KB entry exists)
|
|
|
|
Priority:
|
|
- ISIL code match → highest confidence (unique identifier)
|
|
- Wikidata exact match → high confidence
|
|
- VIAF authority match → high confidence
|
|
- Local KG GHCID match → medium confidence
|
|
- Fuzzy name match → lower confidence, requires verification
|
|
"""
|
|
|
|
entities: List[str] = dspy.InputField(desc="Extracted entity mentions to link")
|
|
entity_types: List[str] = dspy.InputField(desc="Expected types (GLAMORCUBESFIXPHDNT)")
|
|
context: str = dspy.InputField(desc="Surrounding text for disambiguation")
|
|
country_hint: Optional[str] = dspy.InputField(default=None, desc="Country context")
|
|
|
|
linked: EntityLinkerOutput = dspy.OutputField(desc="Linked entities")
|
|
```
|
|
|
|
## Candidate Generation
|
|
|
|
### Multi-Source Candidate Generator
|
|
|
|
```python
|
|
class CandidateGenerator:
|
|
"""Generate entity candidates from multiple knowledge bases."""
|
|
|
|
def __init__(self):
|
|
self.wikidata_client = WikidataClient()
|
|
self.viaf_client = VIAFClient()
|
|
self.isil_registry = ISILRegistry()
|
|
self.geonames_client = GeoNamesClient()
|
|
self.local_kg = TypeDBClient()
|
|
|
|
def generate_candidates(
|
|
self,
|
|
mention: str,
|
|
entity_type: str,
|
|
country_hint: str = None,
|
|
max_candidates: int = 10,
|
|
) -> List[Candidate]:
|
|
"""Generate candidates from all sources."""
|
|
|
|
candidates = []
|
|
|
|
# 1. ISIL Registry (exact match for known codes)
|
|
if self._looks_like_isil(mention):
|
|
isil_candidate = self.isil_registry.lookup(mention)
|
|
if isil_candidate:
|
|
candidates.append(Candidate(
|
|
kb_id=mention,
|
|
kb_source="isil",
|
|
name=isil_candidate["name"],
|
|
score=1.0, # Exact match
|
|
))
|
|
|
|
# 2. Wikidata (label search + type filter)
|
|
wd_candidates = self.wikidata_client.search_entities(
|
|
query=mention,
|
|
instance_of=self._type_to_wikidata_class(entity_type),
|
|
country=country_hint,
|
|
limit=max_candidates,
|
|
)
|
|
candidates.extend(wd_candidates)
|
|
|
|
# 3. VIAF (organization search)
|
|
if entity_type in ["A", "L", "M", "O", "R"]: # Formal organizations
|
|
viaf_candidates = self.viaf_client.search_organizations(
|
|
query=mention,
|
|
limit=max_candidates // 2,
|
|
)
|
|
candidates.extend(viaf_candidates)
|
|
|
|
# 4. Local KG (GHCID lookup)
|
|
local_candidates = self.local_kg.search_custodians(
|
|
name_query=mention,
|
|
custodian_type=entity_type,
|
|
country=country_hint,
|
|
limit=max_candidates // 2,
|
|
)
|
|
candidates.extend(local_candidates)
|
|
|
|
return self._deduplicate(candidates)
|
|
|
|
def _type_to_wikidata_class(self, glamor_type: str) -> str:
|
|
"""Map GLAMORCUBESFIXPHDNT type to Wikidata class."""
|
|
TYPE_MAP = {
|
|
"G": "Q1007870", # art gallery
|
|
"L": "Q7075", # library
|
|
"A": "Q166118", # archive
|
|
"M": "Q33506", # museum
|
|
"O": "Q2659904", # government agency
|
|
"R": "Q31855", # research institute
|
|
"B": "Q167346", # botanical garden
|
|
"E": "Q3918", # university
|
|
"S": "Q988108", # historical society
|
|
"H": "Q16970", # church (with collections)
|
|
"D": "Q35127", # website / digital platform
|
|
}
|
|
return TYPE_MAP.get(glamor_type, "Q43229") # Default: organization
|
|
|
|
def _looks_like_isil(self, text: str) -> bool:
|
|
import re
|
|
return bool(re.match(r"^[A-Z]{2}-[A-Za-z0-9]+$", text))
|
|
```
|
|
|
|
### Wikidata Candidate Search
|
|
|
|
```python
|
|
class WikidataClient:
|
|
"""Wikidata entity search and lookup."""
|
|
|
|
ENDPOINT = "https://query.wikidata.org/sparql"
|
|
|
|
def search_entities(
|
|
self,
|
|
query: str,
|
|
instance_of: str = None,
|
|
country: str = None,
|
|
limit: int = 10,
|
|
) -> List[Candidate]:
|
|
"""Search Wikidata entities by label."""
|
|
|
|
# Build SPARQL query with filters
|
|
filters = []
|
|
if instance_of:
|
|
filters.append(f"?item wdt:P31/wdt:P279* wd:{instance_of} .")
|
|
if country:
|
|
country_qid = self._country_to_qid(country)
|
|
if country_qid:
|
|
filters.append(f"?item wdt:P17 wd:{country_qid} .")
|
|
|
|
filter_clause = "\n".join(filters)
|
|
|
|
sparql = f"""
|
|
SELECT ?item ?itemLabel ?itemDescription ?isil ?viaf WHERE {{
|
|
SERVICE wikibase:mwapi {{
|
|
bd:serviceParam wikibase:api "EntitySearch" .
|
|
bd:serviceParam wikibase:endpoint "www.wikidata.org" .
|
|
bd:serviceParam mwapi:search "{query}" .
|
|
bd:serviceParam mwapi:language "en,nl,de,fr" .
|
|
?item wikibase:apiOutputItem mwapi:item .
|
|
}}
|
|
{filter_clause}
|
|
OPTIONAL {{ ?item wdt:P791 ?isil }}
|
|
OPTIONAL {{ ?item wdt:P214 ?viaf }}
|
|
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en,nl,de,fr" }}
|
|
}}
|
|
LIMIT {limit}
|
|
"""
|
|
|
|
results = self._execute_sparql(sparql)
|
|
|
|
return [
|
|
Candidate(
|
|
kb_id=r["item"]["value"].split("/")[-1],
|
|
kb_source="wikidata",
|
|
name=r.get("itemLabel", {}).get("value", ""),
|
|
description=r.get("itemDescription", {}).get("value", ""),
|
|
isil=r.get("isil", {}).get("value"),
|
|
viaf=r.get("viaf", {}).get("value"),
|
|
score=0.0, # Score computed later
|
|
)
|
|
for r in results
|
|
]
|
|
|
|
def get_entity_details(self, qid: str) -> dict:
|
|
"""Get full entity details from Wikidata."""
|
|
|
|
sparql = f"""
|
|
SELECT ?prop ?propLabel ?value ?valueLabel WHERE {{
|
|
wd:{qid} ?prop ?value .
|
|
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en,nl" }}
|
|
}}
|
|
"""
|
|
|
|
return self._execute_sparql(sparql)
|
|
```
|
|
|
|
### VIAF Authority Search
|
|
|
|
```python
|
|
class VIAFClient:
|
|
"""VIAF Virtual International Authority File client."""
|
|
|
|
SRU_ENDPOINT = "https://viaf.org/viaf/search"
|
|
|
|
def search_organizations(
|
|
self,
|
|
query: str,
|
|
limit: int = 10,
|
|
) -> List[Candidate]:
|
|
"""Search VIAF for corporate bodies."""
|
|
|
|
# SRU CQL query
|
|
cql_query = f'local.corporateNames all "{query}"'
|
|
|
|
params = {
|
|
"query": cql_query,
|
|
"maximumRecords": limit,
|
|
"httpAccept": "application/json",
|
|
"recordSchema": "BriefVIAF",
|
|
}
|
|
|
|
response = requests.get(self.SRU_ENDPOINT, params=params)
|
|
data = response.json()
|
|
|
|
candidates = []
|
|
for record in data.get("records", []):
|
|
viaf_id = record.get("viafID")
|
|
main_heading = record.get("mainHeadingEl", {}).get("datafield", {})
|
|
name = self._extract_name(main_heading)
|
|
|
|
candidates.append(Candidate(
|
|
kb_id=viaf_id,
|
|
kb_source="viaf",
|
|
name=name,
|
|
score=0.0,
|
|
))
|
|
|
|
return candidates
|
|
|
|
def get_authority_cluster(self, viaf_id: str) -> dict:
|
|
"""Get all authority records linked to a VIAF cluster."""
|
|
|
|
url = f"https://viaf.org/viaf/{viaf_id}/viaf.json"
|
|
response = requests.get(url)
|
|
|
|
if response.status_code == 200:
|
|
return response.json()
|
|
return {}
|
|
```
|
|
|
|
### ISIL Registry Lookup
|
|
|
|
```python
|
|
class ISILRegistry:
|
|
"""ISIL (International Standard Identifier for Libraries) registry."""
|
|
|
|
def __init__(self, db_path: str = "data/reference/isil_registry.db"):
|
|
self.db_path = db_path
|
|
|
|
def lookup(self, isil_code: str) -> Optional[dict]:
|
|
"""Look up institution by ISIL code."""
|
|
|
|
import sqlite3
|
|
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
|
|
cursor.execute("""
|
|
SELECT name, city, country, institution_type, notes
|
|
FROM isil_registry
|
|
WHERE isil_code = ?
|
|
""", (isil_code,))
|
|
|
|
row = cursor.fetchone()
|
|
conn.close()
|
|
|
|
if row:
|
|
return {
|
|
"isil_code": isil_code,
|
|
"name": row[0],
|
|
"city": row[1],
|
|
"country": row[2],
|
|
"institution_type": row[3],
|
|
"notes": row[4],
|
|
}
|
|
return None
|
|
|
|
def search_by_name(
|
|
self,
|
|
name: str,
|
|
country: str = None,
|
|
limit: int = 10,
|
|
) -> List[dict]:
|
|
"""Search ISIL registry by institution name."""
|
|
|
|
import sqlite3
|
|
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
|
|
query = """
|
|
SELECT isil_code, name, city, country, institution_type
|
|
FROM isil_registry
|
|
WHERE name LIKE ?
|
|
"""
|
|
params = [f"%{name}%"]
|
|
|
|
if country:
|
|
query += " AND country = ?"
|
|
params.append(country)
|
|
|
|
query += f" LIMIT {limit}"
|
|
|
|
cursor.execute(query, params)
|
|
rows = cursor.fetchall()
|
|
conn.close()
|
|
|
|
return [
|
|
{
|
|
"isil_code": row[0],
|
|
"name": row[1],
|
|
"city": row[2],
|
|
"country": row[3],
|
|
"institution_type": row[4],
|
|
}
|
|
for row in rows
|
|
]
|
|
```
|
|
|
|
## Candidate Ranking
|
|
|
|
### Feature-Based Ranking
|
|
|
|
```python
|
|
class CandidateRanker:
|
|
"""Rank entity candidates using multiple features."""
|
|
|
|
def __init__(self):
|
|
self.name_matcher = NameMatcher()
|
|
self.type_checker = TypeChecker()
|
|
self.location_matcher = LocationMatcher()
|
|
|
|
def rank_candidates(
|
|
self,
|
|
mention: str,
|
|
candidates: List[Candidate],
|
|
context: str,
|
|
expected_type: str,
|
|
location_context: str = None,
|
|
) -> List[Candidate]:
|
|
"""Rank candidates by combined feature score."""
|
|
|
|
for candidate in candidates:
|
|
# Feature 1: Name similarity
|
|
name_score = self.name_matcher.similarity(mention, candidate.name)
|
|
|
|
# Feature 2: Type match
|
|
type_score = self.type_checker.type_match_score(
|
|
candidate.kb_source,
|
|
candidate.kb_id,
|
|
expected_type,
|
|
)
|
|
|
|
# Feature 3: Location context
|
|
location_score = 0.0
|
|
if location_context:
|
|
location_score = self.location_matcher.location_match_score(
|
|
candidate,
|
|
location_context,
|
|
)
|
|
|
|
# Feature 4: Context similarity
|
|
context_score = self._context_similarity(candidate, context)
|
|
|
|
# Feature 5: Source priority
|
|
source_score = self._source_priority(candidate.kb_source)
|
|
|
|
# Combine scores (weighted)
|
|
candidate.score = (
|
|
0.35 * name_score +
|
|
0.25 * type_score +
|
|
0.15 * location_score +
|
|
0.15 * context_score +
|
|
0.10 * source_score
|
|
)
|
|
|
|
# Sort by score descending
|
|
candidates.sort(key=lambda c: c.score, reverse=True)
|
|
return candidates
|
|
|
|
def _source_priority(self, source: str) -> float:
|
|
"""Priority score for KB source (ISIL > Wikidata > VIAF > local)."""
|
|
PRIORITIES = {
|
|
"isil": 1.0, # Unique identifier
|
|
"wikidata": 0.9, # Rich entity data
|
|
"viaf": 0.8, # Authority file
|
|
"local": 0.7, # Local KG
|
|
"geonames": 0.6, # Place data
|
|
}
|
|
return PRIORITIES.get(source, 0.5)
|
|
|
|
def _context_similarity(self, candidate: Candidate, context: str) -> float:
|
|
"""Semantic similarity between candidate description and context."""
|
|
if not candidate.description:
|
|
return 0.5
|
|
|
|
# Use sentence embeddings
|
|
from sentence_transformers import util
|
|
|
|
context_emb = self.embedder.encode(context)
|
|
desc_emb = self.embedder.encode(candidate.description)
|
|
|
|
return float(util.cos_sim(context_emb, desc_emb)[0][0])
|
|
```
|
|
|
|
### Name Matching
|
|
|
|
```python
|
|
class NameMatcher:
|
|
"""Fuzzy name matching for entity linking."""
|
|
|
|
def __init__(self):
|
|
self.normalizer = NameNormalizer()
|
|
|
|
def similarity(self, mention: str, candidate_name: str) -> float:
|
|
"""Compute name similarity score."""
|
|
|
|
# Normalize both names
|
|
norm_mention = self.normalizer.normalize(mention)
|
|
norm_candidate = self.normalizer.normalize(candidate_name)
|
|
|
|
# Exact match
|
|
if norm_mention == norm_candidate:
|
|
return 1.0
|
|
|
|
# Token overlap (Jaccard)
|
|
mention_tokens = set(norm_mention.split())
|
|
candidate_tokens = set(norm_candidate.split())
|
|
jaccard = len(mention_tokens & candidate_tokens) / len(mention_tokens | candidate_tokens)
|
|
|
|
# Levenshtein ratio
|
|
from rapidfuzz import fuzz
|
|
levenshtein = fuzz.ratio(norm_mention, norm_candidate) / 100.0
|
|
|
|
# Token sort ratio (order-independent)
|
|
token_sort = fuzz.token_sort_ratio(norm_mention, norm_candidate) / 100.0
|
|
|
|
# Combine scores
|
|
return 0.4 * jaccard + 0.3 * levenshtein + 0.3 * token_sort
|
|
|
|
|
|
class NameNormalizer:
|
|
"""Normalize institution names for matching."""
|
|
|
|
# Skip words by language (legal forms, articles)
|
|
SKIP_WORDS = {
|
|
"nl": ["stichting", "de", "het", "van", "voor", "en", "te"],
|
|
"en": ["the", "of", "and", "for", "foundation", "trust", "inc"],
|
|
"de": ["der", "die", "das", "und", "für", "stiftung", "e.v."],
|
|
"fr": ["le", "la", "les", "de", "du", "et", "fondation"],
|
|
}
|
|
|
|
def normalize(self, name: str, language: str = "nl") -> str:
|
|
"""Normalize institution name."""
|
|
|
|
import unicodedata
|
|
import re
|
|
|
|
# Lowercase
|
|
name = name.lower()
|
|
|
|
# Remove diacritics
|
|
name = unicodedata.normalize("NFD", name)
|
|
name = "".join(c for c in name if unicodedata.category(c) != "Mn")
|
|
|
|
# Remove punctuation
|
|
name = re.sub(r"[^\w\s]", " ", name)
|
|
|
|
# Remove skip words
|
|
skip = set(self.SKIP_WORDS.get(language, []))
|
|
tokens = [t for t in name.split() if t not in skip]
|
|
|
|
# Collapse whitespace
|
|
return " ".join(tokens)
|
|
```
|
|
|
|
### Type Checking
|
|
|
|
```python
|
|
class TypeChecker:
|
|
"""Check if candidate type matches expected type."""
|
|
|
|
# Wikidata class mappings for GLAMORCUBESFIXPHDNT
|
|
WIKIDATA_TYPE_MAP = {
|
|
"G": ["Q1007870", "Q207694"], # art gallery, museum of art
|
|
"L": ["Q7075", "Q856234"], # library, national library
|
|
"A": ["Q166118", "Q2860091"], # archive, national archive
|
|
"M": ["Q33506", "Q17431399"], # museum, museum building
|
|
"O": ["Q2659904", "Q327333"], # government agency, public body
|
|
"R": ["Q31855", "Q7315155"], # research institute, research center
|
|
"B": ["Q167346", "Q43501"], # botanical garden, zoo
|
|
"E": ["Q3918", "Q875538"], # university, public university
|
|
"S": ["Q988108", "Q15911314"], # historical society, heritage organization
|
|
"H": ["Q16970", "Q839954"], # church, religious institute
|
|
"D": ["Q35127", "Q856584"], # website, digital library
|
|
}
|
|
|
|
def type_match_score(
|
|
self,
|
|
kb_source: str,
|
|
kb_id: str,
|
|
expected_type: str,
|
|
) -> float:
|
|
"""Score type compatibility."""
|
|
|
|
if kb_source == "wikidata":
|
|
return self._wikidata_type_match(kb_id, expected_type)
|
|
elif kb_source == "isil":
|
|
return 0.9 # ISIL implies library/archive type
|
|
elif kb_source == "viaf":
|
|
return 0.8 # VIAF implies organization
|
|
|
|
return 0.5 # Unknown
|
|
|
|
def _wikidata_type_match(self, qid: str, expected_type: str) -> float:
|
|
"""Check if Wikidata entity type matches expected."""
|
|
|
|
expected_classes = self.WIKIDATA_TYPE_MAP.get(expected_type, [])
|
|
if not expected_classes:
|
|
return 0.5
|
|
|
|
# Query Wikidata for instance_of
|
|
sparql = f"""
|
|
SELECT ?class WHERE {{
|
|
wd:{qid} wdt:P31/wdt:P279* ?class .
|
|
VALUES ?class {{ {' '.join(f'wd:{c}' for c in expected_classes)} }}
|
|
}}
|
|
LIMIT 1
|
|
"""
|
|
|
|
results = wikidata_execute_sparql(sparql)
|
|
|
|
if results:
|
|
return 1.0 # Direct type match
|
|
|
|
# Check for broader match
|
|
sparql_broad = f"""
|
|
SELECT ?class WHERE {{
|
|
wd:{qid} wdt:P31 ?class .
|
|
}}
|
|
LIMIT 5
|
|
"""
|
|
|
|
results_broad = wikidata_execute_sparql(sparql_broad)
|
|
if results_broad:
|
|
return 0.6 # Has some type, but not exact match
|
|
|
|
return 0.3 # No type information
|
|
```
|
|
|
|
## Disambiguation Strategies
|
|
|
|
### Context-Based Disambiguation
|
|
|
|
```python
|
|
class DisambiguationModule(dspy.Module):
|
|
"""Disambiguate between multiple candidate matches."""
|
|
|
|
def __init__(self):
|
|
super().__init__()
|
|
self.disambiguator = dspy.ChainOfThought(DisambiguationSignature)
|
|
|
|
def forward(
|
|
self,
|
|
mention: str,
|
|
candidates: List[Candidate],
|
|
context: str,
|
|
) -> Candidate:
|
|
# Format candidates for LLM
|
|
candidate_descriptions = "\n".join([
|
|
f"- {c.kb_source}:{c.kb_id} - {c.name}: {c.description or 'No description'}"
|
|
for c in candidates[:5] # Top 5
|
|
])
|
|
|
|
result = self.disambiguator(
|
|
mention=mention,
|
|
candidates=candidate_descriptions,
|
|
context=context,
|
|
)
|
|
|
|
# Parse result and find matching candidate
|
|
selected_id = result.selected_id
|
|
for candidate in candidates:
|
|
if f"{candidate.kb_source}:{candidate.kb_id}" == selected_id:
|
|
return candidate
|
|
|
|
# Return top candidate if parsing fails
|
|
return candidates[0] if candidates else None
|
|
|
|
|
|
class DisambiguationSignature(dspy.Signature):
|
|
"""Select the correct entity from candidates.
|
|
|
|
Given a mention, multiple candidate matches, and surrounding context,
|
|
determine which candidate is the correct entity reference.
|
|
|
|
Consider:
|
|
- Name similarity (exact vs partial match)
|
|
- Type compatibility (is it the right kind of institution?)
|
|
- Location context (does location match?)
|
|
- Contextual clues (other entities, topics mentioned)
|
|
"""
|
|
|
|
mention: str = dspy.InputField(desc="Entity mention text")
|
|
candidates: str = dspy.InputField(desc="Formatted candidate list")
|
|
context: str = dspy.InputField(desc="Surrounding text context")
|
|
|
|
selected_id: str = dspy.OutputField(desc="Selected candidate ID (format: source:id)")
|
|
reasoning: str = dspy.OutputField(desc="Explanation for selection")
|
|
```
|
|
|
|
### Geographic Disambiguation
|
|
|
|
```python
|
|
class LocationMatcher:
|
|
"""Disambiguate entities using location context."""
|
|
|
|
def __init__(self):
|
|
self.geonames = GeoNamesClient()
|
|
|
|
def location_match_score(
|
|
self,
|
|
candidate: Candidate,
|
|
location_context: str,
|
|
) -> float:
|
|
"""Score location compatibility."""
|
|
|
|
# Extract location from context
|
|
context_locations = self._extract_locations(location_context)
|
|
if not context_locations:
|
|
return 0.5 # No location to match
|
|
|
|
# Get candidate location
|
|
candidate_location = self._get_candidate_location(candidate)
|
|
if not candidate_location:
|
|
return 0.5 # No candidate location
|
|
|
|
# Compare locations
|
|
for context_loc in context_locations:
|
|
# Same city
|
|
if self._same_city(context_loc, candidate_location):
|
|
return 1.0
|
|
|
|
# Same region
|
|
if self._same_region(context_loc, candidate_location):
|
|
return 0.8
|
|
|
|
# Same country
|
|
if self._same_country(context_loc, candidate_location):
|
|
return 0.6
|
|
|
|
return 0.2 # No location match
|
|
|
|
def _get_candidate_location(self, candidate: Candidate) -> Optional[dict]:
|
|
"""Get location for candidate from KB."""
|
|
|
|
if candidate.kb_source == "wikidata":
|
|
sparql = f"""
|
|
SELECT ?city ?country ?coords WHERE {{
|
|
OPTIONAL {{ wd:{candidate.kb_id} wdt:P131 ?city }}
|
|
OPTIONAL {{ wd:{candidate.kb_id} wdt:P17 ?country }}
|
|
OPTIONAL {{ wd:{candidate.kb_id} wdt:P625 ?coords }}
|
|
}}
|
|
LIMIT 1
|
|
"""
|
|
results = wikidata_execute_sparql(sparql)
|
|
if results:
|
|
return {
|
|
"city": results[0].get("city", {}).get("value"),
|
|
"country": results[0].get("country", {}).get("value"),
|
|
"coords": results[0].get("coords", {}).get("value"),
|
|
}
|
|
|
|
elif candidate.kb_source == "isil":
|
|
# ISIL country from code prefix
|
|
country_code = candidate.kb_id.split("-")[0]
|
|
return {"country_code": country_code}
|
|
|
|
return None
|
|
```
|
|
|
|
## NIL Detection
|
|
|
|
### NIL Entity Classifier
|
|
|
|
```python
|
|
class NILDetector:
|
|
"""Detect entities with no knowledge base entry (NIL)."""
|
|
|
|
def __init__(self, nil_threshold: float = 0.4):
|
|
self.nil_threshold = nil_threshold
|
|
|
|
def is_nil(
|
|
self,
|
|
mention: str,
|
|
top_candidate: Optional[Candidate],
|
|
context: str,
|
|
) -> Tuple[bool, str]:
|
|
"""Determine if mention refers to a NIL entity.
|
|
|
|
Returns:
|
|
(is_nil, reason)
|
|
"""
|
|
|
|
# No candidates found
|
|
if top_candidate is None:
|
|
return True, "no_candidates_found"
|
|
|
|
# Top candidate score below threshold
|
|
if top_candidate.score < self.nil_threshold:
|
|
return True, f"low_confidence_score_{top_candidate.score:.2f}"
|
|
|
|
# Name too dissimilar
|
|
name_sim = NameMatcher().similarity(mention, top_candidate.name)
|
|
if name_sim < 0.5:
|
|
return True, f"name_mismatch_{name_sim:.2f}"
|
|
|
|
# Type mismatch (if type info available)
|
|
# ...
|
|
|
|
return False, "valid_match"
|
|
|
|
def create_nil_entity(
|
|
self,
|
|
mention: str,
|
|
entity_type: str,
|
|
context: str,
|
|
provenance: dict,
|
|
) -> dict:
|
|
"""Create a NIL entity record for later KB population."""
|
|
|
|
return {
|
|
"mention_text": mention,
|
|
"entity_type": entity_type,
|
|
"context_snippet": context[:500],
|
|
"nil_reason": "no_kb_match",
|
|
"provenance": provenance,
|
|
"created_date": datetime.now().isoformat(),
|
|
"status": "pending_verification",
|
|
}
|
|
```
|
|
|
|
## Full Entity Linking Pipeline
|
|
|
|
```python
|
|
class EntityLinkingPipeline(dspy.Module):
|
|
"""Complete entity linking pipeline."""
|
|
|
|
def __init__(self):
|
|
super().__init__()
|
|
self.candidate_generator = CandidateGenerator()
|
|
self.candidate_ranker = CandidateRanker()
|
|
self.disambiguator = DisambiguationModule()
|
|
self.nil_detector = NILDetector()
|
|
|
|
def forward(
|
|
self,
|
|
entities: List[dict], # [{mention, type, context}]
|
|
country_hint: str = None,
|
|
) -> EntityLinkerOutput:
|
|
|
|
linked_entities = []
|
|
nil_entities = []
|
|
|
|
for entity in entities:
|
|
mention = entity["mention"]
|
|
entity_type = entity["type"]
|
|
context = entity["context"]
|
|
|
|
# 1. Generate candidates
|
|
candidates = self.candidate_generator.generate_candidates(
|
|
mention=mention,
|
|
entity_type=entity_type,
|
|
country_hint=country_hint,
|
|
)
|
|
|
|
if not candidates:
|
|
nil_entities.append(mention)
|
|
continue
|
|
|
|
# 2. Rank candidates
|
|
ranked = self.candidate_ranker.rank_candidates(
|
|
mention=mention,
|
|
candidates=candidates,
|
|
context=context,
|
|
expected_type=entity_type,
|
|
location_context=country_hint,
|
|
)
|
|
|
|
# 3. Disambiguate if needed
|
|
if len(ranked) > 1 and ranked[0].score - ranked[1].score < 0.1:
|
|
# Close scores - need disambiguation
|
|
selected = self.disambiguator(
|
|
mention=mention,
|
|
candidates=ranked[:5],
|
|
context=context,
|
|
)
|
|
else:
|
|
selected = ranked[0]
|
|
|
|
# 4. NIL detection
|
|
is_nil, nil_reason = self.nil_detector.is_nil(
|
|
mention=mention,
|
|
top_candidate=selected,
|
|
context=context,
|
|
)
|
|
|
|
if is_nil:
|
|
nil_entities.append(mention)
|
|
continue
|
|
|
|
# 5. Create linked entity
|
|
linked_entities.append(LinkedEntity(
|
|
mention_text=mention,
|
|
canonical_name=selected.name,
|
|
kb_id=selected.kb_id,
|
|
kb_source=selected.kb_source,
|
|
confidence=selected.score,
|
|
wikidata_id=selected.kb_id if selected.kb_source == "wikidata" else None,
|
|
viaf_id=selected.viaf,
|
|
isil_code=selected.isil,
|
|
type_match=selected.score > 0.7,
|
|
))
|
|
|
|
return EntityLinkerOutput(
|
|
linked_entities=linked_entities,
|
|
nil_entities=nil_entities,
|
|
)
|
|
```
|
|
|
|
## Confidence Thresholds
|
|
|
|
| Scenario | Threshold | Action |
|
|
|----------|-----------|--------|
|
|
| **Exact ISIL match** | 1.0 | Auto-link |
|
|
| **Wikidata exact name + type** | ≥0.9 | Auto-link |
|
|
| **Fuzzy match, high context** | ≥0.7 | Auto-link |
|
|
| **Fuzzy match, low context** | 0.5-0.7 | Flag for review |
|
|
| **Low score** | <0.5 | Mark as NIL |
|
|
| **No candidates** | 0.0 | Create NIL record |
|
|
|
|
## See Also
|
|
|
|
- [04-entity-extraction.md](./04-entity-extraction.md) - NER patterns and extraction
|
|
- [07-sparql-templates.md](./07-sparql-templates.md) - Wikidata SPARQL queries
|
|
- [06-retrieval-patterns.md](./06-retrieval-patterns.md) - KG retrieval strategies
|
|
- [AGENTS.md](../../AGENTS.md) - Rule 1 (Ontology consultation), Rule 10 (CH-Annotator)
|