Algorithm ========= This page explains the graph-based algorithm that powers Crosstem's derivational stemming. Graph Representation -------------------- Crosstem treats morphology as a **graph problem** instead of a rules-based problem: * **Nodes**: Individual words (e.g., "organize", "organization", "organizational") * **Edges**: Derivational relationships (DERIVED_FROM connections) * **Graph Structure**: Pre-computed from MorphyNet linguistic data, stored as nested JSON dictionaries Why Graph-Based? ---------------- Traditional stemmers use suffix-stripping rules:: # Porter Stemmer approach "organizational" → strip "al" → "organization" → strip "tion" → "organiz" → ERROR: overstemming Graph-based approach:: # Crosstem approach "organizational" → DERIVED_FROM → "organization" → DERIVED_FROM → "organize" → [STOP: root found] BFS Traversal ------------- Crosstem uses **Breadth-First Search (BFS)** to find the morphological root: 1. Start at the input word 2. Explore all words it derives from (direct parents) 3. Score each candidate based on linguistic features 4. Continue traversing until reaching the best root 5. Maximum depth: 3 hops Multi-Hop Example ~~~~~~~~~~~~~~~~~ :: "organizational" (adjective) ↓ hop 1 "organization" (noun) ↓ hop 2 "organize" (verb) ← ROOT FOUND This 2-hop traversal is impossible for single-pass suffix strippers. Scoring Function ---------------- At each hop, candidates are scored based on three factors: 1. **Word Length** (shorter = better) * Penalty: +2 points per hop depth * Rationale: Roots tend to be shorter than derivatives 2. **Part of Speech** (verbs = roots) * Verbs: -10 points (strong preference) * Nouns: -5 points (moderate preference) * Rationale: Many derivations start from verbal roots 3. **Productivity** (high derivatives = real root) * Must exceed language-specific threshold * Filters out archaic/rare words * Example: "run" (33 derivatives) vs "hap" (8 derivatives) Scoring Example ~~~~~~~~~~~~~~~ :: Candidate: "organize" (verb, 13 derivations) - Length penalty: +2 (depth 2) - POS bonus: -10 (verb) - Productivity: ✓ passes threshold (≥5 for English verbs) - Final score: -8 (lower is better) Candidate: "hap" (noun, 8 derivations) - Length penalty: +2 (depth 2) - POS bonus: -5 (noun) - Productivity: ✗ fails threshold (≥9 for English nouns) - Result: FILTERED OUT Productivity Thresholds ----------------------- Different languages have different morphological productivity. Thresholds are calibrated per language: English (Rich Morphology) ~~~~~~~~~~~~~~~~~~~~~~~~~ * Verbs: ≥5 derivations * Nouns: ≥9 derivations * Rationale: English has extensive derivational morphology French/Italian (Moderate) ~~~~~~~~~~~~~~~~~~~~~~~~~~ * Verbs: ≥4 derivations * Nouns: ≥5 derivations * Rationale: Romance languages have moderate productivity Russian/Slavic (Low) ~~~~~~~~~~~~~~~~~~~~ * Verbs: ≥3 derivations * Nouns: ≥2-3 derivations * Rationale: Slavic morphology relies more on inflection than derivation German (Compound-Heavy) ~~~~~~~~~~~~~~~~~~~~~~~ * Verbs: ≥4 derivations * Nouns: ≥3 derivations * Rationale: German uses compounding more than derivation Why Faster Than Porter? ------------------------ Porter Stemmer ~~~~~~~~~~~~~~ * Applies regex rules sequentially on every character * Multiple conditional branches * String manipulation at each step * ~90,000 words/second Crosstem ~~~~~~~~ * Hash table lookup (O(1) average case) * Pre-computed graph stored in memory * No regex, no string manipulation * Just dictionary access + BFS traversal * ~547,000 words/second **Result: 6× faster while being more accurate** Algorithm Pseudocode -------------------- :: function find_root(word, language): graph = load_graph(language) if word not in graph: return word # Unknown word, return as-is queue = [(word, 0)] # (word, depth) visited = set() best_candidate = word best_score = infinity while queue: current, depth = queue.pop(0) if depth > MAX_DEPTH: break for parent in graph[current].derives_from: if parent in visited: continue visited.add(parent) # Check productivity threshold if not meets_productivity(parent, language): continue # Score candidate score = calculate_score(parent, depth) if score < best_score: best_score = score best_candidate = parent # Add to queue for further exploration queue.append((parent, depth + 1)) return best_candidate Limitations ----------- Data Quality ~~~~~~~~~~~~ The algorithm is only as good as the input data: * Bad edges in MorphyNet → wrong roots * Example: "democracy" incorrectly derives from "democrat" in the data * Mitigation: Productivity filtering catches many errors Arbitrary Thresholds ~~~~~~~~~~~~~~~~~~~~ Productivity thresholds are calibrated empirically: * Based on percentile analysis of each language * May not generalize to all domains * Trade-off between precision and coverage Finite Corpus ~~~~~~~~~~~~~ The graph only contains words in MorphyNet: * Domain-specific jargon not included * Neologisms and slang missing * Historical/archaic terms may be incomplete Future Improvements ------------------- Potential enhancements to the algorithm: 1. **Contextual scoring**: Consider surrounding words in sentences 2. **Frequency weighting**: Prefer common words over rare ones 3. **Semantic similarity**: Use embeddings to validate derivational relationships 4. **Dynamic thresholds**: Learn optimal thresholds per corpus 5. **Error correction**: Detect and fix bad edges in the graph References ---------- * MorphyNet: https://morphynet.org/ * BFS algorithm: https://en.wikipedia.org/wiki/Breadth-first_search * Derivational morphology: https://en.wikipedia.org/wiki/Derivation_(linguistics)