Algorithm
This page explains the graph-based algorithm that powers Crosstem’s derivational stemming.
Graph Representation
Crosstem treats morphology as a graph problem instead of a rules-based problem:
Nodes: Individual words (e.g., “organize”, “organization”, “organizational”)
Edges: Derivational relationships (DERIVED_FROM connections)
Graph Structure: Pre-computed from MorphyNet linguistic data, stored as nested JSON dictionaries
Why Graph-Based?
Traditional stemmers use suffix-stripping rules:
# Porter Stemmer approach
"organizational" → strip "al" → "organization"
→ strip "tion" → "organiz"
→ ERROR: overstemming
Graph-based approach:
# Crosstem approach
"organizational" → DERIVED_FROM → "organization"
→ DERIVED_FROM → "organize"
→ [STOP: root found]
BFS Traversal
Crosstem uses Breadth-First Search (BFS) to find the morphological root:
Start at the input word
Explore all words it derives from (direct parents)
Score each candidate based on linguistic features
Continue traversing until reaching the best root
Maximum depth: 3 hops
Multi-Hop Example
"organizational" (adjective)
↓ hop 1
"organization" (noun)
↓ hop 2
"organize" (verb) ← ROOT FOUND
This 2-hop traversal is impossible for single-pass suffix strippers.
Scoring Function
At each hop, candidates are scored based on three factors:
Word Length (shorter = better)
Penalty: +2 points per hop depth
Rationale: Roots tend to be shorter than derivatives
Part of Speech (verbs = roots)
Verbs: -10 points (strong preference)
Nouns: -5 points (moderate preference)
Rationale: Many derivations start from verbal roots
Productivity (high derivatives = real root)
Must exceed language-specific threshold
Filters out archaic/rare words
Example: “run” (33 derivatives) vs “hap” (8 derivatives)
Scoring Example
Candidate: "organize" (verb, 13 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -10 (verb)
- Productivity: ✓ passes threshold (≥5 for English verbs)
- Final score: -8 (lower is better)
Candidate: "hap" (noun, 8 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -5 (noun)
- Productivity: ✗ fails threshold (≥9 for English nouns)
- Result: FILTERED OUT
Productivity Thresholds
Different languages have different morphological productivity. Thresholds are calibrated per language:
English (Rich Morphology)
Verbs: ≥5 derivations
Nouns: ≥9 derivations
Rationale: English has extensive derivational morphology
French/Italian (Moderate)
Verbs: ≥4 derivations
Nouns: ≥5 derivations
Rationale: Romance languages have moderate productivity
Russian/Slavic (Low)
Verbs: ≥3 derivations
Nouns: ≥2-3 derivations
Rationale: Slavic morphology relies more on inflection than derivation
German (Compound-Heavy)
Verbs: ≥4 derivations
Nouns: ≥3 derivations
Rationale: German uses compounding more than derivation
Why Faster Than Porter?
Porter Stemmer
Applies regex rules sequentially on every character
Multiple conditional branches
String manipulation at each step
~90,000 words/second
Crosstem
Hash table lookup (O(1) average case)
Pre-computed graph stored in memory
No regex, no string manipulation
Just dictionary access + BFS traversal
~547,000 words/second
Result: 6× faster while being more accurate
Algorithm Pseudocode
function find_root(word, language):
graph = load_graph(language)
if word not in graph:
return word # Unknown word, return as-is
queue = [(word, 0)] # (word, depth)
visited = set()
best_candidate = word
best_score = infinity
while queue:
current, depth = queue.pop(0)
if depth > MAX_DEPTH:
break
for parent in graph[current].derives_from:
if parent in visited:
continue
visited.add(parent)
# Check productivity threshold
if not meets_productivity(parent, language):
continue
# Score candidate
score = calculate_score(parent, depth)
if score < best_score:
best_score = score
best_candidate = parent
# Add to queue for further exploration
queue.append((parent, depth + 1))
return best_candidate
Limitations
Data Quality
The algorithm is only as good as the input data:
Bad edges in MorphyNet → wrong roots
Example: “democracy” incorrectly derives from “democrat” in the data
Mitigation: Productivity filtering catches many errors
Arbitrary Thresholds
Productivity thresholds are calibrated empirically:
Based on percentile analysis of each language
May not generalize to all domains
Trade-off between precision and coverage
Finite Corpus
The graph only contains words in MorphyNet:
Domain-specific jargon not included
Neologisms and slang missing
Historical/archaic terms may be incomplete
Future Improvements
Potential enhancements to the algorithm:
Contextual scoring: Consider surrounding words in sentences
Frequency weighting: Prefer common words over rare ones
Semantic similarity: Use embeddings to validate derivational relationships
Dynamic thresholds: Learn optimal thresholds per corpus
Error correction: Detect and fix bad edges in the graph
References
MorphyNet: https://morphynet.org/
BFS algorithm: https://en.wikipedia.org/wiki/Breadth-first_search
Derivational morphology: https://en.wikipedia.org/wiki/Derivation_(linguistics)