Algorithm

This page explains the graph-based algorithm that powers Crosstem’s derivational stemming.

Graph Representation

Crosstem treats morphology as a graph problem instead of a rules-based problem:

Nodes: Individual words (e.g., “organize”, “organization”, “organizational”)
Edges: Derivational relationships (DERIVED_FROM connections)
Graph Structure: Pre-computed from MorphyNet linguistic data, stored as nested JSON dictionaries

Why Graph-Based?

Traditional stemmers use suffix-stripping rules:

# Porter Stemmer approach
"organizational" → strip "al" → "organization"
                 → strip "tion" → "organiz"
                 → ERROR: overstemming

Graph-based approach:

# Crosstem approach
"organizational" → DERIVED_FROM → "organization"
                 → DERIVED_FROM → "organize"
                 → [STOP: root found]

BFS Traversal

Crosstem uses Breadth-First Search (BFS) to find the morphological root:

Start at the input word
Explore all words it derives from (direct parents)
Score each candidate based on linguistic features
Continue traversing until reaching the best root
Maximum depth: 3 hops

Multi-Hop Example

"organizational" (adjective)
     ↓ hop 1
"organization" (noun)
     ↓ hop 2
"organize" (verb) ← ROOT FOUND

This 2-hop traversal is impossible for single-pass suffix strippers.

Scoring Function

At each hop, candidates are scored based on three factors:

Word Length (shorter = better)
- Penalty: +2 points per hop depth
- Rationale: Roots tend to be shorter than derivatives
Part of Speech (verbs = roots)
- Verbs: -10 points (strong preference)
- Nouns: -5 points (moderate preference)
- Rationale: Many derivations start from verbal roots
Productivity (high derivatives = real root)
- Must exceed language-specific threshold
- Filters out archaic/rare words
- Example: “run” (33 derivatives) vs “hap” (8 derivatives)

Scoring Example

Candidate: "organize" (verb, 13 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -10 (verb)
- Productivity: ✓ passes threshold (≥5 for English verbs)
- Final score: -8 (lower is better)

Candidate: "hap" (noun, 8 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -5 (noun)
- Productivity: ✗ fails threshold (≥9 for English nouns)
- Result: FILTERED OUT

Productivity Thresholds

Different languages have different morphological productivity. Thresholds are calibrated per language:

English (Rich Morphology)

Verbs: ≥5 derivations
Nouns: ≥9 derivations
Rationale: English has extensive derivational morphology

French/Italian (Moderate)

Verbs: ≥4 derivations
Nouns: ≥5 derivations
Rationale: Romance languages have moderate productivity

Russian/Slavic (Low)

Verbs: ≥3 derivations
Nouns: ≥2-3 derivations
Rationale: Slavic morphology relies more on inflection than derivation

German (Compound-Heavy)

Verbs: ≥4 derivations
Nouns: ≥3 derivations
Rationale: German uses compounding more than derivation

Why Faster Than Porter?

Porter Stemmer

Applies regex rules sequentially on every character
Multiple conditional branches
String manipulation at each step
~90,000 words/second

Crosstem

Hash table lookup (O(1) average case)
Pre-computed graph stored in memory
No regex, no string manipulation
Just dictionary access + BFS traversal
~547,000 words/second

Result: 6× faster while being more accurate

Algorithm Pseudocode

function find_root(word, language):
    graph = load_graph(language)

    if word not in graph:
        return word  # Unknown word, return as-is

    queue = [(word, 0)]  # (word, depth)
    visited = set()
    best_candidate = word
    best_score = infinity

    while queue:
        current, depth = queue.pop(0)

        if depth > MAX_DEPTH:
            break

        for parent in graph[current].derives_from:
            if parent in visited:
                continue

            visited.add(parent)

            # Check productivity threshold
            if not meets_productivity(parent, language):
                continue

            # Score candidate
            score = calculate_score(parent, depth)

            if score < best_score:
                best_score = score
                best_candidate = parent

            # Add to queue for further exploration
            queue.append((parent, depth + 1))

    return best_candidate

Limitations

Data Quality

The algorithm is only as good as the input data:

Bad edges in MorphyNet → wrong roots
Example: “democracy” incorrectly derives from “democrat” in the data
Mitigation: Productivity filtering catches many errors

Arbitrary Thresholds

Productivity thresholds are calibrated empirically:

Based on percentile analysis of each language
May not generalize to all domains
Trade-off between precision and coverage

Finite Corpus

The graph only contains words in MorphyNet:

Domain-specific jargon not included
Neologisms and slang missing
Historical/archaic terms may be incomplete

Future Improvements

Potential enhancements to the algorithm:

Contextual scoring: Consider surrounding words in sentences
Frequency weighting: Prefer common words over rare ones
Semantic similarity: Use embeddings to validate derivational relationships
Dynamic thresholds: Learn optimal thresholds per corpus
Error correction: Detect and fix bad edges in the graph

References

MorphyNet: https://morphynet.org/
BFS algorithm: https://en.wikipedia.org/wiki/Breadth-first_search
Derivational morphology: https://en.wikipedia.org/wiki/Derivation_(linguistics)