Algorithm

This page explains the graph-based algorithm that powers Crosstem’s derivational stemming.

Graph Representation

Crosstem treats morphology as a graph problem instead of a rules-based problem:

  • Nodes: Individual words (e.g., “organize”, “organization”, “organizational”)

  • Edges: Derivational relationships (DERIVED_FROM connections)

  • Graph Structure: Pre-computed from MorphyNet linguistic data, stored as nested JSON dictionaries

Why Graph-Based?

Traditional stemmers use suffix-stripping rules:

# Porter Stemmer approach
"organizational" → strip "al" → "organization"
                 → strip "tion" → "organiz"
                 → ERROR: overstemming

Graph-based approach:

# Crosstem approach
"organizational" → DERIVED_FROM → "organization"
                 → DERIVED_FROM → "organize"
                 → [STOP: root found]

BFS Traversal

Crosstem uses Breadth-First Search (BFS) to find the morphological root:

  1. Start at the input word

  2. Explore all words it derives from (direct parents)

  3. Score each candidate based on linguistic features

  4. Continue traversing until reaching the best root

  5. Maximum depth: 3 hops

Multi-Hop Example

"organizational" (adjective)
     ↓ hop 1
"organization" (noun)
     ↓ hop 2
"organize" (verb) ← ROOT FOUND

This 2-hop traversal is impossible for single-pass suffix strippers.

Scoring Function

At each hop, candidates are scored based on three factors:

  1. Word Length (shorter = better)

    • Penalty: +2 points per hop depth

    • Rationale: Roots tend to be shorter than derivatives

  2. Part of Speech (verbs = roots)

    • Verbs: -10 points (strong preference)

    • Nouns: -5 points (moderate preference)

    • Rationale: Many derivations start from verbal roots

  3. Productivity (high derivatives = real root)

    • Must exceed language-specific threshold

    • Filters out archaic/rare words

    • Example: “run” (33 derivatives) vs “hap” (8 derivatives)

Scoring Example

Candidate: "organize" (verb, 13 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -10 (verb)
- Productivity: ✓ passes threshold (≥5 for English verbs)
- Final score: -8 (lower is better)

Candidate: "hap" (noun, 8 derivations)
- Length penalty: +2 (depth 2)
- POS bonus: -5 (noun)
- Productivity: ✗ fails threshold (≥9 for English nouns)
- Result: FILTERED OUT

Productivity Thresholds

Different languages have different morphological productivity. Thresholds are calibrated per language:

English (Rich Morphology)

  • Verbs: ≥5 derivations

  • Nouns: ≥9 derivations

  • Rationale: English has extensive derivational morphology

French/Italian (Moderate)

  • Verbs: ≥4 derivations

  • Nouns: ≥5 derivations

  • Rationale: Romance languages have moderate productivity

Russian/Slavic (Low)

  • Verbs: ≥3 derivations

  • Nouns: ≥2-3 derivations

  • Rationale: Slavic morphology relies more on inflection than derivation

German (Compound-Heavy)

  • Verbs: ≥4 derivations

  • Nouns: ≥3 derivations

  • Rationale: German uses compounding more than derivation

Why Faster Than Porter?

Porter Stemmer

  • Applies regex rules sequentially on every character

  • Multiple conditional branches

  • String manipulation at each step

  • ~90,000 words/second

Crosstem

  • Hash table lookup (O(1) average case)

  • Pre-computed graph stored in memory

  • No regex, no string manipulation

  • Just dictionary access + BFS traversal

  • ~547,000 words/second

Result: 6× faster while being more accurate

Algorithm Pseudocode

function find_root(word, language):
    graph = load_graph(language)

    if word not in graph:
        return word  # Unknown word, return as-is

    queue = [(word, 0)]  # (word, depth)
    visited = set()
    best_candidate = word
    best_score = infinity

    while queue:
        current, depth = queue.pop(0)

        if depth > MAX_DEPTH:
            break

        for parent in graph[current].derives_from:
            if parent in visited:
                continue

            visited.add(parent)

            # Check productivity threshold
            if not meets_productivity(parent, language):
                continue

            # Score candidate
            score = calculate_score(parent, depth)

            if score < best_score:
                best_score = score
                best_candidate = parent

            # Add to queue for further exploration
            queue.append((parent, depth + 1))

    return best_candidate

Limitations

Data Quality

The algorithm is only as good as the input data:

  • Bad edges in MorphyNet → wrong roots

  • Example: “democracy” incorrectly derives from “democrat” in the data

  • Mitigation: Productivity filtering catches many errors

Arbitrary Thresholds

Productivity thresholds are calibrated empirically:

  • Based on percentile analysis of each language

  • May not generalize to all domains

  • Trade-off between precision and coverage

Finite Corpus

The graph only contains words in MorphyNet:

  • Domain-specific jargon not included

  • Neologisms and slang missing

  • Historical/archaic terms may be incomplete

Future Improvements

Potential enhancements to the algorithm:

  1. Contextual scoring: Consider surrounding words in sentences

  2. Frequency weighting: Prefer common words over rare ones

  3. Semantic similarity: Use embeddings to validate derivational relationships

  4. Dynamic thresholds: Learn optimal thresholds per corpus

  5. Error correction: Detect and fix bad edges in the graph

References