User Guide

This guide covers detailed usage of Crosstem’s features.

Derivational Stemming

Basic Usage

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')
root = stemmer.stem('organization')
print(root)  # organize

The stemmer finds the morphological root by traversing derivational relationships in the linguistic graph.

Understanding Stems

Crosstem returns linguistic roots, not just prefix-stripped words:

# Traditional stemmers
Porter('organization')  → 'organ'      # WRONG (overstemming)
Lancaster('organization') → 'org'      # WRONG (aggressive)

# Lemmatizers
WordNet('organization') → 'organization'  # Preserves POS boundary

# Crosstem
Crosstem('organization') → 'organize'  # TRUE ROOT (crosses POS)

Multi-Hop Derivations

Some words require multiple steps to reach their root:

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')

# 2-hop example
print(stemmer.stem('organizational'))
# organizational → organization → organize

# 3-hop example
print(stemmer.stem('destructiveness'))
# destructiveness → destructive → destruction → destruct

Batch Processing

Process multiple words efficiently:

words = [
    'organization', 'organizational', 'organize',
    'organizing', 'organizer', 'reorganize'
]

stems = [stemmer.stem(word) for word in words]
print(stems)
# ['organize', 'organize', 'organize',
#  'organize', 'organize', 'organize']

Word Families

Get All Derivatives

Find all words derived from a root:

stemmer = DerivationalStemmer('eng')

family = stemmer.get_word_family('organize')
print(f"Found {len(family)} related words")
print(sorted(family))

Example output:

Found 43 related words
['disorganization', 'disorganize', 'disorganized',
 'organ', 'organic', 'organism', 'organization',
 'organizational', 'organize', 'organized',
 'organizer', 'reorganization', 'reorganize', ...]

Use Cases

Word families are useful for:

  • Information retrieval: Find all variants of a search term

  • Corpus analysis: Group related terms together

  • Vocabulary learning: Discover word relationships

  • Text normalization: Standardize related forms

Inflectional Analysis

Basic Usage

from crosstem import InflectionAnalyzer

analyzer = InflectionAnalyzer('eng')

# Get all inflections of a word
inflections = analyzer.get_inflections('run')
print(inflections)
# {'run', 'runs', 'running', 'ran'}

Difference from Stemming

  • Inflections: Same word, different grammatical form (run/runs/ran)

  • Derivations: Related words, different meaning (organize/organization)

Crosstem handles both:

# Inflectional analysis
analyzer.get_inflections('go')  → {'go', 'goes', 'going', 'went', 'gone'}

# Derivational stemming
stemmer.stem('going')  → 'go'
stemmer.stem('organization')  → 'organize'  # different word!

Etymology Tracing

Setup

First, download the etymology dataset:

from crosstem import download_etymology, is_etymology_downloaded

if not is_etymology_downloaded():
    download_etymology()

Basic Usage

from crosstem import EtymologyLinker

linker = EtymologyLinker()

# Trace etymology of a word
etymology = linker.get_etymology('English', 'organize')
print(etymology)

Etymology Relationships

The etymology data includes several relationship types:

  • INHERITED_FROM: Word inherited from ancestor language

  • BORROWED_FROM: Word borrowed/loaned from another language

  • DERIVED_FROM: Word derived from another word

  • ETYMOLOGICAL_ORIGIN_OF: Inverse relationship

Cross-Lingual Queries

linker = EtymologyLinker()

# Find words borrowed into English from French
french_loans = linker.get_borrowed_words('English', 'French')
print(f"Found {len(french_loans)} French loanwords")

Multi-Language Support

Supported Languages

Crosstem supports 15 languages with full derivational data:

Code

Language

Words

cat

Catalan

~50K

ces

Czech

~40K

deu

German

~120K

eng

English

~150K

fin

Finnish

~60K

fra

French

~90K

hbs

Serbo-Croatian

~35K

hun

Hungarian

~45K

ita

Italian

~80K

mon

Mongolian

~25K

pol

Polish

~55K

por

Portuguese

~70K

rus

Russian

~100K

spa

Spanish

~85K

swe

Swedish

~65K

Usage Example

# German
de_stemmer = DerivationalStemmer('deu')
print(de_stemmer.stem('Organisierung'))  # organisieren

# French
fr_stemmer = DerivationalStemmer('fra')
print(fr_stemmer.stem('organisateur'))  # organiser

# Spanish
es_stemmer = DerivationalStemmer('spa')
print(es_stemmer.stem('organizador'))  # organizar

# Russian
ru_stemmer = DerivationalStemmer('rus')
print(ru_stemmer.stem('организация'))  # организовать

Language-Specific Behavior

Each language has calibrated productivity thresholds:

  • English: High threshold (rich derivational morphology)

  • German: Moderate (compound-heavy)

  • Russian: Low threshold (inflection-heavy)

See Algorithm for details on language-specific tuning.

Performance Tips

Memory Usage

  • Base package: ~280 MB (derivational data for 15 languages)

  • Etymology data: ~1 GB (optional)

  • In-memory graph: Loaded once per language

Speed Optimization

# ✓ GOOD: Reuse stemmer instance
stemmer = DerivationalStemmer('eng')
for word in large_corpus:
    stem = stemmer.stem(word)

# ✗ BAD: Creating new instance each time
for word in large_corpus:
    stemmer = DerivationalStemmer('eng')  # Reloads graph!
    stem = stemmer.stem(word)

Backend behavior:

# Default: use Rust backend when available
stemmer = DerivationalStemmer('eng')

# Force pure-Python backend for debugging/parity checks
stemmer_py = DerivationalStemmer('eng', use_rust_backend=False)

Benchmark: Active Rust backend is typically ~2-3x faster than Python fallback and >10x faster than Porter on bundled benchmarks.

Error Handling

Unknown Words

If a word is not in the graph, it’s returned unchanged:

stemmer = DerivationalStemmer('eng')

print(stemmer.stem('neologism123'))  # neologism123
print(stemmer.stem('known_word'))    # <actual root>

Invalid Language

try:
    stemmer = DerivationalStemmer('invalid')
except ValueError as e:
    print(f"Error: {e}")
    # Error: Language 'invalid' not supported

Missing Etymology Data

from crosstem import is_etymology_downloaded, download_etymology

if not is_etymology_downloaded():
    print("Etymology data not found. Downloading...")
    download_etymology()

Best Practices

  1. Reuse instances: Don’t create new stemmers for each word

  2. Batch processing: Process lists of words in one go

  3. Check language support: Verify language code before initialization

  4. Download etymology once: Check if data exists before downloading

  5. Handle unknown words: Plan for words not in the corpus

Next Steps