User Guide

This guide covers detailed usage of Crosstem’s features.

Derivational Stemming

Basic Usage

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')
root = stemmer.stem('organization')
print(root)  # organize

The stemmer finds the morphological root by traversing derivational relationships in the linguistic graph.

Understanding Stems

Crosstem returns linguistic roots, not just prefix-stripped words:

# Traditional stemmers
Porter('organization')  → 'organ'      # WRONG (overstemming)
Lancaster('organization') → 'org'      # WRONG (aggressive)

# Lemmatizers
WordNet('organization') → 'organization'  # Preserves POS boundary

# Crosstem
Crosstem('organization') → 'organize'  # TRUE ROOT (crosses POS)

Multi-Hop Derivations

Some words require multiple steps to reach their root:

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')

# 2-hop example
print(stemmer.stem('organizational'))
# organizational → organization → organize

# 3-hop example
print(stemmer.stem('destructiveness'))
# destructiveness → destructive → destruction → destruct

Batch Processing

Process multiple words efficiently:

words = [
    'organization', 'organizational', 'organize',
    'organizing', 'organizer', 'reorganize'
]

stems = [stemmer.stem(word) for word in words]
print(stems)
# ['organize', 'organize', 'organize',
#  'organize', 'organize', 'organize']

Word Families

Get All Derivatives

Find all words derived from a root:

stemmer = DerivationalStemmer('eng')

family = stemmer.get_word_family('organize')
print(f"Found {len(family)} related words")
print(sorted(family))

Example output:

Found 43 related words
['disorganization', 'disorganize', 'disorganized',
 'organ', 'organic', 'organism', 'organization',
 'organizational', 'organize', 'organized',
 'organizer', 'reorganization', 'reorganize', ...]

Use Cases

Word families are useful for:

Information retrieval: Find all variants of a search term
Corpus analysis: Group related terms together
Vocabulary learning: Discover word relationships
Text normalization: Standardize related forms

Inflectional Analysis

Basic Usage

from crosstem import InflectionAnalyzer

analyzer = InflectionAnalyzer('eng')

# Get all inflections of a word
inflections = analyzer.get_inflections('run')
print(inflections)
# {'run', 'runs', 'running', 'ran'}

Difference from Stemming

Inflections: Same word, different grammatical form (run/runs/ran)
Derivations: Related words, different meaning (organize/organization)

Crosstem handles both:

# Inflectional analysis
analyzer.get_inflections('go')  → {'go', 'goes', 'going', 'went', 'gone'}

# Derivational stemming
stemmer.stem('going')  → 'go'
stemmer.stem('organization')  → 'organize'  # different word!

Etymology Tracing

Setup

First, download the etymology dataset:

from crosstem import download_etymology, is_etymology_downloaded

if not is_etymology_downloaded():
    download_etymology()

Basic Usage

from crosstem import EtymologyLinker

linker = EtymologyLinker()

# Trace etymology of a word
etymology = linker.get_etymology('English', 'organize')
print(etymology)

Etymology Relationships

The etymology data includes several relationship types:

INHERITED_FROM: Word inherited from ancestor language
BORROWED_FROM: Word borrowed/loaned from another language
DERIVED_FROM: Word derived from another word
ETYMOLOGICAL_ORIGIN_OF: Inverse relationship

Cross-Lingual Queries

linker = EtymologyLinker()

# Find words borrowed into English from French
french_loans = linker.get_borrowed_words('English', 'French')
print(f"Found {len(french_loans)} French loanwords")

Multi-Language Support

Supported Languages

Crosstem supports 15 languages with full derivational data:

Code	Language	Words
cat	Catalan	~50K
ces	Czech	~40K
deu	German	~120K
eng	English	~150K
fin	Finnish	~60K
fra	French	~90K
hbs	Serbo-Croatian	~35K
hun	Hungarian	~45K
ita	Italian	~80K
mon	Mongolian	~25K
pol	Polish	~55K
por	Portuguese	~70K
rus	Russian	~100K
spa	Spanish	~85K
swe	Swedish	~65K

Usage Example

# German
de_stemmer = DerivationalStemmer('deu')
print(de_stemmer.stem('Organisierung'))  # organisieren

# French
fr_stemmer = DerivationalStemmer('fra')
print(fr_stemmer.stem('organisateur'))  # organiser

# Spanish
es_stemmer = DerivationalStemmer('spa')
print(es_stemmer.stem('organizador'))  # organizar

# Russian
ru_stemmer = DerivationalStemmer('rus')
print(ru_stemmer.stem('организация'))  # организовать

Language-Specific Behavior

Each language has calibrated productivity thresholds:

English: High threshold (rich derivational morphology)
German: Moderate (compound-heavy)
Russian: Low threshold (inflection-heavy)

See Algorithm for details on language-specific tuning.

Performance Tips

Memory Usage

Base package: ~280 MB (derivational data for 15 languages)
Etymology data: ~1 GB (optional)
In-memory graph: Loaded once per language

Speed Optimization

# ✓ GOOD: Reuse stemmer instance
stemmer = DerivationalStemmer('eng')
for word in large_corpus:
    stem = stemmer.stem(word)

# ✗ BAD: Creating new instance each time
for word in large_corpus:
    stemmer = DerivationalStemmer('eng')  # Reloads graph!
    stem = stemmer.stem(word)

Backend behavior:

# Default: use Rust backend when available
stemmer = DerivationalStemmer('eng')

# Force pure-Python backend for debugging/parity checks
stemmer_py = DerivationalStemmer('eng', use_rust_backend=False)

Benchmark: Active Rust backend is typically ~2-3x faster than Python fallback and >10x faster than Porter on bundled benchmarks.

Error Handling

Unknown Words

If a word is not in the graph, it’s returned unchanged:

stemmer = DerivationalStemmer('eng')

print(stemmer.stem('neologism123'))  # neologism123
print(stemmer.stem('known_word'))    # <actual root>

Invalid Language

try:
    stemmer = DerivationalStemmer('invalid')
except ValueError as e:
    print(f"Error: {e}")
    # Error: Language 'invalid' not supported

Missing Etymology Data

from crosstem import is_etymology_downloaded, download_etymology

if not is_etymology_downloaded():
    print("Etymology data not found. Downloading...")
    download_etymology()

Best Practices

Reuse instances: Don’t create new stemmers for each word
Batch processing: Process lists of words in one go
Check language support: Verify language code before initialization
Download etymology once: Check if data exists before downloading
Handle unknown words: Plan for words not in the corpus

Next Steps

See Examples for real-world use cases
Read Algorithm to understand how it works
Check API Reference for complete method documentation
Learn about Languages for language-specific details