User Guide
This guide covers detailed usage of Crosstem’s features.
Derivational Stemming
Basic Usage
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
root = stemmer.stem('organization')
print(root) # organize
The stemmer finds the morphological root by traversing derivational relationships in the linguistic graph.
Understanding Stems
Crosstem returns linguistic roots, not just prefix-stripped words:
# Traditional stemmers
Porter('organization') → 'organ' # WRONG (overstemming)
Lancaster('organization') → 'org' # WRONG (aggressive)
# Lemmatizers
WordNet('organization') → 'organization' # Preserves POS boundary
# Crosstem
Crosstem('organization') → 'organize' # TRUE ROOT (crosses POS)
Multi-Hop Derivations
Some words require multiple steps to reach their root:
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
# 2-hop example
print(stemmer.stem('organizational'))
# organizational → organization → organize
# 3-hop example
print(stemmer.stem('destructiveness'))
# destructiveness → destructive → destruction → destruct
Batch Processing
Process multiple words efficiently:
words = [
'organization', 'organizational', 'organize',
'organizing', 'organizer', 'reorganize'
]
stems = [stemmer.stem(word) for word in words]
print(stems)
# ['organize', 'organize', 'organize',
# 'organize', 'organize', 'organize']
Word Families
Get All Derivatives
Find all words derived from a root:
stemmer = DerivationalStemmer('eng')
family = stemmer.get_word_family('organize')
print(f"Found {len(family)} related words")
print(sorted(family))
Example output:
Found 43 related words
['disorganization', 'disorganize', 'disorganized',
'organ', 'organic', 'organism', 'organization',
'organizational', 'organize', 'organized',
'organizer', 'reorganization', 'reorganize', ...]
Use Cases
Word families are useful for:
Information retrieval: Find all variants of a search term
Corpus analysis: Group related terms together
Vocabulary learning: Discover word relationships
Text normalization: Standardize related forms
Inflectional Analysis
Basic Usage
from crosstem import InflectionAnalyzer
analyzer = InflectionAnalyzer('eng')
# Get all inflections of a word
inflections = analyzer.get_inflections('run')
print(inflections)
# {'run', 'runs', 'running', 'ran'}
Difference from Stemming
Inflections: Same word, different grammatical form (run/runs/ran)
Derivations: Related words, different meaning (organize/organization)
Crosstem handles both:
# Inflectional analysis
analyzer.get_inflections('go') → {'go', 'goes', 'going', 'went', 'gone'}
# Derivational stemming
stemmer.stem('going') → 'go'
stemmer.stem('organization') → 'organize' # different word!
Etymology Tracing
Setup
First, download the etymology dataset:
from crosstem import download_etymology, is_etymology_downloaded
if not is_etymology_downloaded():
download_etymology()
Basic Usage
from crosstem import EtymologyLinker
linker = EtymologyLinker()
# Trace etymology of a word
etymology = linker.get_etymology('English', 'organize')
print(etymology)
Etymology Relationships
The etymology data includes several relationship types:
INHERITED_FROM: Word inherited from ancestor language
BORROWED_FROM: Word borrowed/loaned from another language
DERIVED_FROM: Word derived from another word
ETYMOLOGICAL_ORIGIN_OF: Inverse relationship
Cross-Lingual Queries
linker = EtymologyLinker()
# Find words borrowed into English from French
french_loans = linker.get_borrowed_words('English', 'French')
print(f"Found {len(french_loans)} French loanwords")
Multi-Language Support
Supported Languages
Crosstem supports 15 languages with full derivational data:
Code |
Language |
Words |
|---|---|---|
cat |
Catalan |
~50K |
ces |
Czech |
~40K |
deu |
German |
~120K |
eng |
English |
~150K |
fin |
Finnish |
~60K |
fra |
French |
~90K |
hbs |
Serbo-Croatian |
~35K |
hun |
Hungarian |
~45K |
ita |
Italian |
~80K |
mon |
Mongolian |
~25K |
pol |
Polish |
~55K |
por |
Portuguese |
~70K |
rus |
Russian |
~100K |
spa |
Spanish |
~85K |
swe |
Swedish |
~65K |
Usage Example
# German
de_stemmer = DerivationalStemmer('deu')
print(de_stemmer.stem('Organisierung')) # organisieren
# French
fr_stemmer = DerivationalStemmer('fra')
print(fr_stemmer.stem('organisateur')) # organiser
# Spanish
es_stemmer = DerivationalStemmer('spa')
print(es_stemmer.stem('organizador')) # organizar
# Russian
ru_stemmer = DerivationalStemmer('rus')
print(ru_stemmer.stem('организация')) # организовать
Language-Specific Behavior
Each language has calibrated productivity thresholds:
English: High threshold (rich derivational morphology)
German: Moderate (compound-heavy)
Russian: Low threshold (inflection-heavy)
See Algorithm for details on language-specific tuning.
Performance Tips
Memory Usage
Base package: ~280 MB (derivational data for 15 languages)
Etymology data: ~1 GB (optional)
In-memory graph: Loaded once per language
Speed Optimization
# ✓ GOOD: Reuse stemmer instance
stemmer = DerivationalStemmer('eng')
for word in large_corpus:
stem = stemmer.stem(word)
# ✗ BAD: Creating new instance each time
for word in large_corpus:
stemmer = DerivationalStemmer('eng') # Reloads graph!
stem = stemmer.stem(word)
Backend behavior:
# Default: use Rust backend when available
stemmer = DerivationalStemmer('eng')
# Force pure-Python backend for debugging/parity checks
stemmer_py = DerivationalStemmer('eng', use_rust_backend=False)
Benchmark: Active Rust backend is typically ~2-3x faster than Python fallback and >10x faster than Porter on bundled benchmarks.
Error Handling
Unknown Words
If a word is not in the graph, it’s returned unchanged:
stemmer = DerivationalStemmer('eng')
print(stemmer.stem('neologism123')) # neologism123
print(stemmer.stem('known_word')) # <actual root>
Invalid Language
try:
stemmer = DerivationalStemmer('invalid')
except ValueError as e:
print(f"Error: {e}")
# Error: Language 'invalid' not supported
Missing Etymology Data
from crosstem import is_etymology_downloaded, download_etymology
if not is_etymology_downloaded():
print("Etymology data not found. Downloading...")
download_etymology()
Best Practices
Reuse instances: Don’t create new stemmers for each word
Batch processing: Process lists of words in one go
Check language support: Verify language code before initialization
Download etymology once: Check if data exists before downloading
Handle unknown words: Plan for words not in the corpus
Next Steps
See Examples for real-world use cases
Read Algorithm to understand how it works
Check API Reference for complete method documentation
Learn about Languages for language-specific details