User Guide ========== This guide covers detailed usage of Crosstem's features. Derivational Stemming --------------------- Basic Usage ~~~~~~~~~~~ :: from crosstem import DerivationalStemmer stemmer = DerivationalStemmer('eng') root = stemmer.stem('organization') print(root) # organize The stemmer finds the morphological root by traversing derivational relationships in the linguistic graph. Understanding Stems ~~~~~~~~~~~~~~~~~~~ Crosstem returns **linguistic roots**, not just prefix-stripped words: :: # Traditional stemmers Porter('organization') → 'organ' # WRONG (overstemming) Lancaster('organization') → 'org' # WRONG (aggressive) # Lemmatizers WordNet('organization') → 'organization' # Preserves POS boundary # Crosstem Crosstem('organization') → 'organize' # TRUE ROOT (crosses POS) Multi-Hop Derivations ~~~~~~~~~~~~~~~~~~~~~~ Some words require multiple steps to reach their root:: from crosstem import DerivationalStemmer stemmer = DerivationalStemmer('eng') # 2-hop example print(stemmer.stem('organizational')) # organizational → organization → organize # 3-hop example print(stemmer.stem('destructiveness')) # destructiveness → destructive → destruction → destruct Batch Processing ~~~~~~~~~~~~~~~~ Process multiple words efficiently:: words = [ 'organization', 'organizational', 'organize', 'organizing', 'organizer', 'reorganize' ] stems = [stemmer.stem(word) for word in words] print(stems) # ['organize', 'organize', 'organize', # 'organize', 'organize', 'organize'] Word Families ------------- Get All Derivatives ~~~~~~~~~~~~~~~~~~~ Find all words derived from a root:: stemmer = DerivationalStemmer('eng') family = stemmer.get_word_family('organize') print(f"Found {len(family)} related words") print(sorted(family)) Example output:: Found 43 related words ['disorganization', 'disorganize', 'disorganized', 'organ', 'organic', 'organism', 'organization', 'organizational', 'organize', 'organized', 'organizer', 'reorganization', 'reorganize', ...] Use Cases ~~~~~~~~~ Word families are useful for: * **Information retrieval**: Find all variants of a search term * **Corpus analysis**: Group related terms together * **Vocabulary learning**: Discover word relationships * **Text normalization**: Standardize related forms Inflectional Analysis --------------------- Basic Usage ~~~~~~~~~~~ :: from crosstem import InflectionAnalyzer analyzer = InflectionAnalyzer('eng') # Get all inflections of a word inflections = analyzer.get_inflections('run') print(inflections) # {'run', 'runs', 'running', 'ran'} Difference from Stemming ~~~~~~~~~~~~~~~~~~~~~~~~~ * **Inflections**: Same word, different grammatical form (run/runs/ran) * **Derivations**: Related words, different meaning (organize/organization) Crosstem handles both:: # Inflectional analysis analyzer.get_inflections('go') → {'go', 'goes', 'going', 'went', 'gone'} # Derivational stemming stemmer.stem('going') → 'go' stemmer.stem('organization') → 'organize' # different word! Etymology Tracing ----------------- Setup ~~~~~ First, download the etymology dataset:: from crosstem import download_etymology, is_etymology_downloaded if not is_etymology_downloaded(): download_etymology() Basic Usage ~~~~~~~~~~~ :: from crosstem import EtymologyLinker linker = EtymologyLinker() # Trace etymology of a word etymology = linker.get_etymology('English', 'organize') print(etymology) Etymology Relationships ~~~~~~~~~~~~~~~~~~~~~~~ The etymology data includes several relationship types: * **INHERITED_FROM**: Word inherited from ancestor language * **BORROWED_FROM**: Word borrowed/loaned from another language * **DERIVED_FROM**: Word derived from another word * **ETYMOLOGICAL_ORIGIN_OF**: Inverse relationship Cross-Lingual Queries ~~~~~~~~~~~~~~~~~~~~~ :: linker = EtymologyLinker() # Find words borrowed into English from French french_loans = linker.get_borrowed_words('English', 'French') print(f"Found {len(french_loans)} French loanwords") Multi-Language Support ---------------------- Supported Languages ~~~~~~~~~~~~~~~~~~~ Crosstem supports 15 languages with full derivational data: +----------+------------------+---------+ | Code | Language | Words | +==========+==================+=========+ | cat | Catalan | ~50K | +----------+------------------+---------+ | ces | Czech | ~40K | +----------+------------------+---------+ | deu | German | ~120K | +----------+------------------+---------+ | eng | English | ~150K | +----------+------------------+---------+ | fin | Finnish | ~60K | +----------+------------------+---------+ | fra | French | ~90K | +----------+------------------+---------+ | hbs | Serbo-Croatian | ~35K | +----------+------------------+---------+ | hun | Hungarian | ~45K | +----------+------------------+---------+ | ita | Italian | ~80K | +----------+------------------+---------+ | mon | Mongolian | ~25K | +----------+------------------+---------+ | pol | Polish | ~55K | +----------+------------------+---------+ | por | Portuguese | ~70K | +----------+------------------+---------+ | rus | Russian | ~100K | +----------+------------------+---------+ | spa | Spanish | ~85K | +----------+------------------+---------+ | swe | Swedish | ~65K | +----------+------------------+---------+ Usage Example ~~~~~~~~~~~~~ :: # German de_stemmer = DerivationalStemmer('deu') print(de_stemmer.stem('Organisierung')) # organisieren # French fr_stemmer = DerivationalStemmer('fra') print(fr_stemmer.stem('organisateur')) # organiser # Spanish es_stemmer = DerivationalStemmer('spa') print(es_stemmer.stem('organizador')) # organizar # Russian ru_stemmer = DerivationalStemmer('rus') print(ru_stemmer.stem('организация')) # организовать Language-Specific Behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each language has calibrated productivity thresholds: * **English**: High threshold (rich derivational morphology) * **German**: Moderate (compound-heavy) * **Russian**: Low threshold (inflection-heavy) See :doc:`algorithm` for details on language-specific tuning. Performance Tips ---------------- Memory Usage ~~~~~~~~~~~~ * Base package: ~280 MB (derivational data for 15 languages) * Etymology data: ~1 GB (optional) * In-memory graph: Loaded once per language Speed Optimization ~~~~~~~~~~~~~~~~~~ :: # ✓ GOOD: Reuse stemmer instance stemmer = DerivationalStemmer('eng') for word in large_corpus: stem = stemmer.stem(word) # ✗ BAD: Creating new instance each time for word in large_corpus: stemmer = DerivationalStemmer('eng') # Reloads graph! stem = stemmer.stem(word) Backend behavior:: # Default: use Rust backend when available stemmer = DerivationalStemmer('eng') # Force pure-Python backend for debugging/parity checks stemmer_py = DerivationalStemmer('eng', use_rust_backend=False) Benchmark: Active Rust backend is typically ~2-3x faster than Python fallback and >10x faster than Porter on bundled benchmarks. Error Handling -------------- Unknown Words ~~~~~~~~~~~~~ If a word is not in the graph, it's returned unchanged:: stemmer = DerivationalStemmer('eng') print(stemmer.stem('neologism123')) # neologism123 print(stemmer.stem('known_word')) # Invalid Language ~~~~~~~~~~~~~~~~ :: try: stemmer = DerivationalStemmer('invalid') except ValueError as e: print(f"Error: {e}") # Error: Language 'invalid' not supported Missing Etymology Data ~~~~~~~~~~~~~~~~~~~~~~ :: from crosstem import is_etymology_downloaded, download_etymology if not is_etymology_downloaded(): print("Etymology data not found. Downloading...") download_etymology() Best Practices -------------- 1. **Reuse instances**: Don't create new stemmers for each word 2. **Batch processing**: Process lists of words in one go 3. **Check language support**: Verify language code before initialization 4. **Download etymology once**: Check if data exists before downloading 5. **Handle unknown words**: Plan for words not in the corpus Next Steps ---------- * See :doc:`examples` for real-world use cases * Read :doc:`algorithm` to understand how it works * Check :doc:`api` for complete method documentation * Learn about :doc:`languages` for language-specific details