API Reference

This page documents all public classes and methods in Crosstem.

DerivationalStemmer

class DerivationalStemmer(language: str = 'eng', use_rust_backend: bool = True)

Main class for finding morphological roots through derivational relationships.

Parameters:

language (str) – ISO 639-3 language code (e.g., ‘eng’, ‘deu’, ‘fra’)
use_rust_backend (bool) – Use Rust backend when available; falls back to Python if unavailable

Raises:

ValueError – If language is not supported

Example:

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')
root = stemmer.stem('organization')

stem(word: str, use_derivations: bool = True) → str

Find the morphological root of a word using BFS graph traversal.

Parameters:: word (str) – The word to stem
Returns:: The morphological root, or the original word if not in graph
Return type:: str

Algorithm: Uses breadth-first search through derivational relationships, scoring candidates based on word length, part of speech, and productivity.

Example:

stemmer = DerivationalStemmer('eng')

# Cross-POS stemming
print(stemmer.stem('organization'))    # organize (noun → verb)
print(stemmer.stem('beautiful'))       # beauty (adj → noun)

# Multi-hop traversal
print(stemmer.stem('organizational'))  # organize (2 hops)

get_word_family(word: str, max_depth: int = 2) → list

Get all words derived from the given root word.

Parameters:: word (str) – The root word
Returns:: Sorted list of words in the derivational family
Return type:: list

get_derivations(word: str) → list

Get derivational links for a word.

Parameters:: word (str) – Input word
Returns:: List of derivation objects with form, pos, and relation
Return type:: list

Example:

stemmer = DerivationalStemmer('eng')
family = stemmer.get_word_family('organize')
print(len(family))  # 43 related words

InflectionAnalyzer

class InflectionAnalyzer(language: str)

Analyzer for inflectional morphology (grammatical variations of the same word).

Parameters:: language (str) – ISO 639-3 language code
Raises:: ValueError – If language is not supported

Example:

from crosstem import InflectionAnalyzer

analyzer = InflectionAnalyzer('eng')
inflections = analyzer.get_inflections('run')

get_inflections(word: str) → set

Get all inflectional forms of a word.

Parameters:: word (str) – The base word
Returns:: Set of inflected forms
Return type:: set

Example:

analyzer = InflectionAnalyzer('eng')

print(analyzer.get_inflections('run'))
# {'run', 'runs', 'running', 'ran'}

print(analyzer.get_inflections('go'))
# {'go', 'goes', 'going', 'went', 'gone'}

EtymologyLinker

class EtymologyLinker

Class for tracing cross-lingual etymology relationships.

Note

Requires etymology data to be downloaded first using download_etymology().

Example:

from crosstem import EtymologyLinker, download_etymology

download_etymology()  # One-time download
linker = EtymologyLinker()

get_etymology(language: str, word: str) → dict

Get etymology information for a word.

Parameters:

language (str) – Full language name (e.g., ‘English’, ‘French’)
word (str) – The word to look up

Returns:

Dictionary of etymology relationships

Return type:

dict

Relationship types:

INHERITED_FROM: Inherited from ancestor language
BORROWED_FROM: Borrowed/loaned from another language
DERIVED_FROM: Derived from another word
ETYMOLOGICAL_ORIGIN_OF: Source of another word

Example:

linker = EtymologyLinker()
etymology = linker.get_etymology('English', 'organize')
print(etymology)

get_borrowed_words(target_lang: str, source_lang: str) → list

Find all words borrowed from one language into another.

Parameters:

target_lang (str) – Language that borrowed words
source_lang (str) – Language that provided words

Returns:

List of borrowed words

Return type:

list

Example:

linker = EtymologyLinker()
french_loans = linker.get_borrowed_words('English', 'French')
print(f"Found {len(french_loans)} French loanwords")

Helper Functions

download_etymology() → None

Download the etymology dataset (~1 GB) from GitHub Releases.

Shows a progress bar during download and validates the file after completion.

Example:

from crosstem import download_etymology
download_etymology()

is_etymology_downloaded() → bool

Check if etymology data is available.

Returns:: True if etymology.json exists, False otherwise
Return type:: bool

Example:

from crosstem import is_etymology_downloaded

if not is_etymology_downloaded():
    print("Please download etymology data first")

remove_etymology() → None

Remove downloaded etymology data to free disk space.

Example:

from crosstem import remove_etymology
remove_etymology()

Supported Languages

SUPPORTED_LANGUAGES: list

List of supported ISO 639-3 language codes:

[
    'cat',  # Catalan
    'ces',  # Czech
    'deu',  # German
    'eng',  # English
    'fin',  # Finnish
    'fra',  # French
    'hbs',  # Serbo-Croatian
    'hun',  # Hungarian
    'ita',  # Italian
    'mon',  # Mongolian
    'pol',  # Polish
    'por',  # Portuguese
    'rus',  # Russian
    'spa',  # Spanish
    'swe',  # Swedish
]

Exceptions

exception ValueError

Raised when an invalid language code is provided:

stemmer = DerivationalStemmer('invalid')
# ValueError: Language 'invalid' not supported

exception FileNotFoundError

Raised when attempting to use etymology features without downloading data:

linker = EtymologyLinker()  # Without downloading first
# FileNotFoundError: Etymology data not found

Constants

MAX_DEPTH: int = 3: Maximum depth for BFS traversal when finding roots.

PRODUCTIVITY_THRESHOLDS: dict

Language-specific productivity thresholds for filtering candidates:

{
    'eng': {'V': 5, 'N': 9},    # English
    'deu': {'V': 4, 'N': 3},    # German
    'fra': {'V': 4, 'N': 5},    # French
    'rus': {'V': 3, 'N': 2},    # Russian
    # ... other languages
}

Type Hints

All public methods include type hints for better IDE support:

from crosstem import DerivationalStemmer

def process_text(text: str, language: str = 'eng') -> list[str]:
    """Process text and return stems."""
    stemmer = DerivationalStemmer(language)
    words = text.split()
    return [stemmer.stem(word) for word in words]