Languages
Language-specific details and coverage information.
Supported Languages
Crosstem currently supports 15 languages with full derivational morphology data.
European Languages
English (eng)
Coverage: ~150,000 words
Productivity: High (rich derivational morphology)
Thresholds: Verbs ≥5, Nouns ≥9
Notes: Best coverage, most extensively tested
German (deu)
Coverage: ~120,000 words
Productivity: Moderate (compound-heavy)
Thresholds: Verbs ≥4, Nouns ≥3
Notes: Handles compound words well
French (fra)
Coverage: ~90,000 words
Productivity: Moderate
Thresholds: Verbs ≥4, Nouns ≥5
Notes: Romance language patterns
Italian (ita)
Coverage: ~80,000 words
Productivity: Moderate
Thresholds: Verbs ≥4, Nouns ≥5
Notes: Similar to French
Spanish (spa)
Coverage: ~85,000 words
Productivity: Moderate-Low
Thresholds: Verbs ≥3, Nouns ≥4
Notes: Romance language, extensive verbal system
Portuguese (por)
Coverage: ~70,000 words
Productivity: Moderate-Low
Thresholds: Verbs ≥3, Nouns ≥4
Notes: Similar to Spanish
Catalan (cat)
Coverage: ~50,000 words
Productivity: Moderate
Thresholds: Verbs ≥3, Nouns ≥4
Notes: Romance language spoken in Catalonia
Swedish (swe)
Coverage: ~65,000 words
Productivity: Moderate
Thresholds: Verbs ≥3, Nouns ≥4
Notes: North Germanic language
Slavic Languages
Russian (rus)
Coverage: ~100,000 words
Productivity: Low (inflection-heavy)
Thresholds: Verbs ≥3, Nouns ≥2
Notes: Rich inflectional system, less derivation
Polish (pol)
Coverage: ~55,000 words
Productivity: Low
Thresholds: Verbs ≥3, Nouns ≥3
Notes: Complex inflectional morphology
Czech (ces)
Coverage: ~40,000 words
Productivity: Low
Thresholds: Verbs ≥3, Nouns ≥3
Notes: West Slavic language
Serbo-Croatian (hbs)
Coverage: ~35,000 words
Productivity: Low
Thresholds: Verbs ≥2, Nouns ≥2
Notes: South Slavic language
Other Languages
Finnish (fin)
Coverage: ~60,000 words
Productivity: Moderate
Thresholds: Verbs ≥3, Nouns ≥4
Notes: Finno-Ugric language, agglutinative
Hungarian (hun)
Coverage: ~45,000 words
Productivity: Moderate
Thresholds: Verbs ≥3, Nouns ≥3
Notes: Finno-Ugric language, agglutinative
Mongolian (mon)
Coverage: ~25,000 words
Productivity: Moderate-Low
Thresholds: Verbs ≥2, Nouns ≥2
Notes: Mongolic language
Language Codes
Crossstem uses ISO 639-3 language codes:
Code |
Language |
Example Usage |
|---|---|---|
cat |
Catalan |
|
ces |
Czech |
|
deu |
German |
|
eng |
English |
|
fin |
Finnish |
|
fra |
French |
|
hbs |
Serbo-Croatian |
|
hun |
Hungarian |
|
ita |
Italian |
|
mon |
Mongolian |
|
pol |
Polish |
|
por |
Portuguese |
|
rus |
Russian |
|
spa |
Spanish |
|
swe |
Swedish |
|
Productivity Thresholds
Each language has calibrated thresholds for filtering candidates:
Language |
Verbs |
Nouns |
Rationale |
|---|---|---|---|
English |
≥5 |
≥9 |
Rich derivational morphology |
German |
≥4 |
≥3 |
Compound-heavy language |
French |
≥4 |
≥5 |
Romance language patterns |
Italian |
≥4 |
≥5 |
Similar to French |
Spanish |
≥3 |
≥4 |
Lower productivity |
Portuguese |
≥3 |
≥4 |
Similar to Spanish |
Russian |
≥3 |
≥2 |
Inflection-heavy |
Polish |
≥3 |
≥3 |
Slavic patterns |
Czech |
≥3 |
≥3 |
Similar to Polish |
Finnish |
≥3 |
≥4 |
Agglutinative morphology |
Hungarian |
≥3 |
≥3 |
Agglutinative morphology |
Others |
≥2-3 |
≥2-4 |
Conservative thresholds |
Language-Specific Examples
English
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
# Noun → Verb
print(stemmer.stem('organization')) # organize
print(stemmer.stem('destruction')) # destruct
# Adjective → Noun
print(stemmer.stem('beautiful')) # beauty
print(stemmer.stem('happiness')) # happy
German
stemmer = DerivationalStemmer('deu')
print(stemmer.stem('Organisation')) # organisieren
print(stemmer.stem('Organisierung')) # organisieren
print(stemmer.stem('Schönheit')) # schön
French
stemmer = DerivationalStemmer('fra')
print(stemmer.stem('organisation')) # organiser
print(stemmer.stem('organisateur')) # organiser
print(stemmer.stem('beauté')) # beau
Spanish
stemmer = DerivationalStemmer('spa')
print(stemmer.stem('organización')) # organizar
print(stemmer.stem('organizador')) # organizar
print(stemmer.stem('belleza')) # bello
Russian
stemmer = DerivationalStemmer('rus')
print(stemmer.stem('организация')) # организовать
print(stemmer.stem('красота')) # красивый
Data Sources
Language data comes from:
MorphyNet v1.0: Derivational morphology
Source: https://morphynet.org/
License: CC BY-SA 4.0
Coverage: 15 languages
UniMorph: Inflectional morphology
Source: https://unimorph.github.io/
License: CC BY-SA 3.0
Coverage: Subset of supported languages
Wiktionary: Etymology relationships
Source: Wiktionary dumps
License: CC BY-SA 3.0
Coverage: 2,265 languages
Future Languages
Potential additions (dependent on data availability):
Arabic
Chinese (Mandarin)
Japanese
Korean
Hindi
Turkish
Dutch
Norwegian
Danish
To request language support, please open an issue on GitHub.
Language Limitations
Coverage Gaps
Domain jargon: Technical/medical terms may be missing
Neologisms: New words not in training data
Slang: Informal language not well-represented
Archaic terms: Historical words may have incomplete data
Morphological Patterns
Compounds: Some compound words may not decompose correctly
Irregular forms: Irregular derivations may be missing
Borrowed words: Recently borrowed words may lack derivational data
Regional variants: Dialect-specific forms may not be included
Performance Variations
English: Best tested, highest quality
Major European: Well-tested, good quality
Slavic: Good coverage, lower productivity requires tuning
Other: Adequate coverage, less extensively tested
Contributing Languages
To add a new language:
Obtain derivational morphology data
Format as MorphyNet-compatible JSON
Calibrate productivity thresholds
Add test cases
Submit pull request
See Contributing for detailed guidelines.