Languages

Language-specific details and coverage information.

Supported Languages

Crosstem currently supports 15 languages with full derivational morphology data.

European Languages

English (eng)

  • Coverage: ~150,000 words

  • Productivity: High (rich derivational morphology)

  • Thresholds: Verbs ≥5, Nouns ≥9

  • Notes: Best coverage, most extensively tested

German (deu)

  • Coverage: ~120,000 words

  • Productivity: Moderate (compound-heavy)

  • Thresholds: Verbs ≥4, Nouns ≥3

  • Notes: Handles compound words well

French (fra)

  • Coverage: ~90,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥4, Nouns ≥5

  • Notes: Romance language patterns

Italian (ita)

  • Coverage: ~80,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥4, Nouns ≥5

  • Notes: Similar to French

Spanish (spa)

  • Coverage: ~85,000 words

  • Productivity: Moderate-Low

  • Thresholds: Verbs ≥3, Nouns ≥4

  • Notes: Romance language, extensive verbal system

Portuguese (por)

  • Coverage: ~70,000 words

  • Productivity: Moderate-Low

  • Thresholds: Verbs ≥3, Nouns ≥4

  • Notes: Similar to Spanish

Catalan (cat)

  • Coverage: ~50,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥3, Nouns ≥4

  • Notes: Romance language spoken in Catalonia

Swedish (swe)

  • Coverage: ~65,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥3, Nouns ≥4

  • Notes: North Germanic language

Slavic Languages

Russian (rus)

  • Coverage: ~100,000 words

  • Productivity: Low (inflection-heavy)

  • Thresholds: Verbs ≥3, Nouns ≥2

  • Notes: Rich inflectional system, less derivation

Polish (pol)

  • Coverage: ~55,000 words

  • Productivity: Low

  • Thresholds: Verbs ≥3, Nouns ≥3

  • Notes: Complex inflectional morphology

Czech (ces)

  • Coverage: ~40,000 words

  • Productivity: Low

  • Thresholds: Verbs ≥3, Nouns ≥3

  • Notes: West Slavic language

Serbo-Croatian (hbs)

  • Coverage: ~35,000 words

  • Productivity: Low

  • Thresholds: Verbs ≥2, Nouns ≥2

  • Notes: South Slavic language

Other Languages

Finnish (fin)

  • Coverage: ~60,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥3, Nouns ≥4

  • Notes: Finno-Ugric language, agglutinative

Hungarian (hun)

  • Coverage: ~45,000 words

  • Productivity: Moderate

  • Thresholds: Verbs ≥3, Nouns ≥3

  • Notes: Finno-Ugric language, agglutinative

Mongolian (mon)

  • Coverage: ~25,000 words

  • Productivity: Moderate-Low

  • Thresholds: Verbs ≥2, Nouns ≥2

  • Notes: Mongolic language

Language Codes

Crossstem uses ISO 639-3 language codes:

Code

Language

Example Usage

cat

Catalan

DerivationalStemmer('cat')

ces

Czech

DerivationalStemmer('ces')

deu

German

DerivationalStemmer('deu')

eng

English

DerivationalStemmer('eng')

fin

Finnish

DerivationalStemmer('fin')

fra

French

DerivationalStemmer('fra')

hbs

Serbo-Croatian

DerivationalStemmer('hbs')

hun

Hungarian

DerivationalStemmer('hun')

ita

Italian

DerivationalStemmer('ita')

mon

Mongolian

DerivationalStemmer('mon')

pol

Polish

DerivationalStemmer('pol')

por

Portuguese

DerivationalStemmer('por')

rus

Russian

DerivationalStemmer('rus')

spa

Spanish

DerivationalStemmer('spa')

swe

Swedish

DerivationalStemmer('swe')

Productivity Thresholds

Each language has calibrated thresholds for filtering candidates:

Language

Verbs

Nouns

Rationale

English

≥5

≥9

Rich derivational morphology

German

≥4

≥3

Compound-heavy language

French

≥4

≥5

Romance language patterns

Italian

≥4

≥5

Similar to French

Spanish

≥3

≥4

Lower productivity

Portuguese

≥3

≥4

Similar to Spanish

Russian

≥3

≥2

Inflection-heavy

Polish

≥3

≥3

Slavic patterns

Czech

≥3

≥3

Similar to Polish

Finnish

≥3

≥4

Agglutinative morphology

Hungarian

≥3

≥3

Agglutinative morphology

Others

≥2-3

≥2-4

Conservative thresholds

Language-Specific Examples

English

from crosstem import DerivationalStemmer

stemmer = DerivationalStemmer('eng')

# Noun → Verb
print(stemmer.stem('organization'))    # organize
print(stemmer.stem('destruction'))     # destruct

# Adjective → Noun
print(stemmer.stem('beautiful'))       # beauty
print(stemmer.stem('happiness'))       # happy

German

stemmer = DerivationalStemmer('deu')

print(stemmer.stem('Organisation'))    # organisieren
print(stemmer.stem('Organisierung'))   # organisieren
print(stemmer.stem('Schönheit'))       # schön

French

stemmer = DerivationalStemmer('fra')

print(stemmer.stem('organisation'))    # organiser
print(stemmer.stem('organisateur'))    # organiser
print(stemmer.stem('beauté'))          # beau

Spanish

stemmer = DerivationalStemmer('spa')

print(stemmer.stem('organización'))    # organizar
print(stemmer.stem('organizador'))     # organizar
print(stemmer.stem('belleza'))         # bello

Russian

stemmer = DerivationalStemmer('rus')

print(stemmer.stem('организация'))     # организовать
print(stemmer.stem('красота'))         # красивый

Data Sources

Language data comes from:

  1. MorphyNet v1.0: Derivational morphology

  2. UniMorph: Inflectional morphology

  3. Wiktionary: Etymology relationships

    • Source: Wiktionary dumps

    • License: CC BY-SA 3.0

    • Coverage: 2,265 languages

Future Languages

Potential additions (dependent on data availability):

  • Arabic

  • Chinese (Mandarin)

  • Japanese

  • Korean

  • Hindi

  • Turkish

  • Dutch

  • Norwegian

  • Danish

To request language support, please open an issue on GitHub.

Language Limitations

Coverage Gaps

  • Domain jargon: Technical/medical terms may be missing

  • Neologisms: New words not in training data

  • Slang: Informal language not well-represented

  • Archaic terms: Historical words may have incomplete data

Morphological Patterns

  • Compounds: Some compound words may not decompose correctly

  • Irregular forms: Irregular derivations may be missing

  • Borrowed words: Recently borrowed words may lack derivational data

  • Regional variants: Dialect-specific forms may not be included

Performance Variations

  • English: Best tested, highest quality

  • Major European: Well-tested, good quality

  • Slavic: Good coverage, lower productivity requires tuning

  • Other: Adequate coverage, less extensively tested

Contributing Languages

To add a new language:

  1. Obtain derivational morphology data

  2. Format as MorphyNet-compatible JSON

  3. Calibrate productivity thresholds

  4. Add test cases

  5. Submit pull request

See Contributing for detailed guidelines.