Languages
=========

Language-specific details and coverage information.

Supported Languages
-------------------

Crosstem currently supports 15 languages with full derivational morphology data.

European Languages
~~~~~~~~~~~~~~~~~~

English (eng)
^^^^^^^^^^^^^

* **Coverage**: ~150,000 words
* **Productivity**: High (rich derivational morphology)
* **Thresholds**: Verbs ≥5, Nouns ≥9
* **Notes**: Best coverage, most extensively tested

German (deu)
^^^^^^^^^^^^

* **Coverage**: ~120,000 words
* **Productivity**: Moderate (compound-heavy)
* **Thresholds**: Verbs ≥4, Nouns ≥3
* **Notes**: Handles compound words well

French (fra)
^^^^^^^^^^^^

* **Coverage**: ~90,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥4, Nouns ≥5
* **Notes**: Romance language patterns

Italian (ita)
^^^^^^^^^^^^^

* **Coverage**: ~80,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥4, Nouns ≥5
* **Notes**: Similar to French

Spanish (spa)
^^^^^^^^^^^^^

* **Coverage**: ~85,000 words
* **Productivity**: Moderate-Low
* **Thresholds**: Verbs ≥3, Nouns ≥4
* **Notes**: Romance language, extensive verbal system

Portuguese (por)
^^^^^^^^^^^^^^^^

* **Coverage**: ~70,000 words
* **Productivity**: Moderate-Low
* **Thresholds**: Verbs ≥3, Nouns ≥4
* **Notes**: Similar to Spanish

Catalan (cat)
^^^^^^^^^^^^^

* **Coverage**: ~50,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥3, Nouns ≥4
* **Notes**: Romance language spoken in Catalonia

Swedish (swe)
^^^^^^^^^^^^^

* **Coverage**: ~65,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥3, Nouns ≥4
* **Notes**: North Germanic language

Slavic Languages
~~~~~~~~~~~~~~~~

Russian (rus)
^^^^^^^^^^^^^

* **Coverage**: ~100,000 words
* **Productivity**: Low (inflection-heavy)
* **Thresholds**: Verbs ≥3, Nouns ≥2
* **Notes**: Rich inflectional system, less derivation

Polish (pol)
^^^^^^^^^^^^

* **Coverage**: ~55,000 words
* **Productivity**: Low
* **Thresholds**: Verbs ≥3, Nouns ≥3
* **Notes**: Complex inflectional morphology

Czech (ces)
^^^^^^^^^^^

* **Coverage**: ~40,000 words
* **Productivity**: Low
* **Thresholds**: Verbs ≥3, Nouns ≥3
* **Notes**: West Slavic language

Serbo-Croatian (hbs)
^^^^^^^^^^^^^^^^^^^^

* **Coverage**: ~35,000 words
* **Productivity**: Low
* **Thresholds**: Verbs ≥2, Nouns ≥2
* **Notes**: South Slavic language

Other Languages
~~~~~~~~~~~~~~~

Finnish (fin)
^^^^^^^^^^^^^

* **Coverage**: ~60,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥3, Nouns ≥4
* **Notes**: Finno-Ugric language, agglutinative

Hungarian (hun)
^^^^^^^^^^^^^^^

* **Coverage**: ~45,000 words
* **Productivity**: Moderate
* **Thresholds**: Verbs ≥3, Nouns ≥3
* **Notes**: Finno-Ugric language, agglutinative

Mongolian (mon)
^^^^^^^^^^^^^^^

* **Coverage**: ~25,000 words
* **Productivity**: Moderate-Low
* **Thresholds**: Verbs ≥2, Nouns ≥2
* **Notes**: Mongolic language

Language Codes
--------------

Crossstem uses ISO 639-3 language codes:

.. list-table::
   :header-rows: 1
   :widths: 15 25 60

   * - Code
     - Language
     - Example Usage
   * - cat
     - Catalan
     - ``DerivationalStemmer('cat')``
   * - ces
     - Czech
     - ``DerivationalStemmer('ces')``
   * - deu
     - German
     - ``DerivationalStemmer('deu')``
   * - eng
     - English
     - ``DerivationalStemmer('eng')``
   * - fin
     - Finnish
     - ``DerivationalStemmer('fin')``
   * - fra
     - French
     - ``DerivationalStemmer('fra')``
   * - hbs
     - Serbo-Croatian
     - ``DerivationalStemmer('hbs')``
   * - hun
     - Hungarian
     - ``DerivationalStemmer('hun')``
   * - ita
     - Italian
     - ``DerivationalStemmer('ita')``
   * - mon
     - Mongolian
     - ``DerivationalStemmer('mon')``
   * - pol
     - Polish
     - ``DerivationalStemmer('pol')``
   * - por
     - Portuguese
     - ``DerivationalStemmer('por')``
   * - rus
     - Russian
     - ``DerivationalStemmer('rus')``
   * - spa
     - Spanish
     - ``DerivationalStemmer('spa')``
   * - swe
     - Swedish
     - ``DerivationalStemmer('swe')``

Productivity Thresholds
-----------------------

Each language has calibrated thresholds for filtering candidates:

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 50

   * - Language
     - Verbs
     - Nouns
     - Rationale
   * - English
     - ≥5
     - ≥9
     - Rich derivational morphology
   * - German
     - ≥4
     - ≥3
     - Compound-heavy language
   * - French
     - ≥4
     - ≥5
     - Romance language patterns
   * - Italian
     - ≥4
     - ≥5
     - Similar to French
   * - Spanish
     - ≥3
     - ≥4
     - Lower productivity
   * - Portuguese
     - ≥3
     - ≥4
     - Similar to Spanish
   * - Russian
     - ≥3
     - ≥2
     - Inflection-heavy
   * - Polish
     - ≥3
     - ≥3
     - Slavic patterns
   * - Czech
     - ≥3
     - ≥3
     - Similar to Polish
   * - Finnish
     - ≥3
     - ≥4
     - Agglutinative morphology
   * - Hungarian
     - ≥3
     - ≥3
     - Agglutinative morphology
   * - Others
     - ≥2-3
     - ≥2-4
     - Conservative thresholds

Language-Specific Examples
---------------------------

English
~~~~~~~

::

   from crosstem import DerivationalStemmer
   
   stemmer = DerivationalStemmer('eng')
   
   # Noun → Verb
   print(stemmer.stem('organization'))    # organize
   print(stemmer.stem('destruction'))     # destruct
   
   # Adjective → Noun
   print(stemmer.stem('beautiful'))       # beauty
   print(stemmer.stem('happiness'))       # happy

German
~~~~~~

::

   stemmer = DerivationalStemmer('deu')
   
   print(stemmer.stem('Organisation'))    # organisieren
   print(stemmer.stem('Organisierung'))   # organisieren
   print(stemmer.stem('Schönheit'))       # schön

French
~~~~~~

::

   stemmer = DerivationalStemmer('fra')
   
   print(stemmer.stem('organisation'))    # organiser
   print(stemmer.stem('organisateur'))    # organiser
   print(stemmer.stem('beauté'))          # beau

Spanish
~~~~~~~

::

   stemmer = DerivationalStemmer('spa')
   
   print(stemmer.stem('organización'))    # organizar
   print(stemmer.stem('organizador'))     # organizar
   print(stemmer.stem('belleza'))         # bello

Russian
~~~~~~~

::

   stemmer = DerivationalStemmer('rus')
   
   print(stemmer.stem('организация'))     # организовать
   print(stemmer.stem('красота'))         # красивый

Data Sources
------------

Language data comes from:

1. **MorphyNet v1.0**: Derivational morphology
   
   * Source: https://morphynet.org/
   * License: CC BY-SA 4.0
   * Coverage: 15 languages

2. **UniMorph**: Inflectional morphology
   
   * Source: https://unimorph.github.io/
   * License: CC BY-SA 3.0
   * Coverage: Subset of supported languages

3. **Wiktionary**: Etymology relationships
   
   * Source: Wiktionary dumps
   * License: CC BY-SA 3.0
   * Coverage: 2,265 languages

Future Languages
----------------

Potential additions (dependent on data availability):

* Arabic
* Chinese (Mandarin)
* Japanese
* Korean
* Hindi
* Turkish
* Dutch
* Norwegian
* Danish

To request language support, please open an issue on GitHub.

Language Limitations
--------------------

Coverage Gaps
~~~~~~~~~~~~~

* **Domain jargon**: Technical/medical terms may be missing
* **Neologisms**: New words not in training data
* **Slang**: Informal language not well-represented
* **Archaic terms**: Historical words may have incomplete data

Morphological Patterns
~~~~~~~~~~~~~~~~~~~~~~

* **Compounds**: Some compound words may not decompose correctly
* **Irregular forms**: Irregular derivations may be missing
* **Borrowed words**: Recently borrowed words may lack derivational data
* **Regional variants**: Dialect-specific forms may not be included

Performance Variations
~~~~~~~~~~~~~~~~~~~~~~

* **English**: Best tested, highest quality
* **Major European**: Well-tested, good quality
* **Slavic**: Good coverage, lower productivity requires tuning
* **Other**: Adequate coverage, less extensively tested

Contributing Languages
----------------------

To add a new language:

1. Obtain derivational morphology data
2. Format as MorphyNet-compatible JSON
3. Calibrate productivity thresholds
4. Add test cases
5. Submit pull request

See :doc:`contributing` for detailed guidelines.