Contributing
We welcome contributions to Crosstem! This guide explains how to contribute.
Ways to Contribute
Bug reports: Found an issue? Report it on GitHub
Feature requests: Suggest new features or improvements
Code contributions: Submit pull requests with fixes or enhancements
Documentation: Improve docs, add examples, fix typos
Language data: Add support for new languages
Testing: Write tests, report edge cases
Getting Started
Development Setup
Fork the repository on GitHub
Clone your fork locally:
git clone https://github.com/YOUR_USERNAME/crossstem.git cd crossstem
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Install in development mode:
pip install -e .
Install development dependencies:
pip install pytest black flake8 mypy
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=crossstem
# Run specific test file
pytest tests/test_stemmer.py
Code Style
We use Black for formatting and flake8 for linting:
# Format code
black crossstem/
# Check linting
flake8 crossstem/
# Type checking
mypy crossstem/
Reporting Bugs
Before reporting a bug:
Check if it’s already reported in GitHub Issues
Make sure you’re using the latest version
Test with a minimal reproducible example
Bug Report Template
**Describe the bug**
A clear description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Import Crossstem
2. Call stemmer.stem('word')
3. See error
**Expected behavior**
What you expected to happen.
**Actual behavior**
What actually happened.
**Environment**
- OS: [e.g., Windows 10, Ubuntu 22.04]
- Python version: [e.g., 3.9.7]
- Crossstem version: [e.g., 0.2.0]
**Minimal example**
```python
from crossstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
print(stemmer.stem('problematic_word'))
```
Feature Requests
We’re open to new features! Please describe:
Use case: Why is this feature needed?
Proposal: How should it work?
Examples: Show example usage
Alternatives: What alternatives exist?
Pull Requests
PR Checklist
Before submitting a PR:
[ ] Code follows the project style (Black + flake8)
[ ] All tests pass
[ ] New tests added for new features
[ ] Documentation updated
[ ] CHANGELOG.md updated
[ ] Commit messages are clear
PR Process
Create a feature branch:
git checkout -b feature/amazing-feature
Make your changes
Add tests for new functionality
Ensure all tests pass
Commit with clear messages:
git commit -m "Add amazing feature"
Push to your fork:
git push origin feature/amazing-feature
Open a Pull Request on GitHub
Adding Languages
To add support for a new language:
Data Requirements
Derivational data: MorphyNet-compatible JSON format
Inflectional data: UniMorph-compatible TSV format
Minimum coverage: At least 20,000 words
License: Must be open license (CC BY-SA or similar)
Format Example
Derivational data (<lang>_derivations.json):
{
"word1": {
"derives_from": ["parent1", "parent2"],
"derives_to": ["child1", "child2"],
"pos": "V"
},
"word2": {
"derives_from": [],
"derives_to": ["child3"],
"pos": "N"
}
}
Calibrating Thresholds
Analyze productivity distribution:
python scripts/analyze_productivity.py <lang>
Set thresholds in
crossstem/stemmer.py:PRODUCTIVITY_THRESHOLDS = { 'new': {'V': 3, 'N': 4}, # Your language # ... existing languages }
Test stemming quality:
python scripts/test_language.py <lang>
Adjust thresholds based on results
Adding Tests
Create tests/test_<lang>.py:
def test_<lang>_stemming():
stemmer = DerivationalStemmer('<lang>')
# Test cases
assert stemmer.stem('word1') == 'expected_root1'
assert stemmer.stem('word2') == 'expected_root2'
# Multi-hop cases
assert stemmer.stem('derived_word') == 'root'
Documentation Updates
Add language to
docs/source/languages.rstUpdate README.md with new language count
Add examples in
docs/source/examples.rst
Improving Algorithm
If you have ideas for improving the BFS algorithm:
Open an issue to discuss the approach
Provide benchmark results showing improvement
Include examples of edge cases it handles better
Ensure it doesn’t regress existing behavior
Testing Strategy
Benchmark against Porter on common word lists
Test accuracy on hand-labeled examples
Measure speed with large corpora
Verify behavior across all 15 languages
Code Organization
Project Structure
crossstem/
├── __init__.py # Package exports
├── stemmer.py # DerivationalStemmer class
├── analyzer.py # InflectionAnalyzer class
├── etymology_linker.py # EtymologyLinker class
├── download.py # Etymology download utilities
├── exceptions.py # Custom exceptions
└── data/ # Language data files
Testing Structure
tests/
├── test_stemmer.py # Stemming tests
├── test_analyzer.py # Inflection tests
├── test_etymology.py # Etymology tests
└── test_<lang>.py # Language-specific tests
Adding Documentation
Documentation is built with Sphinx and hosted on Read the Docs.
Local Build
cd docs/
pip install -r requirements.txt
make html
View at docs/build/html/index.html
Adding Pages
Create
docs/source/<page>.rstAdd to
index.rsttable of contentsBuild and verify locally
Submit PR
Docstring Style
Use Google-style docstrings:
def stem(self, word: str) -> str:
"""Find the morphological root of a word.
Args:
word: The word to stem
Returns:
The morphological root
Example:
>>> stemmer = DerivationalStemmer('eng')
>>> stemmer.stem('organization')
'organize'
"""
Code Review Process
All PRs are reviewed by maintainers. We look for:
Correctness: Does it work as intended?
Tests: Is it well-tested?
Documentation: Is it documented?
Style: Does it follow conventions?
Performance: Does it maintain speed?
Feedback may include:
Requests for changes
Suggestions for improvements
Questions about design decisions
Please be patient and constructive in discussions.
Release Process
Versioning
We follow Semantic Versioning (semver):
MAJOR: Incompatible API changes
MINOR: New features, backwards-compatible
PATCH: Bug fixes, backwards-compatible
Maintainer Responsibilities
Review and merge PRs
Update CHANGELOG.md
Create GitHub releases
Publish to PyPI
Update documentation
Community Guidelines
Be respectful and constructive
Focus on the issue, not the person
Assume good intentions
Ask questions when unclear
Give credit to contributors
License
By contributing, you agree that your contributions will be licensed under the MIT License.
Questions?
Open an issue on GitHub
Tag maintainers: @droidmaximus
Thank you for contributing to Crossstem!