├── tests
    ├── __init__.py
    ├── test_signwriting.py
    ├── test_chinese.py
    ├── test_japanese.py
    ├── test_word_stopping_criteria.py
    └── test_pretokenizer.py
├── examples
    ├── __init__.py
    └── tokens_parity.py
├── words_segmentation
    ├── __init__.py
    ├── signwriting.py
    ├── tokenizer.py
    ├── chinese.py
    ├── japanese.py
    ├── pretokenizer.py
    └── languages.py
├── .gitignore
├── assets
    ├── tokenization-parity.png
    └── tokenization-parity-words.png
├── .github
    └── workflows
    │   ├── lint.yaml
    │   ├── test.yaml
    │   └── release.yaml
├── LICENSE
├── pyproject.toml
└── README.md


/tests/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/examples/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/words_segmentation/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .claude/
3 | *.egg-info
4 | build/
5 | dist/
6 | .env
7 | __pycache__/
8 | *.pyc
9 | *.pyo


--------------------------------------------------------------------------------
/assets/tokenization-parity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sign/words-segmentation/main/assets/tokenization-parity.png


--------------------------------------------------------------------------------
/assets/tokenization-parity-words.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sign/words-segmentation/main/assets/tokenization-parity-words.png


--------------------------------------------------------------------------------
/words_segmentation/signwriting.py:
--------------------------------------------------------------------------------
1 | import re
2 | 
3 | from signwriting.formats.swu import re_swu
4 | 
5 | 
6 | def segment_signwriting(text: str) -> list[str]:
7 |     return re.findall(re_swu['sign'], text)
8 | 


--------------------------------------------------------------------------------
/.github/workflows/lint.yaml:
--------------------------------------------------------------------------------
 1 | name: Lint
 2 | 
 3 | 
 4 | on:
 5 |   push:
 6 |     branches: [ main ]
 7 |   pull_request:
 8 |     branches: [ main ]
 9 | 
10 | 
11 | jobs:
12 |   test:
13 |     name: Lint
14 |     runs-on: ubuntu-latest
15 | 
16 |     steps:
17 |       - uses: actions/checkout@v5
18 | 
19 |       - name: Setup uv
20 |         uses: astral-sh/setup-uv@v6.5.0
21 |         with:
22 |           python-version: "3.12"
23 |           enable-cache: true
24 |           activate-environment: true
25 | 
26 |       - name: Install dependencies
27 |         run: uv pip install ".[dev]"
28 | 
29 |       - name: Lint code
30 |         run: uv run ruff check .


--------------------------------------------------------------------------------
/.github/workflows/test.yaml:
--------------------------------------------------------------------------------
 1 | name: Test
 2 | 
 3 | 
 4 | on:
 5 |   push:
 6 |     branches: [ main ]
 7 |   pull_request:
 8 |     branches: [ main ]
 9 | 
10 | 
11 | jobs:
12 |   test:
13 |     name: Test
14 |     runs-on: ubuntu-latest
15 | 
16 |     steps:
17 |       - uses: actions/checkout@v5
18 | 
19 |       - name: Setup uv
20 |         uses: astral-sh/setup-uv@v6.5.0
21 |         with:
22 |           python-version: "3.12"
23 |           enable-cache: true
24 |           activate-environment: true
25 | 
26 |       - name: Install dependencies
27 |         run: uv pip install ".[dev]"
28 | 
29 |       - name: Test Code
30 |         run: uv run pytest -n auto --dist loadscope
31 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 sign
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/.github/workflows/release.yaml:
--------------------------------------------------------------------------------
 1 | name: Publish Python Package
 2 | on:
 3 |   release:
 4 |     types: [ created ]
 5 | 
 6 | jobs:
 7 |   pypi-publish:
 8 |     name: Upload release to PyPI
 9 |     runs-on: ubuntu-latest
10 |     environment:
11 |       name: pypi
12 |       url: https://pypi.org/p/words-segmentation
13 |     permissions:
14 |       id-token: write
15 |     steps:
16 |       - uses: actions/checkout@v5
17 | 
18 |       - uses: actions/setup-python@v6
19 |         with:
20 |           python-version: "3.12"
21 | 
22 |       - name: Extract release version
23 |         id: get_version
24 |         run: echo "version=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV
25 | 
26 |       - name: Update version in pyproject.toml
27 |         run: |
28 |           sed -i 's/^version = .*/version = "${{ env.version }}"/' pyproject.toml
29 | 
30 |       - name: Install build dependencies
31 |         run: pip install build
32 | 
33 |       - name: Build a binary wheel dist
34 |         run: |
35 |           rm -rf dist
36 |           python -m build
37 | 
38 |       - name: Publish distribution 📦 to PyPI
39 |         uses: pypa/gh-action-pypi-publish@release/v1
40 | 


--------------------------------------------------------------------------------
/tests/test_signwriting.py:
--------------------------------------------------------------------------------
 1 | import pytest
 2 | 
 3 | from words_segmentation.signwriting import segment_signwriting
 4 | 
 5 | 
 6 | def test_segment_single_sign():
 7 |     sign = "𝠀񆄱񈠣񍉡𝠃𝤛𝤵񍉡𝣴𝣵񆄱𝤌𝤆񈠣𝤉𝤚"
 8 |     result = segment_signwriting(sign)
 9 |     assert result == [sign]
10 | 
11 | def test_segment_single_sign_no_prefix():
12 |     sign = "𝠃𝤛𝤵񍉡𝣴𝣵񆄱𝤌𝤆񈠣𝤉𝤚"
13 |     result = segment_signwriting(sign)
14 |     assert result == [sign]
15 | 
16 | def test_segment_with_space():
17 |     signs = [
18 |         "𝠀񀀒񀀚񋚥񋛩𝠃𝤟𝤩񋛩𝣵𝤐񀀒𝤇𝣤񋚥𝤐𝤆񀀚𝣮𝣭",
19 |         "𝠀񂇢񂇈񆙡񋎥񋎵𝠃𝤛𝤬񂇈𝤀𝣺񂇢𝤄𝣻񋎥𝤄𝤗񋎵𝤃𝣟񆙡𝣱𝣸",
20 |         "𝠃𝤙𝤞񀀙𝣷𝤀񅨑𝣼𝤀񆉁𝣳𝣮"
21 |     ]
22 |     result = segment_signwriting(" ".join(signs))
23 |     assert result == signs
24 | 
25 | def test_segment_no_space():
26 |     signs = [
27 |         "𝠀񀀒񀀚񋚥񋛩𝠃𝤟𝤩񋛩𝣵𝤐񀀒𝤇𝣤񋚥𝤐𝤆񀀚𝣮𝣭",
28 |         "𝠀񂇢񂇈񆙡񋎥񋎵𝠃𝤛𝤬񂇈𝤀𝣺񂇢𝤄𝣻񋎥𝤄𝤗񋎵𝤃𝣟񆙡𝣱𝣸",
29 |         "𝠃𝤙𝤞񀀙𝣷𝤀񅨑𝣼𝤀񆉁𝣳𝣮"
30 |     ]
31 |     result = segment_signwriting("".join(signs))
32 |     assert result == signs
33 | 
34 | if __name__ == "__main__":
35 |     pytest.main([__file__, "-v"])
36 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [project]
 2 | name = "words-segmentation"
 3 | description = "Text segmentation into words for multiple languages."
 4 | version = "0.0.1"
 5 | authors = [
 6 |     { name = "Amit Moryossef", email = "amit@sign.mt" },
 7 | ]
 8 | readme = "README.md"
 9 | requires-python = ">=3.10"
10 | dependencies = [
11 |     "transformers[torch]",
12 |     "utf8-tokenizer",
13 |     "fugashi[unidic-lite]", # For Japanese word segmentation
14 |     "jieba", # For Chinese word segmentation
15 |     "signwriting", # For SignWriting segmentation
16 | ]
17 | 
18 | [project.optional-dependencies]
19 | dev = [
20 |     "ruff",
21 |     "pytest",
22 |     "pytest-xdist", # For parallel test execution
23 | ]
24 | 
25 | 
26 | [tool.setuptools]
27 | packages = [
28 |     "words_segmentation",
29 | ]
30 | 
31 | [tool.ruff]
32 | line-length = 120
33 | 
34 | [tool.ruff.lint.per-file-ignores]
35 | "examples/tokens_parity.py" = ["E501"]
36 | 
37 | [tool.ruff.lint]
38 | select = [
39 |     "E", # pycodestyle errors
40 |     "W", # pycodestyle warnings
41 |     "F", # pyflakes
42 |     "C90", # mccabe complexity
43 |     "I", # isort
44 |     "N", # pep8-naming
45 |     "UP", # pyupgrade
46 |     "B", # flake8-bugbear
47 |     "PT", # flake8-pytest-style
48 |     "W605", # invalid escape sequence
49 |     "BLE", # flake8-blind-except
50 | ]
51 | 
52 | [tool.pytest.ini_options]
53 | addopts = "-v"
54 | testpaths = ["words_segmentation", "tests"]
55 | 


--------------------------------------------------------------------------------
/words_segmentation/tokenizer.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | from transformers import AutoTokenizer, PreTrainedTokenizer
 4 | from transformers.tokenization_utils_base import TextInput
 5 | 
 6 | from words_segmentation.pretokenizer import text_to_words, words_to_text
 7 | 
 8 | 
 9 | class WordsSegmentationTokenizer(PreTrainedTokenizer):
10 |     """
11 |     Custom Tokenizer implementation,
12 |     extending PreTrainedTokenizer for basic Hugging Face ecosystem support.
13 |     """
14 | 
15 |     def __init__(self, max_bytes: int = math.inf, **kwargs):
16 |         super().__init__(**kwargs)
17 |         self.max_bytes = max_bytes
18 | 
19 |     @property
20 |     def vocab_size(self) -> float:
21 |         return math.inf
22 | 
23 |     def add_tokens(self, *args, **kwargs):
24 |         raise NotImplementedError("WordsSegmentationTokenizer does not support adding tokens")
25 | 
26 |     def get_vocab(self):
27 |         return {}
28 | 
29 |     def _tokenize(self, text: TextInput, **kwargs):
30 |         return text_to_words(text, max_bytes=self.max_bytes)
31 | 
32 |     def tokenize(self, text: TextInput, **kwargs):
33 |         return self._tokenize(text, **kwargs)
34 | 
35 |     def _encode_plus(self, text: TextInput, **kwargs):
36 |         raise Exception("WordsSegmentationTokenizer can not encode to ids")
37 | 
38 |     def _convert_token_to_id(self, token: str):
39 |         raise Exception("WordsSegmentationTokenizer can not convert to ids")
40 | 
41 |     def _convert_id_to_token(self, index: int):
42 |         raise Exception("WordsSegmentationTokenizer can not decode ids")
43 | 
44 |     def convert_tokens_to_string(self, tokens: list[str]):
45 |         """Converts a sequence of tokens (string) in a single string."""
46 |         return words_to_text(tokens)
47 | 
48 |     def build_inputs_with_special_tokens(self, **unused_kwargs):
49 |         raise Exception("WordsSegmentationTokenizer does not use special tokens")
50 | 
51 |     def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None):
52 |         return ()
53 | 
54 |     def to_dict(self):
55 |         return {"max_bytes": self.max_bytes}
56 | 
57 | 
58 | AutoTokenizer.register(WordsSegmentationTokenizer, slow_tokenizer_class=WordsSegmentationTokenizer)
59 | 
60 | if __name__ == "__main__":
61 |     tokenizer = WordsSegmentationTokenizer()
62 |     print(tokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦", max_bytes=16))
63 | 


--------------------------------------------------------------------------------
/tests/test_chinese.py:
--------------------------------------------------------------------------------
 1 | import pytest
 2 | 
 3 | from words_segmentation.chinese import has_chinese, segment_chinese
 4 | 
 5 | 
 6 | def test_has_chinese_simple():
 7 |     """Test has_chinese with simple Chinese characters."""
 8 |     assert has_chinese("你好")
 9 |     assert has_chinese("中文")
10 |     assert has_chinese("我来到北京清华大学")
11 | 
12 | 
13 | def test_has_chinese_mixed_content():
14 |     """Test has_chinese with mixed Chinese and other characters."""
15 |     assert has_chinese("hello 你好")
16 |     assert has_chinese("mixed 你好")
17 |     assert has_chinese("123 中文 abc")
18 |     assert has_chinese("English 中文 混合")
19 | 
20 | 
21 | def test_has_chinese_no_chinese():
22 |     """Test has_chinese with non-Chinese text."""
23 |     assert not has_chinese("hello")
24 |     assert not has_chinese("English text")
25 |     assert not has_chinese("123456")
26 |     assert not has_chinese("!@#$%^&*()")
27 |     assert not has_chinese("こんにちは")  # Japanese
28 |     assert not has_chinese("עברית")  # Hebrew
29 | 
30 | 
31 | def test_has_chinese_empty_string():
32 |     """Test has_chinese with empty string."""
33 |     assert not has_chinese("")
34 | 
35 | 
36 | def test_has_chinese_whitespace_only():
37 |     """Test has_chinese with whitespace only."""
38 |     assert not has_chinese(" ")
39 |     assert not has_chinese("\n\t")
40 |     assert not has_chinese("   ")
41 | 
42 | 
43 | def test_segment_chinese_simple():
44 |     """Test segment_chinese with simple Chinese text."""
45 |     result = segment_chinese("你好")
46 |     assert result == ["你好"]
47 | 
48 | 
49 | def test_segment_chinese_mixed():
50 |     """Test segment_chinese with mixed Chinese and English."""
51 |     result = segment_chinese("hello 我来到北京清华大学 world")
52 |     assert result == ['hello', ' ', '我', '来到', '北京', '清华大学', ' ', 'world']
53 | 
54 | 
55 | def test_segment_chinese_empty():
56 |     """Test segment_chinese with empty string."""
57 |     result = segment_chinese("")
58 |     assert result == []
59 | 
60 | 
61 | def test_segment_chinese_complex():
62 |     """Test segment_chinese with complex Chinese sentence."""
63 |     result = segment_chinese("小明硕士毕业于中国科学院计算所")
64 |     assert result == ['小明', '硕士', '毕业', '于', '中国科学院', '计算所']
65 | 
66 | 
67 | def test_segment_chinese_compound_words():
68 |     """Test segment_chinese with compound words."""
69 |     result = segment_chinese("中文分词测试")
70 |     assert result == ['中文', '分词', '测试']
71 | 
72 | 
73 | if __name__ == "__main__":
74 |     pytest.main([__file__, "-v"])
75 | 


--------------------------------------------------------------------------------
/words_segmentation/chinese.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Chinese text pretokenization utilities.
 3 | 
 4 | This module provides functions for detecting and segmenting Chinese text using the jieba
 5 | library for word segmentation.
 6 | """
 7 | 
 8 | from functools import cache
 9 | 
10 | import regex
11 | 
12 | 
13 | def has_chinese(text: str) -> bool:
14 |     """
15 |     Check if the given text contains Chinese characters.
16 | 
17 |     Uses Unicode Han ideograph property to detect Chinese characters including:
18 |     - CJK Unified Ideographs (U+4E00-U+9FFF)
19 |     - CJK Extension A, B, C, D, E, F, and G
20 |     - CJK Compatibility Ideographs
21 | 
22 |     Args:
23 |         text: The input text to check for Chinese characters
24 | 
25 |     Returns:
26 |         True if Chinese characters are found, False otherwise
27 |     """
28 |     # Match any Han ideograph using Unicode property
29 |     return bool(regex.search(r'[\p{Han}]', text))
30 | 
31 | 
32 | @cache
33 | def get_chinese_segmenter():
34 |     """
35 |     Get a cached instance of the jieba Chinese word segmenter.
36 | 
37 |     Jieba is a popular Chinese text segmentation library that uses a combination of
38 |     dictionary-based matching and statistical models to segment Chinese text into words.
39 |     The segmenter is cached to avoid repeated initialization overhead.
40 | 
41 |     Returns:
42 |         jieba module instance for text segmentation
43 | 
44 |     Raises:
45 |         ImportError: If the jieba library is not installed
46 |     """
47 |     try:
48 |         import jieba
49 |     except ImportError:
50 |         print("Error: jieba library not found. Please install it with: pip install jieba")
51 |         raise
52 | 
53 |     return jieba
54 | 
55 | 
56 | def segment_chinese(text: str) -> list[str]:
57 |     """
58 |     Segment Chinese text into space-separated words.
59 | 
60 |     Uses jieba's precise segmentation mode to break Chinese text into individual words,
61 |     then joins them with spaces. This preprocessing step helps the tokenizer better
62 |     understand Chinese text structure.
63 | 
64 |     Args:
65 |         text: The Chinese text to segment
66 | 
67 |     Returns:
68 |         List of Chinese words
69 | 
70 |     Example:
71 |         >>> segment_chinese("我爱北京天安门")
72 |         "我 爱 北京 天安门"
73 |     """
74 |     jieba = get_chinese_segmenter()
75 |     # Use jieba.cut() for precise segmentation and join with spaces
76 |     segments = jieba.cut(text)
77 |     # Filter out empty segments and join with single spaces
78 |     return list(segments)
79 | 


--------------------------------------------------------------------------------
/words_segmentation/japanese.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Japanese text pretokenization utilities.
 3 | 
 4 | This module provides functions for detecting and segmenting Japanese text using the fugashi
 5 | library with MeCab for morphological analysis.
 6 | """
 7 | 
 8 | from functools import cache
 9 | 
10 | import regex
11 | 
12 | 
13 | def has_japanese(text: str) -> bool:
14 |     """
15 |     Check if the given text contains Japanese characters.
16 | 
17 |     Detects Japanese writing systems including:
18 |     - Hiragana (U+3040-U+309F): Phonetic script for native words
19 |     - Katakana (U+30A0-U+30FF): Phonetic script for foreign words
20 |     - Kanji (Han ideographs): Chinese characters used in Japanese
21 |     - Half-width Katakana (U+FF65-U+FF9F): Narrow katakana variants
22 | 
23 |     Args:
24 |         text: The input text to check for Japanese characters
25 | 
26 |     Returns:
27 |         True if Japanese characters are found, False otherwise
28 |     """
29 |     # Match Hiragana, Katakana, or Han ideographs (Kanji)
30 |     return bool(regex.search(r'[\p{Hiragana}\p{Katakana}\p{Han}]', text))
31 | 
32 | 
33 | @cache
34 | def get_japanese_tagger():
35 |     """
36 |     Get a cached instance of the fugashi Japanese morphological analyzer.
37 | 
38 |     Fugashi is a Python wrapper for MeCab, a morphological analyzer for Japanese.
39 |     The tagger is configured with the '-Owakati' option to output space-separated
40 |     words without part-of-speech information. The instance is cached to avoid
41 |     repeated initialization overhead.
42 | 
43 |     Returns:
44 |         fugashi.Tagger instance configured for word segmentation
45 | 
46 |     Raises:
47 |         ImportError: If the fugashi library or unidic-lite dictionary is not installed
48 |     """
49 |     try:
50 |         from fugashi import Tagger
51 |     except ImportError:
52 |         print("Error: fugashi library not found. Please install it with: pip install 'fugashi[unidic-lite]'")
53 |         raise
54 | 
55 |     # -Owakati: Output format that produces space-separated words only
56 |     return Tagger('-Owakati')
57 | 
58 | 
59 | def segment_japanese(text: str) -> list[str]:
60 |     """
61 |     Segment Japanese text into space-separated words using morphological analysis.
62 | 
63 |     Uses MeCab via fugashi to perform morphological analysis and word segmentation
64 |     of Japanese text. This handles the complex task of word boundary detection in
65 |     Japanese, which doesn't use spaces between words.
66 | 
67 |     Args:
68 |         text: The Japanese text to segment
69 | 
70 |     Returns:
71 |         List of Japanese words
72 | 
73 |     Example:
74 |         >>> segment_japanese("私は学生です")
75 |         "私 は 学生 です"
76 |     """
77 |     tagger = get_japanese_tagger()
78 |     # Parse the text and return space-separated morphemes
79 |     return [str(word) for word in tagger(text)]
80 | 


--------------------------------------------------------------------------------
/tests/test_japanese.py:
--------------------------------------------------------------------------------
 1 | import pytest
 2 | 
 3 | from words_segmentation.japanese import has_japanese, segment_japanese
 4 | 
 5 | 
 6 | def test_has_japanese_hiragana():
 7 |     """Test has_japanese with Hiragana characters."""
 8 |     assert has_japanese("こんにちは")
 9 |     assert has_japanese("ひらがな")
10 |     assert has_japanese("あいうえお")
11 | 
12 | 
13 | def test_has_japanese_katakana():
14 |     """Test has_japanese with Katakana characters."""
15 |     assert has_japanese("カタカナ")
16 |     assert has_japanese("コンピューター")
17 |     assert has_japanese("アメリカ")
18 | 
19 | 
20 | def test_has_japanese_kanji():
21 |     """Test has_japanese with Kanji characters."""
22 |     assert has_japanese("漢字")
23 |     assert has_japanese("日本語")
24 |     assert has_japanese("学生")
25 | 
26 | 
27 | def test_has_japanese_mixed_content():
28 |     """Test has_japanese with mixed Japanese and other characters."""
29 |     assert has_japanese("hello こんにちは")
30 |     assert has_japanese("私は学生です。")
31 |     assert has_japanese("123 カタカナ abc")
32 |     assert has_japanese("English 日本語 混合")
33 | 
34 | 
35 | def test_has_japanese_no_japanese():
36 |     """Test has_japanese with non-Japanese text."""
37 |     assert not has_japanese("hello")
38 |     assert not has_japanese("English text")
39 |     assert not has_japanese("123456")
40 |     assert not has_japanese("!@#$%^&*()")
41 |     assert not has_japanese("עברית")
42 |     assert not has_japanese("العربية")
43 | 
44 | 
45 | def test_has_japanese_empty_string():
46 |     """Test has_japanese with empty string."""
47 |     assert not has_japanese("")
48 | 
49 | 
50 | def test_has_japanese_whitespace_only():
51 |     """Test has_japanese with whitespace only."""
52 |     assert not has_japanese(" ")
53 |     assert not has_japanese("\n\t")
54 |     assert not has_japanese("   ")
55 | 
56 | 
57 | def test_segment_japanese_simple():
58 |     """Test segment_japanese with simple Japanese text."""
59 |     result = segment_japanese("こんにちは")
60 |     assert result == ['こんにちは']
61 | 
62 | 
63 | def test_segment_japanese_mixed():
64 |     """Test segment_japanese with mixed Japanese and English."""
65 |     result = segment_japanese("hello 私は学生です。 world")
66 |     assert result == ['hello', '私', 'は', '学生', 'です', '。', 'world']
67 | 
68 | 
69 | def test_segment_japanese_empty():
70 |     """Test segment_japanese with empty string."""
71 |     result = segment_japanese("")
72 |     assert result == []
73 | 
74 | 
75 | def test_segment_japanese_complex():
76 |     """Test segment_japanese with complex Japanese sentence."""
77 |     result = segment_japanese("私は東京大学の学生です。")
78 |     assert result == ['私', 'は', '東京', '大学', 'の', '学生', 'です', '。']
79 | 
80 | 
81 | def test_segment_japanese_katakana():
82 |     """Test segment_japanese with Katakana text."""
83 |     result = segment_japanese("コンピューター")
84 |     assert result == ['コンピューター']
85 | 
86 | 
87 | if __name__ == "__main__":
88 |     pytest.main([__file__, "-v"])
89 | 


--------------------------------------------------------------------------------
/words_segmentation/pretokenizer.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | import re
 3 | from collections.abc import Iterable
 4 | from itertools import chain
 5 | 
 6 | import regex
 7 | import torch
 8 | from transformers import PreTrainedTokenizer, StoppingCriteria, add_start_docstrings
 9 | from transformers.generation.stopping_criteria import STOPPING_CRITERIA_INPUTS_DOCSTRING
10 | from utf8_tokenizer.control import CONTROl_TOKENS_PATTERN
11 | 
12 | from words_segmentation.languages import segment_text
13 | 
14 | _COMPILED_GRAPHEME_PATTERN = regex.compile(r"\X")
15 | _COMPLETE_WORD_PATTERNS = [
16 |     rf"[{CONTROl_TOKENS_PATTERN}]",  # Control tokens are always complete
17 |     rf"[^\s{CONTROl_TOKENS_PATTERN}]+\s",  # Words with trailing space are complete
18 | ]
19 | 
20 | 
21 | def words_to_text(words: Iterable[str]) -> str:
22 |     return ''.join(words)
23 | 
24 | 
25 | def text_to_words(text: str, max_bytes: int = math.inf) -> list[str]:
26 |     words = chain.from_iterable(segment_text(text))
27 | 
28 |     if max_bytes == math.inf:
29 |         return list(words)
30 | 
31 |     chunks = (utf8_chunks_grapheme_safe(word, max_bytes=max_bytes) for word in words)
32 |     return list(chain.from_iterable(chunks))
33 | 
34 | 
35 | def utf8_chunks_grapheme_safe(text: str, max_bytes: int = 16) -> Iterable[str]:
36 |     """
37 |     Split a string into chunks of at most max_bytes bytes, without splitting grapheme clusters.
38 |     Except, if there is a single grapheme cluster longer than max_bytes, it will be in its own chunk. 👩‍👩‍👧‍👦
39 |     """
40 |     text_bytes = text.encode("utf-8")
41 |     if len(text_bytes) <= max_bytes:
42 |         yield text
43 |         return
44 | 
45 |     clusters = _COMPILED_GRAPHEME_PATTERN.findall(text)
46 |     if len(clusters) == 1:
47 |         yield text
48 |         return
49 | 
50 |     curr = []
51 |     curr_bytes = 0
52 |     for cluster in clusters:
53 |         cluster_bytes = len(cluster.encode("utf-8"))
54 |         if curr_bytes + cluster_bytes > max_bytes:
55 |             if curr:
56 |                 yield "".join(curr)
57 |             curr = [cluster]
58 |             curr_bytes = cluster_bytes
59 |         else:
60 |             curr.append(cluster)
61 |             curr_bytes += cluster_bytes
62 |     if curr:
63 |         yield "".join(curr)
64 | 
65 | 
66 | def is_word_complete(text: str) -> bool:
67 |     for pattern in _COMPLETE_WORD_PATTERNS:
68 |         if re.fullmatch(pattern, text):
69 |             return True
70 | 
71 |     # TODO: not clear how to know if a word full of whitespaces is complete
72 |     #       maybe if _TOKEN_PATTERN is not a full match, but then need to "delete" the last token.
73 |     return False
74 | 
75 | 
76 | class WordStoppingCriteria(StoppingCriteria):
77 |     def __init__(self, tokenizer: PreTrainedTokenizer):
78 |         self.tokenizer = tokenizer
79 | 
80 |     @add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING)
81 |     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
82 |         texts = self.tokenizer.batch_decode(input_ids.tolist())
83 |         is_done = [is_word_complete(text) for text in texts]
84 |         return torch.tensor(is_done, dtype=torch.bool, device=input_ids.device)
85 | 
86 | 


--------------------------------------------------------------------------------
/examples/tokens_parity.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import pandas as pd
 3 | from transformers import GPT2TokenizerFast
 4 | 
 5 | from words_segmentation.tokenizer import WordsSegmentationTokenizer
 6 | 
 7 | tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
 8 | words_tokenizer = WordsSegmentationTokenizer()
 9 | 
10 | 
11 | # Warm chinese and japanese models, because it prints stuff
12 | words_tokenizer.tokenize("体")
13 | 
14 | texts = {
15 |     "English": "Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate.",
16 |     "Italian": "I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona.",
17 |     "German": "Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten.",
18 |     "Chinese": "团体旅游价格更便宜，所以如果您独自一人或只有一个朋友，请尝试结识其他人并组成一个四到六人的团体，以获得更好的每人价格。",
19 |     "Japanese": "ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。",
20 |     "Finnish": "Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö.",
21 |     "Russian": "Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека.",
22 |     "Arabic": "تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد.",
23 |     "Hebrew": "סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם.",
24 |     "Greek": "Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο.",
25 |     "Tamil": "பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும்.",
26 |     "Kannada": "ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ.",
27 |     "Shan": "ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။",
28 | }
29 | 
30 | data = {
31 |     "Language": [],
32 |     "Bytes (UTF-8)": [],
33 |     "Tokens (GPT-4)": [],
34 |     "Words (Whitespace+)": [],
35 | }
36 | 
37 | print("| Language | Text (Google Translate) | Bytes (UTF-8) | Tokens (GPT-4) | Words (Whitespace+) |")
38 | print("|----------|-------------------------|-------------|----------------|--------------------|")
39 | for lang, text in texts.items():
40 |     data["Language"].append(lang)
41 |     num_bytes = len(text.encode('utf-8'))
42 |     data["Bytes (UTF-8)"].append(num_bytes)
43 |     num_tokens = len(tokenizer.tokenize(text))
44 |     data["Tokens (GPT-4)"].append(num_tokens)
45 |     num_words = len(words_tokenizer.tokenize(text))
46 |     data["Words (Whitespace+)"].append(num_words)
47 |     print(f"| {lang} | {text} | {num_bytes} | {num_tokens} | {num_words} |")
48 | 
49 | df = pd.DataFrame(data)
50 | 
51 | # Plot
52 | plt.figure(figsize=(10, 6))
53 | df.set_index("Language").plot(kind="bar", figsize=(12, 7))
54 | 
55 | plt.title("Text Size Across Languages (Bytes, Tokens, Words)")
56 | plt.ylabel("Count")
57 | plt.xlabel("Language")
58 | plt.xticks(rotation=45)
59 | plt.legend(title="Measure")
60 | plt.tight_layout()
61 | plt.savefig("../assets/tokenization-parity-words.png")
62 | 


--------------------------------------------------------------------------------
/words_segmentation/languages.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Script-aware segmentation with per-language callbacks.
  3 | - Uses Unicode Script_Extensions (scx) to segment Han/Hiragana/Katakana, etc.
  4 | - Falls back to a Default branch that avoids those scripts.
  5 | - Each non-default segment is passed to its language-specific callback.
  6 | """
  7 | 
  8 | from collections.abc import Callable, Iterable
  9 | from functools import cache
 10 | from itertools import chain
 11 | from typing import Any, TypedDict
 12 | 
 13 | import regex
 14 | from utf8_tokenizer.control import CONTROl_TOKENS_PATTERN
 15 | 
 16 | from words_segmentation.chinese import segment_chinese
 17 | from words_segmentation.japanese import segment_japanese
 18 | from words_segmentation.signwriting import segment_signwriting
 19 | 
 20 | # Three classes of tokens inside the Default branch:
 21 | # 1) Control tokens (always atomic)
 22 | # 2) "Words" = runs of non-space, non-control + optional trailing single space
 23 | # 3) Whitespace runs
 24 | _TOKEN_PATTERN = (
 25 |     rf"[{CONTROl_TOKENS_PATTERN}]"  # 1) Control tokens
 26 |     rf"|[^\s{CONTROl_TOKENS_PATTERN}]+\s?"  # 2) Word (+ optional trailing space)
 27 |     r"|\s+"  # 3) Whitespace runs
 28 | )
 29 | _COMPILED_TOKEN_PATTERN = regex.compile(_TOKEN_PATTERN)
 30 | 
 31 | 
 32 | def text_to_unbound_words(text: str) -> list[str]:
 33 |     """Tokenize a non-scripted span using the token rules above."""
 34 |     return _COMPILED_TOKEN_PATTERN.findall(text)
 35 | 
 36 | 
 37 | class LanguageSpec(TypedDict):
 38 |     scripts: tuple[str, ...]  # e.g., ("Han",) or ("Han", "Hiragana", "Katakana")
 39 |     callback: Callable[[str], Any]  # called with the matched span
 40 | 
 41 | 
 42 | LANGUAGE_SPECS: dict[str, LanguageSpec] = {
 43 |     "SignWriting": {
 44 |         "scripts": ("SignWriting",),
 45 |         "callback": segment_signwriting,
 46 |     },
 47 |     "Chinese": {
 48 |         "scripts": ("Han",),
 49 |         "callback": segment_chinese,
 50 |     },
 51 |     "Japanese": {
 52 |         "scripts": ("Han", "Hiragana", "Katakana"),
 53 |         "callback": segment_japanese,
 54 |     },
 55 |     "Default": {
 56 |         "scripts": tuple(),
 57 |         "callback": text_to_unbound_words,
 58 |     },
 59 | }
 60 | 
 61 | 
 62 | def _union_scx(scripts: tuple[str, ...]) -> str:
 63 |     """Create a non-capturing alternation for a set of Script_Extensions."""
 64 |     parts = [fr"\p{{scx={s}}}" for s in scripts]
 65 |     return "(?:" + "|".join(parts) + ")"
 66 | 
 67 | 
 68 | @cache
 69 | def build_regex_from_languages() -> regex.Pattern:
 70 |     """
 71 |     Compile the master regex with named groups for each language plus Default.
 72 | 
 73 |     Precedence: dict order in LANGUAGE_SPECS — first match wins if script sets overlap.
 74 |     Default branch: consumes runs that do NOT begin with any of the listed scripts.
 75 |     """
 76 |     # Explicit language branches (skip Default — it has no 'scripts')
 77 |     branches: list[str] = []
 78 |     for name, spec in LANGUAGE_SPECS.items():
 79 |         if spec["scripts"]:
 80 |             branches.append(fr"(?P<{name}>{_union_scx(spec['scripts'])}+)")
 81 | 
 82 |     # Default: refuse any char that begins one of the explicit-script branches
 83 |     all_scripts = tuple(sorted({s for spec in LANGUAGE_SPECS.values() for s in spec.get("scripts", ())}))
 84 |     forbidden = _union_scx(all_scripts) if all_scripts else r"$a"  # impossible atom if no scripts exist
 85 |     default_branch = fr"(?P<Default>(?:(?!{forbidden})\X)+)"
 86 | 
 87 |     # Combined pattern (verbose mode for readability)
 88 |     pattern = r"(?x)(?:" + "|".join(branches + [default_branch]) + r")"
 89 |     return regex.compile(pattern)
 90 | 
 91 | 
 92 | def segment_text(text: str) -> Iterable[Any]:
 93 |     """
 94 |     Iterate over callback results for each matched span.
 95 |     - Non-Default groups call their language callback.
 96 |     - Default group calls its callback if present in LANGUAGE_SPECS.
 97 |     """
 98 |     pat = build_regex_from_languages()
 99 |     for m in pat.finditer(text):
100 |         group_name = m.lastgroup
101 |         spec = LANGUAGE_SPECS.get(group_name)
102 |         yield spec["callback"](m.group(0))
103 | 
104 | 
105 | if __name__ == "__main__":
106 |     sample = "東京abcかなカナ漢字123 אני אחד私は学生です"
107 |     # Stream results as produced by callbacks
108 |     print(list(chain.from_iterable(segment_text(sample))))
109 | 


--------------------------------------------------------------------------------
/tests/test_word_stopping_criteria.py:
--------------------------------------------------------------------------------
  1 | import pytest
  2 | import torch
  3 | 
  4 | from words_segmentation.pretokenizer import WordStoppingCriteria
  5 | 
  6 | 
  7 | class MockTokenizer:
  8 |     """Mock tokenizer for testing WordStoppingCriteria."""
  9 | 
 10 |     def decode(self, token_ids):
 11 |         """Simple mock decode that converts token IDs to characters."""
 12 |         if isinstance(token_ids, torch.Tensor):
 13 |             token_ids = token_ids.tolist()
 14 |         # Simple mapping: use chr() for decoding, allow control characters
 15 |         return ''.join(chr(tid % 128) for tid in token_ids)
 16 | 
 17 |     def batch_decode(self, token_ids_list):
 18 |         """Batch decode that converts a list of token IDs to a list of strings."""
 19 |         return [self.decode(token_ids) for token_ids in token_ids_list]
 20 | 
 21 | 
 22 | def test_word_stopping_criteria_basic():
 23 |     """Test WordStoppingCriteria with basic functionality on CPU."""
 24 |     tokenizer = MockTokenizer()
 25 |     criteria = WordStoppingCriteria(tokenizer)
 26 | 
 27 |     # Test with complete word (has trailing space) - ASCII 'h','e','l','l','o',' ' = 104,101,108,108,111,32
 28 |     input_ids = torch.tensor([[104, 101, 108, 108, 111, 32]])  # "hello "
 29 |     scores = torch.zeros((1, 100))
 30 |     result = criteria(input_ids, scores)
 31 | 
 32 |     assert isinstance(result, torch.Tensor)
 33 |     assert result.dtype == torch.bool
 34 |     assert result.device.type == "cpu"
 35 |     assert result.shape == (1,)
 36 |     assert result[0].item() is True  # "hello " is complete
 37 | 
 38 | 
 39 | def test_word_stopping_criteria_incomplete():
 40 |     """Test WordStoppingCriteria with incomplete word."""
 41 |     tokenizer = MockTokenizer()
 42 |     criteria = WordStoppingCriteria(tokenizer)
 43 | 
 44 |     # Test with incomplete word (no trailing space) - ASCII 'h','e','l','l','o' = 104,101,108,108,111
 45 |     input_ids = torch.tensor([[104, 101, 108, 108, 111]])  # "hello"
 46 |     scores = torch.zeros((1, 100))
 47 |     result = criteria(input_ids, scores)
 48 | 
 49 |     assert isinstance(result, torch.Tensor)
 50 |     assert result.dtype == torch.bool
 51 |     assert result.device.type == "cpu"
 52 |     assert result.shape == (1,)
 53 |     assert result[0].item() is False  # "hello" is incomplete
 54 | 
 55 | 
 56 | def test_word_stopping_criteria_batch():
 57 |     """Test WordStoppingCriteria with batch of inputs."""
 58 |     tokenizer = MockTokenizer()
 59 |     criteria = WordStoppingCriteria(tokenizer)
 60 | 
 61 |     # Batch with mixed complete and incomplete words (same length with padding)
 62 |     input_ids = torch.tensor([
 63 |         [104, 101, 108, 108, 111, 32],  # "hello " - complete
 64 |         [104, 101, 108, 108, 111, 0],    # "hello" (with padding) - incomplete
 65 |         [119, 111, 114, 108, 100, 32],   # "world " - complete
 66 |     ])
 67 |     scores = torch.zeros((3, 100))
 68 |     result = criteria(input_ids, scores)
 69 | 
 70 |     assert isinstance(result, torch.Tensor)
 71 |     assert result.dtype == torch.bool
 72 |     assert result.device.type == "cpu"
 73 |     assert result.shape == (3,)
 74 |     assert result[0].item() is True   # "hello " is complete
 75 |     assert result[1].item() is False  # "hello" is incomplete
 76 |     assert result[2].item() is True   # "world " is complete
 77 | 
 78 | 
 79 | @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
 80 | def test_word_stopping_criteria_cuda_device():
 81 |     """Test WordStoppingCriteria respects CUDA device."""
 82 |     tokenizer = MockTokenizer()
 83 |     criteria = WordStoppingCriteria(tokenizer)
 84 | 
 85 |     # Test with input on CUDA
 86 |     input_ids = torch.tensor([[104, 101, 108, 108, 111, 32]], device="cuda")  # "hello "
 87 |     scores = torch.zeros((1, 100), device="cuda")
 88 |     result = criteria(input_ids, scores)
 89 | 
 90 |     assert isinstance(result, torch.Tensor)
 91 |     assert result.dtype == torch.bool
 92 |     assert result.device.type == "cuda"  # Should match input device
 93 |     assert result.shape == (1,)
 94 |     assert result[0].item() is True
 95 | 
 96 | 
 97 | def test_word_stopping_criteria_control_token():
 98 |     """Test WordStoppingCriteria with control tokens."""
 99 |     tokenizer = MockTokenizer()
100 |     criteria = WordStoppingCriteria(tokenizer)
101 | 
102 |     # Control token (e.g., \x01)
103 |     input_ids = torch.tensor([[1]])  # Control token
104 |     scores = torch.zeros((1, 100))
105 |     result = criteria(input_ids, scores)
106 | 
107 |     assert isinstance(result, torch.Tensor)
108 |     assert result.dtype == torch.bool
109 |     assert result.shape == (1,)
110 |     assert result[0].item() is True  # Control tokens are always complete
111 | 
112 | 
113 | def test_word_stopping_criteria_utf8_tokenizer():
114 |     """Test WordStoppingCriteria with UTF8Tokenizer to verify batch_decode compatibility."""
115 |     from utf8_tokenizer.tokenizer import UTF8Tokenizer
116 | 
117 |     tokenizer = UTF8Tokenizer()
118 |     criteria = WordStoppingCriteria(tokenizer)
119 | 
120 |     # Create test cases with UTF8Tokenizer's encoded tokens
121 |     # UTF8Tokenizer adds SOT (0x02) and EOT (0x03) control tokens
122 |     # We test that batch_decode produces the same results as individual decode calls
123 | 
124 |     # Encode some test strings
125 |     test_texts = ["hello world ", "test"]
126 |     test_ids = [tokenizer.encode(text) for text in test_texts]
127 | 
128 |     # Pad to same length
129 |     max_len = max(len(ids) for ids in test_ids)
130 |     padded_ids = [ids + [0] * (max_len - len(ids)) for ids in test_ids]
131 | 
132 |     input_ids = torch.tensor(padded_ids)
133 |     scores = torch.zeros((len(padded_ids), 100))
134 | 
135 |     # Get results using batch_decode (new implementation)
136 |     result = criteria(input_ids, scores)
137 | 
138 |     # Verify by checking batch_decode produces same results as individual decode
139 |     batch_decoded = tokenizer.batch_decode(input_ids.tolist())
140 |     individual_decoded = [tokenizer.decode(ids) for ids in input_ids]
141 | 
142 |     assert batch_decoded == individual_decoded, "batch_decode should produce same results as individual decode calls"
143 | 
144 |     # Verify result properties
145 |     assert isinstance(result, torch.Tensor)
146 |     assert result.dtype == torch.bool
147 |     assert result.shape == (2,)
148 | 
149 | 
150 | if __name__ == "__main__":
151 |     pytest.main([__file__, "-v"])
152 | 


--------------------------------------------------------------------------------
/tests/test_pretokenizer.py:
--------------------------------------------------------------------------------
  1 | import pytest
  2 | 
  3 | from words_segmentation.pretokenizer import (
  4 |     is_word_complete,
  5 |     text_to_words,
  6 |     utf8_chunks_grapheme_safe,
  7 | )
  8 | 
  9 | 
 10 | def test_utf8_chunks_english():
 11 |     """Test utf8_chunks_grapheme_safe with English text."""
 12 |     text = "hello world"
 13 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=5))
 14 |     assert chunks == ["hello", " worl", "d"]
 15 |     assert len(chunks) == 3
 16 |     assert all(len(chunk.encode('utf-8')) <= 5 for chunk in chunks)
 17 | 
 18 | 
 19 | def test_utf8_chunks_hebrew():
 20 |     """Test utf8_chunks_grapheme_safe with Hebrew text."""
 21 |     text = "עמית מוריוסף"
 22 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=8))
 23 |     assert chunks == ['עמית', ' מור', 'יוסף']
 24 |     assert "".join(chunks) == text
 25 |     assert len(chunks) == 3
 26 |     assert all(len(chunk.encode('utf-8')) <= 8 for chunk in chunks)
 27 | 
 28 | 
 29 | def test_utf8_chunks_emoji():
 30 |     """Test utf8_chunks_grapheme_safe with basic emoji."""
 31 |     text = "hello 😀 world"
 32 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=8))
 33 |     assert chunks == ['hello ', '😀 wor', 'ld']
 34 |     assert "".join(chunks) == text
 35 |     assert len(chunks) == 3
 36 |     assert all(len(chunk.encode('utf-8')) <= 8 for chunk in chunks)
 37 | 
 38 | 
 39 | def test_utf8_chunks_long_emoji_cluster():
 40 |     """Test utf8_chunks_grapheme_safe with complex emoji cluster."""
 41 |     text = "👩‍👩‍👧‍👦"
 42 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=5))
 43 |     assert chunks == [text]
 44 |     assert len(chunks) == 1
 45 |     assert len(text.encode('utf-8')) > 5
 46 | 
 47 | 
 48 | def test_utf8_chunks_single_grapheme():
 49 |     """Test special case of single grapheme cluster."""
 50 |     text = "a"
 51 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=16))
 52 |     assert chunks == ["a"]
 53 | 
 54 | 
 55 | def test_utf8_chunks_mixed_content():
 56 |     """Test with mixed English, Hebrew, and emoji."""
 57 |     text = "hello עמית 👋"
 58 |     chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=10))
 59 |     assert "".join(chunks) == text
 60 | 
 61 | 
 62 | def test_text_to_words_json():
 63 |     """Test text_to_words with JSON string."""
 64 |     json_text = '{"name": "test", "value": 123}'
 65 |     words = text_to_words(json_text, max_bytes=10)
 66 |     assert words == ['{"name": ', '"test", ', '"value": ', "123}"]
 67 |     assert "".join(words) == json_text
 68 |     assert len(words) == 4
 69 | 
 70 | 
 71 | def test_text_to_words_long_string():
 72 |     """Test text_to_words with long string."""
 73 |     long_text = ("This is a very long string that should be split into multiple chunks "
 74 |                  "when processed by the text_to_words function with appropriate byte limits.")
 75 |     words = text_to_words(long_text, max_bytes=8)
 76 |     assert "".join(words) == long_text
 77 | 
 78 | 
 79 | def test_text_to_words_short_string():
 80 |     """Test text_to_words with short string."""
 81 |     short_text = "hi"
 82 |     words = text_to_words(short_text, max_bytes=16)
 83 |     assert words == ["hi"]
 84 | 
 85 | 
 86 | def test_text_to_words_multiline_code():
 87 |     """Test similar to the processor test example."""
 88 |     text = """
 89 |     def foo():
 90 |         return "bar"
 91 |     """.strip()
 92 |     words = text_to_words(text, max_bytes=10)
 93 |     assert words == ['def ', 'foo():\n', '        ', 'return ', '"bar"']
 94 |     assert "".join(words) == text
 95 |     assert len(words) == 5
 96 |     assert 'def ' in words
 97 |     assert '        ' in words
 98 | 
 99 | 
100 | def test_text_to_words_whitespace():
101 |     """Test proper whitespace handling."""
102 |     text = "hello    world"
103 |     words = text_to_words(text, max_bytes=8)
104 |     assert words == ["hello ", "   ", "world"]
105 |     assert "".join(words) == text
106 |     assert len(words) == 3
107 |     assert "   " in words
108 | 
109 | 
110 | def test_text_to_words_mixed_unicode():
111 |     """Test with mixed content including unicode."""
112 |     text = "hello עמית! 🌟 {'key': 'value'}"
113 |     words = text_to_words(text, max_bytes=10)
114 |     assert "".join(words) == text
115 | 
116 | 
117 | def test_text_to_words_empty():
118 |     """Test with empty string."""
119 |     words = text_to_words("", max_bytes=16)
120 |     assert words == []
121 | 
122 | 
123 | def test_text_to_words_only_whitespace():
124 |     """Test with only whitespace."""
125 |     text = "   \n\t  "
126 |     words = text_to_words(text, max_bytes=16)
127 |     assert text == words[0]
128 |     assert "".join(words) == text
129 |     assert len(words) == 1
130 |     assert all(c.isspace() for c in words[0])
131 | 
132 | 
133 | def test_text_to_words_json_unicode():
134 |     """Test JSON containing unicode characters."""
135 |     json_text = '{"message":"שלום world 🌍","count": 42}'
136 |     words = text_to_words(json_text, max_bytes=6)
137 |     assert "".join(words) == json_text
138 |     assert len(words) > 5
139 |     assert any("🌍" in word for word in words)
140 | 
141 | 
142 | def test_is_word_complete_control_tokens():
143 |     """Test is_word_complete with control tokens."""
144 |     assert is_word_complete("\x01")
145 |     assert is_word_complete("\x02")
146 |     assert is_word_complete("\x03")
147 |     assert is_word_complete("\x08")
148 |     assert is_word_complete("\x7F")
149 | 
150 | 
151 | def test_is_word_complete_words_with_space():
152 |     """Test is_word_complete with words that have trailing space."""
153 |     assert is_word_complete("hello ")
154 |     assert is_word_complete("world ")
155 |     assert is_word_complete("test ")
156 |     assert is_word_complete("עמית ")
157 |     assert is_word_complete("🌟 ")
158 | 
159 | 
160 | def test_is_word_complete_incomplete_words():
161 |     """Test is_word_complete with incomplete words (no trailing space)."""
162 |     assert not is_word_complete("hello")
163 |     assert not is_word_complete("world")
164 |     assert not is_word_complete("test")
165 |     assert not is_word_complete("עמית")
166 |     assert not is_word_complete("🌟")
167 | 
168 | 
169 | def test_is_word_complete_whitespace_only():
170 |     """Test is_word_complete with whitespace-only strings."""
171 |     assert not is_word_complete(" ")
172 |     assert not is_word_complete("  ")
173 |     assert not is_word_complete("\n")
174 |     assert not is_word_complete("\t")
175 |     assert not is_word_complete("   \n\t  ")
176 | 
177 | 
178 | def test_is_word_complete_empty_string():
179 |     """Test is_word_complete with empty string."""
180 |     assert not is_word_complete("")
181 | 
182 | def test_is_word_complete_unicode_with_space():
183 |     """Test is_word_complete with unicode characters and trailing space."""
184 |     assert is_word_complete("שלום ")
185 |     assert is_word_complete("مرحبا ")
186 |     assert is_word_complete("こんにちは ")
187 |     assert not is_word_complete("שלום")
188 |     assert not is_word_complete("مرحبا")
189 |     assert not is_word_complete("こんにちは")
190 | 
191 | 
192 | if __name__ == "__main__":
193 |     pytest.main([__file__, "-v"])
194 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Words Segmentation
  2 | 
  3 | This repository contains a pretokenizer that segments text into "words" for further processing.
  4 | 
  5 | We define three classes of tokens:
  6 | 
  7 | 1. `C0` Control tokens (always atomic)
  8 | 2. "Words" = runs of non-space, non-control + optional single trailing whitespace
  9 | 3. Whitespace runs
 10 | 
 11 | For any script where the default is not suitable, you can implement a custom pretokenizer.
 12 | Modify `LANGUAGE_SPECS` in [languages.py](./words_segmentation/languages.py) to add a custom function for specific
 13 | scripts.
 14 | 
 15 | For example:
 16 | 
 17 | ```python
 18 | LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
 19 |     "Chinese": {
 20 |         "scripts": ("Han",),
 21 |         "callback": segment_chinese,
 22 |     },
 23 |     "Japanese": {
 24 |         "scripts": ("Han", "Hiragana", "Katakana"),
 25 |         "callback": segment_japanese,
 26 |     },
 27 | }
 28 | ```
 29 | 
 30 | Then, with a `max_bytes` parameter, we split long words into smaller chunks while preserving
 31 | Unicode grapheme boundaries.
 32 | 
 33 | ## Usage
 34 | 
 35 | Install:
 36 | 
 37 | ```bash
 38 | pip install words-segmentation
 39 | ```
 40 | 
 41 | Pretokenize text using a Huggingface Tokenizer implementation:
 42 | 
 43 | ```python
 44 | from words_segmentation.tokenizer import WordsSegmentationTokenizer
 45 | 
 46 | pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
 47 | tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
 48 | # ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']
 49 | ```
 50 | 
 51 | ## [Writing systems without word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)
 52 | 
 53 | Perhaps there will come a day when we could have a universal pretokenizer that works for all languages.
 54 | Until then, we need to handle some writing systems with custom logic.
 55 | We implement custom fallback pretoknizers for the following writing systems:
 56 | 
 57 | - [x] [Chinese characters](https://en.wikipedia.org/wiki/Chinese_characters) -
 58 |   using [jieba](https://github.com/fxsjy/jieba)
 59 | - [x] [Japanese writing system](https://en.wikipedia.org/wiki/Japanese_writing_system) -
 60 |   using [fugashi](https://github.com/polm/fugashi)
 61 | - [ ] [Balinese script](https://en.wikipedia.org/wiki/Balinese_script)
 62 | - [ ] [Burmese alphabet](https://en.wikipedia.org/wiki/Burmese_alphabet)
 63 | - [ ] [Chữ Hán](https://en.wikipedia.org/wiki/Ch%E1%BB%AF_H%C3%A1n)
 64 | - [ ] [Chữ Nôm](https://en.wikipedia.org/wiki/Ch%E1%BB%AF_N%C3%B4m)
 65 | - [ ] [Hanja](https://en.wikipedia.org/wiki/Hanja)
 66 | - [ ] [Javanese script](https://en.wikipedia.org/wiki/Javanese_script)
 67 | - [ ] [Khmer script](https://en.wikipedia.org/wiki/Khmer_script)
 68 | - [ ] [Lao script](https://en.wikipedia.org/wiki/Lao_script)
 69 | - [ ] [ʼPhags-pa script](https://en.wikipedia.org/wiki/%CA%BCPhags-pa_script)
 70 | - [ ] [Rasm](https://en.wikipedia.org/wiki/Rasm)
 71 | - [ ] [Sawndip](https://en.wikipedia.org/wiki/Sawndip)
 72 | - [ ] [Scriptio continua](https://en.wikipedia.org/wiki/Scriptio_continua)
 73 | - [ ] [S'gaw Karen alphabet](https://en.wikipedia.org/wiki/S%27gaw_Karen_alphabet)
 74 | - [ ] [Tai Tham script](https://en.wikipedia.org/wiki/Tai_Tham_script)
 75 | - [ ] [Thai script](https://en.wikipedia.org/wiki/Thai_script)
 76 | - [ ] [Tibetan script](https://en.wikipedia.org/wiki/Tibetan_script)
 77 | - [ ] [Vietnamese alphabet](https://en.wikipedia.org/wiki/Vietnamese_alphabet)
 78 | - [ ] [Western Pwo alphabet](https://en.wikipedia.org/wiki/Western_Pwo_alphabet)
 79 | 
 80 | ## Tokenization Parity
 81 | 
 82 | [Foroutan and Meister et al. (2025)](https://www.arxiv.org/pdf/2508.04796) note that:
 83 | > In multilingual models, the same meaning can take far more tokens in some languages,
 84 | > penalizing users of underrepresented languages with worse performance and higher API costs.
 85 | 
 86 | [![Tokenization Parity](assets/tokenization-parity.png)](https://www.linkedin.com/posts/sina-ahmadi-aba470287_dont-speak-english-you-must-pay-more-activity-7360959825893036035-vnFN)
 87 | 
 88 | Let's consider the same example, for whitespace pre-tokenization parity:
 89 | 
 90 | | Language | Text (Google Translate)                                                                                                                                                                                                                                      | Bytes (UTF-8) | Tokens (GPT-4) | Words (Whitespace+) |
 91 | |----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------|---------------------|
 92 | | English  | Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate.                                                                                | 173           | 40             | 34                  |
 93 | | Italian  | I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona.                          | 230           | 58             | 43                  |
 94 | | German   | Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten. | 256           | 64             | 40                  |
 95 | | Chinese  | 团体旅游价格更便宜，所以如果您独自一人或只有一个朋友，请尝试结识其他人并组成一个四到六人的团体，以获得更好的每人价格。                                                                                                                                                                                                  | 177           | 64             | 34                  |
 96 | | Japanese | ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。                                                                                                                                                                                | 227           | 74             | 48                  |
 97 | | Finnish  | Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö.                                                | 212           | 79             | 30                  |
 98 | | Russian  | Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека.                              | 409           | 100            | 32                  |
 99 | | Arabic   | تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد.                                                                  | 341           | 140            | 33                  |
100 | | Hebrew   | סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם.                                                                                                | 281           | 151            | 31                  |
101 | | Greek    | Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο.                                     | 394           | 193            | 36                  |
102 | | Tamil    | பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும்.                                      | 587           | 293            | 26                  |
103 | | Kannada  | ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ.                                              | 565           | 361            | 26                  |
104 | | Shan     | ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။              | 669           | 531            | 23                  |
105 | 
106 | #### Bytes Efficiency
107 | 
108 | English really is the most efficient language in terms of bytes count, which is not suprising given its Latin alphabet,
109 | without diacritics or ligatures (with 1 byte per character).
110 | Other languages that use the Latin alphabet are also relatively efficient (e.g. Italian, German, Finnish), but their
111 | use of diacritics and ligatures increases the byte count.
112 | 
113 | Languages that use non-Latin scripts (e.g. Arabic, Hebrew, Shan) have a much higher byte count, due to the need for
114 | multiple bytes per character in UTF-8 encoding. Hebrew and Arabic use two bytes per character,
115 | while Shan uses three bytes per character, not counting ligatures.
116 | 
117 | #### Tokenization Efficiency (GPT-4)
118 | 
119 | English is also the most efficient language in terms of token count, which is not suprising given that the tokenizer
120 | was trained primarily on English text.
121 | Other languages that use the Latin alphabet are also relatively efficient, but the moment we move to non-Latin scripts,
122 | the token count increases significantly (up to 13x for Shan).
123 | 
124 | #### Words Efficiency
125 | 
126 | Assuming whitespace tokenization as a proxy for words, we see that English is not the most efficient language.
127 | This makes sense, from a language efficiency perspective, that there is no computational bias towards English.
128 | Languages distribute between 23 and 43 words for the same sentence, with English right in the middle with 34.
129 | 
130 | ![Tokenization Parity - Words](assets/tokenization-parity-words.png)
131 | 
132 | ## Cite
133 | 
134 | If you use this code in your research, please consider citing the work:
135 | 
136 | ```bibtex
137 | @misc{moryossef2025words,
138 |   title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
139 |   author={Moryossef, Amit},
140 |   howpublished={\url{https://github.com/sign/words-segmentation}},
141 |   year={2025}
142 | }
143 | ```


--------------------------------------------------------------------------------