├── tests ├── __init__.py ├── test_signwriting.py ├── test_chinese.py ├── test_japanese.py ├── test_word_stopping_criteria.py └── test_pretokenizer.py ├── examples ├── __init__.py └── tokens_parity.py ├── words_segmentation ├── __init__.py ├── signwriting.py ├── tokenizer.py ├── chinese.py ├── japanese.py ├── pretokenizer.py └── languages.py ├── .gitignore ├── assets ├── tokenization-parity.png └── tokenization-parity-words.png ├── .github └── workflows │ ├── lint.yaml │ ├── test.yaml │ └── release.yaml ├── LICENSE ├── pyproject.toml └── README.md /tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /examples/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /words_segmentation/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | .claude/ 3 | *.egg-info 4 | build/ 5 | dist/ 6 | .env 7 | __pycache__/ 8 | *.pyc 9 | *.pyo -------------------------------------------------------------------------------- /assets/tokenization-parity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sign/words-segmentation/main/assets/tokenization-parity.png -------------------------------------------------------------------------------- /assets/tokenization-parity-words.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sign/words-segmentation/main/assets/tokenization-parity-words.png -------------------------------------------------------------------------------- /words_segmentation/signwriting.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | from signwriting.formats.swu import re_swu 4 | 5 | 6 | def segment_signwriting(text: str) -> list[str]: 7 | return re.findall(re_swu['sign'], text) 8 | -------------------------------------------------------------------------------- /.github/workflows/lint.yaml: -------------------------------------------------------------------------------- 1 | name: Lint 2 | 3 | 4 | on: 5 | push: 6 | branches: [ main ] 7 | pull_request: 8 | branches: [ main ] 9 | 10 | 11 | jobs: 12 | test: 13 | name: Lint 14 | runs-on: ubuntu-latest 15 | 16 | steps: 17 | - uses: actions/checkout@v5 18 | 19 | - name: Setup uv 20 | uses: astral-sh/setup-uv@v6.5.0 21 | with: 22 | python-version: "3.12" 23 | enable-cache: true 24 | activate-environment: true 25 | 26 | - name: Install dependencies 27 | run: uv pip install ".[dev]" 28 | 29 | - name: Lint code 30 | run: uv run ruff check . -------------------------------------------------------------------------------- /.github/workflows/test.yaml: -------------------------------------------------------------------------------- 1 | name: Test 2 | 3 | 4 | on: 5 | push: 6 | branches: [ main ] 7 | pull_request: 8 | branches: [ main ] 9 | 10 | 11 | jobs: 12 | test: 13 | name: Test 14 | runs-on: ubuntu-latest 15 | 16 | steps: 17 | - uses: actions/checkout@v5 18 | 19 | - name: Setup uv 20 | uses: astral-sh/setup-uv@v6.5.0 21 | with: 22 | python-version: "3.12" 23 | enable-cache: true 24 | activate-environment: true 25 | 26 | - name: Install dependencies 27 | run: uv pip install ".[dev]" 28 | 29 | - name: Test Code 30 | run: uv run pytest -n auto --dist loadscope 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 sign 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.github/workflows/release.yaml: -------------------------------------------------------------------------------- 1 | name: Publish Python Package 2 | on: 3 | release: 4 | types: [ created ] 5 | 6 | jobs: 7 | pypi-publish: 8 | name: Upload release to PyPI 9 | runs-on: ubuntu-latest 10 | environment: 11 | name: pypi 12 | url: https://pypi.org/p/words-segmentation 13 | permissions: 14 | id-token: write 15 | steps: 16 | - uses: actions/checkout@v5 17 | 18 | - uses: actions/setup-python@v6 19 | with: 20 | python-version: "3.12" 21 | 22 | - name: Extract release version 23 | id: get_version 24 | run: echo "version=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV 25 | 26 | - name: Update version in pyproject.toml 27 | run: | 28 | sed -i 's/^version = .*/version = "${{ env.version }}"/' pyproject.toml 29 | 30 | - name: Install build dependencies 31 | run: pip install build 32 | 33 | - name: Build a binary wheel dist 34 | run: | 35 | rm -rf dist 36 | python -m build 37 | 38 | - name: Publish distribution 📦 to PyPI 39 | uses: pypa/gh-action-pypi-publish@release/v1 40 | -------------------------------------------------------------------------------- /tests/test_signwriting.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from words_segmentation.signwriting import segment_signwriting 4 | 5 | 6 | def test_segment_single_sign(): 7 | sign = "𝠀񆄱񈠣񍉡𝠃𝤛𝤵񍉡𝣴𝣵񆄱𝤌𝤆񈠣𝤉𝤚" 8 | result = segment_signwriting(sign) 9 | assert result == [sign] 10 | 11 | def test_segment_single_sign_no_prefix(): 12 | sign = "𝠃𝤛𝤵񍉡𝣴𝣵񆄱𝤌𝤆񈠣𝤉𝤚" 13 | result = segment_signwriting(sign) 14 | assert result == [sign] 15 | 16 | def test_segment_with_space(): 17 | signs = [ 18 | "𝠀񀀒񀀚񋚥񋛩𝠃𝤟𝤩񋛩𝣵𝤐񀀒𝤇𝣤񋚥𝤐𝤆񀀚𝣮𝣭", 19 | "𝠀񂇢񂇈񆙡񋎥񋎵𝠃𝤛𝤬񂇈𝤀𝣺񂇢𝤄𝣻񋎥𝤄𝤗񋎵𝤃𝣟񆙡𝣱𝣸", 20 | "𝠃𝤙𝤞񀀙𝣷𝤀񅨑𝣼𝤀񆉁𝣳𝣮" 21 | ] 22 | result = segment_signwriting(" ".join(signs)) 23 | assert result == signs 24 | 25 | def test_segment_no_space(): 26 | signs = [ 27 | "𝠀񀀒񀀚񋚥񋛩𝠃𝤟𝤩񋛩𝣵𝤐񀀒𝤇𝣤񋚥𝤐𝤆񀀚𝣮𝣭", 28 | "𝠀񂇢񂇈񆙡񋎥񋎵𝠃𝤛𝤬񂇈𝤀𝣺񂇢𝤄𝣻񋎥𝤄𝤗񋎵𝤃𝣟񆙡𝣱𝣸", 29 | "𝠃𝤙𝤞񀀙𝣷𝤀񅨑𝣼𝤀񆉁𝣳𝣮" 30 | ] 31 | result = segment_signwriting("".join(signs)) 32 | assert result == signs 33 | 34 | if __name__ == "__main__": 35 | pytest.main([__file__, "-v"]) 36 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "words-segmentation" 3 | description = "Text segmentation into words for multiple languages." 4 | version = "0.0.1" 5 | authors = [ 6 | { name = "Amit Moryossef", email = "amit@sign.mt" }, 7 | ] 8 | readme = "README.md" 9 | requires-python = ">=3.10" 10 | dependencies = [ 11 | "transformers[torch]", 12 | "utf8-tokenizer", 13 | "fugashi[unidic-lite]", # For Japanese word segmentation 14 | "jieba", # For Chinese word segmentation 15 | "signwriting", # For SignWriting segmentation 16 | ] 17 | 18 | [project.optional-dependencies] 19 | dev = [ 20 | "ruff", 21 | "pytest", 22 | "pytest-xdist", # For parallel test execution 23 | ] 24 | 25 | 26 | [tool.setuptools] 27 | packages = [ 28 | "words_segmentation", 29 | ] 30 | 31 | [tool.ruff] 32 | line-length = 120 33 | 34 | [tool.ruff.lint.per-file-ignores] 35 | "examples/tokens_parity.py" = ["E501"] 36 | 37 | [tool.ruff.lint] 38 | select = [ 39 | "E", # pycodestyle errors 40 | "W", # pycodestyle warnings 41 | "F", # pyflakes 42 | "C90", # mccabe complexity 43 | "I", # isort 44 | "N", # pep8-naming 45 | "UP", # pyupgrade 46 | "B", # flake8-bugbear 47 | "PT", # flake8-pytest-style 48 | "W605", # invalid escape sequence 49 | "BLE", # flake8-blind-except 50 | ] 51 | 52 | [tool.pytest.ini_options] 53 | addopts = "-v" 54 | testpaths = ["words_segmentation", "tests"] 55 | -------------------------------------------------------------------------------- /words_segmentation/tokenizer.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | from transformers import AutoTokenizer, PreTrainedTokenizer 4 | from transformers.tokenization_utils_base import TextInput 5 | 6 | from words_segmentation.pretokenizer import text_to_words, words_to_text 7 | 8 | 9 | class WordsSegmentationTokenizer(PreTrainedTokenizer): 10 | """ 11 | Custom Tokenizer implementation, 12 | extending PreTrainedTokenizer for basic Hugging Face ecosystem support. 13 | """ 14 | 15 | def __init__(self, max_bytes: int = math.inf, **kwargs): 16 | super().__init__(**kwargs) 17 | self.max_bytes = max_bytes 18 | 19 | @property 20 | def vocab_size(self) -> float: 21 | return math.inf 22 | 23 | def add_tokens(self, *args, **kwargs): 24 | raise NotImplementedError("WordsSegmentationTokenizer does not support adding tokens") 25 | 26 | def get_vocab(self): 27 | return {} 28 | 29 | def _tokenize(self, text: TextInput, **kwargs): 30 | return text_to_words(text, max_bytes=self.max_bytes) 31 | 32 | def tokenize(self, text: TextInput, **kwargs): 33 | return self._tokenize(text, **kwargs) 34 | 35 | def _encode_plus(self, text: TextInput, **kwargs): 36 | raise Exception("WordsSegmentationTokenizer can not encode to ids") 37 | 38 | def _convert_token_to_id(self, token: str): 39 | raise Exception("WordsSegmentationTokenizer can not convert to ids") 40 | 41 | def _convert_id_to_token(self, index: int): 42 | raise Exception("WordsSegmentationTokenizer can not decode ids") 43 | 44 | def convert_tokens_to_string(self, tokens: list[str]): 45 | """Converts a sequence of tokens (string) in a single string.""" 46 | return words_to_text(tokens) 47 | 48 | def build_inputs_with_special_tokens(self, **unused_kwargs): 49 | raise Exception("WordsSegmentationTokenizer does not use special tokens") 50 | 51 | def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None): 52 | return () 53 | 54 | def to_dict(self): 55 | return {"max_bytes": self.max_bytes} 56 | 57 | 58 | AutoTokenizer.register(WordsSegmentationTokenizer, slow_tokenizer_class=WordsSegmentationTokenizer) 59 | 60 | if __name__ == "__main__": 61 | tokenizer = WordsSegmentationTokenizer() 62 | print(tokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦", max_bytes=16)) 63 | -------------------------------------------------------------------------------- /tests/test_chinese.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from words_segmentation.chinese import has_chinese, segment_chinese 4 | 5 | 6 | def test_has_chinese_simple(): 7 | """Test has_chinese with simple Chinese characters.""" 8 | assert has_chinese("你好") 9 | assert has_chinese("中文") 10 | assert has_chinese("我来到北京清华大学") 11 | 12 | 13 | def test_has_chinese_mixed_content(): 14 | """Test has_chinese with mixed Chinese and other characters.""" 15 | assert has_chinese("hello 你好") 16 | assert has_chinese("mixed 你好") 17 | assert has_chinese("123 中文 abc") 18 | assert has_chinese("English 中文 混合") 19 | 20 | 21 | def test_has_chinese_no_chinese(): 22 | """Test has_chinese with non-Chinese text.""" 23 | assert not has_chinese("hello") 24 | assert not has_chinese("English text") 25 | assert not has_chinese("123456") 26 | assert not has_chinese("!@#$%^&*()") 27 | assert not has_chinese("こんにちは") # Japanese 28 | assert not has_chinese("עברית") # Hebrew 29 | 30 | 31 | def test_has_chinese_empty_string(): 32 | """Test has_chinese with empty string.""" 33 | assert not has_chinese("") 34 | 35 | 36 | def test_has_chinese_whitespace_only(): 37 | """Test has_chinese with whitespace only.""" 38 | assert not has_chinese(" ") 39 | assert not has_chinese("\n\t") 40 | assert not has_chinese(" ") 41 | 42 | 43 | def test_segment_chinese_simple(): 44 | """Test segment_chinese with simple Chinese text.""" 45 | result = segment_chinese("你好") 46 | assert result == ["你好"] 47 | 48 | 49 | def test_segment_chinese_mixed(): 50 | """Test segment_chinese with mixed Chinese and English.""" 51 | result = segment_chinese("hello 我来到北京清华大学 world") 52 | assert result == ['hello', ' ', '我', '来到', '北京', '清华大学', ' ', 'world'] 53 | 54 | 55 | def test_segment_chinese_empty(): 56 | """Test segment_chinese with empty string.""" 57 | result = segment_chinese("") 58 | assert result == [] 59 | 60 | 61 | def test_segment_chinese_complex(): 62 | """Test segment_chinese with complex Chinese sentence.""" 63 | result = segment_chinese("小明硕士毕业于中国科学院计算所") 64 | assert result == ['小明', '硕士', '毕业', '于', '中国科学院', '计算所'] 65 | 66 | 67 | def test_segment_chinese_compound_words(): 68 | """Test segment_chinese with compound words.""" 69 | result = segment_chinese("中文分词测试") 70 | assert result == ['中文', '分词', '测试'] 71 | 72 | 73 | if __name__ == "__main__": 74 | pytest.main([__file__, "-v"]) 75 | -------------------------------------------------------------------------------- /words_segmentation/chinese.py: -------------------------------------------------------------------------------- 1 | """ 2 | Chinese text pretokenization utilities. 3 | 4 | This module provides functions for detecting and segmenting Chinese text using the jieba 5 | library for word segmentation. 6 | """ 7 | 8 | from functools import cache 9 | 10 | import regex 11 | 12 | 13 | def has_chinese(text: str) -> bool: 14 | """ 15 | Check if the given text contains Chinese characters. 16 | 17 | Uses Unicode Han ideograph property to detect Chinese characters including: 18 | - CJK Unified Ideographs (U+4E00-U+9FFF) 19 | - CJK Extension A, B, C, D, E, F, and G 20 | - CJK Compatibility Ideographs 21 | 22 | Args: 23 | text: The input text to check for Chinese characters 24 | 25 | Returns: 26 | True if Chinese characters are found, False otherwise 27 | """ 28 | # Match any Han ideograph using Unicode property 29 | return bool(regex.search(r'[\p{Han}]', text)) 30 | 31 | 32 | @cache 33 | def get_chinese_segmenter(): 34 | """ 35 | Get a cached instance of the jieba Chinese word segmenter. 36 | 37 | Jieba is a popular Chinese text segmentation library that uses a combination of 38 | dictionary-based matching and statistical models to segment Chinese text into words. 39 | The segmenter is cached to avoid repeated initialization overhead. 40 | 41 | Returns: 42 | jieba module instance for text segmentation 43 | 44 | Raises: 45 | ImportError: If the jieba library is not installed 46 | """ 47 | try: 48 | import jieba 49 | except ImportError: 50 | print("Error: jieba library not found. Please install it with: pip install jieba") 51 | raise 52 | 53 | return jieba 54 | 55 | 56 | def segment_chinese(text: str) -> list[str]: 57 | """ 58 | Segment Chinese text into space-separated words. 59 | 60 | Uses jieba's precise segmentation mode to break Chinese text into individual words, 61 | then joins them with spaces. This preprocessing step helps the tokenizer better 62 | understand Chinese text structure. 63 | 64 | Args: 65 | text: The Chinese text to segment 66 | 67 | Returns: 68 | List of Chinese words 69 | 70 | Example: 71 | >>> segment_chinese("我爱北京天安门") 72 | "我 爱 北京 天安门" 73 | """ 74 | jieba = get_chinese_segmenter() 75 | # Use jieba.cut() for precise segmentation and join with spaces 76 | segments = jieba.cut(text) 77 | # Filter out empty segments and join with single spaces 78 | return list(segments) 79 | -------------------------------------------------------------------------------- /words_segmentation/japanese.py: -------------------------------------------------------------------------------- 1 | """ 2 | Japanese text pretokenization utilities. 3 | 4 | This module provides functions for detecting and segmenting Japanese text using the fugashi 5 | library with MeCab for morphological analysis. 6 | """ 7 | 8 | from functools import cache 9 | 10 | import regex 11 | 12 | 13 | def has_japanese(text: str) -> bool: 14 | """ 15 | Check if the given text contains Japanese characters. 16 | 17 | Detects Japanese writing systems including: 18 | - Hiragana (U+3040-U+309F): Phonetic script for native words 19 | - Katakana (U+30A0-U+30FF): Phonetic script for foreign words 20 | - Kanji (Han ideographs): Chinese characters used in Japanese 21 | - Half-width Katakana (U+FF65-U+FF9F): Narrow katakana variants 22 | 23 | Args: 24 | text: The input text to check for Japanese characters 25 | 26 | Returns: 27 | True if Japanese characters are found, False otherwise 28 | """ 29 | # Match Hiragana, Katakana, or Han ideographs (Kanji) 30 | return bool(regex.search(r'[\p{Hiragana}\p{Katakana}\p{Han}]', text)) 31 | 32 | 33 | @cache 34 | def get_japanese_tagger(): 35 | """ 36 | Get a cached instance of the fugashi Japanese morphological analyzer. 37 | 38 | Fugashi is a Python wrapper for MeCab, a morphological analyzer for Japanese. 39 | The tagger is configured with the '-Owakati' option to output space-separated 40 | words without part-of-speech information. The instance is cached to avoid 41 | repeated initialization overhead. 42 | 43 | Returns: 44 | fugashi.Tagger instance configured for word segmentation 45 | 46 | Raises: 47 | ImportError: If the fugashi library or unidic-lite dictionary is not installed 48 | """ 49 | try: 50 | from fugashi import Tagger 51 | except ImportError: 52 | print("Error: fugashi library not found. Please install it with: pip install 'fugashi[unidic-lite]'") 53 | raise 54 | 55 | # -Owakati: Output format that produces space-separated words only 56 | return Tagger('-Owakati') 57 | 58 | 59 | def segment_japanese(text: str) -> list[str]: 60 | """ 61 | Segment Japanese text into space-separated words using morphological analysis. 62 | 63 | Uses MeCab via fugashi to perform morphological analysis and word segmentation 64 | of Japanese text. This handles the complex task of word boundary detection in 65 | Japanese, which doesn't use spaces between words. 66 | 67 | Args: 68 | text: The Japanese text to segment 69 | 70 | Returns: 71 | List of Japanese words 72 | 73 | Example: 74 | >>> segment_japanese("私は学生です") 75 | "私 は 学生 です" 76 | """ 77 | tagger = get_japanese_tagger() 78 | # Parse the text and return space-separated morphemes 79 | return [str(word) for word in tagger(text)] 80 | -------------------------------------------------------------------------------- /tests/test_japanese.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from words_segmentation.japanese import has_japanese, segment_japanese 4 | 5 | 6 | def test_has_japanese_hiragana(): 7 | """Test has_japanese with Hiragana characters.""" 8 | assert has_japanese("こんにちは") 9 | assert has_japanese("ひらがな") 10 | assert has_japanese("あいうえお") 11 | 12 | 13 | def test_has_japanese_katakana(): 14 | """Test has_japanese with Katakana characters.""" 15 | assert has_japanese("カタカナ") 16 | assert has_japanese("コンピューター") 17 | assert has_japanese("アメリカ") 18 | 19 | 20 | def test_has_japanese_kanji(): 21 | """Test has_japanese with Kanji characters.""" 22 | assert has_japanese("漢字") 23 | assert has_japanese("日本語") 24 | assert has_japanese("学生") 25 | 26 | 27 | def test_has_japanese_mixed_content(): 28 | """Test has_japanese with mixed Japanese and other characters.""" 29 | assert has_japanese("hello こんにちは") 30 | assert has_japanese("私は学生です。") 31 | assert has_japanese("123 カタカナ abc") 32 | assert has_japanese("English 日本語 混合") 33 | 34 | 35 | def test_has_japanese_no_japanese(): 36 | """Test has_japanese with non-Japanese text.""" 37 | assert not has_japanese("hello") 38 | assert not has_japanese("English text") 39 | assert not has_japanese("123456") 40 | assert not has_japanese("!@#$%^&*()") 41 | assert not has_japanese("עברית") 42 | assert not has_japanese("العربية") 43 | 44 | 45 | def test_has_japanese_empty_string(): 46 | """Test has_japanese with empty string.""" 47 | assert not has_japanese("") 48 | 49 | 50 | def test_has_japanese_whitespace_only(): 51 | """Test has_japanese with whitespace only.""" 52 | assert not has_japanese(" ") 53 | assert not has_japanese("\n\t") 54 | assert not has_japanese(" ") 55 | 56 | 57 | def test_segment_japanese_simple(): 58 | """Test segment_japanese with simple Japanese text.""" 59 | result = segment_japanese("こんにちは") 60 | assert result == ['こんにちは'] 61 | 62 | 63 | def test_segment_japanese_mixed(): 64 | """Test segment_japanese with mixed Japanese and English.""" 65 | result = segment_japanese("hello 私は学生です。 world") 66 | assert result == ['hello', '私', 'は', '学生', 'です', '。', 'world'] 67 | 68 | 69 | def test_segment_japanese_empty(): 70 | """Test segment_japanese with empty string.""" 71 | result = segment_japanese("") 72 | assert result == [] 73 | 74 | 75 | def test_segment_japanese_complex(): 76 | """Test segment_japanese with complex Japanese sentence.""" 77 | result = segment_japanese("私は東京大学の学生です。") 78 | assert result == ['私', 'は', '東京', '大学', 'の', '学生', 'です', '。'] 79 | 80 | 81 | def test_segment_japanese_katakana(): 82 | """Test segment_japanese with Katakana text.""" 83 | result = segment_japanese("コンピューター") 84 | assert result == ['コンピューター'] 85 | 86 | 87 | if __name__ == "__main__": 88 | pytest.main([__file__, "-v"]) 89 | -------------------------------------------------------------------------------- /words_segmentation/pretokenizer.py: -------------------------------------------------------------------------------- 1 | import math 2 | import re 3 | from collections.abc import Iterable 4 | from itertools import chain 5 | 6 | import regex 7 | import torch 8 | from transformers import PreTrainedTokenizer, StoppingCriteria, add_start_docstrings 9 | from transformers.generation.stopping_criteria import STOPPING_CRITERIA_INPUTS_DOCSTRING 10 | from utf8_tokenizer.control import CONTROl_TOKENS_PATTERN 11 | 12 | from words_segmentation.languages import segment_text 13 | 14 | _COMPILED_GRAPHEME_PATTERN = regex.compile(r"\X") 15 | _COMPLETE_WORD_PATTERNS = [ 16 | rf"[{CONTROl_TOKENS_PATTERN}]", # Control tokens are always complete 17 | rf"[^\s{CONTROl_TOKENS_PATTERN}]+\s", # Words with trailing space are complete 18 | ] 19 | 20 | 21 | def words_to_text(words: Iterable[str]) -> str: 22 | return ''.join(words) 23 | 24 | 25 | def text_to_words(text: str, max_bytes: int = math.inf) -> list[str]: 26 | words = chain.from_iterable(segment_text(text)) 27 | 28 | if max_bytes == math.inf: 29 | return list(words) 30 | 31 | chunks = (utf8_chunks_grapheme_safe(word, max_bytes=max_bytes) for word in words) 32 | return list(chain.from_iterable(chunks)) 33 | 34 | 35 | def utf8_chunks_grapheme_safe(text: str, max_bytes: int = 16) -> Iterable[str]: 36 | """ 37 | Split a string into chunks of at most max_bytes bytes, without splitting grapheme clusters. 38 | Except, if there is a single grapheme cluster longer than max_bytes, it will be in its own chunk. 👩‍👩‍👧‍👦 39 | """ 40 | text_bytes = text.encode("utf-8") 41 | if len(text_bytes) <= max_bytes: 42 | yield text 43 | return 44 | 45 | clusters = _COMPILED_GRAPHEME_PATTERN.findall(text) 46 | if len(clusters) == 1: 47 | yield text 48 | return 49 | 50 | curr = [] 51 | curr_bytes = 0 52 | for cluster in clusters: 53 | cluster_bytes = len(cluster.encode("utf-8")) 54 | if curr_bytes + cluster_bytes > max_bytes: 55 | if curr: 56 | yield "".join(curr) 57 | curr = [cluster] 58 | curr_bytes = cluster_bytes 59 | else: 60 | curr.append(cluster) 61 | curr_bytes += cluster_bytes 62 | if curr: 63 | yield "".join(curr) 64 | 65 | 66 | def is_word_complete(text: str) -> bool: 67 | for pattern in _COMPLETE_WORD_PATTERNS: 68 | if re.fullmatch(pattern, text): 69 | return True 70 | 71 | # TODO: not clear how to know if a word full of whitespaces is complete 72 | # maybe if _TOKEN_PATTERN is not a full match, but then need to "delete" the last token. 73 | return False 74 | 75 | 76 | class WordStoppingCriteria(StoppingCriteria): 77 | def __init__(self, tokenizer: PreTrainedTokenizer): 78 | self.tokenizer = tokenizer 79 | 80 | @add_start_docstrings(STOPPING_CRITERIA_INPUTS_DOCSTRING) 81 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor: 82 | texts = self.tokenizer.batch_decode(input_ids.tolist()) 83 | is_done = [is_word_complete(text) for text in texts] 84 | return torch.tensor(is_done, dtype=torch.bool, device=input_ids.device) 85 | 86 | -------------------------------------------------------------------------------- /examples/tokens_parity.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import pandas as pd 3 | from transformers import GPT2TokenizerFast 4 | 5 | from words_segmentation.tokenizer import WordsSegmentationTokenizer 6 | 7 | tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4') 8 | words_tokenizer = WordsSegmentationTokenizer() 9 | 10 | 11 | # Warm chinese and japanese models, because it prints stuff 12 | words_tokenizer.tokenize("体") 13 | 14 | texts = { 15 | "English": "Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate.", 16 | "Italian": "I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona.", 17 | "German": "Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten.", 18 | "Chinese": "团体旅游价格更便宜,所以如果您独自一人或只有一个朋友,请尝试结识其他人并组成一个四到六人的团体,以获得更好的每人价格。", 19 | "Japanese": "ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。", 20 | "Finnish": "Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö.", 21 | "Russian": "Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека.", 22 | "Arabic": "تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد.", 23 | "Hebrew": "סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם.", 24 | "Greek": "Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο.", 25 | "Tamil": "பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும்.", 26 | "Kannada": "ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ.", 27 | "Shan": "ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။", 28 | } 29 | 30 | data = { 31 | "Language": [], 32 | "Bytes (UTF-8)": [], 33 | "Tokens (GPT-4)": [], 34 | "Words (Whitespace+)": [], 35 | } 36 | 37 | print("| Language | Text (Google Translate) | Bytes (UTF-8) | Tokens (GPT-4) | Words (Whitespace+) |") 38 | print("|----------|-------------------------|-------------|----------------|--------------------|") 39 | for lang, text in texts.items(): 40 | data["Language"].append(lang) 41 | num_bytes = len(text.encode('utf-8')) 42 | data["Bytes (UTF-8)"].append(num_bytes) 43 | num_tokens = len(tokenizer.tokenize(text)) 44 | data["Tokens (GPT-4)"].append(num_tokens) 45 | num_words = len(words_tokenizer.tokenize(text)) 46 | data["Words (Whitespace+)"].append(num_words) 47 | print(f"| {lang} | {text} | {num_bytes} | {num_tokens} | {num_words} |") 48 | 49 | df = pd.DataFrame(data) 50 | 51 | # Plot 52 | plt.figure(figsize=(10, 6)) 53 | df.set_index("Language").plot(kind="bar", figsize=(12, 7)) 54 | 55 | plt.title("Text Size Across Languages (Bytes, Tokens, Words)") 56 | plt.ylabel("Count") 57 | plt.xlabel("Language") 58 | plt.xticks(rotation=45) 59 | plt.legend(title="Measure") 60 | plt.tight_layout() 61 | plt.savefig("../assets/tokenization-parity-words.png") 62 | -------------------------------------------------------------------------------- /words_segmentation/languages.py: -------------------------------------------------------------------------------- 1 | """ 2 | Script-aware segmentation with per-language callbacks. 3 | - Uses Unicode Script_Extensions (scx) to segment Han/Hiragana/Katakana, etc. 4 | - Falls back to a Default branch that avoids those scripts. 5 | - Each non-default segment is passed to its language-specific callback. 6 | """ 7 | 8 | from collections.abc import Callable, Iterable 9 | from functools import cache 10 | from itertools import chain 11 | from typing import Any, TypedDict 12 | 13 | import regex 14 | from utf8_tokenizer.control import CONTROl_TOKENS_PATTERN 15 | 16 | from words_segmentation.chinese import segment_chinese 17 | from words_segmentation.japanese import segment_japanese 18 | from words_segmentation.signwriting import segment_signwriting 19 | 20 | # Three classes of tokens inside the Default branch: 21 | # 1) Control tokens (always atomic) 22 | # 2) "Words" = runs of non-space, non-control + optional trailing single space 23 | # 3) Whitespace runs 24 | _TOKEN_PATTERN = ( 25 | rf"[{CONTROl_TOKENS_PATTERN}]" # 1) Control tokens 26 | rf"|[^\s{CONTROl_TOKENS_PATTERN}]+\s?" # 2) Word (+ optional trailing space) 27 | r"|\s+" # 3) Whitespace runs 28 | ) 29 | _COMPILED_TOKEN_PATTERN = regex.compile(_TOKEN_PATTERN) 30 | 31 | 32 | def text_to_unbound_words(text: str) -> list[str]: 33 | """Tokenize a non-scripted span using the token rules above.""" 34 | return _COMPILED_TOKEN_PATTERN.findall(text) 35 | 36 | 37 | class LanguageSpec(TypedDict): 38 | scripts: tuple[str, ...] # e.g., ("Han",) or ("Han", "Hiragana", "Katakana") 39 | callback: Callable[[str], Any] # called with the matched span 40 | 41 | 42 | LANGUAGE_SPECS: dict[str, LanguageSpec] = { 43 | "SignWriting": { 44 | "scripts": ("SignWriting",), 45 | "callback": segment_signwriting, 46 | }, 47 | "Chinese": { 48 | "scripts": ("Han",), 49 | "callback": segment_chinese, 50 | }, 51 | "Japanese": { 52 | "scripts": ("Han", "Hiragana", "Katakana"), 53 | "callback": segment_japanese, 54 | }, 55 | "Default": { 56 | "scripts": tuple(), 57 | "callback": text_to_unbound_words, 58 | }, 59 | } 60 | 61 | 62 | def _union_scx(scripts: tuple[str, ...]) -> str: 63 | """Create a non-capturing alternation for a set of Script_Extensions.""" 64 | parts = [fr"\p{{scx={s}}}" for s in scripts] 65 | return "(?:" + "|".join(parts) + ")" 66 | 67 | 68 | @cache 69 | def build_regex_from_languages() -> regex.Pattern: 70 | """ 71 | Compile the master regex with named groups for each language plus Default. 72 | 73 | Precedence: dict order in LANGUAGE_SPECS — first match wins if script sets overlap. 74 | Default branch: consumes runs that do NOT begin with any of the listed scripts. 75 | """ 76 | # Explicit language branches (skip Default — it has no 'scripts') 77 | branches: list[str] = [] 78 | for name, spec in LANGUAGE_SPECS.items(): 79 | if spec["scripts"]: 80 | branches.append(fr"(?P<{name}>{_union_scx(spec['scripts'])}+)") 81 | 82 | # Default: refuse any char that begins one of the explicit-script branches 83 | all_scripts = tuple(sorted({s for spec in LANGUAGE_SPECS.values() for s in spec.get("scripts", ())})) 84 | forbidden = _union_scx(all_scripts) if all_scripts else r"$a" # impossible atom if no scripts exist 85 | default_branch = fr"(?P(?:(?!{forbidden})\X)+)" 86 | 87 | # Combined pattern (verbose mode for readability) 88 | pattern = r"(?x)(?:" + "|".join(branches + [default_branch]) + r")" 89 | return regex.compile(pattern) 90 | 91 | 92 | def segment_text(text: str) -> Iterable[Any]: 93 | """ 94 | Iterate over callback results for each matched span. 95 | - Non-Default groups call their language callback. 96 | - Default group calls its callback if present in LANGUAGE_SPECS. 97 | """ 98 | pat = build_regex_from_languages() 99 | for m in pat.finditer(text): 100 | group_name = m.lastgroup 101 | spec = LANGUAGE_SPECS.get(group_name) 102 | yield spec["callback"](m.group(0)) 103 | 104 | 105 | if __name__ == "__main__": 106 | sample = "東京abcかなカナ漢字123 אני אחד私は学生です" 107 | # Stream results as produced by callbacks 108 | print(list(chain.from_iterable(segment_text(sample)))) 109 | -------------------------------------------------------------------------------- /tests/test_word_stopping_criteria.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | import torch 3 | 4 | from words_segmentation.pretokenizer import WordStoppingCriteria 5 | 6 | 7 | class MockTokenizer: 8 | """Mock tokenizer for testing WordStoppingCriteria.""" 9 | 10 | def decode(self, token_ids): 11 | """Simple mock decode that converts token IDs to characters.""" 12 | if isinstance(token_ids, torch.Tensor): 13 | token_ids = token_ids.tolist() 14 | # Simple mapping: use chr() for decoding, allow control characters 15 | return ''.join(chr(tid % 128) for tid in token_ids) 16 | 17 | def batch_decode(self, token_ids_list): 18 | """Batch decode that converts a list of token IDs to a list of strings.""" 19 | return [self.decode(token_ids) for token_ids in token_ids_list] 20 | 21 | 22 | def test_word_stopping_criteria_basic(): 23 | """Test WordStoppingCriteria with basic functionality on CPU.""" 24 | tokenizer = MockTokenizer() 25 | criteria = WordStoppingCriteria(tokenizer) 26 | 27 | # Test with complete word (has trailing space) - ASCII 'h','e','l','l','o',' ' = 104,101,108,108,111,32 28 | input_ids = torch.tensor([[104, 101, 108, 108, 111, 32]]) # "hello " 29 | scores = torch.zeros((1, 100)) 30 | result = criteria(input_ids, scores) 31 | 32 | assert isinstance(result, torch.Tensor) 33 | assert result.dtype == torch.bool 34 | assert result.device.type == "cpu" 35 | assert result.shape == (1,) 36 | assert result[0].item() is True # "hello " is complete 37 | 38 | 39 | def test_word_stopping_criteria_incomplete(): 40 | """Test WordStoppingCriteria with incomplete word.""" 41 | tokenizer = MockTokenizer() 42 | criteria = WordStoppingCriteria(tokenizer) 43 | 44 | # Test with incomplete word (no trailing space) - ASCII 'h','e','l','l','o' = 104,101,108,108,111 45 | input_ids = torch.tensor([[104, 101, 108, 108, 111]]) # "hello" 46 | scores = torch.zeros((1, 100)) 47 | result = criteria(input_ids, scores) 48 | 49 | assert isinstance(result, torch.Tensor) 50 | assert result.dtype == torch.bool 51 | assert result.device.type == "cpu" 52 | assert result.shape == (1,) 53 | assert result[0].item() is False # "hello" is incomplete 54 | 55 | 56 | def test_word_stopping_criteria_batch(): 57 | """Test WordStoppingCriteria with batch of inputs.""" 58 | tokenizer = MockTokenizer() 59 | criteria = WordStoppingCriteria(tokenizer) 60 | 61 | # Batch with mixed complete and incomplete words (same length with padding) 62 | input_ids = torch.tensor([ 63 | [104, 101, 108, 108, 111, 32], # "hello " - complete 64 | [104, 101, 108, 108, 111, 0], # "hello" (with padding) - incomplete 65 | [119, 111, 114, 108, 100, 32], # "world " - complete 66 | ]) 67 | scores = torch.zeros((3, 100)) 68 | result = criteria(input_ids, scores) 69 | 70 | assert isinstance(result, torch.Tensor) 71 | assert result.dtype == torch.bool 72 | assert result.device.type == "cpu" 73 | assert result.shape == (3,) 74 | assert result[0].item() is True # "hello " is complete 75 | assert result[1].item() is False # "hello" is incomplete 76 | assert result[2].item() is True # "world " is complete 77 | 78 | 79 | @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") 80 | def test_word_stopping_criteria_cuda_device(): 81 | """Test WordStoppingCriteria respects CUDA device.""" 82 | tokenizer = MockTokenizer() 83 | criteria = WordStoppingCriteria(tokenizer) 84 | 85 | # Test with input on CUDA 86 | input_ids = torch.tensor([[104, 101, 108, 108, 111, 32]], device="cuda") # "hello " 87 | scores = torch.zeros((1, 100), device="cuda") 88 | result = criteria(input_ids, scores) 89 | 90 | assert isinstance(result, torch.Tensor) 91 | assert result.dtype == torch.bool 92 | assert result.device.type == "cuda" # Should match input device 93 | assert result.shape == (1,) 94 | assert result[0].item() is True 95 | 96 | 97 | def test_word_stopping_criteria_control_token(): 98 | """Test WordStoppingCriteria with control tokens.""" 99 | tokenizer = MockTokenizer() 100 | criteria = WordStoppingCriteria(tokenizer) 101 | 102 | # Control token (e.g., \x01) 103 | input_ids = torch.tensor([[1]]) # Control token 104 | scores = torch.zeros((1, 100)) 105 | result = criteria(input_ids, scores) 106 | 107 | assert isinstance(result, torch.Tensor) 108 | assert result.dtype == torch.bool 109 | assert result.shape == (1,) 110 | assert result[0].item() is True # Control tokens are always complete 111 | 112 | 113 | def test_word_stopping_criteria_utf8_tokenizer(): 114 | """Test WordStoppingCriteria with UTF8Tokenizer to verify batch_decode compatibility.""" 115 | from utf8_tokenizer.tokenizer import UTF8Tokenizer 116 | 117 | tokenizer = UTF8Tokenizer() 118 | criteria = WordStoppingCriteria(tokenizer) 119 | 120 | # Create test cases with UTF8Tokenizer's encoded tokens 121 | # UTF8Tokenizer adds SOT (0x02) and EOT (0x03) control tokens 122 | # We test that batch_decode produces the same results as individual decode calls 123 | 124 | # Encode some test strings 125 | test_texts = ["hello world ", "test"] 126 | test_ids = [tokenizer.encode(text) for text in test_texts] 127 | 128 | # Pad to same length 129 | max_len = max(len(ids) for ids in test_ids) 130 | padded_ids = [ids + [0] * (max_len - len(ids)) for ids in test_ids] 131 | 132 | input_ids = torch.tensor(padded_ids) 133 | scores = torch.zeros((len(padded_ids), 100)) 134 | 135 | # Get results using batch_decode (new implementation) 136 | result = criteria(input_ids, scores) 137 | 138 | # Verify by checking batch_decode produces same results as individual decode 139 | batch_decoded = tokenizer.batch_decode(input_ids.tolist()) 140 | individual_decoded = [tokenizer.decode(ids) for ids in input_ids] 141 | 142 | assert batch_decoded == individual_decoded, "batch_decode should produce same results as individual decode calls" 143 | 144 | # Verify result properties 145 | assert isinstance(result, torch.Tensor) 146 | assert result.dtype == torch.bool 147 | assert result.shape == (2,) 148 | 149 | 150 | if __name__ == "__main__": 151 | pytest.main([__file__, "-v"]) 152 | -------------------------------------------------------------------------------- /tests/test_pretokenizer.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from words_segmentation.pretokenizer import ( 4 | is_word_complete, 5 | text_to_words, 6 | utf8_chunks_grapheme_safe, 7 | ) 8 | 9 | 10 | def test_utf8_chunks_english(): 11 | """Test utf8_chunks_grapheme_safe with English text.""" 12 | text = "hello world" 13 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=5)) 14 | assert chunks == ["hello", " worl", "d"] 15 | assert len(chunks) == 3 16 | assert all(len(chunk.encode('utf-8')) <= 5 for chunk in chunks) 17 | 18 | 19 | def test_utf8_chunks_hebrew(): 20 | """Test utf8_chunks_grapheme_safe with Hebrew text.""" 21 | text = "עמית מוריוסף" 22 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=8)) 23 | assert chunks == ['עמית', ' מור', 'יוסף'] 24 | assert "".join(chunks) == text 25 | assert len(chunks) == 3 26 | assert all(len(chunk.encode('utf-8')) <= 8 for chunk in chunks) 27 | 28 | 29 | def test_utf8_chunks_emoji(): 30 | """Test utf8_chunks_grapheme_safe with basic emoji.""" 31 | text = "hello 😀 world" 32 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=8)) 33 | assert chunks == ['hello ', '😀 wor', 'ld'] 34 | assert "".join(chunks) == text 35 | assert len(chunks) == 3 36 | assert all(len(chunk.encode('utf-8')) <= 8 for chunk in chunks) 37 | 38 | 39 | def test_utf8_chunks_long_emoji_cluster(): 40 | """Test utf8_chunks_grapheme_safe with complex emoji cluster.""" 41 | text = "👩‍👩‍👧‍👦" 42 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=5)) 43 | assert chunks == [text] 44 | assert len(chunks) == 1 45 | assert len(text.encode('utf-8')) > 5 46 | 47 | 48 | def test_utf8_chunks_single_grapheme(): 49 | """Test special case of single grapheme cluster.""" 50 | text = "a" 51 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=16)) 52 | assert chunks == ["a"] 53 | 54 | 55 | def test_utf8_chunks_mixed_content(): 56 | """Test with mixed English, Hebrew, and emoji.""" 57 | text = "hello עמית 👋" 58 | chunks = list(utf8_chunks_grapheme_safe(text, max_bytes=10)) 59 | assert "".join(chunks) == text 60 | 61 | 62 | def test_text_to_words_json(): 63 | """Test text_to_words with JSON string.""" 64 | json_text = '{"name": "test", "value": 123}' 65 | words = text_to_words(json_text, max_bytes=10) 66 | assert words == ['{"name": ', '"test", ', '"value": ', "123}"] 67 | assert "".join(words) == json_text 68 | assert len(words) == 4 69 | 70 | 71 | def test_text_to_words_long_string(): 72 | """Test text_to_words with long string.""" 73 | long_text = ("This is a very long string that should be split into multiple chunks " 74 | "when processed by the text_to_words function with appropriate byte limits.") 75 | words = text_to_words(long_text, max_bytes=8) 76 | assert "".join(words) == long_text 77 | 78 | 79 | def test_text_to_words_short_string(): 80 | """Test text_to_words with short string.""" 81 | short_text = "hi" 82 | words = text_to_words(short_text, max_bytes=16) 83 | assert words == ["hi"] 84 | 85 | 86 | def test_text_to_words_multiline_code(): 87 | """Test similar to the processor test example.""" 88 | text = """ 89 | def foo(): 90 | return "bar" 91 | """.strip() 92 | words = text_to_words(text, max_bytes=10) 93 | assert words == ['def ', 'foo():\n', ' ', 'return ', '"bar"'] 94 | assert "".join(words) == text 95 | assert len(words) == 5 96 | assert 'def ' in words 97 | assert ' ' in words 98 | 99 | 100 | def test_text_to_words_whitespace(): 101 | """Test proper whitespace handling.""" 102 | text = "hello world" 103 | words = text_to_words(text, max_bytes=8) 104 | assert words == ["hello ", " ", "world"] 105 | assert "".join(words) == text 106 | assert len(words) == 3 107 | assert " " in words 108 | 109 | 110 | def test_text_to_words_mixed_unicode(): 111 | """Test with mixed content including unicode.""" 112 | text = "hello עמית! 🌟 {'key': 'value'}" 113 | words = text_to_words(text, max_bytes=10) 114 | assert "".join(words) == text 115 | 116 | 117 | def test_text_to_words_empty(): 118 | """Test with empty string.""" 119 | words = text_to_words("", max_bytes=16) 120 | assert words == [] 121 | 122 | 123 | def test_text_to_words_only_whitespace(): 124 | """Test with only whitespace.""" 125 | text = " \n\t " 126 | words = text_to_words(text, max_bytes=16) 127 | assert text == words[0] 128 | assert "".join(words) == text 129 | assert len(words) == 1 130 | assert all(c.isspace() for c in words[0]) 131 | 132 | 133 | def test_text_to_words_json_unicode(): 134 | """Test JSON containing unicode characters.""" 135 | json_text = '{"message":"שלום world 🌍","count": 42}' 136 | words = text_to_words(json_text, max_bytes=6) 137 | assert "".join(words) == json_text 138 | assert len(words) > 5 139 | assert any("🌍" in word for word in words) 140 | 141 | 142 | def test_is_word_complete_control_tokens(): 143 | """Test is_word_complete with control tokens.""" 144 | assert is_word_complete("\x01") 145 | assert is_word_complete("\x02") 146 | assert is_word_complete("\x03") 147 | assert is_word_complete("\x08") 148 | assert is_word_complete("\x7F") 149 | 150 | 151 | def test_is_word_complete_words_with_space(): 152 | """Test is_word_complete with words that have trailing space.""" 153 | assert is_word_complete("hello ") 154 | assert is_word_complete("world ") 155 | assert is_word_complete("test ") 156 | assert is_word_complete("עמית ") 157 | assert is_word_complete("🌟 ") 158 | 159 | 160 | def test_is_word_complete_incomplete_words(): 161 | """Test is_word_complete with incomplete words (no trailing space).""" 162 | assert not is_word_complete("hello") 163 | assert not is_word_complete("world") 164 | assert not is_word_complete("test") 165 | assert not is_word_complete("עמית") 166 | assert not is_word_complete("🌟") 167 | 168 | 169 | def test_is_word_complete_whitespace_only(): 170 | """Test is_word_complete with whitespace-only strings.""" 171 | assert not is_word_complete(" ") 172 | assert not is_word_complete(" ") 173 | assert not is_word_complete("\n") 174 | assert not is_word_complete("\t") 175 | assert not is_word_complete(" \n\t ") 176 | 177 | 178 | def test_is_word_complete_empty_string(): 179 | """Test is_word_complete with empty string.""" 180 | assert not is_word_complete("") 181 | 182 | def test_is_word_complete_unicode_with_space(): 183 | """Test is_word_complete with unicode characters and trailing space.""" 184 | assert is_word_complete("שלום ") 185 | assert is_word_complete("مرحبا ") 186 | assert is_word_complete("こんにちは ") 187 | assert not is_word_complete("שלום") 188 | assert not is_word_complete("مرحبا") 189 | assert not is_word_complete("こんにちは") 190 | 191 | 192 | if __name__ == "__main__": 193 | pytest.main([__file__, "-v"]) 194 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Words Segmentation 2 | 3 | This repository contains a pretokenizer that segments text into "words" for further processing. 4 | 5 | We define three classes of tokens: 6 | 7 | 1. `C0` Control tokens (always atomic) 8 | 2. "Words" = runs of non-space, non-control + optional single trailing whitespace 9 | 3. Whitespace runs 10 | 11 | For any script where the default is not suitable, you can implement a custom pretokenizer. 12 | Modify `LANGUAGE_SPECS` in [languages.py](./words_segmentation/languages.py) to add a custom function for specific 13 | scripts. 14 | 15 | For example: 16 | 17 | ```python 18 | LANGUAGE_SPECS: Dict[str, LanguageSpec] = { 19 | "Chinese": { 20 | "scripts": ("Han",), 21 | "callback": segment_chinese, 22 | }, 23 | "Japanese": { 24 | "scripts": ("Han", "Hiragana", "Katakana"), 25 | "callback": segment_japanese, 26 | }, 27 | } 28 | ``` 29 | 30 | Then, with a `max_bytes` parameter, we split long words into smaller chunks while preserving 31 | Unicode grapheme boundaries. 32 | 33 | ## Usage 34 | 35 | Install: 36 | 37 | ```bash 38 | pip install words-segmentation 39 | ``` 40 | 41 | Pretokenize text using a Huggingface Tokenizer implementation: 42 | 43 | ```python 44 | from words_segmentation.tokenizer import WordsSegmentationTokenizer 45 | 46 | pretokenizer = WordsSegmentationTokenizer(max_bytes=16) 47 | tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦") 48 | # ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍'] 49 | ``` 50 | 51 | ## [Writing systems without word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) 52 | 53 | Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. 54 | Until then, we need to handle some writing systems with custom logic. 55 | We implement custom fallback pretoknizers for the following writing systems: 56 | 57 | - [x] [Chinese characters](https://en.wikipedia.org/wiki/Chinese_characters) - 58 | using [jieba](https://github.com/fxsjy/jieba) 59 | - [x] [Japanese writing system](https://en.wikipedia.org/wiki/Japanese_writing_system) - 60 | using [fugashi](https://github.com/polm/fugashi) 61 | - [ ] [Balinese script](https://en.wikipedia.org/wiki/Balinese_script) 62 | - [ ] [Burmese alphabet](https://en.wikipedia.org/wiki/Burmese_alphabet) 63 | - [ ] [Chữ Hán](https://en.wikipedia.org/wiki/Ch%E1%BB%AF_H%C3%A1n) 64 | - [ ] [Chữ Nôm](https://en.wikipedia.org/wiki/Ch%E1%BB%AF_N%C3%B4m) 65 | - [ ] [Hanja](https://en.wikipedia.org/wiki/Hanja) 66 | - [ ] [Javanese script](https://en.wikipedia.org/wiki/Javanese_script) 67 | - [ ] [Khmer script](https://en.wikipedia.org/wiki/Khmer_script) 68 | - [ ] [Lao script](https://en.wikipedia.org/wiki/Lao_script) 69 | - [ ] [ʼPhags-pa script](https://en.wikipedia.org/wiki/%CA%BCPhags-pa_script) 70 | - [ ] [Rasm](https://en.wikipedia.org/wiki/Rasm) 71 | - [ ] [Sawndip](https://en.wikipedia.org/wiki/Sawndip) 72 | - [ ] [Scriptio continua](https://en.wikipedia.org/wiki/Scriptio_continua) 73 | - [ ] [S'gaw Karen alphabet](https://en.wikipedia.org/wiki/S%27gaw_Karen_alphabet) 74 | - [ ] [Tai Tham script](https://en.wikipedia.org/wiki/Tai_Tham_script) 75 | - [ ] [Thai script](https://en.wikipedia.org/wiki/Thai_script) 76 | - [ ] [Tibetan script](https://en.wikipedia.org/wiki/Tibetan_script) 77 | - [ ] [Vietnamese alphabet](https://en.wikipedia.org/wiki/Vietnamese_alphabet) 78 | - [ ] [Western Pwo alphabet](https://en.wikipedia.org/wiki/Western_Pwo_alphabet) 79 | 80 | ## Tokenization Parity 81 | 82 | [Foroutan and Meister et al. (2025)](https://www.arxiv.org/pdf/2508.04796) note that: 83 | > In multilingual models, the same meaning can take far more tokens in some languages, 84 | > penalizing users of underrepresented languages with worse performance and higher API costs. 85 | 86 | [![Tokenization Parity](assets/tokenization-parity.png)](https://www.linkedin.com/posts/sina-ahmadi-aba470287_dont-speak-english-you-must-pay-more-activity-7360959825893036035-vnFN) 87 | 88 | Let's consider the same example, for whitespace pre-tokenization parity: 89 | 90 | | Language | Text (Google Translate) | Bytes (UTF-8) | Tokens (GPT-4) | Words (Whitespace+) | 91 | |----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------|---------------------| 92 | | English | Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate. | 173 | 40 | 34 | 93 | | Italian | I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona. | 230 | 58 | 43 | 94 | | German | Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten. | 256 | 64 | 40 | 95 | | Chinese | 团体旅游价格更便宜,所以如果您独自一人或只有一个朋友,请尝试结识其他人并组成一个四到六人的团体,以获得更好的每人价格。 | 177 | 64 | 34 | 96 | | Japanese | ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。 | 227 | 74 | 48 | 97 | | Finnish | Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö. | 212 | 79 | 30 | 98 | | Russian | Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека. | 409 | 100 | 32 | 99 | | Arabic | تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد. | 341 | 140 | 33 | 100 | | Hebrew | סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם. | 281 | 151 | 31 | 101 | | Greek | Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο. | 394 | 193 | 36 | 102 | | Tamil | பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும். | 587 | 293 | 26 | 103 | | Kannada | ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ. | 565 | 361 | 26 | 104 | | Shan | ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။ | 669 | 531 | 23 | 105 | 106 | #### Bytes Efficiency 107 | 108 | English really is the most efficient language in terms of bytes count, which is not suprising given its Latin alphabet, 109 | without diacritics or ligatures (with 1 byte per character). 110 | Other languages that use the Latin alphabet are also relatively efficient (e.g. Italian, German, Finnish), but their 111 | use of diacritics and ligatures increases the byte count. 112 | 113 | Languages that use non-Latin scripts (e.g. Arabic, Hebrew, Shan) have a much higher byte count, due to the need for 114 | multiple bytes per character in UTF-8 encoding. Hebrew and Arabic use two bytes per character, 115 | while Shan uses three bytes per character, not counting ligatures. 116 | 117 | #### Tokenization Efficiency (GPT-4) 118 | 119 | English is also the most efficient language in terms of token count, which is not suprising given that the tokenizer 120 | was trained primarily on English text. 121 | Other languages that use the Latin alphabet are also relatively efficient, but the moment we move to non-Latin scripts, 122 | the token count increases significantly (up to 13x for Shan). 123 | 124 | #### Words Efficiency 125 | 126 | Assuming whitespace tokenization as a proxy for words, we see that English is not the most efficient language. 127 | This makes sense, from a language efficiency perspective, that there is no computational bias towards English. 128 | Languages distribute between 23 and 43 words for the same sentence, with English right in the middle with 34. 129 | 130 | ![Tokenization Parity - Words](assets/tokenization-parity-words.png) 131 | 132 | ## Cite 133 | 134 | If you use this code in your research, please consider citing the work: 135 | 136 | ```bibtex 137 | @misc{moryossef2025words, 138 | title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World}, 139 | author={Moryossef, Amit}, 140 | howpublished={\url{https://github.com/sign/words-segmentation}}, 141 | year={2025} 142 | } 143 | ``` --------------------------------------------------------------------------------