├── CHANGELOG.md ├── LICENSE ├── README.md ├── composer.json ├── pint.json └── src └── Grapheme.php /CHANGELOG.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soloterm/grapheme/502334c51a208750e8fc418228c8956e585ba34b/CHANGELOG.md -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Aaron Francis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Grapheme 2 | 3 | [![Latest Version on Packagist](https://img.shields.io/packagist/v/soloterm/grapheme)](https://packagist.org/packages/soloterm/grapheme) 4 | [![Total Downloads](https://img.shields.io/packagist/dt/soloterm/grapheme)](https://packagist.org/packages/soloterm/grapheme) 5 | [![License](https://img.shields.io/packagist/l/soloterm/grapheme)](https://packagist.org/packages/soloterm/grapheme) 6 | 7 | A highly optimized PHP library for calculating the display width of Unicode graphemes in terminal environments. 8 | Accurately determine how many columns a character will occupy in the terminal, including complex emoji, combining marks, 9 | and more. 10 | 11 | This library was built to support [Solo](https://github.com/soloterm/solo), your all-in-one Laravel command to tame local development. 12 | 13 | ## Why Use This Library? 14 | 15 | Building CLI applications can be challenging when it comes to handling modern Unicode text: 16 | 17 | - Emoji and CJK characters take up 2 cells in most terminals 18 | - Zero-width characters (joiners, marks, etc.) don't affect layout but can cause width calculation errors 19 | - Complex text like emoji with skin tones or flags require special handling 20 | - PHP's built-in functions don't fully address these edge cases 21 | 22 | This library solves these problems by providing an accurate, performant, and thoroughly tested way to determine the 23 | display width of any character or grapheme cluster. 24 | 25 | ## Installation 26 | 27 | ```bash 28 | composer require soloterm/grapheme 29 | ``` 30 | 31 | ## Usage 32 | 33 | ```php 34 | use SoloTerm\Grapheme\Grapheme; 35 | 36 | // Basic characters (width: 1) 37 | Grapheme::wcwidth('a'); // Returns: 1 38 | Grapheme::wcwidth('Я'); // Returns: 1 39 | 40 | // East Asian characters (width: 2) 41 | Grapheme::wcwidth('文'); // Returns: 2 42 | Grapheme::wcwidth('あ'); // Returns: 2 43 | 44 | // Emoji (width: 2) 45 | Grapheme::wcwidth('😀'); // Returns: 2 46 | Grapheme::wcwidth('🚀'); // Returns: 2 47 | 48 | // Complex emoji with modifiers (width: 2) 49 | Grapheme::wcwidth('👍🏻'); // Returns: 2 50 | Grapheme::wcwidth('👨‍👩‍👧‍👦'); // Returns: 2 51 | 52 | // Zero-width characters (width: 0) 53 | Grapheme::wcwidth("\u{200B}"); // Returns: 0 (Zero-width space) 54 | 55 | // Characters with combining marks (width: 1) 56 | Grapheme::wcwidth('é'); // Returns: 1 57 | Grapheme::wcwidth("e\u{0301}"); // Returns: 1 (e + combining acute) 58 | 59 | // Special cases 60 | Grapheme::wcwidth("⚠\u{FE0E}"); // Returns: 1 (Warning sign in text presentation) 61 | Grapheme::wcwidth("⚠\u{FE0F}"); // Returns: 2 (Warning sign in emoji presentation) 62 | ``` 63 | 64 | ## Features 65 | 66 | - **Highly optimized** for performance with early-return paths and smart caching 67 | - **Comprehensive Unicode support** including: 68 | - CJK (Chinese, Japanese, Korean) characters 69 | - Emoji (including skin tone modifiers, gender modifiers, flags) 70 | - Zero-width characters and control codes 71 | - Combining marks and accents 72 | - Regional indicators and flags 73 | - Variation selectors 74 | - **Carefully tested** against a wide range of Unicode characters 75 | - **Minimal dependencies** - only requires PHP 8.2+ and an optional intl extension 76 | - **Compatible** with most terminal environments 77 | 78 | ## Terminal Compatibility 79 | 80 | This library aims to match the behavior of `wcwidth()` in modern terminal emulators. 81 | 82 | ## Requirements 83 | 84 | - PHP 8.2 or higher 85 | - The `symfony/polyfill-intl-normalizer` package is included as a dependency 86 | - The `ext-intl` extension is recommended for best performance 87 | 88 | ## Under the Hood 89 | 90 | The library uses a series of optimized patterns and checks to accurately determine character width: 91 | 92 | 1. Fast paths for ASCII and zero-width characters 93 | 2. Special handling for complex scripts like Devanagari 94 | 3. Smart detection of emoji and variation selectors 95 | 4. Proper handling of zero-width joiners (ZWJ) and other invisible characters 96 | 5. Caching of results for improved performance 97 | 98 | ## Testing 99 | 100 | ```bash 101 | composer test 102 | ``` 103 | 104 | The test suite includes over 150 different test cases covering many possible Unicode scenarios. Please feel free to add 105 | more. 106 | 107 | ## Contributing 108 | 109 | Contributions are welcome! Please feel free to submit a pull request. 110 | 111 | ## License 112 | 113 | The MIT License (MIT). 114 | 115 | ## Support 116 | 117 | This is free! If you want to support me: 118 | 119 | - Sponsor my open source work: [aaronfrancis.com/backstage](https://aaronfrancis.com/backstage) 120 | - Check out my courses: 121 | - [Mastering Postgres](https://masteringpostgres.com) 122 | - [High Performance SQLite](https://highperformancesqlite.com) 123 | - [Screencasting](https://screencasting.com) 124 | - Help spread the word about things I make 125 | 126 | ## Credits 127 | 128 | Solo was developed by Aaron Francis. If you like it, please let me know! 129 | 130 | - Twitter: https://twitter.com/aarondfrancis 131 | - Website: https://aaronfrancis.com 132 | - YouTube: https://youtube.com/@aarondfrancis 133 | - GitHub: https://github.com/aarondfrancis/solo 134 | -------------------------------------------------------------------------------- /composer.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "soloterm/grapheme", 3 | "description": "A PHP package to measure the width of unicode strings rendered to a terminal.", 4 | "type": "library", 5 | "license": "MIT", 6 | "authors": [ 7 | { 8 | "name": "Aaron Francis", 9 | "email": "aarondfrancis@gmail.com" 10 | } 11 | ], 12 | "minimum-stability": "dev", 13 | "require": { 14 | "php": "^8.1", 15 | "symfony/polyfill-intl-normalizer": "^1.27.0" 16 | }, 17 | "require-dev": { 18 | "phpunit/phpunit": "^10.5|^11" 19 | }, 20 | "autoload": { 21 | "psr-4": { 22 | "SoloTerm\\Grapheme\\": "src/" 23 | } 24 | }, 25 | "autoload-dev": { 26 | "psr-4": { 27 | "SoloTerm\\Grapheme\\Tests\\": "tests/" 28 | } 29 | }, 30 | "suggest": { 31 | "ext-intl": "For best performance" 32 | }, 33 | "scripts": { 34 | "test": "vendor/bin/phpunit" 35 | } 36 | } 37 | -------------------------------------------------------------------------------- /pint.json: -------------------------------------------------------------------------------- 1 | { 2 | "preset": "laravel", 3 | "rules": { 4 | "not_operator_with_successor_space": false, 5 | "heredoc_to_nowdoc": false, 6 | "phpdoc_summary": false, 7 | "concat_space": { 8 | "spacing": "one" 9 | }, 10 | "function_declaration": { 11 | "closure_fn_spacing": "none" 12 | }, 13 | "trailing_comma_in_multiline": false 14 | }, 15 | "xxx_header_comment": { 16 | "header": "@author Aaron Francis \n\n@link https://aaronfrancis.com\n@link https://x.com/aarondfrancis", 17 | "comment_type": "PHPDoc", 18 | "location": "after_declare_strict" 19 | } 20 | } -------------------------------------------------------------------------------- /src/Grapheme.php: -------------------------------------------------------------------------------- 1 | 7 | */ 8 | 9 | namespace SoloTerm\Grapheme; 10 | 11 | use Normalizer; 12 | 13 | class Grapheme 14 | { 15 | public static $cache = []; 16 | 17 | protected static $maybeNeedsNormalizationPattern = '/[\p{M}\x{0300}-\x{036F}\x{1AB0}-\x{1AFF}\x{1DC0}-\x{1DFF}\x{20D0}-\x{20FF}]/u'; 18 | 19 | protected static $specialCharsPattern = '/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}\x{034F}\x{061C}\x{202A}-\x{202E}]|[\p{M}\x{0300}-\x{036F}\x{1AB0}-\x{1AFF}\x{1DC0}-\x{1DFF}\x{20D0}-\x{20FF}]|[\x{FE0E}\x{FE0F}]|[\x{1F000}-\x{1FFFF}\x{2600}-\x{26FF}\x{2700}-\x{27BF}]|[\x{1100}-\x{11FF}\x{3000}-\x{303F}\x{3130}-\x{318F}\x{AC00}-\x{D7AF}\x{3400}-\x{4DBF}\x{4E00}-\x{9FFF}\x{F900}-\x{FAFF}\x{FF00}-\x{FFEF}]/u'; 20 | 21 | protected static $variationSelectorsPattern = '/[\x{FE0E}\x{FE0F}]/u'; 22 | 23 | protected static $emojiPattern = '/[\x{1F000}-\x{1FFFF}\x{2600}-\x{26FF}\x{2700}-\x{27BF}]/u'; 24 | 25 | protected static $eastAsianPattern = '/[\x{1100}-\x{11FF}\x{3000}-\x{303F}\x{3130}-\x{318F}\x{AC00}-\x{D7AF}\x{3400}-\x{4DBF}\x{4E00}-\x{9FFF}\x{F900}-\x{FAFF}\x{FF00}-\x{FFEF}]/u'; 26 | 27 | protected static $textStyleEmojiPattern = '/^[\x{2600}-\x{26FF}\x{2700}-\x{27BF}]$/u'; 28 | 29 | protected static $flagSequencePattern = '/\p{Regional_Indicator}{2}|\x{1F3F4}[\x{E0060}-\x{E007F}]+/u'; 30 | 31 | protected static $asciiZwjPattern = '/^[\x00-\x7F][\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}]+$/u'; 32 | 33 | protected static $devanagariPattern = '/\p{Devanagari}/u'; 34 | 35 | protected static $singleLetterWithCombiningMarksPattern = '/^\p{L}\p{M}+$/u'; 36 | 37 | protected static $skinTonePattern = '/[\x{1F3FB}-\x{1F3FF}]/u'; 38 | 39 | protected static $flagEmojiPattern = '/^[\x{1F1E6}-\x{1F1FF}]{2}$/u'; 40 | 41 | protected static $emojiZwjPattern = '/^[\x{1F300}-\x{1F6FF}][\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}]+$/u'; 42 | 43 | protected static $emojiWithZwjPattern = '/[\x{1F300}-\x{1F6FF}]/u'; 44 | 45 | // Compiled patterns for filtering 46 | protected static $zwjFilterPattern = '/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}\x{034F}\x{061C}\x{202A}-\x{202E}]+/u'; 47 | 48 | protected static $onlyCombiningMarksPattern = '/^[\p{M}\x{0300}-\x{036F}\x{1AB0}-\x{1AFF}\x{1DC0}-\x{1DFF}\x{20D0}-\x{20FF}]+$/u'; 49 | 50 | protected static $baseCharCombiningZwjPattern = '/^\p{L}[\p{M}\x{0300}-\x{036F}\x{1AB0}-\x{1AFF}\x{1DC0}-\x{1DFF}\x{20D0}-\x{20FF}]+[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}\x{034F}\x{061C}\x{202A}-\x{202E}]+$/u'; 51 | 52 | protected static $hasZeroWidthPattern = '/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}-\x{2064}\x{034F}\x{180E}\x{180B}-\x{180D}\x{061C}\x{200E}\x{200F}\x{202A}-\x{202E}\x{2066}-\x{2069}\x{FFF9}-\x{FFFB}\x{1160}\x{115F}\x{3164}]/u'; 53 | 54 | protected static $textPresentationSymbolsPattern = '/^[\x{2600}-\x{26FF}\x{2700}-\x{27BF}\x{1F100}-\x{1F1FF}]$/u'; 55 | 56 | public static function wcwidth(string $grapheme): int 57 | { 58 | // Check cache first (fastest path) 59 | if (isset(static::$cache[$grapheme])) { 60 | return static::$cache[$grapheme]; 61 | } 62 | 63 | // Fast path for pure ASCII: If strlen == mb_strlen, it's single-byte only → width 1 64 | if (strlen($grapheme) === mb_strlen($grapheme)) { 65 | return static::$cache[$grapheme] = 1; 66 | } 67 | 68 | // Fast path: zero-width character check for single characters 69 | if (mb_strlen($grapheme) === 1 && preg_match(static::$hasZeroWidthPattern, $grapheme)) { 70 | return static::$cache[$grapheme] = 0; 71 | } 72 | 73 | // Handle ASCII + Zero Width sequences (like 'a‍') 74 | if (preg_match(static::$asciiZwjPattern, $grapheme)) { 75 | return static::$cache[$grapheme] = 1; 76 | } 77 | 78 | // Check for special flag sequence patterns (Scotland, England, etc.) 79 | if (preg_match(static::$flagSequencePattern, $grapheme)) { 80 | return static::$cache[$grapheme] = 2; 81 | } 82 | 83 | // Devanagari conjuncts and other complex scripts 84 | if (preg_match(static::$devanagariPattern, $grapheme)) { 85 | return static::$cache[$grapheme] = 1; 86 | } 87 | 88 | // Only normalize if there's a chance of combining marks 89 | if (preg_match(static::$maybeNeedsNormalizationPattern, $grapheme)) { 90 | $grapheme = Normalizer::normalize($grapheme, Normalizer::NFC); 91 | } 92 | 93 | // Special cases for characters followed by ZWJ/ZWNJ 94 | if (mb_strpos($grapheme, "\u{200D}") !== false || mb_strpos($grapheme, "\u{200C}") !== false) { 95 | // Check if it's a single character + ZWJ sequence 96 | if (mb_strlen(preg_replace(static::$zwjFilterPattern, '', $grapheme)) === 1) { 97 | // If it's an emoji + ZWJ, it should be width 2 98 | if (preg_match(static::$emojiZwjPattern, $grapheme)) { 99 | return static::$cache[$grapheme] = 2; 100 | } 101 | 102 | // If it's a CJK/wide char + ZWJ, it should be width 2 103 | if (preg_match(static::$eastAsianPattern, mb_substr($grapheme, 0, 1))) { 104 | return static::$cache[$grapheme] = 2; 105 | } 106 | 107 | // Otherwise, it should be width 1 (ASCII, Latin, etc. + ZWJ) 108 | return static::$cache[$grapheme] = 1; 109 | } 110 | 111 | // If it's an emoji ZWJ sequence 112 | if (preg_match(static::$emojiWithZwjPattern, $grapheme)) { 113 | return static::$cache[$grapheme] = 2; 114 | } 115 | } 116 | 117 | // Handle variation selectors 118 | if (preg_match(static::$variationSelectorsPattern, $grapheme)) { 119 | $baseChar = preg_replace(static::$variationSelectorsPattern, '', $grapheme); 120 | 121 | // Text style variation selector for emoji-capable symbols 122 | if (mb_strpos($grapheme, "\u{FE0E}") !== false) { 123 | // Check if it's an emoji-capable character 124 | if (preg_match(static::$textStyleEmojiPattern, $baseChar)) { 125 | return static::$cache[$grapheme] = 1; 126 | } 127 | } 128 | 129 | // Check if emoji with variation selector 130 | if (preg_match(static::$emojiPattern, $baseChar)) { 131 | return static::$cache[$grapheme] = 2; 132 | } 133 | 134 | // Check if East Asian character with variation selector 135 | if (preg_match(static::$eastAsianPattern, $baseChar)) { 136 | return static::$cache[$grapheme] = 2; 137 | } 138 | 139 | // Otherwise, measure the base character 140 | $width = mb_strwidth($baseChar, 'UTF-8'); 141 | 142 | return static::$cache[$grapheme] = ($width > 0) ? $width : 1; 143 | } 144 | 145 | // Check if the grapheme contains any zero-width characters 146 | $hasZeroWidth = preg_match(static::$hasZeroWidthPattern, $grapheme) === 1; 147 | 148 | // If it has zero-width characters, we need special handling 149 | if ($hasZeroWidth) { 150 | // First, handle text with just formatting characters 151 | $filtered = preg_replace(static::$zwjFilterPattern, '', $grapheme); 152 | 153 | // If nothing is left after removing zero-width chars, or only combining marks left 154 | if ($filtered === '' || preg_match(static::$onlyCombiningMarksPattern, $filtered)) { 155 | return static::$cache[$grapheme] = 0; 156 | } 157 | 158 | // Handle base char + combining marks + ZWJ 159 | if (preg_match(static::$baseCharCombiningZwjPattern, $grapheme)) { 160 | return static::$cache[$grapheme] = 1; 161 | } 162 | 163 | // If it's a single character + zero-width chars 164 | if (mb_strlen($filtered) === 1) { 165 | if (preg_match(static::$eastAsianPattern, $filtered)) { 166 | return static::$cache[$grapheme] = 2; 167 | } 168 | 169 | return static::$cache[$grapheme] = 1; 170 | } 171 | } 172 | 173 | // Check for special characters - if none, do direct width calculation 174 | if (!preg_match(static::$specialCharsPattern, $grapheme)) { 175 | return static::$cache[$grapheme] = mb_strwidth($grapheme, 'UTF-8'); 176 | } 177 | 178 | // Single letter followed by combining marks 179 | if (preg_match(static::$singleLetterWithCombiningMarksPattern, $grapheme)) { 180 | return static::$cache[$grapheme] = 1; 181 | } 182 | 183 | // Handle skin tones or flags (single grapheme) 184 | if (grapheme_strlen($grapheme) === 1) { 185 | if (preg_match(static::$skinTonePattern, $grapheme)) { 186 | return static::$cache[$grapheme] = 2; 187 | } 188 | if (preg_match(static::$flagEmojiPattern, $grapheme)) { 189 | return static::$cache[$grapheme] = 2; 190 | } 191 | } 192 | 193 | // Handle symbols that should be width 1 in text presentation 194 | if (preg_match(static::$textPresentationSymbolsPattern, $grapheme) && mb_strpos($grapheme, "\u{FE0F}") === false) { 195 | return static::$cache[$grapheme] = 1; 196 | } 197 | 198 | // Default fallback to mb_strwidth, carefully filtering zero-width characters 199 | $filtered = preg_replace(static::$zwjFilterPattern, '', $grapheme); 200 | $width = mb_strwidth($filtered, 'UTF-8'); 201 | 202 | return static::$cache[$grapheme] = ($width > 0) ? $width : 1; 203 | } 204 | } 205 | --------------------------------------------------------------------------------